class String
- String
- Reference
- Object
Overview
A String
represents an immutable sequence of UTF-8 characters.
A String
is typically created with a string literal, enclosing UTF-8 characters
in double quotes:
"hello world"
See String
literals in the language reference.
A backslash can be used to denote some characters inside the string:
"\"" # double quote
"\\" # backslash
"\e" # escape
"\f" # form feed
"\n" # newline
"\r" # carriage return
"\t" # tab
"\v" # vertical tab
You can use a backslash followed by an u and four hexadecimal characters to denote a unicode codepoint written:
"\u0041" # == "A"
Or you can use curly braces and specify up to six hexadecimal numbers (0 to 10FFFF):
"\u{41}" # == "A"
A string can span multiple lines:
"hello
world" # same as "hello\n world"
Note that in the above example trailing and leading spaces, as well as newlines, end up in the resulting string. To avoid this, you can split a string into multiple lines by joining multiple literals with a backslash:
"hello " \
"world, " \
"no newlines" # same as "hello world, no newlines"
Alternatively, a backslash followed by a newline can be inserted inside the string literal:
"hello \
world, \
no newlines" # same as "hello world, no newlines"
In this case, leading whitespace is not included in the resulting string.
If you need to write a string that has many double quotes, parentheses, or similar characters, you can use alternative literals:
# Supports double quotes and nested parentheses
%(hello ("world")) # same as "hello (\"world\")"
# Supports double quotes and nested brackets
%[hello ["world"]] # same as "hello [\"world\"]"
# Supports double quotes and nested curlies
%{hello {"world"}} # same as "hello {\"world\"}"
# Supports double quotes and nested angles
%<hello <"world">> # same as "hello <\"world\">"
To create a String
with embedded expressions, you can use string interpolation:
a = 1
b = 2
"sum = #{a + b}" # "sum = 3"
This ends up invoking Object#to_s(IO)
on each expression enclosed by #{...}
.
If you need to dynamically build a string, use String#build
or IO::Memory
.
Non UTF-8 valid strings
A string might end up being composed of bytes which form an invalid
byte sequence according to UTF-8. This can happen if the string is created
via one of the constructors that accept bytes, or when getting a string
from String.build
or IO::Memory
. No exception will be raised, but every
byte that doesn't start a valid UTF-8 byte sequence is interpreted as though
it encodes the Unicode replacement character (U+FFFD) by itself. For example:
# here 255 is not a valid byte value in the UTF-8 encoding
string = String.new(Bytes[255, 97])
string.valid_encoding? # => false
# The first char here is the unicode replacement char
string.chars # => ['�', 'a']
One can also create strings with specific byte value in them by using octal and hexadecimal escape sequences:
# Octal escape sequences
"\101" # # => "A"
"\12" # # => "\n"
"\1" # string with one character with code point 1
"\377" # string with one byte with value 255
# Hexadecimal escape sequences
"\x41" # # => "A"
"\xFF" # string with one byte with value 255
The reason for allowing strings that don't have a valid UTF-8 sequence is that the world is full of content that isn't properly encoded, and having a program raise an exception or stop because of this is not good. It's better if programs are more resilient, but show a replacement character when there's an error in incoming data.
Note that this interpretation only applies to methods inside Crystal; calling
#to_slice
or #to_unsafe
, e.g. when passing a string to a C library, will
expose the invalid UTF-8 byte sequences. In particular, Regex
's underlying
engine may reject strings that are not valid UTF-8, or it may invoke undefined
behavior on invalid strings. If this is undesired, #scrub
could be used to
remove the offending byte sequences first.
Included Modules
- Comparable(String)
Defined in:
core_ext/string_to_yaml.crInstance Method Summary
-
#to_yaml(yaml : YAML::Nodes::Builder)
Override default ScalarStyle for String serialisation from
ANY
toLITERAL
to preserve original string formatting for multiline strings.
Instance Method Detail
Override default ScalarStyle for String serialisation from ANY
to
LITERAL
to preserve original string formatting for multiline strings.
This ensures that round-tripping multiline strings through YAML.parse
and
YAML.dump
does not clobber styles from
key: |-
secret
value
to
key: 'secret
value'
which is ugly and unreadable, even if it is exactly identical in usage.