class String

Overview

A String represents an immutable sequence of UTF-8 characters.

A String is typically created with a string literal, enclosing UTF-8 characters in double quotes:

"hello world"

See String literals in the language reference.

A backslash can be used to denote some characters inside the string:

"\"" # double quote
"\\" # backslash
"\e" # escape
"\f" # form feed
"\n" # newline
"\r" # carriage return
"\t" # tab
"\v" # vertical tab

You can use a backslash followed by an u and four hexadecimal characters to denote a unicode codepoint written:

"\u0041" # == "A"

Or you can use curly braces and specify up to six hexadecimal numbers (0 to 10FFFF):

"\u{41}" # == "A"

A string can span multiple lines:

"hello
      world" # same as "hello\n      world"

Note that in the above example trailing and leading spaces, as well as newlines, end up in the resulting string. To avoid this, you can split a string into multiple lines by joining multiple literals with a backslash:

"hello " \
"world, " \
"no newlines" # same as "hello world, no newlines"

Alternatively, a backslash followed by a newline can be inserted inside the string literal:

"hello \
     world, \
     no newlines" # same as "hello world, no newlines"

In this case, leading whitespace is not included in the resulting string.

If you need to write a string that has many double quotes, parentheses, or similar characters, you can use alternative literals:

# Supports double quotes and nested parentheses
%(hello ("world")) # same as "hello (\"world\")"

# Supports double quotes and nested brackets
%[hello ["world"]] # same as "hello [\"world\"]"

# Supports double quotes and nested curlies
%{hello {"world"}} # same as "hello {\"world\"}"

# Supports double quotes and nested angles
%<hello <"world">> # same as "hello <\"world\">"

To create a String with embedded expressions, you can use string interpolation:

a = 1
b = 2
"sum = #{a + b}" # "sum = 3"

This ends up invoking Object#to_s(IO) on each expression enclosed by #{...}.

If you need to dynamically build a string, use String#build or IO::Memory.

Non UTF-8 valid strings

A string might end up being composed of bytes which form an invalid byte sequence according to UTF-8. This can happen if the string is created via one of the constructors that accept bytes, or when getting a string from String.build or IO::Memory. No exception will be raised, but every byte that doesn't start a valid UTF-8 byte sequence is interpreted as though it encodes the Unicode replacement character (U+FFFD) by itself. For example:

# here 255 is not a valid byte value in the UTF-8 encoding
string = String.new(Bytes[255, 97])
string.valid_encoding? # => false

# The first char here is the unicode replacement char
string.chars # => ['�', 'a']

One can also create strings with specific byte value in them by using octal and hexadecimal escape sequences:

# Octal escape sequences
"\101" # # => "A"
"\12"  # # => "\n"
"\1"   # string with one character with code point 1
"\377" # string with one byte with value 255

# Hexadecimal escape sequences
"\x41" # # => "A"
"\xFF" # string with one byte with value 255

The reason for allowing strings that don't have a valid UTF-8 sequence is that the world is full of content that isn't properly encoded, and having a program raise an exception or stop because of this is not good. It's better if programs are more resilient, but show a replacement character when there's an error in incoming data.

Note that this interpretation only applies to methods inside Crystal; calling #to_slice or #to_unsafe, e.g. when passing a string to a C library, will expose the invalid UTF-8 byte sequences. In particular, Regex's underlying engine may reject strings that are not valid UTF-8, or it may invoke undefined behavior on invalid strings. If this is undesired, #scrub could be used to remove the offending byte sequences first.

Included Modules

Defined in:

extlib.cr

Instance Method Summary

Instance Method Detail

def hex_to_bigint : BigInt #

[View source]
def hex_to_bytes : Bytes #

converts hex string to Bytes in which the leftmost part of the string represents the most significant bytes and the rightmost part of the string represents the least significant bytes Returns Bytes object formatted in big endian if the byte sequence was considered a number - most significant byte in the lowest address and least significant byte in the largest address.


[View source]
def to_period #

[View source]