class
SHAInet::Tokenizer
- SHAInet::Tokenizer
- Reference
- Object
Overview
Very small tokenizer used for toy examples. It builds a vocabulary of words from given text and encodes/decodes sentences to arrays of token IDs.
Defined in:
shainet/text/tokenizer.crConstructors
Instance Method Summary
-
#build(text : String)
Update the vocabulary with all unique words from the given text.
-
#decode(ids : Array(Int32)) : Array(String)
Convert an array of token IDs back to their corresponding words.
-
#encode(text : String) : Array(Int32)
Convert a string into an array of token IDs.
- #inv_vocab : Array(String)
- #vocab : Hash(String, Int32)
Constructor Detail
Instance Method Detail
def build(text : String)
#
Update the vocabulary with all unique words from the given text. Splits the text on whitespace.
def decode(ids : Array(Int32)) : Array(String)
#
Convert an array of token IDs back to their corresponding words. Unknown IDs are returned as an empty string.
def encode(text : String) : Array(Int32)
#
Convert a string into an array of token IDs. Unknown tokens are added to the vocabulary.