class
SHAInet::BPETokenizer
- SHAInet::BPETokenizer
- Reference
- Object
Overview
Simple byte-pair encoding tokenizer. It can train a vocabulary from text and encode/decode using the learned merges.
Defined in:
shainet/text/bpe_tokenizer.crConstructors
Instance Method Summary
-
#decode(ids : Array(Int32)) : String
Decode an array of token IDs back into a string.
-
#encode(text : String) : Array(Int32)
Encode a string into token IDs.
- #inv_vocab : Array(String)
- #merges : Array(Tuple(String, String))
-
#train(text : String, vocab_size : Int32)
Train the tokenizer vocabulary from the given text using the byte-pair encoding algorithm.
- #vocab : Hash(String, Int32)
Constructor Detail
Instance Method Detail
def encode(text : String) : Array(Int32)
#
Encode a string into token IDs. Unknown tokens are added to the vocabulary.
def train(text : String, vocab_size : Int32)
#
Train the tokenizer vocabulary from the given text using the
byte-pair encoding algorithm. vocab_size
determines how many
unique tokens will be created at most.