class SHAInet::BPETokenizer

Overview

Simple byte-pair encoding tokenizer. It can train a vocabulary from text and encode/decode using the learned merges.

Defined in:

shainet/text/bpe_tokenizer.cr

Constructors

Instance Method Summary

Constructor Detail

def self.new #

[View source]

Instance Method Detail

def decode(ids : Array(Int32)) : String #

Decode an array of token IDs back into a string.


[View source]
def encode(text : String) : Array(Int32) #

Encode a string into token IDs. Unknown tokens are added to the vocabulary.


[View source]
def inv_vocab : Array(String) #

[View source]
def merges : Array(Tuple(String, String)) #

[View source]
def train(text : String, vocab_size : Int32) #

Train the tokenizer vocabulary from the given text using the byte-pair encoding algorithm. vocab_size determines how many unique tokens will be created at most.


[View source]
def vocab : Hash(String, Int32) #

[View source]