class SHAInet::BPETokenizer

SHAInet::BPETokenizer
Reference
Object

Overview

Simple byte-pair encoding tokenizer. It can train a vocabulary from text and encode/decode using the learned merges.

Defined in:

shainet/text/bpe_tokenizer.cr

Constructors

.new

Instance Method Summary

#decode(ids : Array(Int32)) : String
Decode an array of token IDs back into a string.
#encode(text : String) : Array(Int32)
Encode a string into token IDs.
#inv_vocab : Array(String)
#merges : Array(Tuple(String, String))
#train(text : String, vocab_size : Int32)
Train the tokenizer vocabulary from the given text using the byte-pair encoding algorithm.
#vocab : Hash(String, Int32)

Constructor Detail

def self.new #

[View source]

Instance Method Detail

def decode(ids : Array(Int32)) : String #

Decode an array of token IDs back into a string.

[View source]

def encode(text : String) : Array(Int32) #

Encode a string into token IDs. Unknown tokens are added to the vocabulary.

[View source]

def inv_vocab : Array(String) #

[View source]

def merges : Array(Tuple(String, String)) #

[View source]

def train(text : String, vocab_size : Int32) #

Train the tokenizer vocabulary from the given text using the byte-pair encoding algorithm. vocab_size determines how many unique tokens will be created at most.

[View source]

def vocab : Hash(String, Int32) #

[View source]

CrystalDoc.info

shainet

class SHAInet::BPETokenizer

Overview

Defined in:

Constructors

Instance Method Summary

Constructor Detail

Instance Method Detail