class SHAInet::Tokenizer

Overview

Very small tokenizer used for toy examples. It builds a vocabulary of words from given text and encodes/decodes sentences to arrays of token IDs.

Defined in:

shainet/text/tokenizer.cr

Constructors

Instance Method Summary

Constructor Detail

def self.new #

[View source]

Instance Method Detail

def build(text : String) #

Update the vocabulary with all unique words from the given text. Splits the text on whitespace.


[View source]
def decode(ids : Array(Int32)) : Array(String) #

Convert an array of token IDs back to their corresponding words. Unknown IDs are returned as an empty string.


[View source]
def encode(text : String) : Array(Int32) #

Convert a string into an array of token IDs. Unknown tokens are added to the vocabulary.


[View source]
def inv_vocab : Array(String) #

[View source]
def vocab : Hash(String, Int32) #

[View source]