class Llama::Vocab

Overview

Wrapper for the llama_vocab structure

Defined in:

llama/vocab.cr

Constructors

Instance Method Summary

Constructor Detail

def self.new(handle : Pointer(LibLlama::LlamaVocab), model : Model) #

Creates a new Vocab instance from a raw pointer

Note: This constructor is intended for internal use. Users should obtain Vocab instances through Model#vocab.


[View source]

Instance Method Detail

def add_bos? : Bool #

Returns whether the model adds BOS token by default


[View source]
def add_eos? : Bool #

Returns whether the model adds EOS token by default


[View source]
def add_sep? : Bool #

Returns whether the model adds SEP token by default


[View source]
def bos : Int32 #

Returns the beginning-of-sentence token ID


[View source]
def control?(token : Int32) : Bool #

Checks if a token is a control token


[View source]
def detokenize(tokens : Array(Int32), remove_special : Bool = true, unparse_special : Bool = false) : String #

Converts a token sequence into text.

This is the inverse of #tokenize and should be preferred over joining token text entries when reconstructing generated output.


[View source]
def eog?(token : Int32) : Bool #

Checks if a token is an end-of-generation token


[View source]
def eos : Int32 #

Returns the end-of-sentence token ID


[View source]
def eot : Int32 #

Returns the end-of-turn token ID


[View source]
def format_token(token : Int32, show_id : Bool = true, show_text : Bool = true) : String #

Format a token for display

Parameters:

  • token: The token to format
  • show_id: Whether to show the token ID
  • show_text: Whether to show the token text

Returns:

  • A formatted string representation of the token

[View source]
def mask : Int32 #

Returns the mask token ID (if defined by the tokenizer)


[View source]
def n_tokens : Int32 #

Returns the number of tokens in the vocabulary


[View source]
def nl : Int32 #

Returns the newline token ID


[View source]
def pad : Int32 #

Returns the padding token ID


[View source]
def to_unsafe : Pointer(Llama::LibLlama::LlamaVocab) #

Returns the raw pointer to the underlying llama_vocab structure


[View source]
def token_text(token : Int32) : String #

Returns the raw vocabulary text entry for a token.

This is a direct wrapper for llama_vocab_get_text. It is useful for inspecting vocabulary entries, but it is not detokenization. Use #detokenize for token sequences and #token_to_piece for rendering one token as output text.


[View source]
def token_to_piece(token : Int32, lstrip : Int32 = 0, special : Bool = false) : String #

Converts a token to a piece of text

Parameters:

  • token: The token to convert
  • lstrip: Whether to strip leading spaces (0 = no, 1 = yes)
  • special: Whether to render special tokens

Returns:

  • The rendered token piece

[View source]
def token_to_text(token : Int32) : String #

Returns the raw vocabulary text entry for a token.

Prefer #token_text for vocabulary inspection. Prefer #detokenize or #token_to_piece when reconstructing model output.

DEPRECATED Use token_text for vocabulary inspection, token_to_piece for a single rendered token, or detokenize for token sequences.


[View source]
def tokenize(text : String, add_special : Bool = true, parse_special : Bool = true) : Array(Int32) #

Tokenizes a string into an array of token IDs


[View source]