class Llamero::BaseModel

Overview

The primary client for interacting directly with models that are available on the local computer.

This class allows the app to switch out specific fune-tuning aspects that improve the models accuracy for specific tasks. This includes:

Logs are always disabled with --log-disable when running from within this app

The chat_template_* properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM. Chat models are the most commonly used types of models and they are the most likely to be used with this class. You can get the symbols you need from the HF repo you got your model from, under the prompt-template section.

Included Modules

Defined in:

models/base_model.cr

Constructors

Instance Method Summary

Instance methods inherited from module Llamero::Tokenizer

model_name : String model_name, model_root_path : Path model_root_path, tokenize(text_to_tokenize : IO | String) : Array(String) tokenize

Constructor Detail

def self.new(model_name : String, grammar_root_path : Path | Nil = nil, lora_root_path : Path | Nil = nil, model_root_path : Path | Nil = nil, repeat_penalty : Float32 | Nil = nil, top_k_sampling : Int32 | Nil = nil, threads : Int32 | Nil = nil, grammer_file : String | Nil = nil, context_size : Int32 | Nil = nil, temperature : Float32 | Nil = nil, keep : String | Nil = nil, n_predict : Int32 | Nil = nil, chat_template_system_prompt_opening_wrapper : String | Nil = nil, chat_template_system_prompt_closing_wrapper : String | Nil = nil, chat_template_user_prompt_opening_wrapper : String | Nil = nil, chat_template_user_prompt_closing_wrapper : String | Nil = nil, unique_token_at_the_end_of_the_prompt_to_split_on : String | Nil = nil, chat_template_end_of_generation_token : String | Nil = nil) #

Override any of the default values that are set in the child class


[View source]

Instance Method Detail

def chat(prompt_chain : Llamero::BasePrompt, grammar_class : Llamero::BaseGrammar, grammar_file : String | Path = Path.new, timeout : Time::Span = Time::Span.new(minutes: 2), max_retries : Int32 = 5, temperature : Float32 | Nil = nil, max_tokens : Int32 | Nil = nil, repeat_penalty : Float32 | Nil = nil, top_k_sampling : Int32 | Nil = nil, n_predict : Int32 | Nil = nil) #

This is the primary method for interacting with the LLM. It takes a prompt chain, sends the prompt to the LLM and uses concurrency to wait for the response or retry after a timeout threshold.

Timeout: 2 minutes Retry: 5 times


[View source]
def chat_template_end_of_generation_token : String #

[View source]
def chat_template_end_of_generation_token=(chat_template_end_of_generation_token : String) #

[View source]
def chat_template_system_prompt_closing_wrapper : String #

[View source]
def chat_template_system_prompt_closing_wrapper=(chat_template_system_prompt_closing_wrapper : String) #

[View source]
def chat_template_system_prompt_opening_wrapper : String #

The chat_template_* properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM.


[View source]
def chat_template_system_prompt_opening_wrapper=(chat_template_system_prompt_opening_wrapper : String) #

The chat_template_* properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM.


[View source]
def chat_template_user_prompt_closing_wrapper : String #

[View source]
def chat_template_user_prompt_closing_wrapper=(chat_template_user_prompt_closing_wrapper : String) #

[View source]
def chat_template_user_prompt_opening_wrapper : String #

[View source]
def chat_template_user_prompt_opening_wrapper=(chat_template_user_prompt_opening_wrapper : String) #

[View source]
def context_size : Int32 #

Most Llama models use a 2048 context window for their training data. Default: 2048.


[View source]
def context_size=(context_size : Int32) #

Most Llama models use a 2048 context window for their training data. Default: 2048.


[View source]
def grammar_file : Path #

This is just the name of the grammer file, relative to the grammar_root_path. If it's blank, it's not included in the execute command


[View source]
def grammar_file=(grammar_file : Path) #

This is just the name of the grammer file, relative to the grammar_root_path. If it's blank, it's not included in the execute command


[View source]
def grammar_root_path : Path #

The directory where any grammar files will be located

Defaults to /Users/#{whoami.strip}/grammars


[View source]
def grammar_root_path=(grammar_root_path : Path) #

The directory where any grammar files will be located

Defaults to /Users/#{whoami.strip}/grammars


[View source]
def keep : String #

This can be set by using the Llamero::Tokenizer#tokenize method from the Llamero::Tokenizer module.


[View source]
def keep=(keep : String) #

This can be set by using the Llamero::Tokenizer#tokenize method from the Llamero::Tokenizer module.


[View source]
def lora_root_path : Path #

The directory where any lora filters will be located. This is optional, but if you want to use lora filters, you will need to specify this. Lora filters are specific per model they were fine-tune from.

Default: /Users/#{whoami.strip}/loras


[View source]
def lora_root_path=(lora_root_path : Path) #

The directory where any lora filters will be located. This is optional, but if you want to use lora filters, you will need to specify this. Lora filters are specific per model they were fine-tune from.

Default: /Users/#{whoami.strip}/loras


[View source]
def model_name : String #

This should be the full filename of the model, including the .gguf file extension.

Example: meta-llama-3-8b-instruct-Q6_K.gguf


[View source]
def model_name=(model_name : String) #

This should be the full filename of the model, including the .gguf file extension.

Example: meta-llama-3-8b-instruct-Q6_K.gguf


[View source]
def model_root_path : Path #

The directory where the model files will be located. This is required.

Default: /Users/#{whoami.strip}/models


[View source]
def model_root_path=(model_root_path : Path) #

The directory where the model files will be located. This is required.

Default: /Users/#{whoami.strip}/models


[View source]
def n_predict : Int32 #

Setting this changes how many tokens are trying to be predicted at a time. Setting this to -1 will generate tokens infinitely, but causes the context window to reset frequently. Setting this to -2 stops generating as soon as the context window fills up Default: 512


[View source]
def n_predict=(n_predict : Int32) #

Setting this changes how many tokens are trying to be predicted at a time. Setting this to -1 will generate tokens infinitely, but causes the context window to reset frequently. Setting this to -2 stops generating as soon as the context window fills up Default: 512


[View source]
def quick_chat(prompt_chain : Array(NamedTuple(role: String, content: String)), grammar_class : Llamero::BaseGrammar | Nil = nil, grammar_file : String | Path = Path.new, temperature : Float32 | Nil = nil, max_tokens : Int32 | Nil = nil, repeat_penalty : Float32 | Nil = nil, top_k_sampling : Int32 | Nil = nil, n_predict : Int32 | Nil = nil, timeout : Time::Span = Time::Span.new(minutes: 5), max_retries : Int32 = 5) : String #

This is the main method for interacting with the LLM. It takes in an array of messages, and returns the response from the LLM.

Default Timeout: 30 seconds Default Max Retries: 5


[View source]
def repeat_penalty : Float32 #

Adjust up to punish repetitions more harshly, lower for more monotonous responses. Default: 1.1


[View source]
def repeat_penalty=(repeat_penalty : Float32) #

Adjust up to punish repetitions more harshly, lower for more monotonous responses. Default: 1.1


[View source]
def temperature : Float32 #

Adjust up or down to play with creativity. Default: 0.9


[View source]
def temperature=(temperature : Float32) #

Adjust up or down to play with creativity. Default: 0.9


[View source]
def threads : Int32 #

Number of threads. Should be set to the number of physical cores, not logical cores. Default is 12, but should be configured per system for optimal performance.


[View source]
def threads=(threads : Int32) #

Number of threads. Should be set to the number of physical cores, not logical cores. Default is 12, but should be configured per system for optimal performance.


[View source]
def tmp_grammar_file_path : Path #

Sometimes we'll need to use a temporary file to pass the grammar to the LLM. This is the path to that file. We'll clear it after we're done with it, but this isn't meant to be used outside of this class. :nodoc:


[View source]
def tmp_grammar_file_path=(tmp_grammar_file_path : Path) #

Sometimes we'll need to use a temporary file to pass the grammar to the LLM. This is the path to that file. We'll clear it after we're done with it, but this isn't meant to be used outside of this class. :nodoc:


[View source]
def top_k_sampling : Int32 #

Adjust up to get more unique responses, adjust down to get more "probable" responses. Default: 80


[View source]
def top_k_sampling=(top_k_sampling : Int32) #

Adjust up to get more unique responses, adjust down to get more "probable" responses. Default: 80


[View source]
def unique_token_at_the_end_of_the_prompt_to_split_on : String #

This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.


[View source]
def unique_token_at_the_end_of_the_prompt_to_split_on=(unique_token_at_the_end_of_the_prompt_to_split_on : String) #

This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.


[View source]