class Llamero::BaseModel
- Llamero::BaseModel
- Reference
- Object
Overview
The primary client for interacting directly with models that are available on the local computer.
This class allows the app to switch out specific fune-tuning aspects that improve the models accuracy for specific tasks. This includes:
- adding custom
grammars
for customizing response formats - adding or switching out LoRA's
- managing the model's context window (token length)
Logs are always disabled with --log-disable when running from within this app
The chat_template_*
properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM.
Chat models are the most commonly used types of models and they are the most likely to be used with this class.
You can get the symbols you need from the HF repo you got your model from, under the prompt-template
section.
Included Modules
Defined in:
models/base_model.crConstructors
Instance Method Summary
-
#chat(prompt_chain : Llamero::BasePrompt, grammar_class : Llamero::BaseGrammar, grammar_file : String | Path = Path.new, timeout : Time::Span = Time::Span.new(minutes: 2), max_retries : Int32 = 5, temperature : Float32 | Nil = nil, max_tokens : Int32 | Nil = nil, repeat_penalty : Float32 | Nil = nil, top_k_sampling : Int32 | Nil = nil, n_predict : Int32 | Nil = nil)
This is the primary method for interacting with the LLM.
- #chat_template_end_of_generation_token : String
- #chat_template_end_of_generation_token=(chat_template_end_of_generation_token : String)
- #chat_template_system_prompt_closing_wrapper : String
- #chat_template_system_prompt_closing_wrapper=(chat_template_system_prompt_closing_wrapper : String)
-
#chat_template_system_prompt_opening_wrapper : String
The
chat_template_*
properties are used to wrap the system and user prompts. -
#chat_template_system_prompt_opening_wrapper=(chat_template_system_prompt_opening_wrapper : String)
The
chat_template_*
properties are used to wrap the system and user prompts. - #chat_template_user_prompt_closing_wrapper : String
- #chat_template_user_prompt_closing_wrapper=(chat_template_user_prompt_closing_wrapper : String)
- #chat_template_user_prompt_opening_wrapper : String
- #chat_template_user_prompt_opening_wrapper=(chat_template_user_prompt_opening_wrapper : String)
-
#context_size : Int32
Most Llama models use a 2048 context window for their training data.
-
#context_size=(context_size : Int32)
Most Llama models use a 2048 context window for their training data.
-
#grammar_file : Path
This is just the name of the grammer file, relative to the grammar_root_path.
-
#grammar_file=(grammar_file : Path)
This is just the name of the grammer file, relative to the grammar_root_path.
-
#grammar_root_path : Path
The directory where any grammar files will be located
-
#grammar_root_path=(grammar_root_path : Path)
The directory where any grammar files will be located
-
#keep : String
This can be set by using the
Llamero::Tokenizer#tokenize
method from theLlamero::Tokenizer
module. -
#keep=(keep : String)
This can be set by using the
Llamero::Tokenizer#tokenize
method from theLlamero::Tokenizer
module. -
#lora_root_path : Path
The directory where any lora filters will be located.
-
#lora_root_path=(lora_root_path : Path)
The directory where any lora filters will be located.
-
#model_name : String
This should be the full filename of the model, including the .gguf file extension.
-
#model_name=(model_name : String)
This should be the full filename of the model, including the .gguf file extension.
-
#model_root_path : Path
The directory where the model files will be located.
-
#model_root_path=(model_root_path : Path)
The directory where the model files will be located.
-
#n_predict : Int32
Setting this changes how many tokens are trying to be predicted at a time.
-
#n_predict=(n_predict : Int32)
Setting this changes how many tokens are trying to be predicted at a time.
-
#quick_chat(prompt_chain : Array(NamedTuple(role: String, content: String)), grammar_class : Llamero::BaseGrammar | Nil = nil, grammar_file : String | Path = Path.new, temperature : Float32 | Nil = nil, max_tokens : Int32 | Nil = nil, repeat_penalty : Float32 | Nil = nil, top_k_sampling : Int32 | Nil = nil, n_predict : Int32 | Nil = nil, timeout : Time::Span = Time::Span.new(minutes: 5), max_retries : Int32 = 5) : String
This is the main method for interacting with the LLM.
-
#repeat_penalty : Float32
Adjust up to punish repetitions more harshly, lower for more monotonous responses.
-
#repeat_penalty=(repeat_penalty : Float32)
Adjust up to punish repetitions more harshly, lower for more monotonous responses.
-
#temperature : Float32
Adjust up or down to play with creativity.
-
#temperature=(temperature : Float32)
Adjust up or down to play with creativity.
-
#threads : Int32
Number of threads.
-
#threads=(threads : Int32)
Number of threads.
-
#tmp_grammar_file_path : Path
Sometimes we'll need to use a temporary file to pass the grammar to the LLM.
-
#tmp_grammar_file_path=(tmp_grammar_file_path : Path)
Sometimes we'll need to use a temporary file to pass the grammar to the LLM.
-
#top_k_sampling : Int32
Adjust up to get more unique responses, adjust down to get more "probable" responses.
-
#top_k_sampling=(top_k_sampling : Int32)
Adjust up to get more unique responses, adjust down to get more "probable" responses.
-
#unique_token_at_the_end_of_the_prompt_to_split_on : String
This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.
-
#unique_token_at_the_end_of_the_prompt_to_split_on=(unique_token_at_the_end_of_the_prompt_to_split_on : String)
This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.
Instance methods inherited from module Llamero::Tokenizer
model_name : String
model_name,
model_root_path : Path
model_root_path,
tokenize(text_to_tokenize : IO | String) : Array(String)
tokenize
Constructor Detail
Override any of the default values that are set in the child class
Instance Method Detail
This is the primary method for interacting with the LLM. It takes a prompt chain, sends the prompt to the LLM and uses concurrency to wait for the response or retry after a timeout threshold.
Timeout: 2 minutes Retry: 5 times
The chat_template_*
properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM.
The chat_template_*
properties are used to wrap the system and user prompts. This is used to generate the prompts for the LLM.
Most Llama models use a 2048 context window for their training data. Default: 2048.
Most Llama models use a 2048 context window for their training data. Default: 2048.
This is just the name of the grammer file, relative to the grammar_root_path. If it's blank, it's not included in the execute command
This is just the name of the grammer file, relative to the grammar_root_path. If it's blank, it's not included in the execute command
The directory where any grammar files will be located
Defaults to /Users/#{
whoami.strip}/grammars
The directory where any grammar files will be located
Defaults to /Users/#{
whoami.strip}/grammars
This can be set by using the Llamero::Tokenizer#tokenize
method from the Llamero::Tokenizer
module.
This can be set by using the Llamero::Tokenizer#tokenize
method from the Llamero::Tokenizer
module.
The directory where any lora filters will be located. This is optional, but if you want to use lora filters, you will need to specify this. Lora filters are specific per model they were fine-tune from.
Default: /Users/#{whoami
.strip}/loras
The directory where any lora filters will be located. This is optional, but if you want to use lora filters, you will need to specify this. Lora filters are specific per model they were fine-tune from.
Default: /Users/#{whoami
.strip}/loras
This should be the full filename of the model, including the .gguf file extension.
Example: meta-llama-3-8b-instruct-Q6_K.gguf
This should be the full filename of the model, including the .gguf file extension.
Example: meta-llama-3-8b-instruct-Q6_K.gguf
The directory where the model files will be located. This is required.
Default: /Users/#{whoami
.strip}/models
The directory where the model files will be located. This is required.
Default: /Users/#{whoami
.strip}/models
Setting this changes how many tokens are trying to be predicted at a time. Setting this to -1 will generate tokens infinitely, but causes the context window to reset frequently. Setting this to -2 stops generating as soon as the context window fills up Default: 512
Setting this changes how many tokens are trying to be predicted at a time. Setting this to -1 will generate tokens infinitely, but causes the context window to reset frequently. Setting this to -2 stops generating as soon as the context window fills up Default: 512
This is the main method for interacting with the LLM. It takes in an array of messages, and returns the response from the LLM.
Default Timeout: 30 seconds Default Max Retries: 5
Adjust up to punish repetitions more harshly, lower for more monotonous responses. Default: 1.1
Adjust up to punish repetitions more harshly, lower for more monotonous responses. Default: 1.1
Number of threads. Should be set to the number of physical cores, not logical cores. Default is 12, but should be configured per system for optimal performance.
Number of threads. Should be set to the number of physical cores, not logical cores. Default is 12, but should be configured per system for optimal performance.
Sometimes we'll need to use a temporary file to pass the grammar to the LLM. This is the path to that file. We'll clear it after we're done with it, but this isn't meant to be used outside of this class. :nodoc:
Sometimes we'll need to use a temporary file to pass the grammar to the LLM. This is the path to that file. We'll clear it after we're done with it, but this isn't meant to be used outside of this class. :nodoc:
Adjust up to get more unique responses, adjust down to get more "probable" responses. Default: 80
Adjust up to get more unique responses, adjust down to get more "probable" responses. Default: 80
This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.
This is a unique token at the end of the prompt that is used to split off and parse the response from the LLM.