llama.cr
Crystal bindings for llama.cpp, a C/C++ implementation of LLaMA, Falcon, GPT-2, and other large language models.
Please check the LLAMA_VERSION file for the current compatible version of llama.cpp.
This project is under active development and may change rapidly.
Features
- Low-level bindings to the llama.cpp C API
- High-level Crystal wrapper classes for easy usage
- Memory management for C resources
- Simple text generation interface
- Advanced sampling methods (Min-P, Typical, Mirostat, etc.)
- Batch processing for efficient token handling
- KV cache management for optimized inference
- State saving and loading
Installation
Prerequisites
You need the llama.cpp shared library (libllama) available on your system.
1. Download Prebuilt Binary (Recommended)
You can download a prebuilt binary matching the required version automatically:
Linux/macOS (bash):
LLAMA_VERSION=$(cat LLAMA_VERSION)
wget "https://github.com/ggerganov/llama.cpp/releases/download/v${LLAMA_VERSION}/llama-linux-x64.zip"
unzip "llama-linux-x64.zip"
# Move libllama.so to ./lib or set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
You can also specify the library location without installing:
crystal build examples/simple.cr --link-flags="-L/path/to/lib"
LD_LIBRARY_PATH=/path/to/lib ./simple /path/to/model.gguf "Your prompt here"
Build from source (advanced users)
You can build llama.cpp from source if you prefer:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout v$(cat ../LLAMA_VERSION)
mkdir build && cd build
cmake ..
cmake --build . --config Release
sudo cmake --install .
sudo ldconfig
Obtaining GGUF Model Files
You'll need a model file in GGUF format. For testing, smaller quantized models (1-3B parameters) with Q4_K_M quantization are recommended.
Popular options:
Adding to Your Project
Add the dependency to your shard.yml
:
dependencies:
llama:
github: kojix2/llama.cr
Then run shards install
.
Usage
Basic Text Generation
require "llama"
# Load a model
model = Llama::Model.new("/path/to/model.gguf")
# Create a context
context = model.context
# Generate text
response = context.generate("Once upon a time", max_tokens: 100, temperature: 0.8)
puts response
# Or use the convenience method
response = Llama.generate("/path/to/model.gguf", "Once upon a time")
puts response
Advanced Sampling
require "llama"
model = Llama::Model.new("/path/to/model.gguf")
context = model.context
# Create a sampler chain with multiple sampling methods
chain = Llama::SamplerChain.new
chain.add(Llama::Sampler::TopK.new(40))
chain.add(Llama::Sampler::MinP.new(0.05, 1))
chain.add(Llama::Sampler::Temp.new(0.8))
chain.add(Llama::Sampler::Dist.new(42))
# Generate text with the custom sampler chain
result = context.generate_with_sampler("Write a short poem about AI:", chain, 150)
puts result
Chat Conversations
require "llama"
require "llama/chat"
model = Llama::Model.new("/path/to/model.gguf")
context = model.context
# Create a chat conversation
messages = [
Llama::ChatMessage.new("system", "You are a helpful assistant."),
Llama::ChatMessage.new("user", "Hello, who are you?")
]
# Generate a response
response = context.chat(messages)
puts "Assistant: #{response}"
# Continue the conversation
messages << Llama::ChatMessage.new("assistant", response)
messages << Llama::ChatMessage.new("user", "Tell me a joke")
response = context.chat(messages)
puts "Assistant: #{response}"
Embeddings
require "llama"
model = Llama::Model.new("/path/to/model.gguf")
# Create a context with embeddings enabled
context = model.context(embeddings: true)
# Get embeddings for text
text = "Hello, world!"
tokens = model.vocab.tokenize(text)
batch = Llama::Batch.get_one(tokens)
context.decode(batch)
embeddings = context.get_embeddings_seq(0)
puts "Embedding dimension: #{embeddings.size}"
Utilities
System Info
puts Llama.system_info
Tokenization Utility
model = Llama::Model.new("/path/to/model.gguf")
puts Llama.tokenize_and_format(model.vocab, "Hello, world!", ids_only: true)
Examples
The examples
directory contains sample code demonstrating various features:
simple.cr
- Basic text generationchat.cr
- Chat conversations with modelstokenize.cr
- Tokenization and vocabulary features
API Documentation
See kojix2.github.io/llama.cr for full API docs.
Core Classes
- Llama::Model - Represents a loaded LLaMA model
- Llama::Context - Handles inference state for a model
- Llama::Vocab - Provides access to the model's vocabulary
- Llama::Batch - Manages batches of tokens for efficient processing
- Llama::KvCache - Controls the key-value cache for optimized inference
- Llama::State - Handles saving and loading model state
- Llama::SamplerChain - Combines multiple sampling methods
Samplers
- Llama::Sampler::TopK - Keeps only the top K most likely tokens
- Llama::Sampler::TopP - Nucleus sampling (keeps tokens until cumulative probability exceeds P)
- Llama::Sampler::Temp - Applies temperature to logits
- Llama::Sampler::Dist - Samples from the final probability distribution
- Llama::Sampler::MinP - Keeps tokens with probability >= P * max_probability
- Llama::Sampler::Typical - Selects tokens based on their "typicality" (entropy)
- Llama::Sampler::Mirostat - Dynamically adjusts sampling to maintain target entropy
- Llama::Sampler::Penalties - Applies penalties to reduce repetition
Development
See DEVELOPMENT.md for development guidelines.
Contributing
- Fork it (https://github.com/kojix2/llama.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
License
This project is available under the MIT License. See the LICENSE file for more info.