The current CGo bindings for loading LoRAs only supports the older
GGLA file format, which is no longer supported. This switches to
use functions that load the newer GGUF LoRAs.
Cgo handles passing pointers through C to another Go function
with integer handles to the Go memory. However, it is not legal
to cast this handle to a void * aux data in C. Doing this results
in panics about invalid pointers on the stack in certain
circumstances.
Instead, we should pass a pointer to the handle and pin that in
memory. It would probably also be safe to directly pin the Go
function pointer and pass that rather than using the handle since
it is an opaque blob to C. However, using a handle is the more
generally correct solution and there is no need to get clever.
Command line options to the runner that control resource usage
(mmap, mlock, tensor split) are used by Ollama but not currently
implemented. This implements support for these while ignoring
others that have no meaning in this context.
The embeddings endpoint only takes a single input and provides a
single output, instead of multiple as the current implementation
expected. Fixing this also allows the implementation to be simplified
and a few embedding-specific issues to be addressed.
There are multiple causes and paths that result in a sequence
ending. Not all of these free the sampling context or reset the
pieces slice. This factors out the removal code so that all
paths release resources.
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.
This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.
The algorithm is:
- Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
- Drop the oldest half of the remaining tokens
- Shift the remaining new tokens to the back of the cache
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.