The current CGo bindings for loading LoRAs only supports the older
GGLA file format, which is no longer supported. This switches to
use functions that load the newer GGUF LoRAs.
Cgo handles passing pointers through C to another Go function
with integer handles to the Go memory. However, it is not legal
to cast this handle to a void * aux data in C. Doing this results
in panics about invalid pointers on the stack in certain
circumstances.
Instead, we should pass a pointer to the handle and pin that in
memory. It would probably also be safe to directly pin the Go
function pointer and pass that rather than using the handle since
it is an opaque blob to C. However, using a handle is the more
generally correct solution and there is no need to get clever.
Command line options to the runner that control resource usage
(mmap, mlock, tensor split) are used by Ollama but not currently
implemented. This implements support for these while ignoring
others that have no meaning in this context.
The embeddings endpoint only takes a single input and provides a
single output, instead of multiple as the current implementation
expected. Fixing this also allows the implementation to be simplified
and a few embedding-specific issues to be addressed.
The health endpoint needs a little more work to show progress as Ollama
expects but we can at least return the right status and have comments
for the future.
pieces[] is used to cache pending responses and is currently being
passed around to different functions. Move it into the sequences
where it logically belongs.
If a connection is closed while a sequence is being decoded, tokens
will continue to be added to the channel without anyone to read them.
This will result in the sender blocking, which will in turn block
all other decoding and sending for other sequences.
This is not limited to just the connection between Ollama and the
runner process. If the connection to the Ollama API is closed by
the user then Ollama will close the connection to the runner,
triggering this issue.
There are multiple causes and paths that result in a sequence
ending. Not all of these free the sampling context or reset the
pieces slice. This factors out the removal code so that all
paths release resources.
Currently the entire KV cache is shared by all parallel requestors.
This gives maximum resource utilization but there is a potential for
overflow and unfairness if multiple requests are trying to use
significant context. Instead, it is better to have a hard partition
of KV cache space.
It is not safe to hold a mutex only while we are waiting for the
condition variable to signal that a new sequence has been added. It's
possible that a sequence could be added in the middle of batch
processing. For example, if a new sequence is added while Decode()
is running, it will get picked up for sampling, despite not having
been added to the original batch.
This change holds a mutex for the majority of the time when active
processing is happening, releasing it only for a brief period each
time around the loop. Depending on the workload and the scheduler
is may result in unfairness between different requests. However,
this was not actually observed in testing.
This addresses the correctness issue - better performance and fairness
can be achieved with additional improvements in the future.
We should process a batch of tokens for each parallel request, rather
than having a shared pool. Otherwise, a single request can fill the
batch and then subsequent ones will fail or get starved.
Server.cpp used the KV cache size allocated for each parallel request
as the allocated size for the batch. This is the upper bound for the
batch but since we know how many tokens we will actually put in a batch
there is no need to over allocate.
The fields for inference parameters are very similar between the
Ollama API and Ollama/runners. However, some of the names are
slightly different. For these fields (such as NumKeep and
NumPredict), the values from Ollama were never read properly and
defaults were always used.
In the future, we can share a single interface rather than duplicating
structs. However, this keeps the interface consistent with minimal
changes in Ollama as long as we continue to use server.cpp
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.
This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.
The algorithm is:
- Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
- Drop the oldest half of the remaining tokens
- Shift the remaining new tokens to the back of the cache
If nothing has been added to a batch then decoding will fail if
attempted. This can happen, for example, if the run loop is woken
up but we realize that we have the generation limit.
If the number of input tokens exceeds the size of the batch, multiple
batches will be submitted but they will all contain the first tokens.
This processes the input tokens as expected so that each batch has
the next set of tokens.
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.
When adding tokens to a batch, the index is zero based but is
checked against being greater than the max batch size. This results
in an out-of-bounds access when the final token is added.
tokenize() passes a string length longer than the actual data into
llama_tokenize(). This entire string length gets scanned in the
C++ code despite there being a NULL terminator in the correct
location (because it gets converted into std::string). The result
is read of uninitialized memory, which depending on the contents
of that memory fails the check for partial multi-byte UTF8
characters.
In addition, if there is not enough space in the passed buffer for
token output then llama_tokenize() returns the required space as
a negative number. We should convert this to a positive number
before reallocing.
The first problem results in the following splat:
libc++abi: terminating due to uncaught exception of type std::invalid_argument: failed to convert utf8 to codepoint
SIGABRT: abort
PC=0x193cd55f0 m=11 sigcode=0
signal arrived during cgo execution
goroutine 27 gp=0x14000708700 m=11 mp=0x14000584908 [syscall]:
runtime.cgocall(0x105549e68, 0x140000c6bf8)
/opt/homebrew/Cellar/go/1.22.5/libexec/src/runtime/cgocall.go:157 +0x44 fp=0x140000c6bc0 sp=0x140000c6b80 pc=0x104b372c4
github.com/ollama/ollama/llm._Cfunc_llama_tokenize(0x15180f400, 0x152009a00, 0x5aa, 0x140002e8800, 0x5aa, 0x1, 0x1)
_cgo_gotypes.go:270 +0x34 fp=0x140000c6bf0 sp=0x140000c6bc0 pc=0x104ef7664
github.com/ollama/ollama/llm.tokenize.func2(0x140001dd800?, 0x152009a00, 0x5aa, 0x1400012cdc0?)
/Users/jesse/ollama/llm/llm.go:74 +0x8c fp=0x140000c6c50 sp=0x140000c6bf0 pc=0x104ef83cc
github.com/ollama/ollama/llm.tokenize(0x140003f7da0, {0x140001dd800, 0x5a8})
/Users/jesse/ollama/llm/llm.go:74 +0xb4 fp=0x140000c6d90 sp=0x140000c6c50 pc=0x104ef7f94
github.com/ollama/ollama/llm.(*llmServer).Tokenize(0x140000c6df8?, {0x105516574?, 0x5a8?}, {0x140001dd800?, 0x140000c6d00?})
/Users/jesse/ollama/llm/server.go:963 +0x2c fp=0x140000c6dc0 sp=0x140000c6d90 pc=0x104ef6b6c
github.com/ollama/ollama/llm.LlamaServer.Tokenize-fm({0x105e876f0?, 0x140001e5c70?}, {0x140001dd800?, 0x140000350e0?})
<autogenerated>:1 +0x50 fp=0x140000c6e00 sp=0x140000c6dc0 pc=0x105532fc0
github.com/ollama/ollama/server.chatPrompt({0x105e876f0, 0x140001e5c70}, 0x14000616480, 0x140000c7508, 0x1400013e000, {0x1400014e008, 0x7, 0x7}, {0x0, 0x0, ...})
/Users/jesse/ollama/server/prompt.go:36 +0x2a0 fp=0x140000c7100 sp=0x140000c6e00 pc=0x1055165a0
github.com/ollama/ollama/server.(*Server).ChatHandler(0x1400000e9c0, 0x1400011c100)
/Users/jesse/ollama/server/routes.go:1340 +0x478 fp=0x140000c7610 sp=0x140000c7100 pc=0x105523318
github.com/ollama/ollama/server.(*Server).ChatHandler-fm(0x9?)
<autogenerated>:1 +0x30 fp=0x140000c7630 sp=0x140000c7610 pc=0x105533130
If the runner subprocess encounters an error, it will close the HTTP
connect, which causes Ollama to free the instance of the model that has
open. When Ollama exits, it will again try to free the models for all
of the runners that were open, resulting in a double free.