The fields for inference parameters are very similar between the
Ollama API and Ollama/runners. However, some of the names are
slightly different. For these fields (such as NumKeep and
NumPredict), the values from Ollama were never read properly and
defaults were always used.
In the future, we can share a single interface rather than duplicating
structs. However, this keeps the interface consistent with minimal
changes in Ollama as long as we continue to use server.cpp
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.
This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.
The algorithm is:
- Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
- Drop the oldest half of the remaining tokens
- Shift the remaining new tokens to the back of the cache
If nothing has been added to a batch then decoding will fail if
attempted. This can happen, for example, if the run loop is woken
up but we realize that we have the generation limit.
If the number of input tokens exceeds the size of the batch, multiple
batches will be submitted but they will all contain the first tokens.
This processes the input tokens as expected so that each batch has
the next set of tokens.
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.
When adding tokens to a batch, the index is zero based but is
checked against being greater than the max batch size. This results
in an out-of-bounds access when the final token is added.