The embeddings endpoint only takes a single input and provides a
single output, instead of multiple as the current implementation
expected. Fixing this also allows the implementation to be simplified
and a few embedding-specific issues to be addressed.
There are multiple causes and paths that result in a sequence
ending. Not all of these free the sampling context or reset the
pieces slice. This factors out the removal code so that all
paths release resources.
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.
This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.
The algorithm is:
- Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
- Drop the oldest half of the remaining tokens
- Shift the remaining new tokens to the back of the cache
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.