40 Commits

Author SHA1 Message Date
Jesse Gross
52e88ab7b3 runner.go: Health endpoint comments
The health endpoint needs a little more work to show progress as Ollama
expects but we can at least return the right status and have comments
for the future.
2024-09-03 21:15:14 -04:00
Jesse Gross
4ca8579428 runner.go: Cleanups
Variable names in Go style, removing commented out code and typo fixes
2024-09-03 21:15:14 -04:00
Jesse Gross
d022cfc9e6 runner.go: Move pieces[] into sequence
pieces[] is used to cache pending responses and is currently being
passed around to different functions. Move it into the sequences
where it logically belongs.
2024-09-03 21:15:14 -04:00
Jesse Gross
6ccd0644e1 runner.go: Fix deadlock if a connection is closed during decoding
If a connection is closed while a sequence is being decoded, tokens
will continue to be added to the channel without anyone to read them.
This will result in the sender blocking, which will in turn block
all other decoding and sending for other sequences.

This is not limited to just the connection between Ollama and the
runner process. If the connection to the Ollama API is closed by
the user then Ollama will close the connection to the runner,
triggering this issue.
2024-09-03 21:15:14 -04:00
Jesse Gross
0b73cca386 runner.go: Fix resource leaks when removing sequences
There are multiple causes and paths that result in a sequence
ending. Not all of these free the sampling context or reset the
pieces slice. This factors out the removal code so that all
paths release resources.
2024-09-03 21:15:14 -04:00
Jesse Gross
55fb0633db runner.go: Separate KV cache and context sizes
Currently the entire KV cache is shared by all parallel requestors.
This gives maximum resource utilization but there is a potential for
overflow and unfairness if multiple requests are trying to use
significant context. Instead, it is better to have a hard partition
of KV cache space.
2024-09-03 21:15:14 -04:00
Jesse Gross
53b600921e runner.go: Hold mutex for entire time when processing batch
It is not safe to hold a mutex only while we are waiting for the
condition variable to signal that a new sequence has been added. It's
possible that a sequence could be added in the middle of batch
processing. For example, if a new sequence is added while Decode()
is running, it will get picked up for sampling, despite not having
been added to the original batch.

This change holds a mutex for the majority of the time when active
processing is happening, releasing it only for a brief period each
time around the loop. Depending on the workload and the scheduler
is may result in unfairness between different requests. However,
this was not actually observed in testing.

This addresses the correctness issue - better performance and fairness
can be achieved with additional improvements in the future.
2024-09-03 21:15:14 -04:00
Jesse Gross
8e1554c91d runner.go: Scale batches to be processed by numParallel
We should process a batch of tokens for each parallel request, rather
than having a shared pool. Otherwise, a single request can fill the
batch and then subsequent ones will fail or get starved.

Server.cpp used the KV cache size allocated for each parallel request
as the allocated size for the batch. This is the upper bound for the
batch but since we know how many tokens we will actually put in a batch
there is no need to over allocate.
2024-09-03 21:15:14 -04:00
Jesse Gross
76718ead40 runner.go: Support MinP parameter
MinP is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
90d25d3b0a runner.go: Check for incomplete UTF-8 character
Generated text can contain a partial multi-byte Unicode character at
the end. Check for this and hold it over until the next token is
produced.
2024-09-03 21:15:14 -04:00
Jesse Gross
477f529d26 runner.go: Implement RepeatLastN to penalize repeated tokens
RepeatLastN is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
eccd4dd8d2 runner.go: Use correct JSON field names for runners
The fields for inference parameters are very similar between the
Ollama API and Ollama/runners. However, some of the names are
slightly different. For these fields (such as NumKeep and
NumPredict), the values from Ollama were never read properly and
defaults were always used.

In the future, we can share a single interface rather than duplicating
structs. However, this keeps the interface consistent with minimal
changes in Ollama as long as we continue to use server.cpp
2024-09-03 21:15:14 -04:00
Jesse Gross
69cc5795a7 runner.go: Shift context window when KV cache space is exceeded
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.

This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.

The algorithm is:
 - Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
 - Drop the oldest half of the remaining tokens
 - Shift the remaining new tokens to the back of the cache
2024-09-03 21:15:14 -04:00
Jesse Gross
5a441d227a runner.go: Don't decode if nothing has been added to the batch
If nothing has been added to a batch then decoding will fail if
attempted. This can happen, for example, if the run loop is woken
up but we realize that we have the generation limit.
2024-09-03 21:15:14 -04:00
Jesse Gross
8aa97b5e83 llama.go: Advance though tokens when processing multiple batches
If the number of input tokens exceeds the size of the batch, multiple
batches will be submitted but they will all contain the first tokens.
This processes the input tokens as expected so that each batch has
the next set of tokens.
2024-09-03 21:15:14 -04:00
Jesse Gross
5d34320b7c runner.go: Fix off by one in batch size check
When adding tokens to a batch, the index is zero based but is
checked against being greater than the max batch size. This results
in an out-of-bounds access when the final token is added.
2024-09-03 21:15:14 -04:00
Jesse Gross
0c2f95f3de runner: Initialize numPredict
numPredict is used to enforce a limit on the number of tokens to
generate. Is it passed in from Ollama but it is never stored to
be checked.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
47b0e81219 fix dolphin-mistral 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
751009a5d7 Runtime selection of new or old runners
This adjusts the new runners to comingle with existing runners so we can use an
env var to toggle the new runners on.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
8527028bf4 Implement timings response in Go server
This implements the fields necessary for `run --verbose`
to generate timing information.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
e0241118d0 Get embeddings working
Truncation doesn't pass, but the other embeddings tests pass
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
f97ee8c506 Fix parallel requests 2024-09-03 21:15:13 -04:00
jmorganca
1da6c40f4f lint 2024-09-03 21:15:13 -04:00
jmorganca
de634b7fd7 fix issues with runner 2024-09-03 21:15:13 -04:00
jmorganca
8f79a2e86a cleanup stop code 2024-09-03 21:15:13 -04:00
jmorganca
7d0a452938 num predict 2024-09-03 21:15:13 -04:00
jmorganca
43efc893d7 basic progress 2024-09-03 21:15:13 -04:00
jmorganca
20afaae020 add more runner params 2024-09-03 21:15:13 -04:00
jmorganca
72f3fe4b94 truncate stop properly 2024-09-03 21:15:13 -04:00
jmorganca
a379d68aa9 wip stop tokens 2024-09-03 21:15:13 -04:00
jmorganca
b2ef3bf490 embeddings 2024-09-03 21:15:12 -04:00
jmorganca
ce15ed6d69 remove dependency on llm 2024-09-03 21:15:12 -04:00
jmorganca
c0b94376b2 grammar 2024-09-03 21:15:12 -04:00
jmorganca
72be8e27c4 sampling 2024-09-03 21:15:12 -04:00
jmorganca
d12db0568e better example module, add port 2024-09-03 21:15:12 -04:00
jmorganca
ec17359a68 wip 2024-09-03 21:15:12 -04:00
jmorganca
fbc8572859 add llava to runner 2024-09-03 21:15:12 -04:00
jmorganca
a8f91d3cc1 add llava 2024-09-03 21:15:12 -04:00
jmorganca
9fe48978a8 move runner package down 2024-09-03 21:15:12 -04:00
jmorganca
01ccbc07fe replace static build in llm 2024-09-03 21:15:12 -04:00