ollama

Author	SHA1	Message	Date
Jesse Gross	52e88ab7b3	runner.go: Health endpoint comments The health endpoint needs a little more work to show progress as Ollama expects but we can at least return the right status and have comments for the future.	2024-09-03 21:15:14 -04:00
Jesse Gross	4ca8579428	runner.go: Cleanups Variable names in Go style, removing commented out code and typo fixes	2024-09-03 21:15:14 -04:00
Jesse Gross	d022cfc9e6	runner.go: Move pieces[] into sequence pieces[] is used to cache pending responses and is currently being passed around to different functions. Move it into the sequences where it logically belongs.	2024-09-03 21:15:14 -04:00
Jesse Gross	6ccd0644e1	runner.go: Fix deadlock if a connection is closed during decoding If a connection is closed while a sequence is being decoded, tokens will continue to be added to the channel without anyone to read them. This will result in the sender blocking, which will in turn block all other decoding and sending for other sequences. This is not limited to just the connection between Ollama and the runner process. If the connection to the Ollama API is closed by the user then Ollama will close the connection to the runner, triggering this issue.	2024-09-03 21:15:14 -04:00
Jesse Gross	0b73cca386	runner.go: Fix resource leaks when removing sequences There are multiple causes and paths that result in a sequence ending. Not all of these free the sampling context or reset the pieces slice. This factors out the removal code so that all paths release resources.	2024-09-03 21:15:14 -04:00
Jesse Gross	55fb0633db	runner.go: Separate KV cache and context sizes Currently the entire KV cache is shared by all parallel requestors. This gives maximum resource utilization but there is a potential for overflow and unfairness if multiple requests are trying to use significant context. Instead, it is better to have a hard partition of KV cache space.	2024-09-03 21:15:14 -04:00
Jesse Gross	53b600921e	runner.go: Hold mutex for entire time when processing batch It is not safe to hold a mutex only while we are waiting for the condition variable to signal that a new sequence has been added. It's possible that a sequence could be added in the middle of batch processing. For example, if a new sequence is added while Decode() is running, it will get picked up for sampling, despite not having been added to the original batch. This change holds a mutex for the majority of the time when active processing is happening, releasing it only for a brief period each time around the loop. Depending on the workload and the scheduler is may result in unfairness between different requests. However, this was not actually observed in testing. This addresses the correctness issue - better performance and fairness can be achieved with additional improvements in the future.	2024-09-03 21:15:14 -04:00
Jesse Gross	8e1554c91d	runner.go: Scale batches to be processed by numParallel We should process a batch of tokens for each parallel request, rather than having a shared pool. Otherwise, a single request can fill the batch and then subsequent ones will fail or get starved. Server.cpp used the KV cache size allocated for each parallel request as the allocated size for the batch. This is the upper bound for the batch but since we know how many tokens we will actually put in a batch there is no need to over allocate.	2024-09-03 21:15:14 -04:00
Jesse Gross	76718ead40	runner.go: Support MinP parameter MinP is a user-facing parameter that is exposed that is exposed through the APIs but is not currently plumbed through.	2024-09-03 21:15:14 -04:00
Jesse Gross	90d25d3b0a	runner.go: Check for incomplete UTF-8 character Generated text can contain a partial multi-byte Unicode character at the end. Check for this and hold it over until the next token is produced.	2024-09-03 21:15:14 -04:00
Jesse Gross	477f529d26	runner.go: Implement RepeatLastN to penalize repeated tokens RepeatLastN is a user-facing parameter that is exposed that is exposed through the APIs but is not currently plumbed through.	2024-09-03 21:15:14 -04:00
Jesse Gross	eccd4dd8d2	runner.go: Use correct JSON field names for runners The fields for inference parameters are very similar between the Ollama API and Ollama/runners. However, some of the names are slightly different. For these fields (such as NumKeep and NumPredict), the values from Ollama were never read properly and defaults were always used. In the future, we can share a single interface rather than duplicating structs. However, this keeps the interface consistent with minimal changes in Ollama as long as we continue to use server.cpp	2024-09-03 21:15:14 -04:00
Jesse Gross	69cc5795a7	runner.go: Shift context window when KV cache space is exceeded Currently, once the KV cache is full, text generation stops. Instead, we should shift out the oldest context so that new generation can continue based on more recent context. This uses the algorithm from llama.cpp that is currently used by Ollama with the server.cpp code. There are others but they are never turned on through Ollama, so this restores parity. The algorithm is: - Retain a configurable number of tokens at the beginning (for things like beginning of sequence tokens - Drop the oldest half of the remaining tokens - Shift the remaining new tokens to the back of the cache	2024-09-03 21:15:14 -04:00
Jesse Gross	5a441d227a	runner.go: Don't decode if nothing has been added to the batch If nothing has been added to a batch then decoding will fail if attempted. This can happen, for example, if the run loop is woken up but we realize that we have the generation limit.	2024-09-03 21:15:14 -04:00
Jesse Gross	8aa97b5e83	llama.go: Advance though tokens when processing multiple batches If the number of input tokens exceeds the size of the batch, multiple batches will be submitted but they will all contain the first tokens. This processes the input tokens as expected so that each batch has the next set of tokens.	2024-09-03 21:15:14 -04:00
Jesse Gross	5d34320b7c	runner.go: Fix off by one in batch size check When adding tokens to a batch, the index is zero based but is checked against being greater than the max batch size. This results in an out-of-bounds access when the final token is added.	2024-09-03 21:15:14 -04:00
Jesse Gross	0c2f95f3de	runner: Initialize numPredict numPredict is used to enforce a limit on the number of tokens to generate. Is it passed in from Ollama but it is never stored to be checked.	2024-09-03 21:15:14 -04:00
Daniel Hiltgen	47b0e81219	fix dolphin-mistral	2024-09-03 21:15:14 -04:00
Daniel Hiltgen	751009a5d7	Runtime selection of new or old runners This adjusts the new runners to comingle with existing runners so we can use an env var to toggle the new runners on.	2024-09-03 21:15:14 -04:00
Daniel Hiltgen	8527028bf4	Implement timings response in Go server This implements the fields necessary for `run --verbose` to generate timing information.	2024-09-03 21:15:14 -04:00
Daniel Hiltgen	e0241118d0	Get embeddings working Truncation doesn't pass, but the other embeddings tests pass	2024-09-03 21:15:14 -04:00
Daniel Hiltgen	f97ee8c506	Fix parallel requests	2024-09-03 21:15:13 -04:00
jmorganca	1da6c40f4f	lint	2024-09-03 21:15:13 -04:00
jmorganca	de634b7fd7	fix issues with runner	2024-09-03 21:15:13 -04:00
jmorganca	8f79a2e86a	cleanup stop code	2024-09-03 21:15:13 -04:00
jmorganca	7d0a452938	num predict	2024-09-03 21:15:13 -04:00
jmorganca	43efc893d7	basic progress	2024-09-03 21:15:13 -04:00
jmorganca	20afaae020	add more runner params	2024-09-03 21:15:13 -04:00
jmorganca	72f3fe4b94	truncate stop properly	2024-09-03 21:15:13 -04:00
jmorganca	a379d68aa9	wip stop tokens	2024-09-03 21:15:13 -04:00
jmorganca	b2ef3bf490	embeddings	2024-09-03 21:15:12 -04:00
jmorganca	ce15ed6d69	remove dependency on `llm`	2024-09-03 21:15:12 -04:00
jmorganca	c0b94376b2	grammar	2024-09-03 21:15:12 -04:00
jmorganca	72be8e27c4	sampling	2024-09-03 21:15:12 -04:00
jmorganca	d12db0568e	better `example` module, add port	2024-09-03 21:15:12 -04:00
jmorganca	ec17359a68	wip	2024-09-03 21:15:12 -04:00
jmorganca	fbc8572859	add `llava` to `runner`	2024-09-03 21:15:12 -04:00
jmorganca	a8f91d3cc1	add llava	2024-09-03 21:15:12 -04:00
jmorganca	9fe48978a8	move `runner` package down	2024-09-03 21:15:12 -04:00
jmorganca	01ccbc07fe	replace static build in `llm`	2024-09-03 21:15:12 -04:00

40 Commits