3541 Commits

Author SHA1 Message Date
Jesse Gross
53b600921e runner.go: Hold mutex for entire time when processing batch
It is not safe to hold a mutex only while we are waiting for the
condition variable to signal that a new sequence has been added. It's
possible that a sequence could be added in the middle of batch
processing. For example, if a new sequence is added while Decode()
is running, it will get picked up for sampling, despite not having
been added to the original batch.

This change holds a mutex for the majority of the time when active
processing is happening, releasing it only for a brief period each
time around the loop. Depending on the workload and the scheduler
is may result in unfairness between different requests. However,
this was not actually observed in testing.

This addresses the correctness issue - better performance and fairness
can be achieved with additional improvements in the future.
2024-09-03 21:15:14 -04:00
Jesse Gross
8e1554c91d runner.go: Scale batches to be processed by numParallel
We should process a batch of tokens for each parallel request, rather
than having a shared pool. Otherwise, a single request can fill the
batch and then subsequent ones will fail or get starved.

Server.cpp used the KV cache size allocated for each parallel request
as the allocated size for the batch. This is the upper bound for the
batch but since we know how many tokens we will actually put in a batch
there is no need to over allocate.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
f52d4b9879 Make new tokenizer logic conditional (#6395)
Only use the new cgo tokenizer/detokenizer if we're
using the new runners
2024-09-03 21:15:14 -04:00
Jesse Gross
76718ead40 runner.go: Support MinP parameter
MinP is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
90d25d3b0a runner.go: Check for incomplete UTF-8 character
Generated text can contain a partial multi-byte Unicode character at
the end. Check for this and hold it over until the next token is
produced.
2024-09-03 21:15:14 -04:00
Jesse Gross
477f529d26 runner.go: Implement RepeatLastN to penalize repeated tokens
RepeatLastN is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
eccd4dd8d2 runner.go: Use correct JSON field names for runners
The fields for inference parameters are very similar between the
Ollama API and Ollama/runners. However, some of the names are
slightly different. For these fields (such as NumKeep and
NumPredict), the values from Ollama were never read properly and
defaults were always used.

In the future, we can share a single interface rather than duplicating
structs. However, this keeps the interface consistent with minimal
changes in Ollama as long as we continue to use server.cpp
2024-09-03 21:15:14 -04:00
Jesse Gross
69cc5795a7 runner.go: Shift context window when KV cache space is exceeded
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.

This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.

The algorithm is:
 - Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
 - Drop the oldest half of the remaining tokens
 - Shift the remaining new tokens to the back of the cache
2024-09-03 21:15:14 -04:00
Jesse Gross
5a441d227a runner.go: Don't decode if nothing has been added to the batch
If nothing has been added to a batch then decoding will fail if
attempted. This can happen, for example, if the run loop is woken
up but we realize that we have the generation limit.
2024-09-03 21:15:14 -04:00
Jesse Gross
8aa97b5e83 llama.go: Advance though tokens when processing multiple batches
If the number of input tokens exceeds the size of the batch, multiple
batches will be submitted but they will all contain the first tokens.
This processes the input tokens as expected so that each batch has
the next set of tokens.
2024-09-03 21:15:14 -04:00
Jesse Gross
523d84c563 llama.go: Use dynamic buffer for TokenToPiece
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
2024-09-03 21:15:14 -04:00
Jesse Gross
ed19fad862 llama.go: Make batch memory allocation match configuration
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.
2024-09-03 21:15:14 -04:00
Jesse Gross
5d34320b7c runner.go: Fix off by one in batch size check
When adding tokens to a batch, the index is zero based but is
checked against being greater than the max batch size. This results
in an out-of-bounds access when the final token is added.
2024-09-03 21:15:14 -04:00
Jesse Gross
1c36f36c41 llm: Fix array out-of-bounds memory access when tokenizing
tokenize() passes a string length longer than the actual data into
llama_tokenize(). This entire string length gets scanned in the
C++ code despite there being a NULL terminator in the correct
location (because it gets converted into std::string). The result
is read of uninitialized memory, which depending on the contents
of that memory fails the check for partial multi-byte UTF8
characters.

In addition, if there is not enough space in the passed buffer for
token output then llama_tokenize() returns the required space as
a negative number. We should convert this to a positive number
before reallocing.

The first problem results in the following splat:
libc++abi: terminating due to uncaught exception of type std::invalid_argument: failed to convert utf8 to codepoint
SIGABRT: abort
PC=0x193cd55f0 m=11 sigcode=0
signal arrived during cgo execution

goroutine 27 gp=0x14000708700 m=11 mp=0x14000584908 [syscall]:
runtime.cgocall(0x105549e68, 0x140000c6bf8)
	/opt/homebrew/Cellar/go/1.22.5/libexec/src/runtime/cgocall.go:157 +0x44 fp=0x140000c6bc0 sp=0x140000c6b80 pc=0x104b372c4
github.com/ollama/ollama/llm._Cfunc_llama_tokenize(0x15180f400, 0x152009a00, 0x5aa, 0x140002e8800, 0x5aa, 0x1, 0x1)
	_cgo_gotypes.go:270 +0x34 fp=0x140000c6bf0 sp=0x140000c6bc0 pc=0x104ef7664
github.com/ollama/ollama/llm.tokenize.func2(0x140001dd800?, 0x152009a00, 0x5aa, 0x1400012cdc0?)
	/Users/jesse/ollama/llm/llm.go:74 +0x8c fp=0x140000c6c50 sp=0x140000c6bf0 pc=0x104ef83cc
github.com/ollama/ollama/llm.tokenize(0x140003f7da0, {0x140001dd800, 0x5a8})
	/Users/jesse/ollama/llm/llm.go:74 +0xb4 fp=0x140000c6d90 sp=0x140000c6c50 pc=0x104ef7f94
github.com/ollama/ollama/llm.(*llmServer).Tokenize(0x140000c6df8?, {0x105516574?, 0x5a8?}, {0x140001dd800?, 0x140000c6d00?})
	/Users/jesse/ollama/llm/server.go:963 +0x2c fp=0x140000c6dc0 sp=0x140000c6d90 pc=0x104ef6b6c
github.com/ollama/ollama/llm.LlamaServer.Tokenize-fm({0x105e876f0?, 0x140001e5c70?}, {0x140001dd800?, 0x140000350e0?})
	<autogenerated>:1 +0x50 fp=0x140000c6e00 sp=0x140000c6dc0 pc=0x105532fc0
github.com/ollama/ollama/server.chatPrompt({0x105e876f0, 0x140001e5c70}, 0x14000616480, 0x140000c7508, 0x1400013e000, {0x1400014e008, 0x7, 0x7}, {0x0, 0x0, ...})
	/Users/jesse/ollama/server/prompt.go:36 +0x2a0 fp=0x140000c7100 sp=0x140000c6e00 pc=0x1055165a0
github.com/ollama/ollama/server.(*Server).ChatHandler(0x1400000e9c0, 0x1400011c100)
	/Users/jesse/ollama/server/routes.go:1340 +0x478 fp=0x140000c7610 sp=0x140000c7100 pc=0x105523318
github.com/ollama/ollama/server.(*Server).ChatHandler-fm(0x9?)
	<autogenerated>:1 +0x30 fp=0x140000c7630 sp=0x140000c7610 pc=0x105533130
2024-09-03 21:15:14 -04:00
Jesse Gross
0c2f95f3de runner: Initialize numPredict
numPredict is used to enforce a limit on the number of tokens to
generate. Is it passed in from Ollama but it is never stored to
be checked.
2024-09-03 21:15:14 -04:00
Jesse Gross
ebdf781397 server: Fix double free on runner subprocess error.
If the runner subprocess encounters an error, it will close the HTTP
connect, which causes Ollama to free the instance of the model that has
open. When Ollama exits, it will again try to free the models for all
of the runners that were open, resulting in a double free.
2024-09-03 21:15:14 -04:00
Jesse Gross
23c7c1326e llm: Fix lint 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
8fe30d161c Fix filename for non darwin arm builds 2024-09-03 21:15:14 -04:00
jmorganca
a483a4c4ed lint 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
b267ab92b0 Add missing vendor headers to ggml sync 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
189ca38f1d Wire up native source file dependencies
This should make sure incremental builds correctly identify
when to rebuild components based on which native files
are modified.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
80db43b7b4 Bump llama sync to 1e6f65 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
47b0e81219 fix dolphin-mistral 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
21947d5c1b harden integration tests 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
751009a5d7 Runtime selection of new or old runners
This adjusts the new runners to comingle with existing runners so we can use an
env var to toggle the new runners on.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
8527028bf4 Implement timings response in Go server
This implements the fields necessary for `run --verbose`
to generate timing information.
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
e0241118d0 Get embeddings working
Truncation doesn't pass, but the other embeddings tests pass
2024-09-03 21:15:14 -04:00
Daniel Hiltgen
f97ee8c506 Fix parallel requests 2024-09-03 21:15:13 -04:00
Daniel Hiltgen
e9dd656ff5 Update sync with latest llama.cpp layout, and run against b3485 2024-09-03 21:15:13 -04:00
Daniel Hiltgen
6c0d892498 Prefix all build artifacts with an OS/ARCH dir
This will help keep incremental builds from stomping on each other and make it
easier to stitch together the final runner payloads
2024-09-03 21:15:13 -04:00
Daniel Hiltgen
13348e3629 Get linux building
Still needs a bit more refinement to (auto)detect cuda/hip and fallback
gracefully if not detected.
2024-09-03 21:15:13 -04:00
jmorganca
3d5a08c315 add note in readme 2024-09-03 21:15:13 -04:00
jmorganca
a29851bc9b clean up metal code 2024-09-03 21:15:13 -04:00
jmorganca
8dda9293fa fix Makefile on windows 2024-09-03 21:15:13 -04:00
jmorganca
b3c62dcafd remove printing 2024-09-03 21:15:13 -04:00
jmorganca
9b8b7cd9b5 dont apply license to stb_image.h and json.hpp 2024-09-03 21:15:13 -04:00
jmorganca
1da6c40f4f lint 2024-09-03 21:15:13 -04:00
jmorganca
76ca2de06e update sync header 2024-09-03 21:15:13 -04:00
jmorganca
0eabc2e34d remove unused script 2024-09-03 21:15:13 -04:00
jmorganca
dded27dcfa fix metal 2024-09-03 21:15:13 -04:00
jmorganca
080b600865 add header to not edit 2024-09-03 21:15:13 -04:00
jmorganca
d6b6de9a5a add header to not edit 2024-09-03 21:15:13 -04:00
jmorganca
24a741424f fix build on windows 2024-09-03 21:15:13 -04:00
jmorganca
4d476d894e fix Makefile 2024-09-03 21:15:13 -04:00
jmorganca
bd94ddfc56 fix README.md 2024-09-03 21:15:13 -04:00
jmorganca
f1f54c5bd5 fix README.md 2024-09-03 21:15:13 -04:00
jmorganca
18662d1180 consistent whitespace 2024-09-03 21:15:13 -04:00
jmorganca
3d1f3569cf update .gitattributes 2024-09-03 21:15:13 -04:00
jmorganca
083a9e9b4e link metal 2024-09-03 21:15:13 -04:00
jmorganca
d0703eaf44 wip 2024-09-03 21:15:13 -04:00