86 Commits

Author SHA1 Message Date
Jesse Gross
76718ead40 runner.go: Support MinP parameter
MinP is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
477f529d26 runner.go: Implement RepeatLastN to penalize repeated tokens
RepeatLastN is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
69cc5795a7 runner.go: Shift context window when KV cache space is exceeded
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.

This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.

The algorithm is:
 - Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
 - Drop the oldest half of the remaining tokens
 - Shift the remaining new tokens to the back of the cache
2024-09-03 21:15:14 -04:00
Jesse Gross
523d84c563 llama.go: Use dynamic buffer for TokenToPiece
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
2024-09-03 21:15:14 -04:00
Jesse Gross
ed19fad862 llama.go: Make batch memory allocation match configuration
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.
2024-09-03 21:15:14 -04:00
jmorganca
a483a4c4ed lint 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
e9dd656ff5 Update sync with latest llama.cpp layout, and run against b3485 2024-09-03 21:15:13 -04:00
Daniel Hiltgen
6c0d892498 Prefix all build artifacts with an OS/ARCH dir
This will help keep incremental builds from stomping on each other and make it
easier to stitch together the final runner payloads
2024-09-03 21:15:13 -04:00
jmorganca
a29851bc9b clean up metal code 2024-09-03 21:15:13 -04:00
jmorganca
8dda9293fa fix Makefile on windows 2024-09-03 21:15:13 -04:00
jmorganca
b3c62dcafd remove printing 2024-09-03 21:15:13 -04:00
jmorganca
1da6c40f4f lint 2024-09-03 21:15:13 -04:00
jmorganca
dded27dcfa fix metal 2024-09-03 21:15:13 -04:00
jmorganca
24a741424f fix build on windows 2024-09-03 21:15:13 -04:00
jmorganca
083a9e9b4e link metal 2024-09-03 21:15:13 -04:00
jmorganca
d0703eaf44 wip 2024-09-03 21:15:13 -04:00
jmorganca
ce00e387c3 wip meta 2024-09-03 21:15:13 -04:00
jmorganca
763d7b601c sync 2024-09-03 21:15:13 -04:00
jmorganca
4d0e6c55b0 remove perl docs 2024-09-03 21:15:13 -04:00
jmorganca
3375b82c56 remove build scripts 2024-09-03 21:15:13 -04:00
jmorganca
a632a04426 fix output 2024-09-03 21:15:13 -04:00
jmorganca
110f37ffb0 arch build 2024-09-03 21:15:13 -04:00
jmorganca
f2f03ff7f2 add temporary makefile 2024-09-03 21:15:13 -04:00
jmorganca
9966a055e5 fix cgo flags for darwin amd64 2024-09-03 21:15:13 -04:00
jmorganca
43efc893d7 basic progress 2024-09-03 21:15:13 -04:00
jmorganca
20afaae020 add more runner params 2024-09-03 21:15:13 -04:00
jmorganca
b2ef3bf490 embeddings 2024-09-03 21:15:12 -04:00
jmorganca
ce15ed6d69 remove dependency on llm 2024-09-03 21:15:12 -04:00
jmorganca
c0b94376b2 grammar 2024-09-03 21:15:12 -04:00
jmorganca
72be8e27c4 sampling 2024-09-03 21:15:12 -04:00
jmorganca
d12db0568e better example module, add port 2024-09-03 21:15:12 -04:00
jmorganca
ec17359a68 wip 2024-09-03 21:15:12 -04:00
jmorganca
fbc8572859 add llava to runner 2024-09-03 21:15:12 -04:00
jmorganca
28bedcd807 wip 2024-09-03 21:15:12 -04:00
jmorganca
b22d78720e cuda linux 2024-09-03 21:15:12 -04:00
jmorganca
9547aa53ff disable log file 2024-09-03 21:15:12 -04:00
jmorganca
a8f91d3cc1 add llava 2024-09-03 21:15:12 -04:00
jmorganca
e86db9381a avx2 should only add avx2 2024-09-03 21:15:12 -04:00
jmorganca
9fe48978a8 move runner package down 2024-09-03 21:15:12 -04:00
jmorganca
01ccbc07fe replace static build in llm 2024-09-03 21:15:12 -04:00
jmorganca
0110994d06 Initial llama Go module 2024-09-03 21:15:12 -04:00
jmorganca
2ef3a217d1 add sync of llama.cpp 2024-09-03 21:15:12 -04:00
Michael Yang
fccf8d179f partial decode ggml bin for more info 2023-08-10 09:23:10 -07:00
Bruce MacDonald
984c9c628c fix embeddings invalid values 2023-08-09 16:50:53 -04:00
Bruce MacDonald
09d8bf6730 fix build errors 2023-08-09 10:45:57 -04:00
Bruce MacDonald
7a5f3616fd
embed text document in modelfile 2023-08-09 10:26:19 -04:00
Michael Yang
f2074ed4c0
Merge pull request #306 from jmorganca/default-keep-system
automatically set num_keep if num_keep < 0
2023-08-08 09:25:34 -07:00
Bruce MacDonald
a6f6d18f83 embed text document in modelfile 2023-08-08 11:27:17 -04:00
Jeffrey Morgan
5eb712f962 trim whitespace before checking stop conditions
Fixes #295
2023-08-08 00:29:19 -04:00
Michael Yang
4dc5b117dd automatically set num_keep if num_keep < 0
num_keep defines how many tokens to keep in the context when truncating
inputs. if left to its default value of -1, the server will calculate
num_keep to be the left of the system instructions
2023-08-07 16:19:12 -07:00