92 Commits

Author SHA1 Message Date
jmorganca
f443dd7b81 llama: sync llama.cpp to commit 8962422 2024-09-03 21:23:46 -04:00
Jesse Gross
8db94469e0 runner.go: Support GGUF LoRAs
The current CGo bindings for loading LoRAs only supports the older
GGLA file format, which is no longer supported. This switches to
use functions that load the newer GGUF LoRAs.
2024-09-03 21:15:14 -04:00
Jesse Gross
c989321509 runner.go: Don't cast a Go handle to a C void *
Cgo handles passing pointers through C to another Go function
with integer handles to the Go memory. However, it is not legal
to cast this handle to a void * aux data in C. Doing this results
in panics about invalid pointers on the stack in certain
circumstances.

Instead, we should pass a pointer to the handle and pin that in
memory. It would probably also be safe to directly pin the Go
function pointer and pass that rather than using the handle since
it is an opaque blob to C. However, using a handle is the more
generally correct solution and there is no need to get clever.
2024-09-03 21:15:14 -04:00
Jesse Gross
e4a091bafd runner.go: Support resource usage command line options
Command line options to the runner that control resource usage
(mmap, mlock, tensor split) are used by Ollama but not currently
implemented. This implements support for these while ignoring
others that have no meaning in this context.
2024-09-03 21:15:14 -04:00
Jesse Gross
46a7c682f2 runner.go: Fix embeddings endpoint
The embeddings endpoint only takes a single input and provides a
single output, instead of multiple as the current implementation
expected. Fixing this also allows the implementation to be simplified
and a few embedding-specific issues to be addressed.
2024-09-03 21:15:14 -04:00
Jesse Gross
0b73cca386 runner.go: Fix resource leaks when removing sequences
There are multiple causes and paths that result in a sequence
ending. Not all of these free the sampling context or reset the
pieces slice. This factors out the removal code so that all
paths release resources.
2024-09-03 21:15:14 -04:00
Jesse Gross
76718ead40 runner.go: Support MinP parameter
MinP is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
477f529d26 runner.go: Implement RepeatLastN to penalize repeated tokens
RepeatLastN is a user-facing parameter that is exposed that is exposed
through the APIs but is not currently plumbed through.
2024-09-03 21:15:14 -04:00
Jesse Gross
69cc5795a7 runner.go: Shift context window when KV cache space is exceeded
Currently, once the KV cache is full, text generation stops. Instead,
we should shift out the oldest context so that new generation can
continue based on more recent context.

This uses the algorithm from llama.cpp that is currently used by Ollama
with the server.cpp code. There are others but they are never turned
on through Ollama, so this restores parity.

The algorithm is:
 - Retain a configurable number of tokens at the beginning (for things
like beginning of sequence tokens
 - Drop the oldest half of the remaining tokens
 - Shift the remaining new tokens to the back of the cache
2024-09-03 21:15:14 -04:00
Jesse Gross
523d84c563 llama.go: Use dynamic buffer for TokenToPiece
The cgo binding for llama_token_to_piece uses a fixed 12 byte buffer,
which is usually but not always enough to hold a token. This increase
the buffer size if needed, similar to what llama.cpp does internally.
2024-09-03 21:15:14 -04:00
Jesse Gross
ed19fad862 llama.go: Make batch memory allocation match configuration
Batch size defaults to 512 but is configurable. However, llama.go uses
a fixed size buffer, causing crashes is the batch size is increase.
This changes the array size to follow the configuration.
2024-09-03 21:15:14 -04:00
jmorganca
a483a4c4ed lint 2024-09-03 21:15:14 -04:00
Daniel Hiltgen
e9dd656ff5 Update sync with latest llama.cpp layout, and run against b3485 2024-09-03 21:15:13 -04:00
Daniel Hiltgen
6c0d892498 Prefix all build artifacts with an OS/ARCH dir
This will help keep incremental builds from stomping on each other and make it
easier to stitch together the final runner payloads
2024-09-03 21:15:13 -04:00
jmorganca
a29851bc9b clean up metal code 2024-09-03 21:15:13 -04:00
jmorganca
8dda9293fa fix Makefile on windows 2024-09-03 21:15:13 -04:00
jmorganca
b3c62dcafd remove printing 2024-09-03 21:15:13 -04:00
jmorganca
1da6c40f4f lint 2024-09-03 21:15:13 -04:00
jmorganca
dded27dcfa fix metal 2024-09-03 21:15:13 -04:00
jmorganca
24a741424f fix build on windows 2024-09-03 21:15:13 -04:00
jmorganca
083a9e9b4e link metal 2024-09-03 21:15:13 -04:00
jmorganca
d0703eaf44 wip 2024-09-03 21:15:13 -04:00
jmorganca
ce00e387c3 wip meta 2024-09-03 21:15:13 -04:00
jmorganca
763d7b601c sync 2024-09-03 21:15:13 -04:00
jmorganca
4d0e6c55b0 remove perl docs 2024-09-03 21:15:13 -04:00
jmorganca
3375b82c56 remove build scripts 2024-09-03 21:15:13 -04:00
jmorganca
a632a04426 fix output 2024-09-03 21:15:13 -04:00
jmorganca
110f37ffb0 arch build 2024-09-03 21:15:13 -04:00
jmorganca
f2f03ff7f2 add temporary makefile 2024-09-03 21:15:13 -04:00
jmorganca
9966a055e5 fix cgo flags for darwin amd64 2024-09-03 21:15:13 -04:00
jmorganca
43efc893d7 basic progress 2024-09-03 21:15:13 -04:00
jmorganca
20afaae020 add more runner params 2024-09-03 21:15:13 -04:00
jmorganca
b2ef3bf490 embeddings 2024-09-03 21:15:12 -04:00
jmorganca
ce15ed6d69 remove dependency on llm 2024-09-03 21:15:12 -04:00
jmorganca
c0b94376b2 grammar 2024-09-03 21:15:12 -04:00
jmorganca
72be8e27c4 sampling 2024-09-03 21:15:12 -04:00
jmorganca
d12db0568e better example module, add port 2024-09-03 21:15:12 -04:00
jmorganca
ec17359a68 wip 2024-09-03 21:15:12 -04:00
jmorganca
fbc8572859 add llava to runner 2024-09-03 21:15:12 -04:00
jmorganca
28bedcd807 wip 2024-09-03 21:15:12 -04:00
jmorganca
b22d78720e cuda linux 2024-09-03 21:15:12 -04:00
jmorganca
9547aa53ff disable log file 2024-09-03 21:15:12 -04:00
jmorganca
a8f91d3cc1 add llava 2024-09-03 21:15:12 -04:00
jmorganca
e86db9381a avx2 should only add avx2 2024-09-03 21:15:12 -04:00
jmorganca
9fe48978a8 move runner package down 2024-09-03 21:15:12 -04:00
jmorganca
01ccbc07fe replace static build in llm 2024-09-03 21:15:12 -04:00
jmorganca
0110994d06 Initial llama Go module 2024-09-03 21:15:12 -04:00
jmorganca
2ef3a217d1 add sync of llama.cpp 2024-09-03 21:15:12 -04:00
Michael Yang
fccf8d179f partial decode ggml bin for more info 2023-08-10 09:23:10 -07:00
Bruce MacDonald
984c9c628c fix embeddings invalid values 2023-08-09 16:50:53 -04:00