69 Commits

Author SHA1 Message Date
Roy Han
23d5beeb9c input 2024-07-16 15:19:13 -07:00
Roy Han
eb7cc2d1ce image embeddings 2024-07-15 12:13:06 -07:00
Roy Han
8f6d0242b6 refactoring 2024-07-09 16:19:02 -07:00
Roy Han
c697eb2a9b fix hanging on single string 2024-07-09 15:51:55 -07:00
Roy Han
bcb63e6e0e touches 2024-07-09 13:37:00 -07:00
royjhan
b7c622dd32
Merge branch 'main' into royh-batchembed 2024-07-08 15:10:52 -07:00
Jeffrey Morgan
d8def1ff94
llm: allow gemma 2 to context shift (#5534) 2024-07-07 13:41:51 -04:00
Jeffrey Morgan
0e09c380fc
llm: print caching notices in debug only (#5533) 2024-07-07 12:38:04 -04:00
Jeffrey Morgan
2cc854f8cb
llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511)
* Revert "fix cmake build (#5505)"

This reverts commit 4fd5f3526a116d05cd74cfcc7217d4e6326e1bea.

* llm: fix missing dylibs by restoring old build behavior

* crlf -> lf
2024-07-05 21:48:31 -04:00
Jeffrey Morgan
4fd5f3526a
fix cmake build (#5505) 2024-07-05 19:07:01 -04:00
Jeffrey Morgan
8f8e736b13
update llama.cpp submodule to d7fd29f (#5475) 2024-07-05 13:25:58 -04:00
Jeffrey Morgan
d89454de80
Use slot with cached prompt instead of least recently used (#5492)
* Use common prefix to select slot

* actually report `longest`
2024-07-05 12:32:47 -04:00
Roy Han
17de2b4405 Refactoring of legacy and new 2024-07-03 14:02:25 -07:00
royjhan
3b5a4a77f3
Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371)
* openai compatibility

* Revert "openai compatibility"

This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.

* remove erroneous subtraction of prompt cache
2024-07-03 13:46:23 -07:00
royjhan
a5f23d766e
Merge branch 'main' into royh-batchembed 2024-07-03 11:20:24 -07:00
Roy Han
512e0a7bde Clean up 2024-07-01 16:29:54 -07:00
Roy Han
aee25acb5b move normalization to go 2024-07-01 14:10:58 -07:00
Jeffrey Morgan
717f7229eb
Do not shift context for sliding window models (#5368)
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
2024-06-28 19:39:31 -07:00
Roy Han
80c1a3f812 playing around with truncate stuff 2024-06-28 18:17:09 -07:00
Roy Han
c111d8bb51 normalization 2024-06-28 17:19:04 -07:00
Roy Han
5213c12354 clean up 2024-06-28 15:26:58 -07:00
Roy Han
49e341147d add server function 2024-06-28 15:03:53 -07:00
Roy Han
c406fa7a4c api/embed draft 2024-06-28 14:54:21 -07:00
Roy Han
ff191d7cba Initial Draft 2024-06-25 13:29:47 -07:00
Michael Yang
9d91e5e587 remove confusing log message 2024-06-19 11:14:11 -07:00
Daniel Hiltgen
fb9cdfa723 Fix server.cpp for the new cuda build macros 2024-06-14 14:51:40 -07:00
Jeffrey Morgan
ead259d877
llm: fix seed value not being applied to requests (#4986) 2024-06-11 14:24:41 -07:00
Jeffrey Morgan
34f142797a
llm: always add bos token to prompt (#4941)
* fix embedding by adding fixes from llama.cpp upstream

* remove assert

---------

Co-authored-by: Jesper Ek <deadbeef84@gmail.com>
2024-06-08 18:47:10 -07:00
Michael Yang
829ff87bd1
revert tokenize ffi (#4761)
* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65dbb88004cd046c8acc0c8e889816e1828.

* Revert "vocab only"

This reverts commit bf54c845e9ea63ec58762a991dcea78d2c934b47.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a04108f6cae625802e69faa4b48480bc208.
2024-05-31 18:54:21 -07:00
Michael Yang
de781b37c8 rm unused infill 2024-05-29 11:26:47 -07:00
Michael Yang
3e21799377 rm unused system prompt 2024-05-29 11:26:47 -07:00
Michael Yang
26a00a0410 use ffi for tokenizing/detokenizing 2024-05-29 11:26:47 -07:00
Michael Yang
714adb8bd1
bump (#4597) 2024-05-23 14:16:26 -07:00
Daniel Hiltgen
b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Sam
e15307fdf4
feat: add support for flash_attn (#4120)
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Michael Yang
58876091f7 log clean up 2024-05-09 14:55:36 -07:00
Daniel Hiltgen
920a4b0794 Merge remote-tracking branch 'upstream/main' into pr3702 2024-05-08 16:44:35 -07:00
Michael Yang
44869c59d6 omit prompt and generate settings from final response 2024-05-03 17:00:02 -07:00
jmorganca
fcf4d60eee llm: add back check for empty token cache 2024-04-30 17:38:44 -04:00
Jeffrey Morgan
18d9a7e1f1
update llama.cpp submodule to f364eb6 (#4060) 2024-04-30 17:25:39 -04:00
Daniel Hiltgen
23d23409a0
Update llama.cpp (#4036)
* Bump llama.cpp to b2761

* Adjust types for bump
2024-04-29 23:18:48 -04:00
ManniX-ITA
c942e4a07b
Fixed startup sequence to report model loading 2024-04-17 17:40:32 +02:00
Jeffrey Morgan
7c9792a6e0
Support unicode characters in model path (#3681)
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
2024-04-16 17:00:12 -04:00
Daniel Hiltgen
0a0e9f3e0f Apply 01-cache.diff 2024-04-01 16:48:18 -07:00
Daniel Hiltgen
58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Jeffrey Morgan
f5ca7f8c8e
add license in file header for vendored llama.cpp code (#3351) 2024-03-26 16:23:23 -04:00
Daniel Hiltgen
43799532c1 Bump llama.cpp to b2474
The release just before ggml-cuda.cu refactoring
2024-03-23 09:54:56 +01:00
Jeffrey Morgan
e95ffc7448
llama: remove server static assets (#3174) 2024-03-15 19:24:12 -07:00
Daniel Hiltgen
85129d3a32 Adapt our build for imported server.cpp 2024-03-12 14:57:15 -07:00
Daniel Hiltgen
9ac6440da3 Import server.cpp as of b2356 2024-03-12 13:58:06 -07:00