ollama

Author	SHA1	Message	Date
Roy Han	23d5beeb9c	input	2024-07-16 15:19:13 -07:00
Roy Han	eb7cc2d1ce	image embeddings	2024-07-15 12:13:06 -07:00
Roy Han	8f6d0242b6	refactoring	2024-07-09 16:19:02 -07:00
Roy Han	c697eb2a9b	fix hanging on single string	2024-07-09 15:51:55 -07:00
Roy Han	bcb63e6e0e	touches	2024-07-09 13:37:00 -07:00
royjhan	b7c622dd32	Merge branch 'main' into royh-batchembed	2024-07-08 15:10:52 -07:00
Jeffrey Morgan	d8def1ff94	llm: allow gemma 2 to context shift (#5534 )	2024-07-07 13:41:51 -04:00
Jeffrey Morgan	0e09c380fc	llm: print caching notices in debug only (#5533 )	2024-07-07 12:38:04 -04:00
Jeffrey Morgan	2cc854f8cb	llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511 ) * Revert "fix cmake build (#5505)" This reverts commit 4fd5f3526a116d05cd74cfcc7217d4e6326e1bea. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf	2024-07-05 21:48:31 -04:00
Jeffrey Morgan	4fd5f3526a	fix cmake build (#5505 )	2024-07-05 19:07:01 -04:00
Jeffrey Morgan	8f8e736b13	update llama.cpp submodule to `d7fd29f` (#5475 )	2024-07-05 13:25:58 -04:00
Jeffrey Morgan	d89454de80	Use slot with cached prompt instead of least recently used (#5492 ) * Use common prefix to select slot * actually report `longest`	2024-07-05 12:32:47 -04:00
Roy Han	17de2b4405	Refactoring of legacy and new	2024-07-03 14:02:25 -07:00
royjhan	3b5a4a77f3	Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371 ) * openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache	2024-07-03 13:46:23 -07:00
royjhan	a5f23d766e	Merge branch 'main' into royh-batchembed	2024-07-03 11:20:24 -07:00
Roy Han	512e0a7bde	Clean up	2024-07-01 16:29:54 -07:00
Roy Han	aee25acb5b	move normalization to go	2024-07-01 14:10:58 -07:00
Jeffrey Morgan	717f7229eb	Do not shift context for sliding window models (#5368 ) * Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2	2024-06-28 19:39:31 -07:00
Roy Han	80c1a3f812	playing around with truncate stuff	2024-06-28 18:17:09 -07:00
Roy Han	c111d8bb51	normalization	2024-06-28 17:19:04 -07:00
Roy Han	5213c12354	clean up	2024-06-28 15:26:58 -07:00
Roy Han	49e341147d	add server function	2024-06-28 15:03:53 -07:00
Roy Han	c406fa7a4c	api/embed draft	2024-06-28 14:54:21 -07:00
Roy Han	ff191d7cba	Initial Draft	2024-06-25 13:29:47 -07:00
Michael Yang	9d91e5e587	remove confusing log message	2024-06-19 11:14:11 -07:00
Daniel Hiltgen	fb9cdfa723	Fix server.cpp for the new cuda build macros	2024-06-14 14:51:40 -07:00
Jeffrey Morgan	ead259d877	llm: fix seed value not being applied to requests (#4986 )	2024-06-11 14:24:41 -07:00
Jeffrey Morgan	34f142797a	llm: always add bos token to prompt (#4941 ) * fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by: Jesper Ek <deadbeef84@gmail.com>	2024-06-08 18:47:10 -07:00
Michael Yang	829ff87bd1	revert tokenize ffi (#4761 ) * Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit 763bb65dbb88004cd046c8acc0c8e889816e1828. * Revert "vocab only" This reverts commit bf54c845e9ea63ec58762a991dcea78d2c934b47. * Revert "use ffi for tokenizing/detokenizing" This reverts commit 26a00a04108f6cae625802e69faa4b48480bc208.	2024-05-31 18:54:21 -07:00
Michael Yang	de781b37c8	rm unused infill	2024-05-29 11:26:47 -07:00
Michael Yang	3e21799377	rm unused system prompt	2024-05-29 11:26:47 -07:00
Michael Yang	26a00a0410	use ffi for tokenizing/detokenizing	2024-05-29 11:26:47 -07:00
Michael Yang	714adb8bd1	bump (#4597 )	2024-05-23 14:16:26 -07:00
Daniel Hiltgen	b37b496a12	Wire up load progress This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load	2024-05-23 13:36:48 -07:00
Sam	e15307fdf4	feat: add support for flash_attn (#4120 ) * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support	2024-05-20 13:36:03 -07:00
Michael Yang	58876091f7	log clean up	2024-05-09 14:55:36 -07:00
Daniel Hiltgen	920a4b0794	Merge remote-tracking branch 'upstream/main' into pr3702	2024-05-08 16:44:35 -07:00
Michael Yang	44869c59d6	omit prompt and generate settings from final response	2024-05-03 17:00:02 -07:00
jmorganca	fcf4d60eee	llm: add back check for empty token cache	2024-04-30 17:38:44 -04:00
Jeffrey Morgan	18d9a7e1f1	update llama.cpp submodule to `f364eb6` (#4060 )	2024-04-30 17:25:39 -04:00
Daniel Hiltgen	23d23409a0	Update llama.cpp (#4036 ) * Bump llama.cpp to b2761 * Adjust types for bump	2024-04-29 23:18:48 -04:00
ManniX-ITA	c942e4a07b	Fixed startup sequence to report model loading	2024-04-17 17:40:32 +02:00
Jeffrey Morgan	7c9792a6e0	Support unicode characters in model path (#3681 ) * parse wide argv characters on windows * cleanup * move cleanup to end of `main`	2024-04-16 17:00:12 -04:00
Daniel Hiltgen	0a0e9f3e0f	Apply 01-cache.diff	2024-04-01 16:48:18 -07:00
Daniel Hiltgen	58d95cc9bd	Switch back to subprocessing for llama.cpp This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.	2024-04-01 16:48:18 -07:00
Jeffrey Morgan	f5ca7f8c8e	add license in file header for vendored llama.cpp code (#3351 )	2024-03-26 16:23:23 -04:00
Daniel Hiltgen	43799532c1	Bump llama.cpp to b2474 The release just before ggml-cuda.cu refactoring	2024-03-23 09:54:56 +01:00
Jeffrey Morgan	e95ffc7448	llama: remove server static assets (#3174 )	2024-03-15 19:24:12 -07:00
Daniel Hiltgen	85129d3a32	Adapt our build for imported server.cpp	2024-03-12 14:57:15 -07:00
Daniel Hiltgen	9ac6440da3	Import server.cpp as of b2356	2024-03-12 13:58:06 -07:00

1 2

69 Commits