History

Jesse Gross 8e1554c91d runner.go: Scale batches to be processed by numParallel

We should process a batch of tokens for each parallel request, rather
than having a shared pool. Otherwise, a single request can fill the
batch and then subsequent ones will fail or get starved.

Server.cpp used the KV cache size allocated for each parallel request
as the allocated size for the batch. This is the upper bound for the
batch but since we know how many tokens we will actually put in a batch
there is no need to over allocate.

2024-09-03 21:15:14 -04:00

README.md

fix issues with runner

2024-09-03 21:15:13 -04:00

runner.go

runner.go: Scale batches to be processed by numParallel

2024-09-03 21:15:14 -04:00

stop_test.go

cleanup stop code

2024-09-03 21:15:13 -04:00

stop.go

cleanup stop code

2024-09-03 21:15:13 -04:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings

TODO

Parallization
More tests