forked from third-party-mirrors/ollama
We should process a batch of tokens for each parallel request, rather than having a shared pool. Otherwise, a single request can fill the batch and then subsequent ones will fail or get starved. Server.cpp used the KV cache size allocated for each parallel request as the allocated size for the batch. This is the upper bound for the batch but since we know how many tokens we will actually put in a batch there is no need to over allocate.
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings
TODO
- Parallization
- More tests