forked from third-party-mirrors/ollama
If the number of input tokens exceeds the size of the batch, multiple batches will be submitted but they will all contain the first tokens. This processes the input tokens as expected so that each batch has the next set of tokens.
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings
TODO
- Parallization
- More tests