Daniel Hiltgen 923b329481 llama: wire up builtin runner
This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.
2024-10-29 09:56:08 -07:00
..
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00
2024-10-29 09:56:08 -07:00

runner

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings