Currently for sampling we are using an internal interface for the llama.cpp examples, which tends to change from release to release. This is the only such interface used for text models, though llava and clip are also used for image processing. This switches to use the stable interfaces, reducing the amount of work needed for future llama.cpp bumps. It also significantly reduces the amount of code that we need to vendor (much of it is unused but is a dependency). The sampling logic is the same as it is now for the parameters that we support and is done at the CGo layer. However, in the future if there are benefits to reconfiguring it then we can expose the primatives to native Go code.
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings