History

Jesse Gross fcbf5f5e51 runner.go: Use stable llama.cpp sampling interface

Currently for sampling we are using an internal interface for the
llama.cpp examples, which tends to change from release to release.
This is the only such interface used for text models, though llava
and clip are also used for image processing.

This switches to use the stable interfaces, reducing the amount of
work needed for future llama.cpp bumps. It also significantly
reduces the amount of code that we need to vendor (much of it is
unused but is a dependency).

The sampling logic is the same as it is now for the parameters that
we support and is done at the CGo layer. However, in the future if
there are benefits to reconfiguring it then we can expose the
primatives to native Go code.

2024-11-04 14:14:41 -08:00

cache_test.go

runner.go: Better abstract vision model integration

2024-10-30 14:53:43 -07:00

cache.go

runner.go: Better abstract vision model integration

2024-10-30 14:53:43 -07:00

image_test.go

runner.go: Better abstract vision model integration

2024-10-30 14:53:43 -07:00

image.go

llama: Improve error handling

2024-11-02 13:37:55 -07:00

README.md

…

requirements.go

…

runner.go

runner.go: Use stable llama.cpp sampling interface

2024-11-04 14:14:41 -08:00

stop_test.go

…

stop.go

…

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings