Jesse Gross 53b600921e runner.go: Hold mutex for entire time when processing batch
It is not safe to hold a mutex only while we are waiting for the
condition variable to signal that a new sequence has been added. It's
possible that a sequence could be added in the middle of batch
processing. For example, if a new sequence is added while Decode()
is running, it will get picked up for sampling, despite not having
been added to the original batch.

This change holds a mutex for the majority of the time when active
processing is happening, releasing it only for a brief period each
time around the loop. Depending on the workload and the scheduler
is may result in unfairness between different requests. However,
this was not actually observed in testing.

This addresses the correctness issue - better performance and fairness
can be achieved with additional improvements in the future.
2024-09-03 21:15:14 -04:00
..
2024-09-03 21:15:13 -04:00
2024-09-03 21:15:13 -04:00
2024-09-03 21:15:13 -04:00

runner

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings

TODO

  • Parallization
  • More tests