Matt Williams e2389b63aa add examples of streaming in python and node

Signed-off-by: Matt Williams <m@technovangelist.com>

2023-09-14 07:12:09 -07:00

5.6 KiB

Raw Permalink Blame History

API

Endpoints

Generate a completion
Create a model
List local models
Copy a model
Delete a model
Pull a model
Generate embeddings

Conventions

Model names

Model names follow a model:tag format. Some examples are orca-mini:3b-q4_1 and llama2:70b. The tag is optional and if not provided will default to latest. The tag is used to identify a specific version.

Durations

All durations are returned in nanoseconds.

Streams

Many API responses are streams of JSON objects showing the current status. For examples of working with streams in various languages, see streaming.md

Generate a completion

POST /api/generate

Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses. The final response object will include statistics and additional data from the request.

Parameters

model: (required) the model name
prompt: the prompt to generate a response for

Advanced parameters:

options: additional model parameters listed in the documentation for the Modelfile such as temperature
system: system prompt to (overrides what is defined in the Modelfile)
template: the full prompt or prompt template (overrides what is defined in the Modelfile)
context: the context parameter returned from a previous request to /generate, this can be used to keep a short conversational memory

Request

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2:7b",
  "prompt": "Why is the sky blue?"
}'

Response

A stream of JSON objects:

{
  "model": "llama2:7b",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
}

The final response in the stream also includes additional data about the generation:

total_duration: time spent generating the response
load_duration: time spent in nanoseconds loading the model
sample_count: number of samples generated
sample_duration: time spent generating samples
prompt_eval_count: number of tokens in the prompt
prompt_eval_duration: time spent in nanoseconds evaluating the prompt
eval_count: number of tokens the response
eval_duration: time in nanoseconds spent generating the response
context: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration.

{
  "model": "llama2:7b",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "context": [1, 2, 3],
  "done": true,
  "total_duration": 5589157167,
  "load_duration": 3013701500,
  "sample_count": 114,
  "sample_duration": 81442000,
  "prompt_eval_count": 46,
  "prompt_eval_duration": 1160282000,
  "eval_count": 113,
  "eval_duration": 1325948000
}

Create a Model

POST /api/create

Create a model from a Modelfile

Parameters

name: name of the model to create
path: path to the Modelfile

Request

curl -X POST http://localhost:11434/api/create -d '{
  "name": "mario",
  "path": "~/Modelfile"
}'

Response

A stream of JSON objects. When finished, status is success

{
  "status": "parsing modelfile"
}

List Local Models

GET /api/tags

List models that are available locally.

Request

curl http://localhost:11434/api/tags

Response

{
  "models": [
    {
      "name": "llama2:7b",
      "modified_at": "2023-08-02T17:02:23.713454393-07:00",
      "size": 3791730596
    },
    {
      "name": "llama2:13b",
      "modified_at": "2023-08-08T12:08:38.093596297-07:00",
      "size": 7323310500
    }
  ]
}

Copy a Model

POST /api/copy

Copy a model. Creates a model with another name from an existing model.

Request

curl http://localhost:11434/api/copy -d '{
  "source": "llama2:7b",
  "destination": "llama2-backup"
}'

Delete a Model

DELETE /api/delete

Delete a model and its data.

Parameters

model: model name to delete

Request

curl -X DELETE http://localhost:11434/api/delete -d '{
  "name": "llama2:13b"
}'

Pull a Model

POST /api/pull

Download a model from a the model registry. Cancelled pulls are resumed from where they left off, and multiple calls to will share the same download progress.

Parameters

name: name of the model to pull

Request

curl -X POST http://localhost:11434/api/pull -d '{
  "name": "llama2:7b"
}'

Response

{
  "status": "downloading digestname",
  "digest": "digestname",
  "total": 2142590208
}

Generate Embeddings

POST /api/embeddings

Generate embeddings from a model

Parameters

model: name of model to generate embeddings from
prompt: text to generate embeddings for

Advanced parameters:

options: additional model parameters listed in the documentation for the Modelfile such as temperature

Request

curl -X POST http://localhost:11434/api/embeddings -d '{
  "model": "llama2:7b",
  "prompt": "Here is an article about llamas..."
}'

Response

{
  "embeddings": [
    0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
    0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
  ]
}

5.6 KiB Raw Permalink Blame History

API

Endpoints

Conventions

Model names

Durations

Streams

Generate a completion

Parameters

Request

Response

Create a Model

Parameters

Request

Response

List Local Models

Request

Response

Copy a Model

Request

Delete a Model

Parameters

Request

Pull a Model

Parameters

Request

Response

Generate Embeddings

Parameters

Request

Response

5.6 KiB

Raw Permalink Blame History