Compare commits

...

29 Commits

Author SHA1 Message Date
Blake Mizerany
67691e410d
cmd: preserve exact bytes when displaying template/system layers (#7586) 2024-11-13 23:53:30 -08:00
Jesse Gross
d7eb05b936 runner.go: Fix off-by-one for num predicted 2024-11-12 11:35:57 -08:00
Daniel Hiltgen
636a743c2b
CI: give windows lint more time (#7635)
It looks like 8 minutes isn't quite enough and we're seeing sporadic timeouts
2024-11-12 11:22:39 -08:00
Daniel Hiltgen
df011054fa
Jetpack support for Go server (#7217)
This adds support for the Jetson JetPack variants into the Go runner
2024-11-12 10:31:52 -08:00
Daniel Hiltgen
ac07160c8d
doc: capture numeric group requirement (#6941)
Docker uses the container filesystem for name resolution, so we can't guide users
to use the name of the host group.  Instead they must specify the numeric ID.
2024-11-12 09:13:23 -08:00
Daniel Hiltgen
6606e4243c
docs: Capture docker cgroup workaround (#7519)
GPU support can break on some systems after a while.  This captures a
known workaround to solve the problem.
2024-11-12 09:12:50 -08:00
Jesse Gross
65973ceb64 runner.go: Make KV entry accounting more robust
The structure of the accounting for KV cache shifting was carried
over from the old runner but it now doesn't feel natural with the new
runner. There are a number of invariants that should hold true but
are difficult to reason about. There is at least one bug report
that would imply that the invariants are not holding.

This reduces the number of implicit assumptions and is more forgiving
of unexpected situations. It also improves behavior around which input
tokens are kept when truncation occurs.

Bug #7545
2024-11-11 20:23:03 -08:00
Joey Zheng
bebef1e50d
readme: add aichat terminal app to community integrations (#7418) 2024-11-11 16:44:46 -08:00
Evan
d48c1c5a44
api: fix typos in Go Doc comments (#7620) 2024-11-11 16:21:58 -08:00
Prasad Bhalerao
36a8372b28
readme: add GoLamify to community integrations (#7521) 2024-11-10 22:38:18 -08:00
Ivo Stoykov
4e94227b5d
readme: add browser extension that enables using Ollama for interacting with web pages (#5827) 2024-11-10 22:14:22 -08:00
frances720
479d551766
docs: add mentions of Llama 3.2 (#7517) 2024-11-10 19:04:23 -08:00
Evan
76b2b723b2
api: fix typo in python ClientFromEnvironment docs (#7604) 2024-11-10 17:30:27 -08:00
Arhan Busam
b8d77cdeab
readme: add llama3.2-vision to model list (#7580) 2024-11-10 13:36:25 -08:00
Jesse Gross
c2e8cbaa14 runner.go: Check for zero length images
If we get a request with a zero length image, it will result in
an out-of-bounds error when we pass the data to the image encoder.
2024-11-08 09:39:32 -08:00
Edward J. Schwartz
771fab1dd8
docs: update langchainpy.md with proper model name (#7527) 2024-11-08 09:36:17 -08:00
Daniel Hiltgen
3a5239e6bf
Set macos min version for all architectures (#7579) 2024-11-08 09:27:04 -08:00
Daniel Hiltgen
3d25e7bf8c
win: remove preview title from installer (#7529)
This should have been in #7347 but was overlooked.
2024-11-07 14:26:47 -08:00
Daniel Hiltgen
1618700c5a
Workaround buggy P2P ROCm copy on windows (#7466)
This enables the workaround code only for windows which should help windows users with muliple AMD GPUs
2024-11-07 14:26:31 -08:00
Daniel Hiltgen
b111aa5a91
Debug logging for nvcuda init (#7532)
Some users are reporting crashes during nvcuda.dll initialization
on windows.  This should help narrow down where things are going bad.
2024-11-07 14:25:53 -08:00
Daniel Hiltgen
9e83e550e1
Align rocm compiler flags (#7467)
Bring consistency with the old generate script behavior
2024-11-07 10:20:50 -08:00
Daniel Hiltgen
fc2a0715df
Be explicit for gpu library link dir (#7560)
On linux nvcc isn't automatically linking to the same cuda version.
2024-11-07 09:20:40 -08:00
Jesse Gross
3020d2dc58 docs: OLLAMA_NEW_RUNNERS no longer exists 2024-11-06 14:39:02 -08:00
Jesse Gross
a909417602 runner.go: Remove unused arguments
Now that server.cpp is gone, we don't need to keep passing arguments
that were only ignored and only kept for compatibility.
2024-11-06 13:32:18 -08:00
Jesse Gross
6cd566872b sched: Lift parallel restriction for multimodal models except mllama
The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.

However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.
2024-11-06 13:32:18 -08:00
RAPID ARCHITECT
9d71bcc3e2
Update README.md (#7516)
added reddit rate below hexabot, ollama powered reddit search and analysis with streamlit for the intervace
2024-11-05 15:07:25 -08:00
Daniel Hiltgen
a4c70fe157
One corrupt manifest should not wedge model operations (#7515)
One potential failure mode is an empty file which bubbles up as an EOF error,
leading to all pulls and listing operations failing.  Instead, continue and
warn about the corrupt manifest.  This also allows re-pulling the corrupt
manifest to repair the system.
2024-11-05 14:21:45 -08:00
Jesse Gross
34a75102f7 prompt: Use a single token when estimating mllama context size
Currently we assume that images take 768 tokens of context size for
the purposes of clipping old messages that exceed the context window.
However, our mllama implementation stores the full image embedding
in a single token. As a result, there is significant waste of context
space.

Ideally, we would handle this more generically and have the
implementation report the number of tokens. However, at the moment
this would just result in a similar set of 'if' conditions in the
runner plus APIs to report it back. So for now, we just keep this
simple.
2024-11-05 10:11:50 -08:00
Med Marrouchi
4157d1f7b6
readme: add Hexabot to the list of community integrations 2024-11-05 09:06:38 -08:00
35 changed files with 265 additions and 168 deletions

View File

@ -281,7 +281,7 @@ jobs:
shell: bash shell: bash
- uses: golangci/golangci-lint-action@v6 - uses: golangci/golangci-lint-action@v6
with: with:
args: --timeout 8m0s -v args: --timeout 10m0s -v
test: test:
strategy: strategy:
matrix: matrix:

View File

@ -5,6 +5,8 @@ ARG CUDA_V11_ARCHITECTURES="50;52;53;60;61;62;70;72;75;80;86"
ARG CUDA_VERSION_12=12.4.0 ARG CUDA_VERSION_12=12.4.0
ARG CUDA_V12_ARCHITECTURES="60;61;62;70;72;75;80;86;87;89;90;90a" ARG CUDA_V12_ARCHITECTURES="60;61;62;70;72;75;80;86;87;89;90;90a"
ARG ROCM_VERSION=6.1.2 ARG ROCM_VERSION=6.1.2
ARG JETPACK_6=r36.2.0
ARG JETPACK_5=r35.4.1
### To create a local image for building linux binaries on mac or windows with efficient incremental builds ### To create a local image for building linux binaries on mac or windows with efficient incremental builds
# #
@ -13,7 +15,7 @@ ARG ROCM_VERSION=6.1.2
# #
### Then incremental builds will be much faster in this container ### Then incremental builds will be much faster in this container
# #
# make -C llama -j 10 && go build -trimpath -o dist/linux-amd64/ollama . # make -j 10 && go build -trimpath -o dist/linux-amd64/ollama .
# #
FROM --platform=linux/amd64 rocm/dev-centos-7:${ROCM_VERSION}-complete AS unified-builder-amd64 FROM --platform=linux/amd64 rocm/dev-centos-7:${ROCM_VERSION}-complete AS unified-builder-amd64
ARG CMAKE_VERSION ARG CMAKE_VERSION
@ -76,9 +78,9 @@ ARG CUDA_V12_ARCHITECTURES
ARG OLLAMA_FAST_BUILD ARG OLLAMA_FAST_BUILD
RUN --mount=type=cache,target=/root/.ccache \ RUN --mount=type=cache,target=/root/.ccache \
if grep "^flags" /proc/cpuinfo|grep avx>/dev/null; then \ if grep "^flags" /proc/cpuinfo|grep avx>/dev/null; then \
make -C llama -j $(expr $(nproc) / 2 ) ; \ make -j $(expr $(nproc) / 2 ) ; \
else \ else \
make -C llama -j 5 ; \ make -j 5 ; \
fi fi
FROM --platform=linux/arm64 unified-builder-arm64 AS runners-arm64 FROM --platform=linux/arm64 unified-builder-arm64 AS runners-arm64
@ -90,7 +92,46 @@ ARG CUDA_V11_ARCHITECTURES
ARG CUDA_V12_ARCHITECTURES ARG CUDA_V12_ARCHITECTURES
ARG OLLAMA_FAST_BUILD ARG OLLAMA_FAST_BUILD
RUN --mount=type=cache,target=/root/.ccache \ RUN --mount=type=cache,target=/root/.ccache \
make -C llama -j 8 make -j 5
# Jetsons need to be built in discrete stages
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK_5} AS runners-jetpack5-arm64
ARG GOLANG_VERSION
RUN apt-get update && apt-get install -y git curl ccache && \
curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-arm64.tar.gz | tar xz -C /usr/local && \
ln -s /usr/local/go/bin/go /usr/local/bin/go && \
ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt && \
apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /go/src/github.com/ollama/ollama/
COPY . .
ARG CGO_CFLAGS
ENV GOARCH arm64
RUN --mount=type=cache,target=/root/.ccache \
make -j 5 cuda_v11 \
CUDA_ARCHITECTURES="72;87" \
GPU_RUNNER_VARIANT=_jetpack5 \
CGO_EXTRA_LDFLAGS_LINUX=-L/usr/local/cuda/lib64/stubs \
DIST_LIB_DIR=/go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack5/lib/ollama \
DIST_GPU_RUNNER_DEPS_DIR=/go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack5/lib/ollama/cuda_jetpack5
FROM --platform=linux/arm64 nvcr.io/nvidia/l4t-jetpack:${JETPACK_6} AS runners-jetpack6-arm64
ARG GOLANG_VERSION
RUN apt-get update && apt-get install -y git curl ccache && \
curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-arm64.tar.gz | tar xz -C /usr/local && \
ln -s /usr/local/go/bin/go /usr/local/bin/go && \
ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt && \
apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /go/src/github.com/ollama/ollama/
COPY . .
ARG CGO_CFLAGS
ENV GOARCH arm64
RUN --mount=type=cache,target=/root/.ccache \
make -j 5 cuda_v12 \
CUDA_ARCHITECTURES="87" \
GPU_RUNNER_VARIANT=_jetpack6 \
CGO_EXTRA_LDFLAGS_LINUX=-L/usr/local/cuda/lib64/stubs \
DIST_LIB_DIR=/go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack6/lib/ollama \
DIST_GPU_RUNNER_DEPS_DIR=/go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack6/lib/ollama/cuda_jetpack6
# Intermediate stages used for ./scripts/build_linux.sh # Intermediate stages used for ./scripts/build_linux.sh
@ -134,12 +175,20 @@ FROM --platform=linux/arm64 builder-arm64 AS build-arm64
COPY . . COPY . .
COPY --from=runners-arm64 /go/src/github.com/ollama/ollama/dist/ dist/ COPY --from=runners-arm64 /go/src/github.com/ollama/ollama/dist/ dist/
COPY --from=runners-arm64 /go/src/github.com/ollama/ollama/build/ build/ COPY --from=runners-arm64 /go/src/github.com/ollama/ollama/build/ build/
COPY --from=runners-jetpack5-arm64 /go/src/github.com/ollama/ollama/dist/ dist/
COPY --from=runners-jetpack5-arm64 /go/src/github.com/ollama/ollama/build/ build/
COPY --from=runners-jetpack6-arm64 /go/src/github.com/ollama/ollama/dist/ dist/
COPY --from=runners-jetpack6-arm64 /go/src/github.com/ollama/ollama/build/ build/
ARG GOFLAGS ARG GOFLAGS
ARG CGO_CFLAGS ARG CGO_CFLAGS
RUN --mount=type=cache,target=/root/.ccache \ RUN --mount=type=cache,target=/root/.ccache \
go build -trimpath -o dist/linux-arm64/bin/ollama . go build -trimpath -o dist/linux-arm64/bin/ollama .
RUN cd dist/linux-$GOARCH && \ RUN cd dist/linux-$GOARCH && \
tar --exclude runners -cf - . | pigz --best > ../ollama-linux-$GOARCH.tgz tar --exclude runners -cf - . | pigz --best > ../ollama-linux-$GOARCH.tgz
RUN cd dist/linux-$GOARCH-jetpack5 && \
tar --exclude runners -cf - . | pigz --best > ../ollama-linux-$GOARCH-jetpack5.tgz
RUN cd dist/linux-$GOARCH-jetpack6 && \
tar --exclude runners -cf - . | pigz --best > ../ollama-linux-$GOARCH-jetpack6.tgz
FROM --platform=linux/amd64 scratch AS dist-amd64 FROM --platform=linux/amd64 scratch AS dist-amd64
COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/ollama-linux-*.tgz / COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/ollama-linux-*.tgz /
@ -180,16 +229,23 @@ RUN rm -rf \
FROM --platform=linux/amd64 ubuntu:22.04 AS runtime-amd64 FROM --platform=linux/amd64 ubuntu:22.04 AS runtime-amd64
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y ca-certificates && \ apt-get install -y ca-certificates && \
rm -rf /var/lib/apt/lists/* apt-get clean && rm -rf /var/lib/apt/lists/*
COPY --from=container-build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/bin/ /bin/ COPY --from=container-build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/bin/ /bin/
COPY --from=runners-cuda-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/lib/ /lib/ COPY --from=runners-cuda-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/lib/ /lib/
FROM --platform=linux/arm64 ubuntu:22.04 AS runtime-arm64 FROM --platform=linux/arm64 ubuntu:22.04 AS runtime-arm64
COPY --from=build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack5/lib/ /lib/
COPY --from=build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64-jetpack6/lib/ /lib/
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y ca-certificates && \ apt-get install -y ca-certificates && \
rm -rf /var/lib/apt/lists/* apt-get clean && rm -rf /var/lib/apt/lists/*
COPY --from=container-build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/bin/ /bin/ COPY --from=container-build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/bin/ /bin/
COPY --from=runners-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/ COPY --from=cpu-build-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
COPY --from=cuda-11-build-runner-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
COPY --from=cuda-12-build-runner-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
COPY --from=cuda-build-jetpack5-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
COPY --from=cuda-build-jetpack6-arm64 /go/src/github.com/ollama/ollama/dist/linux-arm64/lib/ /lib/
# ROCm libraries larger so we keep it distinct from the CPU/CUDA image # ROCm libraries larger so we keep it distinct from the CPU/CUDA image
FROM --platform=linux/amd64 ubuntu:22.04 AS runtime-rocm FROM --platform=linux/amd64 ubuntu:22.04 AS runtime-rocm
@ -198,7 +254,7 @@ FROM --platform=linux/amd64 ubuntu:22.04 AS runtime-rocm
COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64-rocm/lib/ /lib/ COPY --from=build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64-rocm/lib/ /lib/
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y ca-certificates && \ apt-get install -y ca-certificates && \
rm -rf /var/lib/apt/lists/* apt-get clean && rm -rf /var/lib/apt/lists/*
COPY --from=container-build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/bin/ /bin/ COPY --from=container-build-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/bin/ /bin/
COPY --from=runners-rocm-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/lib/ /lib/ COPY --from=runners-rocm-amd64 /go/src/github.com/ollama/ollama/dist/linux-amd64/lib/ /lib/

View File

@ -48,9 +48,11 @@ Ollama supports a list of models available on [ollama.com/library](https://ollam
Here are some example models that can be downloaded: Here are some example models that can be downloaded:
| Model | Parameters | Size | Download | | Model | Parameters | Size | Download |
| ------------------ | ---------- | ----- | ------------------------------ | | ------------------ | ---------- | ----- | -------------------------------- |
| Llama 3.2 | 3B | 2.0GB | `ollama run llama3.2` | | Llama 3.2 | 3B | 2.0GB | `ollama run llama3.2` |
| Llama 3.2 | 1B | 1.3GB | `ollama run llama3.2:1b` | | Llama 3.2 | 1B | 1.3GB | `ollama run llama3.2:1b` |
| Llama 3.2 Vision | 11B | 7.9GB | `ollama run llama3.2-vision` |
| Llama 3.2 Vision | 90B | 55GB | `ollama run llama3.2-vision:90b` |
| Llama 3.1 | 8B | 4.7GB | `ollama run llama3.1` | | Llama 3.1 | 8B | 4.7GB | `ollama run llama3.1` |
| Llama 3.1 | 70B | 40GB | `ollama run llama3.1:70b` | | Llama 3.1 | 70B | 40GB | `ollama run llama3.1:70b` |
| Llama 3.1 | 405B | 231GB | `ollama run llama3.1:405b` | | Llama 3.1 | 405B | 231GB | `ollama run llama3.1:405b` |
@ -331,6 +333,8 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [ARGO](https://github.com/xark-argo/argo) (Locally download and run Ollama and Huggingface models with RAG on Mac/Windows/Linux) - [ARGO](https://github.com/xark-argo/argo) (Locally download and run Ollama and Huggingface models with RAG on Mac/Windows/Linux)
- [G1](https://github.com/bklieger-groq/g1) (Prototype of using prompting strategies to improve the LLM's reasoning through o1-like reasoning chains.) - [G1](https://github.com/bklieger-groq/g1) (Prototype of using prompting strategies to improve the LLM's reasoning through o1-like reasoning chains.)
- [Ollama App](https://github.com/JHubi1/ollama-app) (Modern and easy-to-use multi-platform client for Ollama) - [Ollama App](https://github.com/JHubi1/ollama-app) (Modern and easy-to-use multi-platform client for Ollama)
- [Hexabot](https://github.com/hexastack/hexabot) (A conversational AI builder)
- [Reddit Rate]((https://github.com/rapidarchitect/reddit_analyzer)) (Search and Rate Reddit topics with a weighted summation)
### Terminal ### Terminal
@ -357,6 +361,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [Ollama eBook Summary](https://github.com/cognitivetech/ollama-ebook-summary/) - [Ollama eBook Summary](https://github.com/cognitivetech/ollama-ebook-summary/)
- [Ollama Mixture of Experts (MOE) in 50 lines of code](https://github.com/rapidarchitect/ollama_moe) - [Ollama Mixture of Experts (MOE) in 50 lines of code](https://github.com/rapidarchitect/ollama_moe)
- [vim-intelligence-bridge](https://github.com/pepo-ec/vim-intelligence-bridge) Simple interaction of "Ollama" with the Vim editor - [vim-intelligence-bridge](https://github.com/pepo-ec/vim-intelligence-bridge) Simple interaction of "Ollama" with the Vim editor
- [aichat](https://github.com/sigoden/aichat) All-in-one LLM CLI tool featuring Shell Assistant, Chat-REPL, RAG, AI tools & agents, with access to OpenAI, Claude, Gemini, Ollama, Groq, and more.
### Apple Vision Pro ### Apple Vision Pro
- [Enchanted](https://github.com/AugustDev/enchanted) - [Enchanted](https://github.com/AugustDev/enchanted)
@ -413,6 +418,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [Ollama PHP](https://github.com/ArdaGnsrn/ollama-php) - [Ollama PHP](https://github.com/ArdaGnsrn/ollama-php)
- [Agents-Flex for Java](https://github.com/agents-flex/agents-flex) with [example](https://github.com/agents-flex/agents-flex/tree/main/agents-flex-llm/agents-flex-llm-ollama/src/test/java/com/agentsflex/llm/ollama) - [Agents-Flex for Java](https://github.com/agents-flex/agents-flex) with [example](https://github.com/agents-flex/agents-flex/tree/main/agents-flex-llm/agents-flex-llm-ollama/src/test/java/com/agentsflex/llm/ollama)
- [Ollama for Swift](https://github.com/mattt/ollama-swift) - [Ollama for Swift](https://github.com/mattt/ollama-swift)
- [GoLamify](https://github.com/prasad89/golamify)
### Mobile ### Mobile
@ -450,6 +456,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [Discord-Ollama Chat Bot](https://github.com/kevinthedang/discord-ollama) (Generalized TypeScript Discord Bot w/ Tuning Documentation) - [Discord-Ollama Chat Bot](https://github.com/kevinthedang/discord-ollama) (Generalized TypeScript Discord Bot w/ Tuning Documentation)
- [Discord AI chat/moderation bot](https://github.com/rapmd73/Companion) Chat/moderation bot written in python. Uses Ollama to create personalities. - [Discord AI chat/moderation bot](https://github.com/rapmd73/Companion) Chat/moderation bot written in python. Uses Ollama to create personalities.
- [Headless Ollama](https://github.com/nischalj10/headless-ollama) (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server) - [Headless Ollama](https://github.com/nischalj10/headless-ollama) (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server)
- [Local AI Helper](https://github.com/ivostoykov/localAI) (Chrome and Firefox extensions that enable interactions with the active tab and customisable API endpoints. Includes secure storage for user prompts.)
- [vnc-lm](https://github.com/jk011ru/vnc-lm) (A containerized Discord bot with support for attachments and web links) - [vnc-lm](https://github.com/jk011ru/vnc-lm) (A containerized Discord bot with support for attachments and web links)
- [LSP-AI](https://github.com/SilasMarvin/lsp-ai) (Open-source language server for AI-powered functionality) - [LSP-AI](https://github.com/SilasMarvin/lsp-ai) (Open-source language server for AI-powered functionality)
- [QodeAssist](https://github.com/Palm1r/QodeAssist) (AI-powered coding assistant plugin for Qt Creator) - [QodeAssist](https://github.com/Palm1r/QodeAssist) (AI-powered coding assistant plugin for Qt Creator)

View File

@ -55,7 +55,7 @@ func checkError(resp *http.Response, body []byte) error {
// ClientFromEnvironment creates a new [Client] using configuration from the // ClientFromEnvironment creates a new [Client] using configuration from the
// environment variable OLLAMA_HOST, which points to the network host and // environment variable OLLAMA_HOST, which points to the network host and
// port on which the ollama service is listenting. The format of this variable // port on which the ollama service is listening. The format of this variable
// is: // is:
// //
// <scheme>://<host>:<port> // <scheme>://<host>:<port>

View File

@ -12,7 +12,7 @@ import (
"time" "time"
) )
// StatusError is an error with and HTTP status code. // StatusError is an error with an HTTP status code and message.
type StatusError struct { type StatusError struct {
StatusCode int StatusCode int
Status string Status string
@ -57,7 +57,7 @@ type GenerateRequest struct {
Template string `json:"template"` Template string `json:"template"`
// Context is the context parameter returned from a previous call to // Context is the context parameter returned from a previous call to
// Generate call. It can be used to keep a short conversational memory. // [Client.Generate]. It can be used to keep a short conversational memory.
Context []int `json:"context,omitempty"` Context []int `json:"context,omitempty"`
// Stream specifies whether the response is streaming; it is true by default. // Stream specifies whether the response is streaming; it is true by default.
@ -90,14 +90,14 @@ type ChatRequest struct {
// Messages is the messages of the chat - can be used to keep a chat memory. // Messages is the messages of the chat - can be used to keep a chat memory.
Messages []Message `json:"messages"` Messages []Message `json:"messages"`
// Stream enable streaming of returned response; true by default. // Stream enables streaming of returned responses; true by default.
Stream *bool `json:"stream,omitempty"` Stream *bool `json:"stream,omitempty"`
// Format is the format to return the response in (e.g. "json"). // Format is the format to return the response in (e.g. "json").
Format string `json:"format"` Format string `json:"format"`
// KeepAlive controls how long the model will stay loaded into memory // KeepAlive controls how long the model will stay loaded into memory
// followin the request. // following the request.
KeepAlive *Duration `json:"keep_alive,omitempty"` KeepAlive *Duration `json:"keep_alive,omitempty"`
// Tools is an optional list of tools the model has access to. // Tools is an optional list of tools the model has access to.
@ -203,8 +203,8 @@ type Metrics struct {
EvalDuration time.Duration `json:"eval_duration,omitempty"` EvalDuration time.Duration `json:"eval_duration,omitempty"`
} }
// Options specified in [GenerateRequest], if you add a new option here add it // Options specified in [GenerateRequest]. If you add a new option here, also
// to the API docs also. // add it to the API docs.
type Options struct { type Options struct {
Runner Runner
@ -236,7 +236,7 @@ type Runner struct {
NumGPU int `json:"num_gpu,omitempty"` NumGPU int `json:"num_gpu,omitempty"`
MainGPU int `json:"main_gpu,omitempty"` MainGPU int `json:"main_gpu,omitempty"`
LowVRAM bool `json:"low_vram,omitempty"` LowVRAM bool `json:"low_vram,omitempty"`
F16KV bool `json:"f16_kv,omitempty"` F16KV bool `json:"f16_kv,omitempty"` // Deprecated: This option is ignored
LogitsAll bool `json:"logits_all,omitempty"` LogitsAll bool `json:"logits_all,omitempty"`
VocabOnly bool `json:"vocab_only,omitempty"` VocabOnly bool `json:"vocab_only,omitempty"`
UseMMap *bool `json:"use_mmap,omitempty"` UseMMap *bool `json:"use_mmap,omitempty"`
@ -613,7 +613,6 @@ func DefaultOptions() Options {
NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically
NumThread: 0, // let the runtime decide NumThread: 0, // let the runtime decide
LowVRAM: false, LowVRAM: false,
F16KV: true,
UseMLock: false, UseMLock: false,
UseMMap: nil, UseMMap: nil,
}, },

View File

@ -136,7 +136,7 @@ Type: filesandordirs; Name: "{%TEMP}\ollama*"
Type: filesandordirs; Name: "{%LOCALAPPDATA}\Programs\Ollama" Type: filesandordirs; Name: "{%LOCALAPPDATA}\Programs\Ollama"
[Messages] [Messages]
WizardReady=Ollama Windows Preview WizardReady=Ollama
ReadyLabel1=%nLet's get you up and running with your own large language models. ReadyLabel1=%nLet's get you up and running with your own large language models.
SetupAppRunningError=Another Ollama installer is running.%n%nPlease cancel or finish the other installer, then click OK to continue with this install, or Cancel to exit. SetupAppRunningError=Another Ollama installer is running.%n%nPlease cancel or finish the other installer, then click OK to continue with this install, or Cancel to exit.

View File

@ -800,9 +800,9 @@ func ShowHandler(cmd *cobra.Command, args []string) error {
case "parameters": case "parameters":
fmt.Println(resp.Parameters) fmt.Println(resp.Parameters)
case "system": case "system":
fmt.Println(resp.System) fmt.Print(resp.System)
case "template": case "template":
fmt.Println(resp.Template) fmt.Print(resp.Template)
} }
return nil return nil

View File

@ -350,7 +350,7 @@ func AMDGetGPUInfo() ([]RocmGPUInfo, error) {
return nil, err return nil, err
} }
} }
gpuInfo.DependencyPath = libDir gpuInfo.DependencyPath = []string{libDir}
if gfxOverride == "" { if gfxOverride == "" {
// Only load supported list once // Only load supported list once

View File

@ -111,7 +111,7 @@ func AMDGetGPUInfo() ([]RocmGPUInfo, error) {
UnreliableFreeMemory: true, UnreliableFreeMemory: true,
ID: strconv.Itoa(i), // TODO this is probably wrong if we specify visible devices ID: strconv.Itoa(i), // TODO this is probably wrong if we specify visible devices
DependencyPath: libDir, DependencyPath: []string{libDir},
MinimumMemory: rocmMinimumMemory, MinimumMemory: rocmMinimumMemory,
Name: name, Name: name,
Compute: gfx, Compute: gfx,

View File

@ -240,7 +240,7 @@ func GetGPUInfo() GpuInfoList {
Library: "cpu", Library: "cpu",
Variant: cpuCapability.String(), Variant: cpuCapability.String(),
ID: "0", ID: "0",
DependencyPath: depPath, DependencyPath: []string{depPath},
}, },
CPUs: details, CPUs: details,
}, },
@ -293,11 +293,11 @@ func GetGPUInfo() GpuInfoList {
gpuInfo.DriverMinor = driverMinor gpuInfo.DriverMinor = driverMinor
variant := cudaVariant(gpuInfo) variant := cudaVariant(gpuInfo)
if depPath != "" { if depPath != "" {
gpuInfo.DependencyPath = depPath gpuInfo.DependencyPath = []string{depPath}
// Check for variant specific directory // Check for variant specific directory
if variant != "" { if variant != "" {
if _, err := os.Stat(filepath.Join(depPath, "cuda_"+variant)); err == nil { if _, err := os.Stat(filepath.Join(depPath, "cuda_"+variant)); err == nil {
gpuInfo.DependencyPath = filepath.Join(depPath, "cuda_"+variant) gpuInfo.DependencyPath = []string{filepath.Join(depPath, "cuda_"+variant), depPath}
} }
} }
} }
@ -370,7 +370,7 @@ func GetGPUInfo() GpuInfoList {
gpuInfo.FreeMemory = uint64(memInfo.free) gpuInfo.FreeMemory = uint64(memInfo.free)
gpuInfo.ID = C.GoString(&memInfo.gpu_id[0]) gpuInfo.ID = C.GoString(&memInfo.gpu_id[0])
gpuInfo.Name = C.GoString(&memInfo.gpu_name[0]) gpuInfo.Name = C.GoString(&memInfo.gpu_name[0])
gpuInfo.DependencyPath = depPath gpuInfo.DependencyPath = []string{depPath}
oneapiGPUs = append(oneapiGPUs, gpuInfo) oneapiGPUs = append(oneapiGPUs, gpuInfo)
} }
} }

View File

@ -4,6 +4,7 @@
#include "gpu_info_nvcuda.h" #include "gpu_info_nvcuda.h"
void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) { void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
LOG(resp->ch.verbose, "initializing %s\n", nvcuda_lib_path);
CUresult ret; CUresult ret;
resp->err = NULL; resp->err = NULL;
resp->num_devices = 0; resp->num_devices = 0;
@ -57,8 +58,10 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
resp->cudaErr = -1; resp->cudaErr = -1;
return; return;
} }
LOG(resp->ch.verbose, "dlsym: %s - %p\n", l[i].s, *l[i].p);
} }
LOG(resp->ch.verbose, "calling cuInit\n");
ret = (*resp->ch.cuInit)(0); ret = (*resp->ch.cuInit)(0);
if (ret != CUDA_SUCCESS) { if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuInit err: %d\n", ret); LOG(resp->ch.verbose, "cuInit err: %d\n", ret);
@ -75,15 +78,18 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
resp->ch.driver_minor = 0; resp->ch.driver_minor = 0;
// Report driver version if we're in verbose mode, ignore errors // Report driver version if we're in verbose mode, ignore errors
LOG(resp->ch.verbose, "calling cuDriverGetVersion\n");
ret = (*resp->ch.cuDriverGetVersion)(&version); ret = (*resp->ch.cuDriverGetVersion)(&version);
if (ret != CUDA_SUCCESS) { if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuDriverGetVersion failed: %d\n", ret); LOG(resp->ch.verbose, "cuDriverGetVersion failed: %d\n", ret);
} else { } else {
LOG(resp->ch.verbose, "raw version 0x%x\n", version);
resp->ch.driver_major = version / 1000; resp->ch.driver_major = version / 1000;
resp->ch.driver_minor = (version - (resp->ch.driver_major * 1000)) / 10; resp->ch.driver_minor = (version - (resp->ch.driver_major * 1000)) / 10;
LOG(resp->ch.verbose, "CUDA driver version: %d.%d\n", resp->ch.driver_major, resp->ch.driver_minor); LOG(resp->ch.verbose, "CUDA driver version: %d.%d\n", resp->ch.driver_major, resp->ch.driver_minor);
} }
LOG(resp->ch.verbose, "calling cuDeviceGetCount\n");
ret = (*resp->ch.cuDeviceGetCount)(&resp->num_devices); ret = (*resp->ch.cuDeviceGetCount)(&resp->num_devices);
if (ret != CUDA_SUCCESS) { if (ret != CUDA_SUCCESS) {
LOG(resp->ch.verbose, "cuDeviceGetCount err: %d\n", ret); LOG(resp->ch.verbose, "cuDeviceGetCount err: %d\n", ret);
@ -94,6 +100,7 @@ void nvcuda_init(char *nvcuda_lib_path, nvcuda_init_resp_t *resp) {
resp->cudaErr = ret; resp->cudaErr = ret;
return; return;
} }
LOG(resp->ch.verbose, "device count %d\n", resp->num_devices);
} }
const int buflen = 256; const int buflen = 256;

View File

@ -25,7 +25,7 @@ type GpuInfo struct { // TODO better name maybe "InferenceProcessor"?
MinimumMemory uint64 `json:"-"` MinimumMemory uint64 `json:"-"`
// Any extra PATH/LD_LIBRARY_PATH dependencies required for the Library to operate properly // Any extra PATH/LD_LIBRARY_PATH dependencies required for the Library to operate properly
DependencyPath string `json:"lib_path,omitempty"` DependencyPath []string `json:"lib_path,omitempty"`
// Extra environment variables specific to the GPU as list of [key,value] // Extra environment variables specific to the GPU as list of [key,value]
EnvWorkarounds [][2]string `json:"envs,omitempty"` EnvWorkarounds [][2]string `json:"envs,omitempty"`

View File

@ -355,7 +355,6 @@ curl http://localhost:11434/api/generate -d '{
"num_gpu": 1, "num_gpu": 1,
"main_gpu": 0, "main_gpu": 0,
"low_vram": false, "low_vram": false,
"f16_kv": true,
"vocab_only": false, "vocab_only": false,
"use_mmap": true, "use_mmap": true,
"use_mlock": false, "use_mlock": false,

View File

@ -108,7 +108,7 @@ Custom CPU settings are not currently supported in the new Go server build but w
#### Containerized Linux Build #### Containerized Linux Build
If you have Docker available, you can build linux binaries with `OLLAMA_NEW_RUNNERS=1 ./scripts/build_linux.sh` which has the CUDA and ROCm dependencies included. The resulting binary is placed in `./dist` If you have Docker available, you can build linux binaries with `./scripts/build_linux.sh` which has the CUDA and ROCm dependencies included. The resulting binary is placed in `./dist`
### Windows ### Windows

View File

@ -32,7 +32,7 @@ ollama run my-model
Ollama supports importing adapters based on several different model architectures including: Ollama supports importing adapters based on several different model architectures including:
* Llama (including Llama 2, Llama 3, and Llama 3.1); * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2);
* Mistral (including Mistral 1, Mistral 2, and Mixtral); and * Mistral (including Mistral 1, Mistral 2, and Mixtral); and
* Gemma (including Gemma 1 and Gemma 2) * Gemma (including Gemma 1 and Gemma 2)
@ -67,14 +67,12 @@ ollama run my-model
Ollama supports importing models for several different architectures including: Ollama supports importing models for several different architectures including:
* Llama (including Llama 2, Llama 3, and Llama 3.1); * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2);
* Mistral (including Mistral 1, Mistral 2, and Mixtral); * Mistral (including Mistral 1, Mistral 2, and Mixtral);
* Gemma (including Gemma 1 and Gemma 2); and * Gemma (including Gemma 1 and Gemma 2); and
* Phi3 * Phi3
This includes importing foundation models as well as any fine tuned models which which have been _fused_ with a foundation model. This includes importing foundation models as well as any fine tuned models which have been _fused_ with a foundation model.
## Importing a GGUF based model or adapter ## Importing a GGUF based model or adapter
If you have a GGUF based model or adapter it is possible to import it into Ollama. You can obtain a GGUF model or adapter by: If you have a GGUF based model or adapter it is possible to import it into Ollama. You can obtain a GGUF model or adapter by:

View File

@ -120,7 +120,7 @@ FROM <model directory>
The model directory should contain the Safetensors weights for a supported architecture. The model directory should contain the Safetensors weights for a supported architecture.
Currently supported model architectures: Currently supported model architectures:
* Llama (including Llama 2, Llama 3, and Llama 3.1) * Llama (including Llama 2, Llama 3, Llama 3.1, and Llama 3.2)
* Mistral (including Mistral 1, Mistral 2, and Mixtral) * Mistral (including Mistral 1, Mistral 2, and Mixtral)
* Gemma (including Gemma 1 and Gemma 2) * Gemma (including Gemma 1 and Gemma 2)
* Phi3 * Phi3

View File

@ -95,7 +95,9 @@ If none of those resolve the problem, gather additional information and file an
On linux, AMD GPU access typically requires `video` and/or `render` group membership to access the `/dev/kfd` device. If permissions are not set up correctly, Ollama will detect this and report an error in the server log. On linux, AMD GPU access typically requires `video` and/or `render` group membership to access the `/dev/kfd` device. If permissions are not set up correctly, Ollama will detect this and report an error in the server log.
When running in a container, in some Linux distributions and container runtimes, the ollama process may be unable to access the GPU. Use `ls -ld /dev/kfd /dev/dri /dev/dri/*` on the host system to determine the group assignments on your system, and pass additional `--group-add ...` arguments to the container so it can access the required devices. When running in a container, in some Linux distributions and container runtimes, the ollama process may be unable to access the GPU. Use `ls -lnd /dev/kfd /dev/dri /dev/dri/*` on the host system to determine the **numeric** group IDs on your system, and pass additional `--group-add ...` arguments to the container so it can access the required devices. For example, in the following output `crw-rw---- 1 0 44 226, 0 Sep 16 16:55 /dev/dri/card0` the group ID column is `44`
If Ollama initially works on the GPU in a docker container, but then switches to running on CPU after some period of time with errors in the server log reporting GPU discovery failures, this can be resolved by disabling systemd cgroup management in Docker. Edit `/etc/docker/daemon.json` on the host and add `"exec-opts": ["native.cgroupdriver=cgroupfs"]` to the docker configuration.
If you are experiencing problems getting Ollama to correctly discover or use your GPU for inference, the following may help isolate the failure. If you are experiencing problems getting Ollama to correctly discover or use your GPU for inference, the following may help isolate the failure.
- `AMD_LOG_LEVEL=3` Enable info log levels in the AMD HIP/ROCm libraries. This can help show more detailed error codes that can help troubleshoot problems - `AMD_LOG_LEVEL=3` Enable info log levels in the AMD HIP/ROCm libraries. This can help show more detailed error codes that can help troubleshoot problems

View File

@ -10,7 +10,7 @@ This sounds like a typical censored response, but even llama2-uncensored gives a
So let's figure out how we can use **LangChain** with Ollama to ask our question to the actual document, the Odyssey by Homer, using Python. So let's figure out how we can use **LangChain** with Ollama to ask our question to the actual document, the Odyssey by Homer, using Python.
Let's start by asking a simple question that we can get an answer to from the **Llama2** model using **Ollama**. First, we need to install the **LangChain** package: Let's start by asking a simple question that we can get an answer to from the **Llama3** model using **Ollama**. First, we need to install the **LangChain** package:
`pip install langchain_community` `pip install langchain_community`

View File

@ -21,6 +21,8 @@ package llama
#cgo cuda CFLAGS: -fPIE -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1 #cgo cuda CFLAGS: -fPIE -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1
#cgo cuda CXXFLAGS: -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1 #cgo cuda CXXFLAGS: -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1
#cgo cuda CXXFLAGS: -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1 #cgo cuda CXXFLAGS: -DGGML_USE_CUDA -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_BUILD=1
#cgo cuda_jetpack5 LDFLAGS: -lggml_cuda_jetpack5 -L/usr/local/cuda-11/lib64
#cgo cuda_jetpack6 LDFLAGS: -lggml_cuda_jetpack6 -L/usr/local/cuda-12/lib64
#cgo cuda_v11 LDFLAGS: -lggml_cuda_v11 -L/usr/local/cuda-11/lib64 #cgo cuda_v11 LDFLAGS: -lggml_cuda_v11 -L/usr/local/cuda-11/lib64
#cgo cuda_v12 LDFLAGS: -lggml_cuda_v12 -L/usr/local/cuda-12/lib64 #cgo cuda_v12 LDFLAGS: -lggml_cuda_v12 -L/usr/local/cuda-12/lib64
#cgo darwin,amd64 CFLAGS: -Wno-incompatible-pointer-types-discards-qualifiers #cgo darwin,amd64 CFLAGS: -Wno-incompatible-pointer-types-discards-qualifiers
@ -36,8 +38,8 @@ package llama
#cgo linux CXXFLAGS: -D_GNU_SOURCE #cgo linux CXXFLAGS: -D_GNU_SOURCE
#cgo linux,amd64 LDFLAGS: -L${SRCDIR}/build/Linux/amd64 #cgo linux,amd64 LDFLAGS: -L${SRCDIR}/build/Linux/amd64
#cgo linux,amd64 LDFLAGS: -L${SRCDIR}/build/Linux/amd64 #cgo linux,amd64 LDFLAGS: -L${SRCDIR}/build/Linux/amd64
#cgo linux,arm64 CFLAGS: -D__aarch64__ -D__ARM_NEON -D__ARM_FEATURE_FMA -D__ARM_FEATURE_MATMUL_INT8 #cgo linux,arm64 CFLAGS: -D__aarch64__ -D__ARM_NEON -D__ARM_FEATURE_FMA
#cgo linux,arm64 CXXFLAGS: -D__aarch64__ -D__ARM_NEON -D__ARM_FEATURE_FMA -D__ARM_FEATURE_MATMUL_INT8 #cgo linux,arm64 CXXFLAGS: -D__aarch64__ -D__ARM_NEON -D__ARM_FEATURE_FMA
#cgo linux,arm64 LDFLAGS: -L${SRCDIR}/build/Linux/arm64 #cgo linux,arm64 LDFLAGS: -L${SRCDIR}/build/Linux/arm64
#cgo linux,arm64,sve CFLAGS: -march=armv8.6-a+sve #cgo linux,arm64,sve CFLAGS: -march=armv8.6-a+sve
#cgo linux,arm64,sve CXXFLAGS: -march=armv8.6-a+sve #cgo linux,arm64,sve CXXFLAGS: -march=armv8.6-a+sve

View File

@ -58,6 +58,8 @@ endif
GPU_COMPILER_CUFLAGS = \ GPU_COMPILER_CUFLAGS = \
$(GPU_COMPILER_FPIC) \ $(GPU_COMPILER_FPIC) \
$(addprefix -m,$(GPU_RUNNER_CPU_FLAGS)) \ $(addprefix -m,$(GPU_RUNNER_CPU_FLAGS)) \
-mf16c \
-mfma \
-parallel-jobs=2 \ -parallel-jobs=2 \
-c \ -c \
-O3 \ -O3 \
@ -77,6 +79,9 @@ GPU_COMPILER_CUFLAGS = \
-D_CRT_SECURE_NO_WARNINGS \ -D_CRT_SECURE_NO_WARNINGS \
-D_GNU_SOURCE \ -D_GNU_SOURCE \
-D_XOPEN_SOURCE=600 \ -D_XOPEN_SOURCE=600 \
-DUSE_PROF_API=1 \
-std=gnu++14 \
-x hip \
-mllvm=-amdgpu-early-inline-all=true \ -mllvm=-amdgpu-early-inline-all=true \
-mllvm=-amdgpu-function-calls=false \ -mllvm=-amdgpu-function-calls=false \
-Wno-expansion-to-defined \ -Wno-expansion-to-defined \
@ -87,6 +92,12 @@ GPU_COMPILER_CUFLAGS = \
-Wno-unused-result \ -Wno-unused-result \
-I. -I.
# Workaround buggy P2P copy on some windows multi-GPU setups
# This workaround breaks linux systems with small system RAM, so only enable on windows
ifeq ($(OS),windows)
GPU_COMPILER_CUFLAGS += -DGGML_CUDA_NO_PEER_COPY=1
endif
include make/gpu.make include make/gpu.make
# Adjust the rules from gpu.make to handle the ROCm dependencies properly # Adjust the rules from gpu.make to handle the ROCm dependencies properly

View File

@ -20,7 +20,7 @@ GPU_COMPILER_CFLAGS_LINUX = $(CFLAGS) -Xcompiler -fPIC -D_GNU_SOURCE
GPU_COMPILER_CXXFLAGS_WIN = $(CXXFLAGS) -D_WIN32_WINNT=0x602 GPU_COMPILER_CXXFLAGS_WIN = $(CXXFLAGS) -D_WIN32_WINNT=0x602
GPU_COMPILER_CXXFLAGS_LINUX = $(CXXFLAGS) -Xcompiler -fPIC -D_GNU_SOURCE GPU_COMPILER_CXXFLAGS_LINUX = $(CXXFLAGS) -Xcompiler -fPIC -D_GNU_SOURCE
GPU_LIBS = $(sort $(wildcard $(addsuffix *.$(SHARED_EXT)*,$(addprefix $(GPU_LIB_DIR)/$(SHARED_PREFIX),$(GPU_RUNNER_LIBS_SHORT))))) GPU_LIBS = $(sort $(wildcard $(addsuffix *.$(SHARED_EXT)*,$(addprefix $(GPU_LIB_DIR)/$(SHARED_PREFIX),$(GPU_RUNNER_LIBS_SHORT)))))
GPU_DIST_DEPS_LIBS= $(sort $(addprefix $(DIST_LIB_DIR)/,$(notdir $(GPU_LIBS)))) GPU_DIST_DEPS_LIBS= $(sort $(addprefix $(DIST_GPU_RUNNER_DEPS_DIR)/,$(notdir $(GPU_LIBS))))
ifeq ($(OS),linux) ifeq ($(OS),linux)
CUDA_PATH?=/usr/local/cuda CUDA_PATH?=/usr/local/cuda

View File

@ -85,7 +85,7 @@ $(RUNNERS_BUILD_DIR)/$(GPU_RUNNER_NAME)/ollama_llama_server$(EXE_EXT): $(RUNNERS
GOARCH=$(ARCH) CGO_LDFLAGS="$(TARGET_CGO_LDFLAGS)" go build -buildmode=pie $(GPU_GOFLAGS) -trimpath -tags $(subst $(space),$(comma),$(GPU_RUNNER_CPU_FLAGS) $(GPU_RUNNER_GO_TAGS)) -o $@ ./runner GOARCH=$(ARCH) CGO_LDFLAGS="$(TARGET_CGO_LDFLAGS)" go build -buildmode=pie $(GPU_GOFLAGS) -trimpath -tags $(subst $(space),$(comma),$(GPU_RUNNER_CPU_FLAGS) $(GPU_RUNNER_GO_TAGS)) -o $@ ./runner
$(RUNNERS_BUILD_DIR)/$(GPU_RUNNER_NAME)/$(SHARED_PREFIX)ggml_$(GPU_RUNNER_NAME).$(SHARED_EXT): $(GPU_RUNNER_OBJS) $(DIST_GPU_RUNNER_LIB_DEPS) $(COMMON_HDRS) $(GPU_RUNNER_HDRS) $(RUNNERS_BUILD_DIR)/$(GPU_RUNNER_NAME)/$(SHARED_PREFIX)ggml_$(GPU_RUNNER_NAME).$(SHARED_EXT): $(GPU_RUNNER_OBJS) $(DIST_GPU_RUNNER_LIB_DEPS) $(COMMON_HDRS) $(GPU_RUNNER_HDRS)
@-mkdir -p $(dir $@) @-mkdir -p $(dir $@)
$(CCACHE) $(GPU_COMPILER) --shared $(GPU_RUNNER_DRIVER_LIB_LINK) -L${DIST_GPU_RUNNER_DEPS_DIR} $(foreach lib, $(GPU_RUNNER_LIBS_SHORT), -l$(lib)) $(GPU_RUNNER_OBJS) -o $@ $(CCACHE) $(GPU_COMPILER) --shared -L$(GPU_LIB_DIR) $(GPU_RUNNER_DRIVER_LIB_LINK) -L${DIST_GPU_RUNNER_DEPS_DIR} $(foreach lib, $(GPU_RUNNER_LIBS_SHORT), -l$(lib)) $(GPU_RUNNER_OBJS) -o $@
# Distribution targets # Distribution targets
$(RUNNERS_DIST_DIR)/%: $(RUNNERS_BUILD_DIR)/% $(RUNNERS_DIST_DIR)/%: $(RUNNERS_BUILD_DIR)/%

View File

@ -2,6 +2,7 @@ package main
import ( import (
"errors" "errors"
"fmt"
"log/slog" "log/slog"
"reflect" "reflect"
"time" "time"
@ -22,7 +23,11 @@ type InputCache struct {
lc *llama.Context lc *llama.Context
} }
func NewInputCache(lc *llama.Context, kvSize int, numSlots int, multiUserCache bool) *InputCache { func NewInputCache(lc *llama.Context, kvSize int, numSlots int, multiUserCache bool) (*InputCache, error) {
if kvSize/numSlots < 1 {
return nil, fmt.Errorf("must have at least one kv cache entry per parallel sequence (kv: %v parallel: %v)", kvSize, numSlots)
}
slots := make([]InputCacheSlot, numSlots) slots := make([]InputCacheSlot, numSlots)
for i := range slots { for i := range slots {
@ -37,7 +42,7 @@ func NewInputCache(lc *llama.Context, kvSize int, numSlots int, multiUserCache b
slots: slots, slots: slots,
multiUserCache: multiUserCache, multiUserCache: multiUserCache,
lc: lc, lc: lc,
} }, nil
} }
// Locking: Operations on InputCacheSlot (including finding one // Locking: Operations on InputCacheSlot (including finding one
@ -58,7 +63,7 @@ type InputCacheSlot struct {
lastUsed time.Time lastUsed time.Time
} }
func (c *InputCache) LoadCacheSlot(prompt []input, cachePrompt bool) (*InputCacheSlot, []input, int, error) { func (c *InputCache) LoadCacheSlot(prompt []input, cachePrompt bool) (*InputCacheSlot, []input, error) {
var slot *InputCacheSlot var slot *InputCacheSlot
var numPast int var numPast int
var err error var err error
@ -75,7 +80,7 @@ func (c *InputCache) LoadCacheSlot(prompt []input, cachePrompt bool) (*InputCach
slot, numPast, err = c.findBestCacheSlot(prompt) slot, numPast, err = c.findBestCacheSlot(prompt)
} }
if err != nil { if err != nil {
return nil, nil, 0, err return nil, nil, err
} }
if !cachePrompt { if !cachePrompt {
@ -102,7 +107,7 @@ func (c *InputCache) LoadCacheSlot(prompt []input, cachePrompt bool) (*InputCach
prompt = prompt[numPast:] prompt = prompt[numPast:]
slot.Inputs = slot.Inputs[:numPast] slot.Inputs = slot.Inputs[:numPast]
return slot, prompt, numPast, nil return slot, prompt, nil
} }
func (c *InputCache) findLongestCacheSlot(prompt []input) (*InputCacheSlot, int, error) { func (c *InputCache) findLongestCacheSlot(prompt []input) (*InputCacheSlot, int, error) {
@ -194,14 +199,30 @@ func countCommonPrefix(a []input, b []input) int {
return count return count
} }
func (c *InputCache) ShiftCacheSlot(slot *InputCacheSlot, numKeep int, numDiscard int, numPast int) { // Frees up space in the KV cache by deleting the oldest half of history and shifting
// TODO (jessegross): KV cache removal can fail for certain types of models // the newest half into that space (saving numKeep inputs at the beginning).
// server.cpp doesn't handle this, though we can be more graceful //
c.lc.KvCacheSeqRm(slot.Id, numKeep, numKeep+numDiscard) // Assumes that at least 1 entry can be freed up by shifting (i.e. numKeep < numCtx)
c.lc.KvCacheSeqAdd(slot.Id, numKeep+numDiscard, numPast, -numDiscard) func (c *InputCache) ShiftCacheSlot(slot *InputCacheSlot, numKeep int) {
targetFree := (c.numCtx - numKeep) / 2
targetFree = max(targetFree, 1)
for i := numKeep + numDiscard; i < len(slot.Inputs); i++ { currentFree := c.numCtx - len(slot.Inputs)
slot.Inputs[i-numDiscard] = slot.Inputs[i] discard := targetFree - currentFree
if discard <= 0 {
return
} }
slot.Inputs = slot.Inputs[:len(slot.Inputs)-numDiscard]
slog.Debug("context limit hit - shifting", "limit", c.numCtx, "input", len(slot.Inputs),
"keep", numKeep, "discard", discard)
// TODO (jessegross): KV cache removal can fail for certain types of models
c.lc.KvCacheSeqRm(slot.Id, numKeep, numKeep+discard)
c.lc.KvCacheSeqAdd(slot.Id, numKeep+discard, len(slot.Inputs), -discard)
for i := numKeep + discard; i < len(slot.Inputs); i++ {
slot.Inputs[i-discard] = slot.Inputs[i]
}
slot.Inputs = slot.Inputs[:len(slot.Inputs)-discard]
} }

View File

@ -68,6 +68,10 @@ func (c *ImageContext) NewEmbed(llamaContext *llama.Context, data []byte, aspect
return nil, nil return nil, nil
} }
if len(data) <= 0 {
return nil, errors.New("received zero length image")
}
hash := c.hashImage(data) hash := c.hashImage(data)
c.mu.Lock() c.mu.Lock()

View File

@ -34,9 +34,6 @@ type input struct {
} }
type Sequence struct { type Sequence struct {
// number of inputs evaluated
numPast int
// batch index // batch index
iBatch int iBatch int
@ -112,21 +109,15 @@ func (s *Server) NewSequence(prompt string, images []ImageData, params NewSequen
params.numKeep = len(inputs) params.numKeep = len(inputs)
} }
if !params.embedding { if s.model.AddBOSToken() {
// Subtracting 4 ensures that at least 1 input can be discarded during shift params.numKeep += 1
params.numKeep = min(params.numKeep, s.cache.numCtx-4)
params.numKeep += s.bosToken
} else {
// Embeddings are 1 shot - just truncate to the context window, without ever shifting
params.numKeep = min(params.numKeep, s.cache.numCtx)
} }
// truncate to fit in context window // Ensure that at least 1 input can be discarded during shift
params.numKeep = min(params.numKeep, s.cache.numCtx-1)
if len(inputs) > s.cache.numCtx { if len(inputs) > s.cache.numCtx {
slog.Warn("truncating input prompt", "limit", s.cache.numCtx, "prompt", len(inputs), "numKeep", params.numKeep) slog.Warn("input exceeds context length", "prompt", len(inputs), "limit", s.cache.numCtx)
newInputs := inputs[:params.numKeep]
newInputs = append(newInputs, inputs[len(inputs)-s.cache.numCtx+params.numKeep:]...)
inputs = newInputs
} }
var sc *llama.SamplingContext var sc *llama.SamplingContext
@ -231,9 +222,6 @@ type Server struct {
// KV cache // KV cache
cache *InputCache cache *InputCache
// does this model require a beginning of sequence token?
bosToken int
// next sequence for prompt processing to avoid starvation // next sequence for prompt processing to avoid starvation
nextSeq int nextSeq int
@ -258,18 +246,6 @@ func (s *Server) allNil() bool {
return true return true
} }
func (s *Server) shiftContext(seq *Sequence) {
numLeft := seq.numPast - seq.numKeep
numDiscard := numLeft / 2
slog.Debug("context limit hit - shifting", "limit", s.cache.numCtx, "numPast", seq.numPast,
"numKeep", seq.numKeep, "numLeft", numLeft, "numDiscard", numDiscard)
s.cache.ShiftCacheSlot(seq.cache, seq.numKeep, numDiscard, seq.numPast)
seq.numPast -= numDiscard
}
func flushPending(seq *Sequence) bool { func flushPending(seq *Sequence) bool {
joined := strings.Join(seq.pendingResponses, "") joined := strings.Join(seq.pendingResponses, "")
seq.pendingResponses = []string{} seq.pendingResponses = []string{}
@ -369,17 +345,24 @@ func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch)
} }
// if past the num predict limit // if past the num predict limit
if seq.numPredict > 0 && seq.numPredicted > seq.numPredict { if seq.numPredict > 0 && seq.numPredicted >= seq.numPredict {
s.removeSequence(seqIdx, "limit") s.removeSequence(seqIdx, "limit")
continue continue
} }
if seq.numPast+len(seq.inputs) > s.cache.numCtx { var numInputsProcessed int
s.shiftContext(seq) shifted := false
for i, input := range seq.inputs {
if len(seq.cache.Inputs)+1 > s.cache.numCtx {
if !shifted {
s.cache.ShiftCacheSlot(seq.cache, seq.numKeep)
shifted = true
} else {
break
}
} }
var numInputsProcessed int
for i, input := range seq.inputs {
embedding := input.embed != nil embedding := input.embed != nil
// If we don't currently have a batch, use one of the correct type and // If we don't currently have a batch, use one of the correct type and
@ -403,13 +386,12 @@ func (s *Server) processBatch(tokenBatch *llama.Batch, embedBatch *llama.Batch)
} }
crossAttention = seq.crossAttention crossAttention = seq.crossAttention
batch.Add(input.token, input.embed, seq.numPast, numInputsProcessed+1 == len(seq.inputs), seq.cache.Id) batch.Add(input.token, input.embed, len(seq.cache.Inputs), i+1 == len(seq.inputs), seq.cache.Id)
seq.numPast++ seq.cache.Inputs = append(seq.cache.Inputs, input)
numInputsProcessed++ numInputsProcessed++
} }
if numInputsProcessed > 0 { if numInputsProcessed > 0 {
seq.cache.Inputs = append(seq.cache.Inputs, seq.inputs[:numInputsProcessed]...)
seq.inputs = seq.inputs[numInputsProcessed:] seq.inputs = seq.inputs[numInputsProcessed:]
seq.iBatch = batch.NumTokens() - 1 seq.iBatch = batch.NumTokens() - 1
} }
@ -632,7 +614,7 @@ func (s *Server) completion(w http.ResponseWriter, r *http.Request) {
s.mu.Lock() s.mu.Lock()
for i, sq := range s.seqs { for i, sq := range s.seqs {
if sq == nil { if sq == nil {
seq.cache, seq.inputs, seq.numPast, err = s.cache.LoadCacheSlot(seq.inputs, req.CachePrompt) seq.cache, seq.inputs, err = s.cache.LoadCacheSlot(seq.inputs, req.CachePrompt)
if err != nil { if err != nil {
s.mu.Unlock() s.mu.Unlock()
http.Error(w, fmt.Sprintf("Failed to load cache: %v", err), http.StatusInternalServerError) http.Error(w, fmt.Sprintf("Failed to load cache: %v", err), http.StatusInternalServerError)
@ -715,7 +697,7 @@ func (s *Server) embeddings(w http.ResponseWriter, r *http.Request) {
s.mu.Lock() s.mu.Lock()
for i, sq := range s.seqs { for i, sq := range s.seqs {
if sq == nil { if sq == nil {
seq.cache, seq.inputs, seq.numPast, err = s.cache.LoadCacheSlot(seq.inputs, req.CachePrompt) seq.cache, seq.inputs, err = s.cache.LoadCacheSlot(seq.inputs, req.CachePrompt)
if err != nil { if err != nil {
s.mu.Unlock() s.mu.Unlock()
http.Error(w, fmt.Sprintf("Failed to load cache: %v", err), http.StatusInternalServerError) http.Error(w, fmt.Sprintf("Failed to load cache: %v", err), http.StatusInternalServerError)
@ -802,10 +784,6 @@ func (s *Server) loadModel(
} }
} }
if s.model.AddBOSToken() {
s.bosToken = 1
}
if ppath != "" { if ppath != "" {
var err error var err error
s.image, err = NewImageContext(s.lc, ppath) s.image, err = NewImageContext(s.lc, ppath)
@ -814,7 +792,10 @@ func (s *Server) loadModel(
} }
} }
s.cache = NewInputCache(s.lc, kvSize, s.parallel, multiUserCache) s.cache, err = NewInputCache(s.lc, kvSize, s.parallel, multiUserCache)
if err != nil {
panic(err)
}
s.status = ServerStatusReady s.status = ServerStatusReady
s.ready.Done() s.ready.Done()
@ -837,14 +818,8 @@ func main() {
mlock := flag.Bool("mlock", false, "force system to keep model in RAM rather than swapping or compressing") mlock := flag.Bool("mlock", false, "force system to keep model in RAM rather than swapping or compressing")
tensorSplit := flag.String("tensor-split", "", "fraction of the model to offload to each GPU, comma-separated list of proportions") tensorSplit := flag.String("tensor-split", "", "fraction of the model to offload to each GPU, comma-separated list of proportions")
multiUserCache := flag.Bool("multiuser-cache", false, "optimize input cache algorithm for multiple users") multiUserCache := flag.Bool("multiuser-cache", false, "optimize input cache algorithm for multiple users")
// Expose requirements as a JSON output to stdout
requirements := flag.Bool("requirements", false, "print json requirement information") requirements := flag.Bool("requirements", false, "print json requirement information")
// These are either ignored by llama.cpp or have no significance to us
_ = flag.Bool("embedding", false, "enable embedding vector output (default: disabled)")
_ = flag.Bool("log-disable", false, "disables logging to a file")
_ = flag.Bool("memory-f32", false, "use f32 instead of f16 for memory key+value (default: disabled) not recommended: doubles context memory required and no measurable increase in quality")
flag.Parse() flag.Parse()
if *requirements { if *requirements {
printRequirements(os.Stdout) printRequirements(os.Stdout)

View File

@ -186,7 +186,6 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, ggml *GGML, adapter
"--model", model, "--model", model,
"--ctx-size", strconv.Itoa(opts.NumCtx), "--ctx-size", strconv.Itoa(opts.NumCtx),
"--batch-size", strconv.Itoa(opts.NumBatch), "--batch-size", strconv.Itoa(opts.NumBatch),
"--embedding",
} }
if opts.NumGPU >= 0 { if opts.NumGPU >= 0 {
@ -218,10 +217,6 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, ggml *GGML, adapter
params = append(params, "--threads", strconv.Itoa(defaultThreads)) params = append(params, "--threads", strconv.Itoa(defaultThreads))
} }
if !opts.F16KV {
params = append(params, "--memory-f32")
}
flashAttnEnabled := envconfig.FlashAttention() flashAttnEnabled := envconfig.FlashAttention()
for _, g := range gpus { for _, g := range gpus {
@ -311,9 +306,9 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, ggml *GGML, adapter
// Note: we always put the dependency path first // Note: we always put the dependency path first
// since this was the exact version we compiled/linked against // since this was the exact version we compiled/linked against
if gpus[0].DependencyPath != "" { if gpus[0].DependencyPath != nil {
// assume gpus from the same library have the same dependency path // assume gpus from the same library have the same dependency path
libraryPaths = append([]string{gpus[0].DependencyPath}, libraryPaths...) libraryPaths = append(gpus[0].DependencyPath, libraryPaths...)
} }
server := filepath.Join(dir, "ollama_llama_server") server := filepath.Join(dir, "ollama_llama_server")

View File

@ -440,7 +440,6 @@ func TestParseFileParameters(t *testing.T) {
"num_gpu 1": {"num_gpu", "1"}, "num_gpu 1": {"num_gpu", "1"},
"main_gpu 1": {"main_gpu", "1"}, "main_gpu 1": {"main_gpu", "1"},
"low_vram true": {"low_vram", "true"}, "low_vram true": {"low_vram", "true"},
"f16_kv true": {"f16_kv", "true"},
"logits_all true": {"logits_all", "true"}, "logits_all true": {"logits_all", "true"},
"vocab_only true": {"vocab_only", "true"}, "vocab_only true": {"vocab_only", "true"},
"use_mmap true": {"use_mmap", "true"}, "use_mmap true": {"use_mmap", "true"},

View File

@ -6,10 +6,6 @@ set -e
mkdir -p dist mkdir -p dist
for TARGETARCH in arm64 amd64; do
echo "Building Go runner darwin $TARGETARCH"
rm -rf llama/build
GOOS=darwin ARCH=$TARGETARCH GOARCH=$TARGETARCH make -C llama -j 8
# These require Xcode v13 or older to target MacOS v11 # These require Xcode v13 or older to target MacOS v11
# If installed to an alternate location use the following to enable # If installed to an alternate location use the following to enable
# export SDKROOT=/Applications/Xcode_12.5.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk # export SDKROOT=/Applications/Xcode_12.5.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
@ -17,6 +13,11 @@ for TARGETARCH in arm64 amd64; do
export CGO_CFLAGS=-mmacosx-version-min=11.3 export CGO_CFLAGS=-mmacosx-version-min=11.3
export CGO_CXXFLAGS=-mmacosx-version-min=11.3 export CGO_CXXFLAGS=-mmacosx-version-min=11.3
export CGO_LDFLAGS=-mmacosx-version-min=11.3 export CGO_LDFLAGS=-mmacosx-version-min=11.3
for TARGETARCH in arm64 amd64; do
echo "Building Go runner darwin $TARGETARCH"
rm -rf llama/build
GOOS=darwin ARCH=$TARGETARCH GOARCH=$TARGETARCH make -C llama -j 8
CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -o dist/ollama-darwin-$TARGETARCH CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -o dist/ollama-darwin-$TARGETARCH
CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -cover -o dist/ollama-darwin-$TARGETARCH-cov CGO_ENABLED=1 GOOS=darwin GOARCH=$TARGETARCH go build -trimpath -cover -o dist/ollama-darwin-$TARGETARCH-cov
done done

View File

@ -690,7 +690,8 @@ func CopyModel(src, dst model.Name) error {
} }
func deleteUnusedLayers(deleteMap map[string]struct{}) error { func deleteUnusedLayers(deleteMap map[string]struct{}) error {
manifests, err := Manifests() // Ignore corrupt manifests to avoid blocking deletion of layers that are freshly orphaned
manifests, err := Manifests(true)
if err != nil { if err != nil {
return err return err
} }
@ -853,8 +854,8 @@ func PullModel(ctx context.Context, name string, regOpts *registryOptions, fn fu
manifest, _, err := GetManifest(mp) manifest, _, err := GetManifest(mp)
if errors.Is(err, os.ErrNotExist) { if errors.Is(err, os.ErrNotExist) {
// noop // noop
} else if err != nil && !errors.Is(err, os.ErrNotExist) { } else if err != nil {
return err slog.Warn("pulling model with bad existing manifest", "name", name, "error", err)
} else { } else {
for _, l := range manifest.Layers { for _, l := range manifest.Layers {
deleteMap[l.Digest] = struct{}{} deleteMap[l.Digest] = struct{}{}

View File

@ -106,7 +106,8 @@ func (l *Layer) Remove() error {
return nil return nil
} }
ms, err := Manifests() // Ignore corrupt manifests to avoid blocking deletion of layers that are freshly orphaned
ms, err := Manifests(true)
if err != nil { if err != nil {
return err return err
} }

View File

@ -123,7 +123,7 @@ func WriteManifest(name model.Name, config Layer, layers []Layer) error {
return json.NewEncoder(f).Encode(m) return json.NewEncoder(f).Encode(m)
} }
func Manifests() (map[model.Name]*Manifest, error) { func Manifests(continueOnError bool) (map[model.Name]*Manifest, error) {
manifests, err := GetManifestPath() manifests, err := GetManifestPath()
if err != nil { if err != nil {
return nil, err return nil, err
@ -145,22 +145,29 @@ func Manifests() (map[model.Name]*Manifest, error) {
if !fi.IsDir() { if !fi.IsDir() {
rel, err := filepath.Rel(manifests, match) rel, err := filepath.Rel(manifests, match)
if err != nil { if err != nil {
if !continueOnError {
return nil, fmt.Errorf("%s %w", match, err)
}
slog.Warn("bad filepath", "path", match, "error", err) slog.Warn("bad filepath", "path", match, "error", err)
continue continue
} }
n := model.ParseNameFromFilepath(rel) n := model.ParseNameFromFilepath(rel)
if !n.IsValid() { if !n.IsValid() {
if !continueOnError {
return nil, fmt.Errorf("%s %w", rel, err)
}
slog.Warn("bad manifest name", "path", rel) slog.Warn("bad manifest name", "path", rel)
continue continue
} }
m, err := ParseNamedManifest(n) m, err := ParseNamedManifest(n)
if syntax := &(json.SyntaxError{}); errors.As(err, &syntax) { if err != nil {
if !continueOnError {
return nil, fmt.Errorf("%s %w", n, err)
}
slog.Warn("bad manifest", "name", n, "error", err) slog.Warn("bad manifest", "name", n, "error", err)
continue continue
} else if err != nil {
return nil, fmt.Errorf("%s: %w", n, err)
} }
ms[n] = m ms[n] = m

View File

@ -112,7 +112,7 @@ func TestManifests(t *testing.T) {
createManifest(t, d, p) createManifest(t, d, p)
} }
ms, err := Manifests() ms, err := Manifests(true)
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }

View File

@ -27,6 +27,16 @@ func chatPrompt(ctx context.Context, m *Model, tokenize tokenizeFunc, opts *api.
isMllama := checkMllamaModelFamily(m) isMllama := checkMllamaModelFamily(m)
var imageNumTokens int
// TODO: Ideally we would compute this from the projector metadata but some pieces are implementation dependent
if isMllama {
// Our mllama implementation packs all of the embeddings into a single token
imageNumTokens = 1
} else {
// Clip images are represented as 768 tokens, each an embedding
imageNumTokens = 768
}
n := len(msgs) - 1 n := len(msgs) - 1
// in reverse, find all messages that fit into context window // in reverse, find all messages that fit into context window
for i := n; i >= 0; i-- { for i := n; i >= 0; i-- {
@ -59,9 +69,7 @@ func chatPrompt(ctx context.Context, m *Model, tokenize tokenizeFunc, opts *api.
ctxLen := len(s) ctxLen := len(s)
if m.ProjectorPaths != nil { if m.ProjectorPaths != nil {
for _, m := range msgs[i:] { for _, m := range msgs[i:] {
// images are represented as 768 sized embeddings ctxLen += imageNumTokens * len(m.Images)
// TODO: get embedding length from project metadata
ctxLen += 768 * len(m.Images)
} }
} }

View File

@ -622,7 +622,7 @@ func (s *Server) PushHandler(c *gin.Context) {
} }
func checkNameExists(name model.Name) error { func checkNameExists(name model.Name) error {
names, err := Manifests() names, err := Manifests(true)
if err != nil { if err != nil {
return err return err
} }
@ -894,7 +894,7 @@ func getKVData(digest string, verbose bool) (llm.KV, error) {
} }
func (s *Server) ListHandler(c *gin.Context) { func (s *Server) ListHandler(c *gin.Context) {
ms, err := Manifests() ms, err := Manifests(true)
if err != nil { if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
return return
@ -1211,6 +1211,9 @@ func Serve(ln net.Listener) error {
} }
if !envconfig.NoPrune() { if !envconfig.NoPrune() {
if _, err := Manifests(false); err != nil {
slog.Warn("corrupt manifests detected, skipping prune operation. Re-pull or delete to clear", "error", err)
} else {
// clean up unused layers and manifests // clean up unused layers and manifests
if err := PruneLayers(); err != nil { if err := PruneLayers(); err != nil {
return err return err
@ -1225,6 +1228,7 @@ func Serve(ln net.Listener) error {
return err return err
} }
} }
}
ctx, done := context.WithCancel(context.Background()) ctx, done := context.WithCancel(context.Background())
schedCtx, schedDone := context.WithCancel(ctx) schedCtx, schedDone := context.WithCancel(ctx)

View File

@ -130,11 +130,11 @@ func (s *Scheduler) processPending(ctx context.Context) {
continue continue
} }
numParallel := int(envconfig.NumParallel()) numParallel := int(envconfig.NumParallel())
// TODO (jmorganca): multimodal models don't support parallel yet // TODO (jmorganca): mllama doesn't support parallel yet
// see https://github.com/ollama/ollama/issues/4165 // see https://github.com/ollama/ollama/issues/4165
if len(pending.model.ProjectorPaths) > 0 && numParallel != 1 { if checkMllamaModelFamily(pending.model) && numParallel != 1 {
numParallel = 1 numParallel = 1
slog.Warn("multimodal models don't support parallel requests yet") slog.Warn("mllama doesn't support parallel requests yet")
} }
for { for {