fix(docs): update FA FAQ wording slightly

2024-11-14 09:37:23 +11:00 · 2024-11-14 09:37:23 +11:00 · 66839c3bd7
commit 66839c3bd7
parent 7d787ba90d
1 changed files with 1 additions and 1 deletions
--- a/docs/faq.md
+++ b/docs/faq.md
@ -291,7 +291,7 @@ Installing multiple GPUs of the same brand can be a great way to increase your a

 Flash Attention is a feature of most (but not all) modern models that can significantly reduce memory usage as the context size grows.  To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.

-> Note: If you're using an uncommon quantization type with CUDA, advanced users may benefit from building Ollama and passing `GGML_CUDA_FA_ALL_QUANTS=1` to the llama.cpp build to enable FA for all combinations of quantisation types. More information on this can be found in [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/fb4a0ec0833c71cff5a1a367ba375447ce6106eb/ggml/src/ggml-cuda/fattn-common.cuh#L575).
+> Note: Advanced users using CUDA may benefit from building Ollama and passing `GGML_CUDA_FA_ALL_QUANTS=1` to the llama.cpp build to enable FA for all combinations of quantisation types. More information on this can be found in [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/fb4a0ec0833c71cff5a1a367ba375447ce6106eb/ggml/src/ggml-cuda/fattn-common.cuh#L575).

 ## How can I set the quantization type for the K/V cache?