addressing new comments after merge

Signed-off-by: Matt Williams <m@technovangelist.com>
This commit is contained in:
Matt Williams 2023-10-15 14:17:23 -07:00
parent b2974a7095
commit 4522109b11
2 changed files with 17 additions and 2 deletions

View File

@ -124,7 +124,7 @@ PARAMETER <parameter> <parametervalue>
| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 |
| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 |
| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 |
| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. | int | seed 42 |
| stop | Sets the stop sequences to use. | string | stop "AI assistant:" |
| tfs_z | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1) | float | tfs_z 1 |
| num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 |

View File

@ -1,6 +1,6 @@
# How to Quantize a Model
Sometimes the model you want to work with is not available at [https://ollama.ai/library](https://ollama.ai/library). If you want to try out that model before we have a chance to quantize it, you can use this process.
Sometimes the model you want to work with is not available at [https://ollama.ai/library](https://ollama.ai/library).
## Figure out if we can run the model?
@ -37,6 +37,20 @@ This will output two files into the directory. First is a f16.bin file that is t
You can find the repository for the Docker container here: [https://github.com/mxyng/quantize](https://github.com/mxyng/quantize)
For instance, if you wanted to convert the Mistral 7B model to a Q4 quantized model, then you could go through the following steps:
1. First verify the model will potentially work.
2. Now clone Mistral 7B to your machine. You can find the command to run when you click the three vertical dots button on the model page, then click **Clone Repository**.
1. For this repo, the command is:
```shell
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1
```
2. Navigate into the new directory and run `docker run --rm -v .:/repo ollama/quantize -q q4_0 /repo`
3. Now you can create a modelfile using the q4_0.bin file that was created.
## Convert and Quantize Manually
### Clone llama.cpp to your machine
@ -48,6 +62,7 @@ If we know the model has a chance of working, then we need to convert and quanti
[`git clone https://github.com/ggerganov/llama.cpp.git`](https://github.com/ggerganov/llama.cpp.git)
1. If you don't have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip
3. Install the Python dependencies: `pip install torch transformers sentencepiece`
4. Run 'make' to build the project and the quantize executable.
### Convert the model to GGUF