AI prompts
base on Blazingly fast LLM inference. <a name="top"></a>
<h1 align="center">
mistral.rs
</h1>
<h3 align="center">
Blazingly fast LLM inference.
</h3>
<p align="center">
| <a href="https://ericlbuehler.github.io/mistral.rs/mistralrs/"><b>Rust Documentation</b></a> | <a href="https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-pyo3/API.md"><b>Python Documentation</b></a> | <a href="https://discord.gg/SZrecqK8qw"><b>Discord</b></a> | <a href="https://matrix.to/#/#mistral.rs:matrix.org"><b>Matrix</b></a> |
</p>
Please submit requests for new models [here](https://github.com/EricLBuehler/mistral.rs/issues/156).
## Get started fast š
1) [Install](#installation-and-build)
2) [Get models](#getting-models)
3) Deploy with our easy to use APIs
- [Python](examples/python)
- [Rust](mistralrs/examples)
- [OpenAI compatible HTTP server](docs/HTTP.md)
## Quick examples
*After following installation instructions*
- Check out UQFF for prequantized models of various methods!
- Models can be found [here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
- š„ Try out AFQ ISQ (2, 3, 4, 6, 8 bit) for blazingly fast Metal performance in all models!
```
./mistralrs-server -i --isq afq8 plain -m meta-llama/Llama-3.2-3B-Instruct
```
- šš Easily add web search capabilities to your models! Compatible with OpenAI's `web_search_options` parameter: [documentation](docs/WEB_SEARCH.md)
```
./mistralrs-server --enable-search --port 1234 --isq q4k plain -m NousResearch/Hermes-3-Llama-3.1-8B
```
- š¦š¦š¦š¦ Run the **Llama 4** Models with long context length and vision support: [documentation](docs/LLAMA4.md)
```
./mistralrs-server -i --isq q4k vision-plain -m meta-llama/Llama-4-Scout-17B-16E-Instruct -a llama4
```
- Run the **Qwen 3** hybrid reasoning models with full tool calling support: [documentation](docs/QWEN3.md)
```
./mistralrs-server -i --isq 4 plain -m Qwen/Qwen3-8B
```
- ššš Run the entire **Gemma 3** Model family (1b, 4b, 12b, 27b) with 128k context length and vision support: [documentation](docs/GEMMA3.md)
```
./mistralrs-server -i vision-plain -m google/gemma-3-4b-it -a gemma3
```
- Run the **Mistral 3** Model with 128k context length and strong vision support: [documentation](docs/MISTRAL3.md)
```
./mistralrs-server -i --isq q4k vision-plain -m mistralai/Mistral-Small-3.1-24B-Instruct-2503 -a mistral3
```
- ššš Run the **Deepseek R1/V3** model with automatic **tensor parallelism**: [documentation](docs/DEEPSEEKV3.md)
```
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1
```
- ššš Run the **Deepseek R1** [distillations](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) out of the box
```
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
```
- š¦š· Run the **Llama 3.2 Vision** Model: [documentation and guide here](docs/VLLAMA.md)
<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "400" height = "267">
<h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>
```
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
```
- Ļā“ š· Run the **Phi 4 Multimodal** model: [documentation and guide here](docs/PHI4MM.md)
```
./mistralrs-server -i vision-plain -m microsoft/Phi-4-multimodal-instruct -a phi4mm
```
- Ļā“ Run the new **Phi 4/Phi 4 Mini** models with 128K context window
```
./mistralrs-server -i plain -m microsoft/Phi-4-mini-instruct -a phi3
```
- š§® Enhance ISQ by collecting an imatrix from calibration data: [documentation](docs/IMATRIX.md)
```
./mistralrs-server -i --isq Q4K plain -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt
```
- š²š· Run the **FLUX.1** diffusion model: [documentation and guide here](docs/FLUX.md)
<img src="https://github.com/user-attachments/assets/82bf5009-e3e9-402b-acf9-c48a52c7721b" width = "400" height = "267">
```
./mistralrs-server --port 1234 diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
```
- Other models: [see a support matrix](#support-matrix) and [how to run them](#run-with-the-cli)
Mistral.rs supports several model categories:
- Text to Text
- Text+Image to Text: Vision (see [the docs](docs/VISION_MODELS.md))
- Text to Image: Image Generation (see [the docs](docs/IMAGEGEN_MODELS.md))
## Description
**Easy**:
- Lightweight OpenAI API compatible HTTP server
- Python API
- Grammar support with JSON Schema, Regex, Lark, and Guidance via [LLGuidance library](https://github.com/microsoft/llguidance)
- [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from š¤ Hugging Face by quantizing in-place
- Enhance performance with an [imatrix](docs/IMATRIX.md)!
- Automatic [device mapping](docs/DEVICE_MAPPING.md) to easily load and run models across multiple GPUs and CPU.
- Specify custom chat templates easily: [chat templates](docs/CHAT_TOK.md)
**Fast**:
- Apple silicon support: ARM NEON, Accelerate, Metal
- Accelerated CPU inference with MKL, AVX support
- CUDA support with FlashAttention and cuDNN.
- Automatic tensor-parallelism support with NCCL: [distributed documentation](docs/DISTRIBUTED.md)
**Quantization**:
- [Details](docs/QUANTS.md)
- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with imatrix support
- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit, with [Marlin](https://github.com/IST-DASLab/marlin) kernel support in 4-bit and 8-bit.
- AFQ: š„ 2-bit, 3-bit, 4-bit, 6-bit and 8-bit, designed to be fast on Metal!
- HQQ: 4-bit and 8 bit, with ISQ support
- FP8
- BNB: bitsandbytes int8, fp4, nf4 support
- Easily run MLX prequantized models
- Automatic ISQ to select the fastest and most accurate quantization method.
**Powerful**:
- LoRA support with weight merging
- First X-LoRA inference platform with first class support
- [AnyMoE](docs/ANYMOE.md): Build a memory-efficient MoE model from anything, in seconds
- Various [sampling and penalty](docs/SAMPLING.mds) methods
- Native tool calling support for Llama, Mistral Small, Mistral Nemo, Hermes, and DeepSeek models: [docs](docs/TOOL_CALLING.md)
- Prompt chunking: process large prompts in a more manageable way
**Advanced features**:
- [PagedAttention](docs/PAGED_ATTENTION.md) and continuous batching (CUDA and Metal support)
- [FlashAttention](docs/FLASH_ATTENTION.md) V2/V3
- Prefix caching, including support for multimodal prefix caching
- [Topology](docs/TOPOLOGY.md): Configure ISQ and device mapping easily
- [UQFF](docs/UQFF.md): Quantized file format for easy mixing of quants, [collection here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter activation with adapter preloading: [examples and docs](docs/ADAPTER_MODELS.md#adapter-model-dynamic-adapter-activation)
- Integrated agentic web search capabilities, enabling models to easily access the internet.
**Documentation for mistral.rs can be found [here](docs/README.md).**
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
<!-- Mistral GGUF demo, old API -->
<!-- https://github.com/EricLBuehler/mistral.rs/assets/65165915/3396abcd-8d44-4bf7-95e6-aa532db09415 -->
https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c
## Architecture Support matrix
> Note: See [supported models](#supported-models) for more information
|Model|Supports quantization|Supports adapters|Supports device mapping|Supported by AnyMoE|
|--|--|--|--|--|
|Mistral v0.1/v0.2/v0.3|ā
|ā
|ā
|ā
|
|Gemma|ā
|ā
|ā
|ā
|
|Llama 3.1/3.2|ā
|ā
|ā
|ā
|
|Mixtral|ā
|ā
|ā
| |
|Phi 2|ā
|ā
|ā
|ā
|
|Phi 3|ā
|ā
|ā
|ā
|
|Phi 3.5 MoE|ā
| |ā
| |
|Qwen 2.5|ā
| |ā
|ā
|
|Phi 3 Vision|ā
| |ā
|ā
|
|Idefics 2|ā
| |ā
|ā
|
|Gemma 2|ā
|ā
|ā
|ā
|
|Starcoder 2|ā
|ā
|ā
|ā
|
|LLaVa Next|ā
| |ā
|ā
|
|LLaVa|ā
| |ā
|ā
|
|Llama 3.2 Vision|ā
| |ā
| |
|Qwen2-VL|ā
| |ā
| |
|Idefics 3|ā
| |ā
|ā
|
|DeepseekV2|ā
| |ā
| |
|DeepseekV3|ā
| |ā
| |
|MinCPM-O 2.6|ā
| |ā
| |
|Phi 4 Multimodal|ā
| |ā
| |
|Qwen2.5-VL|ā
| |ā
| |
|Gemma 3|ā
| |ā
|ā
|
|Mistral 3|ā
| |ā
|ā
|
|Llama 4|ā
| |ā
| |
|Qwen 3|ā
| |ā
| |
## APIs and Integrations
### Rust Crate
Rust multithreaded/async API for easy integration into any application.
- [Docs](https://ericlbuehler.github.io/mistral.rs/mistralrs/)
- [Examples](mistralrs/examples/)
- To install: Add `mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }`
### Python API
Python API for mistral.rs.
- [Installation including PyPI](mistralrs-pyo3/_README.md)
- [Docs](mistralrs-pyo3/API.md)
- [Examples](examples/python)
- [Cookbook](examples/python/cookbook.ipynb)
### HTTP Server
OpenAI API compatible API server
- [API Docs](docs/HTTP.md).
- [Running](README.md#run-with-the-cli)
- [Example](examples/server/chat.py)
### Llama Index integration (Python)
- Docs: https://docs.llamaindex.ai/en/stable/examples/llm/mistral_rs/
---
## Supported accelerators
- NVIDIA GPUs (CUDA):
- Compile with the `cuda` feature: `--features cuda`
- FlashAttention support: compile with the `flash-attn` feature
- cuDNN support: compile with the`cudnn` feature: `--features cudnn`
- Apple Silicon GPU (Metal):
- Compile with the `metal` feature: `--features metal`
- CPU:
- Intel MKL: compile with the `mkl` feature: `--features mkl`
- Apple Accelerate: compile with the `accelerate` feature: `--features accelerate`
- ARM NEON and AVX are used automatically
Enabling features is done by passing `--features ...` to the build system. When using `cargo run` or `maturin develop`, pass the `--features` flag before the `--` separating build flags from runtime flags.
- To enable a single feature like `metal`: `cargo build --release --features metal`.
- To enable multiple features, specify them in quotes: `cargo build --release --features "cuda flash-attn cudnn"`.
## Installation and Build
> Note: You can use our [Docker containers here](https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs).
> Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
- Install the [Python package here](mistralrs-pyo3/_README.md).
- The Python package has [wheels on PyPi](mistralrs-pyo3/_README.md#installation-from-pypi)!
1) Install required packages:
- `OpenSSL` (*Example on Ubuntu:* `sudo apt install libssl-dev`)
- <b>*Linux only:*</b> `pkg-config` (*Example on Ubuntu:* `sudo apt install pkg-config`)
2) Install Rust: https://rustup.rs/
*Example on Ubuntu:*
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
3) <b>*Optional:*</b> Set HF token correctly (skip if already set or your model is not gated, or if you want to use the `token_source` parameters in Python or the command line.)
- Note: you can install `huggingface-cli` as documented [here](https://huggingface.co/docs/huggingface_hub/en/installation).
```bash
huggingface-cli login
```
4) Download the code:
```bash
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
```
5) Build or install:
- Base build command
```bash
cargo build --release
```
- Build with CUDA support
```bash
cargo build --release --features cuda
```
- Build with CUDA and Flash Attention V2 support
```bash
cargo build --release --features "cuda flash-attn"
```
- Build with Metal support
```bash
cargo build --release --features metal
```
- Build with Accelerate support
```bash
cargo build --release --features accelerate
```
- Build with MKL support
```bash
cargo build --release --features mkl
```
- Install with `cargo install` for easy command line usage
Pass the same values to `--features` as you would for `cargo build`
```bash
cargo install --path mistralrs-server --features cuda
```
6) The build process will output a binary `mistralrs-server` at `./target/release/mistralrs-server`. We can switch to that directory so that the binary can be accessed as `./mistralrs-server` with the following command:
*Example on Ubuntu:*
```
cd target/release
```
7) Use our APIs and integrations:
[APIs and integrations list](#apis-and-integrations)
## Getting models
There are 2 ways to get models with mistral.rs:
- From Hugging Face Hub (easiest)
- From local files
- Running a GGUF model
- Specify local paths
### Getting models from Hugging Face Hub
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
- `literal:<value>`: Load from a specified literal
- `env:<value>`: Load from a specified environment variable
- `path:<value>`: Load from a specified file
- `cache`: **default**: Load from the HF token at ~/.cache/huggingface/token or equivalent.
- `none`: Use no HF token
This is passed in the following ways:
- Command line:
```bash
./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
```
- Python:
[Here](examples/python/token_source.py) is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using `none`).
### Loading models from local files:
You can also instruct mistral.rs to load models fully locally by modifying the `*_model_id` arguments or options:
```bash
./mistralrs-server --port 1234 plain -m . -a mistral
```
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
- `--model-id` (server) or `model_id` (python/rust) or `--tok-model-id` (server) or `tok_model_id` (python/rust):
- `config.json`
- `tokenizer_config.json`
- `tokenizer.json` (if not specified separately)
- `.safetensors`/`.bin`/`.pth`/`.pt` files (defaults to `.safetensors`)
- `preprocessor_config.json` (required for vision models).
- `processor_config.json` (optional for vision models).
- `--quantized-model-id` (server) or `quantized_model_id` (python/rust):
- Specified `.gguf` or `.ggml` file.
- `--x-lora-model-id` (server) or `xlora_model_id` (python/rust):
- `xlora_classifier.safetensors`
- `xlora_config.json`
- Adapters `.safetensors` and `adapter_config.json` files in their respective directories
- `--adapters-model-id` (server) or `adapters_model_id` (python/rust):
- Adapters `.safetensors` and `adapter_config.json` files in their respective directories
### Running GGUF models
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
You must also specify either `-i` for interactive mode or `--port` to launch a server, just like when [running a non-GGUF model with the CLI](#run-with-the-cli)
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
1) [With a specified tokenizer](#with-a-specified-tokenizer)
1) [With the builtin tokenizer](#with-the-builtin-tokenizer)
#### With a specified tokenizer
Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:
```bash
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf -t microsoft/Phi-3.5-mini-instruct
```
If the specified tokenizer model ID contains a `tokenizer.json`, then it will be used over the GGUF tokenizer.
#### With the builtin tokenizer
Using the builtin tokenizer:
```bash
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf
```
(or using a local file):
```bash
./mistralrs-server gguf -m path/to/files -f Phi-3.5-mini-instruct-Q4_K_M.gguf
```
There are a few more ways to configure:
**Chat template:**
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either [provide a tokenizer](#with-a-specified-tokenizer) (recommended), or specify a custom chat template.
```bash
./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.gguf
```
**Tokenizer**
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise,
please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
**Supported GGUF tokenizer types**
- `llama` (sentencepiece)
- `gpt2` (BPE)
## Run with the CLI
Mistral.rs uses subcommands to control the model type. Please run `./mistralrs-server --help` to see the subcommands which categorize the models by kind.
### Architecture for plain models
> Note: for plain models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`plain`). For quantized models (gguf/ggml), you may specify data type of `f32` or `bf16` (`f16` is not recommended due to its lower precision in quantized inference).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
- `mistral`
- `gemma`
- `mixtral`
- `llama`
- `phi2`
- `phi3`
- `phi3.5moe`
- `qwen2`
- `gemma2`
- `starcoder2`
- `deepseekv2`
- `deepseekv3`
- `qwen3`
- `qwen3moe`
### Architecture for vision models
> Note: for vision models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`vision-plain`).
- `phi3v`
- `idefics2`
- `llava_next`
- `llava`
- `vllama`
- `qwen2vl`
- `idefics3`
- `minicpmo`
- `phi4mm`
- `qwen2_5vl`
- `gemma3`
- `mistral3`
- `llama4`
### Supported GGUF architectures
**Plain:**
- `llama`
- `phi2`
- `phi3`
- `starcoder2`
- `qwen2`
**With adapters:**
- `llama`
- `phi3`
### Interactive mode
You can launch interactive mode, a simple chat application running in the terminal, by passing `-i`:
```bash
./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
```
Vision models work too:
```bash
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
```
And even diffusion models:
```bash
./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
```
On Apple Silicon (`Metal`), run with throughput log, settings of paged attention (maximum usage of 4GB for kv cache) and dtype (bf16 for kv cache and attention)
```bash
cargo build --release --features metal
./target/release/mistralrs-server -i --throughput --paged-attn --pa-gpu-mem 4096 gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf
```
### OpenAI HTTP server
You can an HTTP server
```bash
./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe
```
### Structured selection with a `.toml` file
We provide a method to select models with a `.toml` file. The keys are the same as the command line, with `no_kv_cache` and `tokenizer_json` being "global" keys.
Example:
```bash
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml
```
---
Please submit more benchmarks via raising an issue!
## Supported models
**Quantization support**
|Model|GGUF|GGML|ISQ|
|--|--|--|--|
|Mistral|ā
| |ā
|
|Gemma| | |ā
|
|Llama|ā
|ā
|ā
|
|Mixtral|ā
| |ā
|
|Phi 2|ā
| |ā
|
|Phi 3|ā
| |ā
|
|Phi 3.5 MoE| | |ā
|
|Qwen 2.5| | |ā
|
|Phi 3 Vision| | |ā
|
|Idefics 2| | |ā
|
|Gemma 2| | |ā
|
|Starcoder 2| |ā
|ā
|
|LLaVa Next| | |ā
|
|LLaVa| | |ā
|
|Llama 3.2 Vision| | |ā
|
|Qwen2-VL| | |ā
|
|Idefics 3| | |ā
|
|Deepseek V2| | |ā
|
|Deepseek V3| | |ā
|
|MiniCPM-O 2.6| | |ā
|
|Qwen2.5-VL| | |ā
|
|Gemma 3| | |ā
|
|Mistral 3| | |ā
|
|Llama 4| | |ā
|
|Qwen 3| | |ā
|
**Device mapping support**
|Model category|Supported|
|--|--|
|Plain|ā
|
|GGUF|ā
|
|GGML| |
|Vision Plain|ā
|
**X-LoRA and LoRA support**
|Model|X-LoRA|X-LoRA+GGUF|X-LoRA+GGML|
|--|--|--|--|
|Mistral|ā
|ā
| |
|Gemma|ā
| | |
|Llama|ā
|ā
|ā
|
|Mixtral|ā
|ā
| |
|Phi 2|ā
| | |
|Phi 3|ā
|ā
| |
|Phi 3.5 MoE| | | |
|Qwen 2.5| | | |
|Phi 3 Vision| | | |
|Idefics 2| | | |
|Gemma 2|ā
| | |
|Starcoder 2|ā
| | |
|LLaVa Next| | | |
|LLaVa| | | |
|Qwen2-VL| | | |
|Idefics 3| | | |
|Deepseek V2| | | |
|Deepseek V3| | | |
|MiniCPM-O 2.6| | | |
|Qwen2.5-VL| | | |
|Gemma 3| | | |
|Mistral 3| | | |
|Llama 4| | | |
|Qwen 3| | | |
**AnyMoE support**
|Model|AnyMoE|
|--|--|
|Mistral 7B|ā
|
|Gemma|ā
|
|Llama|ā
|
|Mixtral| |
|Phi 2|ā
|
|Phi 3|ā
|
|Phi 3.5 MoE| |
|Qwen 2.5|ā
|
|Phi 3 Vision| |
|Idefics 2| |
|Gemma 2|ā
|
|Starcoder 2|ā
|
|LLaVa Next|ā
|
|LLaVa|ā
|
|Llama 3.2 Vision| |
|Qwen2-VL| |
|Idefics 3|ā
|
|Deepseek V2| |
|Deepseek V3| |
|MiniCPM-O 2.6| |
|Qwen2.5-VL| |
|Gemma 3|ā
|
|Mistral 3|ā
|
|Llama 4| |
|Qwen 3| |
### Using derivative model
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass `--help` after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
- **Plain**: Model id
- **Quantized**: Quantized model id, quantized filename, and tokenizer id
- **X-LoRA**: Model id, X-LoRA ordering
- **X-LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- **LoRA**: Model id
- **LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
- **Vision Plain**: Model id
See [this](#adapter-ordering-file) section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a `tokenizer_config.json` file, it is not necessary to specify. Otherwise, templates can be found in `chat_templates` and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
`./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf`
### Adapter model support: X-LoRA and LoRA
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the `x-lora-*` architecture, and LoRA support by selecting the `lora-*` architecture. Please find docs for adapter models [here](docs/ADAPTER_MODELS.md). Examples may be found [here](docs/LORA_XLORA.md).
### Chat Templates and Tokenizer
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation [here](docs/CHAT_TOK.md).
## Contributing
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request.
If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
## FAQ
- Debugging with the environment variable `MISTRALRS_DEBUG=1` causes the following things
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- `mistralrs_gguf_tensors.txt` or `mistralrs_ggml_tensors.txt`
- More logging.
- Setting the CUDA compiler path:
- Set the `NVCC_CCBIN` environment variable during build.
- Error: `recompile with -fPIE`:
- Some Linux distributions require compiling with `-fPIE`.
- Set the `CUDA_NVCC_FLAGS` environment variable to `-fPIE` during build: `CUDA_NVCC_FLAGS=-fPIE`
- Error `CUDA_ERROR_NOT_FOUND` or symbol not found when using a normal or vison model:
- For non-quantized models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device.
- What is the minimum supported CUDA compute cap?
- The minimum CUDA compute cap is **5.3**.
- Metal not found (error: unable to find utility "metal", not a developer tool or in PATH)
1) Install Xcode: `xcode-select --install`
2) Set the active developer directory: `sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer`
## Credits
This project would not be possible without the excellent work at [`candle`](https://github.com/huggingface/candle). Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
<p align="right">
<a href="#top">ā¬ļø Back to Top</a>
</p>
", Assign "at most 3 tags" to the expected json: {"id":"9685","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"