base on Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc ![llama-swap header image](docs/assets/hero3.webp) ![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mostlygeek/llama-swap/total) ![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/mostlygeek/llama-swap/go-ci.yml) ![GitHub Repo stars](https://img.shields.io/github/stars/mostlygeek/llama-swap) # llama-swap Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications. Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file. ## Features: - ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies - ✅ On-demand model switching - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc.) - future proof, upgrade your inference servers at any time. - ✅ OpenAI API supported endpoints: - `v1/completions` - `v1/chat/completions` - `v1/embeddings` - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36)) - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867)) - ✅ Anthropic API supported endpoints: - `v1/messages` - ✅ llama-server (llama.cpp) supported endpoints - `v1/rerank`, `v1/reranking`, `/rerank` - `/infill` - for code infilling - `/completion` - for completion endpoint - ✅ llama-swap API - `/ui` - web UI - `/upstream/:model_id` - direct access to upstream server ([demo](https://github.com/mostlygeek/llama-swap/pull/31)) - `/models/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58)) - `/running` - list currently running models ([#61](https://github.com/mostlygeek/llama-swap/issues/61)) - `/log` - remote log monitoring - `/health` - just returns "OK" - ✅ Customizable - Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107)) - Automatic unloading of models after timeout by setting a `ttl` - Reliable Docker and Podman support using `cmd` and `cmdStop` together - Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235)) ### Web UI llama-swap includes a real time web interface for monitoring logs and controlling models: <img width="1164" height="745" alt="image" src="https://github.com/user-attachments/assets/bacf3f9d-819f-430b-9ed2-1bfaa8d54579" /> The Activity Page shows recent requests: <img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" /> ## Installation llama-swap can be installed in multiple ways 1. Docker 2. Homebrew (OSX and Linux) 3. WinGet 4. From release binaries 5. From source ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap)) Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) including [non-root variants with improved security](docs/container-security.md). ```shell $ docker pull ghcr.io/mostlygeek/llama-swap:cuda # run with a custom configuration and models directory $ docker run -it --rm --runtime nvidia -p 9292:8080 \ -v /path/to/models:/models \ -v /path/to/custom/config.yaml:/app/config.yaml \ ghcr.io/mostlygeek/llama-swap:cuda # configuration hot reload supported with a # directory volume mount $ docker run -it --rm --runtime nvidia -p 9292:8080 \ -v /path/to/models:/models \ -v /path/to/custom/config.yaml:/app/config.yaml \ -v /path/to/config:/config \ ghcr.io/mostlygeek/llama-swap:cuda -config /config/config.yaml -watch-config ``` <details> <summary> more examples </summary> ```shell # pull latest images per platform docker pull ghcr.io/mostlygeek/llama-swap:cpu docker pull ghcr.io/mostlygeek/llama-swap:cuda docker pull ghcr.io/mostlygeek/llama-swap:vulkan docker pull ghcr.io/mostlygeek/llama-swap:intel docker pull ghcr.io/mostlygeek/llama-swap:musa # tagged llama-swap, platform and llama-server version images docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795 # non-root cuda docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root ``` </details> ### Homebrew Install (macOS/Linux) ```shell brew tap mostlygeek/llama-swap brew install llama-swap llama-swap --config path/to/config.yaml --listen localhost:8080 ``` ### WinGet Install (Windows) > [!NOTE] > WinGet is maintained by community contributor [Dvd-Znf](https://github.com/Dvd-Znf) ([#327](https://github.com/mostlygeek/llama-swap/issues/327)). It is not an official part of llama-swap. ```shell # install C:\> winget install llama-swap # upgrade C:\> winget upgrade llama-swap ``` ### Pre-built Binaries Binaries are available on the [release](https://github.com/mostlygeek/llama-swap/releases) page for Linux, Mac, Windows and FreeBSD. ### Building from source 1. Building requires Go and Node.js (for UI). 1. `git clone https://github.com/mostlygeek/llama-swap.git` 1. `make clean all` 1. look in the `build/` subdirectory for the llama-swap binary ## Configuration ```yaml # minimum viable config.yaml models: model1: cmd: llama-server --port ${PORT} --model /path/to/model.gguf ``` That's all you need to get started: 1. `models` - holds all model configurations 2. `model1` - the ID used in API calls 3. `cmd` - the command to run to start the server. 4. `${PORT}` - an automatically assigned port number Almost all configuration settings are optional and can be added one step at a time: - Advanced features - `groups` to run multiple models at once - `hooks` to run things on startup - `macros` reusable snippets - Model customization - `ttl` to automatically unload models - `aliases` to use familiar model names (e.g., "gpt-4o-mini") - `env` to pass custom environment variables to inference servers - `cmdStop` gracefully stop Docker/Podman containers - `useModelName` to override model names sent to upstream servers - `${PORT}` automatic port variables for dynamic port assignment - `filters` rewrite parts of requests before sending to the upstream server See the [configuration documentation](docs/configuration.md) for all options. ## How does llama-swap work? When a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly. In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the `groups` feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used. ## Reverse Proxy Configuration (nginx) If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https://github.com/mostlygeek/llama-swap/issues/236)) Recommended nginx configuration snippets: ```nginx # SSE for UI events/logs location /api/events { proxy_pass http://your-llama-swap-backend; proxy_buffering off; proxy_cache off; } # Streaming chat completions (stream=true) location /v1/chat/completions { proxy_pass http://your-llama-swap-backend; proxy_buffering off; proxy_cache off; } ``` As a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior. ## Monitoring Logs on the CLI ```shell # sends up to the last 10KB of logs curl http://host/logs' # streams combined logs curl -Ns 'http://host/logs/stream' # just llama-swap's logs curl -Ns 'http://host/logs/stream/proxy' # just upstream's logs curl -Ns 'http://host/logs/stream/upstream' # stream and filter logs with linux pipes curl -Ns http://host/logs/stream | grep 'eval time' # skips history and just streams new log entries curl -Ns 'http://host/logs/stream?no-history' ``` ## Do I need to use llama.cpp's server (llama-server)? Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported. For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals for proper shutdown. ## Star History > [!NOTE] > ⭐️ Star this project to help others discover it! [![Star History Chart](https://api.star-history.com/svg?repos=mostlygeek/llama-swap&type=Date)](https://www.star-history.com/#mostlygeek/llama-swap&Date) ", Assign "at most 3 tags" to the expected json: {"id":"13098","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"