base on Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud. <div align="center"> <h1>🦾 OpenLLM: Self-Hosting LLMs Made Easy</h1> [![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202-green.svg)](https://github.com/bentoml/OpenLLM/blob/main/LICENSE) [![Releases](https://img.shields.io/pypi/v/openllm.svg?logo=pypi&label=PyPI&logoColor=gold)](https://pypi.org/project/openllm) [![CI](https://results.pre-commit.ci/badge/github/bentoml/OpenLLM/main.svg)](https://results.pre-commit.ci/latest/github/bentoml/OpenLLM/main) [![X](https://badgen.net/badge/icon/@bentomlai/000000?icon=twitter&label=Follow)](https://twitter.com/bentomlai) [![Community](https://badgen.net/badge/icon/Community/562f5d?icon=slack&label=Join)](https://l.bentoml.com/join-slack) </div> OpenLLM allows developers to run **any open-source LLMs** (Llama 3.3, Qwen2.5, Phi3 and [more](#supported-models)) or **custom models** as **OpenAI-compatible APIs** with a single command. It features a [built-in chat UI](#chat-ui), state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Docker, Kubernetes, and [BentoCloud](#deploy-to-bentocloud). Understand the [design philosophy of OpenLLM](https://www.bentoml.com/blog/from-ollama-to-openllm-running-llms-in-the-cloud). ## Get Started Run the following commands to install OpenLLM and explore it interactively. ```bash pip install openllm # or pip3 install openllm openllm hello ``` ![hello](https://github.com/user-attachments/assets/5af19f23-1b34-4c45-b1e0-a6798b4586d1) ## Supported models OpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add a [model repository to run custom models](#set-up-a-custom-repository) with OpenLLM. <table> <tr> <th>Model</th> <th>Parameters</th> <th>Required GPU</th> <th>Start a Server</th> </tr> <tr> <td>deepseek</td> <td>r1-671b</td> <td>80Gx16</td> <td><code>openllm serve deepseek:r1-671b</code></td> </tr> <tr> <td>gemma2</td> <td>2b</td> <td>12G</td> <td><code>openllm serve gemma2:2b</code></td> </tr> <tr> <td>gemma3</td> <td>3b</td> <td>12G</td> <td><code>openllm serve gemma3:3b</code></td> </tr> <tr> <td>jamba1.5</td> <td>mini-ff0a</td> <td>80Gx2</td> <td><code>openllm serve jamba1.5:mini-ff0a</code></td> </tr> <tr> <td>llama3.1</td> <td>8b</td> <td>24G</td> <td><code>openllm serve llama3.1:8b</code></td> </tr> <tr> <td>llama3.2</td> <td>1b</td> <td>24G</td> <td><code>openllm serve llama3.2:1b</code></td> </tr> <tr> <td>llama3.3</td> <td>70b</td> <td>80Gx2</td> <td><code>openllm serve llama3.3:70b</code></td> </tr> <tr> <td>llama4</td> <td>17b16e</td> <td>80Gx8</td> <td><code>openllm serve llama4:17b16e</code></td> </tr> <tr> <td>mistral</td> <td>8b-2410</td> <td>24G</td> <td><code>openllm serve mistral:8b-2410</code></td> </tr> <tr> <td>mistral-large</td> <td>123b-2407</td> <td>80Gx4</td> <td><code>openllm serve mistral-large:123b-2407</code></td> </tr> <tr> <td>phi4</td> <td>14b</td> <td>80G</td> <td><code>openllm serve phi4:14b</code></td> </tr> <tr> <td>pixtral</td> <td>12b-2409</td> <td>80G</td> <td><code>openllm serve pixtral:12b-2409</code></td> </tr> <tr> <td>qwen2.5</td> <td>7b</td> <td>24G</td> <td><code>openllm serve qwen2.5:7b</code></td> </tr> <tr> <td>qwen2.5-coder</td> <td>3b</td> <td>24G</td> <td><code>openllm serve qwen2.5-coder:3b</code></td> </tr> <tr> <td>qwq</td> <td>32b</td> <td>80G</td> <td><code>openllm serve qwq:32b</code></td> </tr> </table> For the full model list, see the [OpenLLM models repository](https://github.com/bentoml/openllm-models). ## Start an LLM server To start an LLM server locally, use the `openllm serve` command and specify the model version. > [!NOTE] > OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models. > > 1. Create your Hugging Face token [here](https://huggingface.co/settings/tokens). > 2. Request access to the gated model, such as [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). > 3. Set your token as an environment variable by running: > ```bash > export HF_TOKEN=<your token> > ``` ```bash openllm serve llama3.2:1b ``` The server will be accessible at [http://localhost:3000](http://localhost:3000/), providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following: - **The API host address**: By default, the LLM is hosted at [http://localhost:3000](http://localhost:3000/). - **The model name:** The name can be different depending on the tool you use. - **The API key**: The API key used for client authentication. This is optional. Here are some examples: <details> <summary>OpenAI Python client</summary> ```python from openai import OpenAI client = OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Use the following func to get the available models # model_list = client.models.list() # print(model_list) chat_completion = client.chat.completions.create( model="meta-llama/Llama-3.2-1B-Instruct", messages=[ { "role": "user", "content": "Explain superconductors like I'm five years old" } ], stream=True, ) for chunk in chat_completion: print(chunk.choices[0].delta.content or "", end="") ``` </details> <details> <summary>LlamaIndex</summary> ```python from llama_index.llms.openai import OpenAI llm = OpenAI(api_bese="http://localhost:3000/v1", model="meta-llama/Llama-3.2-1B-Instruct", api_key="dummy") ... ``` </details> ## Chat UI OpenLLM provides a chat UI at the `/chat` endpoint for the launched LLM server at http://localhost:3000/chat. <img width="800" alt="openllm_ui" src="https://github.com/bentoml/OpenLLM/assets/5886138/8b426b2b-67da-4545-8b09-2dc96ff8a707"> ## Chat with a model in the CLI To start a chat conversation in the CLI, use the `openllm run` command and specify the model version. ```bash openllm run llama3:8b ``` ## Model repository A model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted at [this GitHub repository](https://github.com/bentoml/openllm-models). To see all available models from the default and any added repository, use: ```bash openllm model list ``` To ensure your local list of models is synchronized with the latest updates from all connected repositories, run: ```bash openllm repo update ``` To review a model’s information, run: ```bash openllm model get llama3.2:1b ``` ### Add a model to the default model repository You can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out this [example pull request](https://github.com/bentoml/openllm-models/pull/1). ### Set up a custom repository You can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with a `bentos` directory to store custom LLMs. You need to [build your Bentos with BentoML](https://docs.bentoml.com/en/latest/guides/build-options.html) and submit them to your model repository. First, prepare your custom models in a `bentos` directory following the guidelines provided by [BentoML to build Bentos](https://docs.bentoml.com/en/latest/guides/build-options.html). Check out the [default model repository](https://github.com/bentoml/openllm-repo) for an example and read the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) for details. Then, register your custom model repository with OpenLLM: ```bash openllm repo add <repo-name> <repo-url> ``` **Note**: Currently, OpenLLM only supports adding public repositories. ## Deploy to BentoCloud OpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud. [Sign up for BentoCloud](https://www.bentoml.com/) for free and [log in](https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html). Then, run `openllm deploy` to deploy a model to BentoCloud: ```bash openllm deploy llama3.2:1b --env HF_TOKEN ``` > [!NOTE] > If you are deploying a gated model, make sure to set HF_TOKEN in enviroment variables. Once the deployment is complete, you can run model inference on the BentoCloud console: <img width="800" alt="bentocloud_ui" src="https://github.com/bentoml/OpenLLM/assets/65327072/4f7819d9-73ea-488a-a66c-f724e5d063e6"> ## Community OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 [Join our Slack community!](https://l.bentoml.com/join-slack) ## Contributing As an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute: - Repost a bug by [creating a GitHub issue](https://github.com/bentoml/OpenLLM/issues/new/choose). - [Submit a pull request](https://github.com/bentoml/OpenLLM/compare) or help review other developers’ [pull requests](https://github.com/bentoml/OpenLLM/pulls). - Add an LLM to the OpenLLM default model repository so that other users can run your model. See the [pull request template](https://github.com/bentoml/openllm-models/pull/1). - Check out the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) to learn more. ## Acknowledgements This project uses the following open-source projects: - [bentoml/bentoml](https://github.com/bentoml/bentoml) for production level model serving - [vllm-project/vllm](https://github.com/vllm-project/vllm) for production level LLM backend - [blrchen/chatgpt-lite](https://github.com/blrchen/chatgpt-lite) for a fancy Web Chat UI - [astral-sh/uv](https://github.com/astral-sh/uv) for blazing fast model requirements installing We are grateful to the developers and contributors of these projects for their hard work and dedication. ", Assign "at most 3 tags" to the expected json: {"id":"4191","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"