AI prompts
base on On device AI inference in minutes—now for MLX & GGUF and Qualcomm NPU, with Android and iOS coming soon. <div align="center">
<p>
<img width="100%" src="assets/banner.png" alt="Nexa AI Banner">
</p>
<p align="center">
<a href="https://docs.nexa.ai">
<img src="https://img.shields.io/badge/docs-website-brightgreen?logo=readthedocs" alt="Documentation">
</a>
<a href="https://x.com/nexa_ai"><img alt="X account" src="https://img.shields.io/twitter/url/https/twitter.com/diffuserslib.svg?style=social&label=Follow%20%40Nexa_AI"></a>
<a href="https://discord.com/invite/nexa-ai">
<img src="https://img.shields.io/discord/1192186167391682711?color=5865F2&logo=discord&logoColor=white&style=flat-square" alt="Join us on Discord">
</a>
<a href="https://join.slack.com/t/nexa-ai-community/shared_invite/zt-3837k9xpe-LEty0disTTUnTUQ4O3uuNw">
<img src="https://img.shields.io/badge/slack-join%20chat-4A154B?logo=slack&logoColor=white" alt="Join us on Slack">
</a>
</p>


</div>
# Nexa SDK
Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own `.nexa` format, enabling efficient quantized inference across diverse platforms.
## Qualcomm NPU PC Demos
<table>
<tr>
<td width="50%">
<img width="100%" src="assets/PC_demo_2_image.gif" alt="Multi-Image Reasoning Demo">
<p align="center"><b>🖼️ Multi-Image Reasoning</b><br>Spot the difference across two images in multi-round dialogue.</p>
</td>
<td width="50%">
<img width="100%" src="assets/PC_Demo_Agent.gif" alt="Image + Audio Function Call Demo">
<p align="center"><b>🎤 Image + Text → Function Call</b><br>Snap a poster, add a voice note, and AI agent creates a calendar event.</p>
</td>
</tr>
<tr>
<td colspan="2" align="center">
<img width="50%" src="assets/PC_Demo_Audio.gif" alt="Multi-Audio Comparison Demo">
<p align="center"><b>🎶 Multi-Audio Comparison</b><br>Tell the difference between two music clips locally.</p>
</td>
</tr>
</table>
## Recent updates
#### 📣 **2025.08.20: Qualcomm NPU Support**
- Qualcomm NPU support for GGUF models.
OmniNeural-4B is the **first multimodal AI model built natively for NPUs** — handling text, images, and audio in one model.
- Check the model and demos at [Hugginface repo](https://huggingface.co/NexaAI/OmniNeural-4B)
- Check our [OmniNeural-4B technical blog](https://nexa.ai/blogs/omnineural-4b)
- Download our [arm64 with Qualcomm NPU support](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_windows_arm64.exe) installer and try it!
#### 📣 **2025.08.12: ASR & TTS Support in MLX format
- ASR & TTS model support in MLX format.
- new "> /mic" mode to transcribe live speech directly in your terminal.
## Installation
### macOS
* [arm64](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_macos_arm64.pkg)
* [x86_64](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_macos_x86_64.pkg)
### Windows
* [arm64 with Qualcomm NPU support](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_windows_arm64.exe)
* [x86_64](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_windows_x86_64.exe)
### Linux
```bash
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
```
## Supported Models
You can run any compatible GGUF,MLX, or nexa model from 🤗 Hugging Face by using the `<full repo name>`.
### Qualcomm NPU models
> [!TIP]
> You need to download the [arm64 with Qualcomm NPU support](https://nexa-model-hub-bucket.s3.us-west-1.amazonaws.com/public/nexa_sdk/downloads/nexa-cli_windows_arm64.exe) and make sure you have Snapdragon® X Elite chip on your laptop.
#### Quick Start (Windows arm64, Snapdragon X Elite)
1. **Login & Get Access Token (required for Pro Models)**
- Create an account at [sdk.nexa.ai](https://sdk.nexa.ai)
- Go to **Deployment → Create Token**
- Run this once in your terminal (replace with your token):
```bash
nexa config set license '<your_token_here>'
```
2. Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU
```bash
nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu
```
### GGUF models
> [!TIP]
> GGUF runs on macOS, Linux, and Windows.
📝 Run and chat with LLMs, e.g. Qwen3:
```bash
nexa infer ggml-org/Qwen3-1.7B-GGUF
```
🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:
```bash
nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF
```
### MLX models
> [!TIP]
> MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably.
> We recommend starting with models from our curated [NexaAI Collection](https://huggingface.co/NexaAI/collections) for best results. For example
📝 Run and chat with LLMs, e.g. Qwen3:
```bash
nexa infer NexaAI/Qwen3-4B-4bit-MLX
```
🖼️ Run and chat with Multimodal models, e.g. Gemma3n:
```bash
nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX
```
## CLI Reference
| Essential Command | What it does |
|----------------------------------|----------------------------------------------------------------------|
| `nexa -h` | show all CLI commands |
| `nexa pull <repo>` | Interactive download & cache of a model |
| `nexa infer <repo>` | Local inference |
| `nexa list` | Show all cached models with sizes |
| `nexa remove <repo>` / `nexa clean` | Delete one / all cached models |
| `nexa serve --host 127.0.0.1:8080` | Launch OpenAI‑compatible REST server |
| `nexa run <repo>` | Chat with a model via an existing server |
👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!
See [CLI Reference](https://nexaai.mintlify.app/nexa-sdk-go/NexaCLI) for full commands.
## Acknowledgements
We would like to thank the following projects:
- [llama.cpp](https://github.com/ggml-org/llama.cpp)
- [mlx-lm](https://github.com/ml-explore/mlx-lm)
- [mlx-vlm](https://github.com/Blaizzy/mlx-vlm)
- [mlx-audio](https://github.com/Blaizzy/mlx-audio)
", Assign "at most 3 tags" to the expected json: {"id":"12239","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"