base on A modular voice assistant application for experimenting with state-of-the-art transcription, response generation, and text-to-speech models. Supports OpenAI, Groq, Elevanlabs, CartesiaAI, and Deepgram APIs, plus local models via Ollama. Ideal for research and development in voice technology. # VERBI - Voice Assistant šŸŽ™ļø <p align="center"> <a href="https://trendshift.io/repositories/11584" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11584" alt="PromtEngineer%2FVerbi | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> </p> [![GitHub Stars](https://img.shields.io/github/stars/PromtEngineer/Verbi?style=social)](https://github.com/PromtEngineer/Verbi/stargazers) [![GitHub Forks](https://img.shields.io/github/forks/PromtEngineer/Verbi?style=social)](https://github.com/PromtEngineer/Verbi/network/members) [![GitHub Issues](https://img.shields.io/github/issues/PromtEngineer/Verbi)](https://github.com/PromtEngineer/Verbi/issues) [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/PromtEngineer/Verbi)](https://github.com/PromtEngineer/Verbi/pulls) [![License](https://img.shields.io/github/license/PromtEngineer/Verbi)](https://github.com/PromtEngineer/Verbi/blob/main/LICENSE) ## Motivation ✨✨✨ Welcome to the Voice Assistant project! šŸŽ™ļø Our goal is to create a modular voice assistant application that allows you to experiment with state-of-the-art (SOTA) models for various components. The modular structure provides flexibility, enabling you to pick and choose between different SOTA models for transcription, response generation, and text-to-speech (TTS). This approach facilitates easy testing and comparison of different models, making it an ideal platform for research and development in voice assistant technologies. Whether you're a developer, researcher, or enthusiast, this project is for you! ## Features 🧰 - **Modular Design**: Easily switch between different models for transcription, response generation, and TTS. - **Support for Multiple APIs**: Integrates with OpenAI, Groq, and Deepgram APIs, along with placeholders for local models. - **Audio Recording and Playback**: Record audio from the microphone and play generated speech. - **Configuration Management**: Centralized configuration in `config.py` for easy setup and management. ## Project Structure šŸ“‚ ```plaintext voice_assistant/ ā”œā”€ā”€ voice_assistant/ │ ā”œā”€ā”€ __init__.py │ ā”œā”€ā”€ audio.py │ ā”œā”€ā”€ api_key_manager.py │ ā”œā”€ā”€ config.py │ ā”œā”€ā”€ transcription.py │ ā”œā”€ā”€ response_generation.py │ ā”œā”€ā”€ text_to_speech.py │ ā”œā”€ā”€ utils.py │ ā”œā”€ā”€ local_tts_api.py │ ā”œā”€ā”€ local_tts_generation.py ā”œā”€ā”€ .env ā”œā”€ā”€ run_voice_assistant.py ā”œā”€ā”€ piper_server.py ā”œā”€ā”€ setup.py ā”œā”€ā”€ requirements.txt └── README.md ``` ## Setup Instructions šŸ“‹ #### Prerequisites āœ… - Python 3.10 or higher - Virtual environment (recommended) #### Step-by-Step Instructions šŸ”¢ 1. šŸ“„ **Clone the repository** ```shell git clone https://github.com/PromtEngineer/Verbi.git cd Verbi ``` 2. šŸ **Set up a virtual environment** Using `venv`: ```shell python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` ``` Using `conda`: ```shell conda create --name verbi python=3.10 conda activate verbi ``` 3. šŸ“¦ **Install the required packages** ```shell pip install -r requirements.txt ``` 4. šŸ› ļø **Set up the environment variables** Create a `.env` file in the root directory and add your API keys: ```shell OPENAI_API_KEY=your_openai_api_key GROQ_API_KEY=your_groq_api_key DEEPGRAM_API_KEY=your_deepgram_api_key LOCAL_MODEL_PATH=path/to/local/model PIPER_SERVER_URL=server_url ``` 5. 🧩 **Configure the models** Edit config.py to select the models you want to use: ```shell class Config: # Model selection TRANSCRIPTION_MODEL = 'groq' # Options: 'openai', 'groq', 'deepgram', 'fastwhisperapi' 'local' RESPONSE_MODEL = 'groq' # Options: 'openai', 'groq', 'ollama', 'local' TTS_MODEL = 'deepgram' # Options: 'openai', 'deepgram', 'elevenlabs', 'local', 'melotts', 'piper' # API keys and paths OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") GROQ_API_KEY = os.getenv("GROQ_API_KEY") DEEPGRAM_API_KEY = os.getenv("DEEPGRAM_API_KEY") LOCAL_MODEL_PATH = os.getenv("LOCAL_MODEL_PATH") ``` If you are running LLM locally via [Ollama](https://ollama.com/), make sure the Ollama server is runnig before starting verbi. 6. šŸ”Š **Configure ElevenLabs Jarvis' Voice** - Voice samples [here](https://github.com/PromtEngineer/Verbi/tree/main/voice_samples). - Follow this [link](https://elevenlabs.io/app/voice-lab/share/de3746fa51a09e771604d74b5d1ff6797b6b96a5958f9de95cef544dde31dad9/WArWzu0z4mbSyy5BfRKM) to add the Jarvis voice to your ElevenLabs account. - Name the voice 'Paul J.' or, if you prefer a different name, ensure it matches the ELEVENLABS_VOICE_ID variable in the text_to_speech.py file. 7. šŸƒ **Run the voice assistant** ```shell python run_voice_assistant.py ``` 8. šŸŽ¤ **Install FastWhisperAPI** _Optional step if you need a local transcription model_ ***Clone the repository*** ```shell cd.. git clone https://github.com/3choff/FastWhisperAPI.git cd FastWhisperAPI ``` ***Install the required packages:*** ```shell pip install -r requirements.txt ``` ***Run the API*** ```shell fastapi run main.py ``` ***Alternative Setup and Run Methods*** The API can also run directly on a Docker container or in Google Colab. ***Docker:*** ***Build a Docker container:*** ```shell docker build -t fastwhisperapi . ``` ***Run the container*** ```shell docker run -p 8000:8000 fastwhisperapi ``` Refer to the repository documentation for the Google Colab method: https://github.com/3choff/FastWhisperAPI/blob/main/README.md 8. šŸŽ¤ **Install Local TTS - MeloTTS** _Optional step if you need a local Text to Speech model_ ***Install MeloTTS from Github*** Use the following [link](https://github.com/myshell-ai/MeloTTS/blob/main/docs/install.md#linux-and-macos-install) to install MeloTTS for your operating system. Once the package is installed on your local virtual environment, you can start the api server using the following command. ```shell python voice_assistant/local_tts_api.py ``` The `local_tts_api.py` file implements as fastapi server that will listen to incoming text and will generate audio using MeloTTS model. In order to use the local TTS model, you will need to update the `config.py` file by setting: ```shell TTS_MODEL = 'melotts' # Options: 'openai', 'deepgram', 'elevenlabs', 'local', 'melotts', 'piper' ``` 9. šŸŽ¤ **Install Local TTS - Piper** _A faster and lightweight alternative to MeloTTS_ ***Download the Piper Binary and the voice from Github*** Use the following [link](https://github.com/rhasspy/piper) to install Piper Binary for your operating system. Use the following [link](https://github.com/rhasspy/piper?tab=readme-ov-file#voices) to download Piper voices. Each voice will have two files: | `.onnx` | Actual voice model | | `.onnx.json` | Model configuration | For example: ```shell models/en_US-lessac-medium/ ā”œā”€ā”€ en_US-lessac-medium.onnx ā”œā”€ā”€ en_US-lessac-medium.onnx.json ``` Once the binary and voice is downloaded on your system, edit the `piper_server.py` and provide the binary and voice paths. ```shell piper_executable = "./piper/piper" #example path to the piper binary model_path = "en_US-lessac-medium.onnx" #example path to the .onnx file ``` You can start the api server using the following command. ```shell python piper_server.py ``` The `piper_server.py` file implements as fastapi server that will listen to incoming text and will generate audio using Piper model. In order to use the local TTS model, you will need to update the `config.py` file by setting: ```shell TTS_MODEL = 'piper' # Options: 'openai', 'deepgram', 'elevenlabs', 'local', 'melotts','piper' ``` You can run the main file to start using verbi with local models. ## Model Options āš™ļø #### Transcription Models šŸŽ¤ - **OpenAI**: Uses OpenAI's Whisper model. - **Groq**: Uses Groq's Whisper-large-v3 model. - **Deepgram**: Uses Deepgram's transcription model. - **FastWhisperAPI**: Uses FastWhisperAPI, a local transcription API powered by Faster Whisper. - **Local**: Placeholder for a local speech-to-text (STT) model. #### Response Generation Models šŸ’¬ - **OpenAI**: Uses OpenAI's GPT-4 model. - **Groq**: Uses Groq's LLaMA model. - **Ollama**: Uses any model served via Ollama. - **Local**: Placeholder for a local language model. #### Text-to-Speech (TTS) Models šŸ”Š - **OpenAI**: Uses OpenAI's TTS model with the 'fable' voice. - **Deepgram**: Uses Deepgram's TTS model with the 'aura-angus-en' voice. - **ElevenLabs**: Uses ElevenLabs' TTS model with the 'Paul J.' voice. - **Local**: Placeholder for a local TTS model. ## Detailed Module Descriptions šŸ“˜ - **`run_verbi.py`**: Main script to run the voice assistant. - **`voice_assistant/config.py`**: Manages configuration settings and API keys. - **`voice_assistant/api_key_manager.py`**: Handles retrieval of API keys based on configured models. - **`voice_assistant/audio.py`**: Functions for recording and playing audio. - **`voice_assistant/transcription.py`**: Manages audio transcription using various APIs. - **`voice_assistant/response_generation.py`**: Handles generating responses using various language models. - **`voice_assistant/text_to_speech.py`**: Manages converting text responses into speech. - **`voice_assistant/utils.py`**: Contains utility functions like deleting files. - **`voice_assistant/local_tts_api.py`**: Contains the api implementation to run the MeloTTS model. - **`voice_assistant/local_tts_generation.py`**: Contains the code to use the MeloTTS api to generated audio. - **`voice_assistant/__init__.py`**: Initializes the `voice_assistant` package. ## Roadmap šŸ›¤ļøšŸ›¤ļøšŸ›¤ļø Here's what's next for the Voice Assistant project: 1. **Add Support for Streaming**: Enable real-time streaming of audio input and output. 2. **Add Support for ElevenLabs and Enhanced Deepgram for TTS**: Integrate additional TTS options for higher quality and variety. 3. **Add Filler Audios**: Include background or filler audios while waiting for model responses to enhance user experience. 4. **Add Support for Local Models Across the Board**: Expand support for local models in transcription, response generation, and TTS. ## Contributing šŸ¤ We welcome contributions from the community! If you'd like to help improve this project, please follow these steps: 1. Fork the repository. 2. Create a new branch (`git checkout -b feature-branch`). 3. Make your changes and commit them (`git commit -m 'Add new feature'`). 4. Push to the branch (`git push origin feature-branch`). 5. Open a pull request detailing your changes. ## Star History ✨✨✨ [![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/Verbi&type=Date)](https://star-history.com/#PromtEngineer/Verbi&Date) ", Assign "at most 3 tags" to the expected json: {"id":"11584","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"