Audio processing
MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1B parameters, it is designed for realtime speech generation, can run directly on CPU without a GPU, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.
Fine-tune Gemma 4 and 3n with audio, images and text on Apple Silicon, using PyTorch and Metal Performance Shaders.
Turn any content into a personalized AI podcast. NotebookLM-style, except you control the script, voices, and hosts. Listen in Apple Podcasts, Spotify, or any podcast app.
Give Claude the ability to watch and understand videos — Claude Code plugin with frame extraction and multimodal audio analysis
Local meeting transcription → Obsidian vault. No cloud, no API keys.
100% private on-device voice models for speech-to-text and meeting transcription on macOS
Transcribe microphone and computer audio to markdown.
Agentic Hours-Long Video Editing via Music Synchronization
Every meeting, every idea, every voice note — searchable by your AI. Open-source, privacy-first conversation memory layer.
Personal project on Rust aimed to help understand foreign language better. Uses VAD+Whisper to transcribe, then translate according to the custom dictionary.
Fully local voice AI for iOS
Fully local meeting transcription with speaker diarization, AI summaries, and PDF output
The Infinite Crate is a DAW plugin built on JUCE, React, and the Lyria RealTime live music model
NPM Library to transcribe Audio & Videos completely in browser with WebGPU and WebCodecs. 100% private and offline with WASM fallbacks
Real-time transcription and AI assistant for Meta Ray-Ban smart glasses. Live speech-to-text, speaker diarization, Gemini Live vision+voice, and WebRTC streaming.
✨✨[ICML 2026] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Muesli - local meeting transcription + dictation for macOS (Granola + WisprFlow alternative)
A SOTA Industrial-Grade Voice Activity Detection & Audio Event Detection, supporting 100+ languages, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD