base on # WhisperJAV <p align="center"> <img src="https://img.shields.io/badge/version-1.7.4-blue.svg" alt="Version"> <img src="https://img.shields.io/badge/python-3.9--3.12-green.svg" alt="Python"> <img src="https://img.shields.io/badge/license-MIT-orange.svg" alt="License"> </p> A subtitle generator for Japanese Adult Videos. --- ### What is the idea: Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the **spontaneous and noisy domain of JAV**. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data. #### 1. The Acoustic Profile JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by: * **Non-Verbal Vocalisations (NVVs):** A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure. * **Spectral Mimicry:** These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., *fu*), acting as accidental adversarial examples that trick the model into recognizing words where none exist. * **Extreme Dynamics:** Volatile shifts in audio intensity, ranging from faint whispers (*sasayaki*) to high-decibel screams, which confuse standard gain control and attention mechanisms. * **Linguistic Variance:** The prevalence of theatrical onomatopoeia and *Role Language* (*Yakuwarigo*) containing exaggerated intonations and slang absent from standard corpora. #### 2. Temporal Drift and Hallucination While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes **contextual drift** and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive **hallucination loops** where the model generates unrelated text to fill the acoustic void. #### 3. The Pre-processing Paradox & Fine-Tuning Risks Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific **log-Mel spectrogram** features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering. Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of **overfitting**. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs. **WhisperJAV** is an attempt to address above failure points. The inference pipelines do: 1. **Acoustic Filtering:** Deploys **scene-based segmentation** and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3]. 2. **Linguistic Adaptation:** Normalizes **domain-specific terminology** and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in *Kansai-ben*) that standard BPE tokenizers fail to parse [4, 5]. 3. **Defensive Decoding:** Tunes **log-probability thresholding** and `no_speech_threshold` to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., `(moans)`) from the final subtitle track [6, 7]. --- ## Quick Start ### GUI (Recommended for most users) ```bash whisperjav-gui ``` A window opens. Add your files, pick a mode, click Start. ### Command Line ```bash # Basic usage whisperjav video.mp4 # Specify mode and sensitivity whisperjav audio.mp3 --mode balanced --sensitivity aggressive # Process a folder whisperjav /path/to/media_folder --output-dir ./subtitles ``` --- ## Features ### Processing Modes | Mode | Backend | Scene Detection | VAD | Best For | |------|---------|-----------------|-----|----------| | **faster** | stable-ts (turbo) | No | No | Speed priority, clean audio | | **fast** | stable-ts | Yes | No | General use, mixed quality | | **balanced** | faster-whisper | Yes | Yes | Default. Noisy audio, dialogue-heavy | | **fidelity** | OpenAI Whisper | Yes | Yes (Silero) | Maximum accuracy, slower | | **transformers** | HuggingFace | Optional | Internal | Japanese-optimized model, customizable | ### Sensitivity Settings - **Conservative**: Higher thresholds, fewer hallucinations. Good for noisy content. - **Balanced**: Default. Works for most content. - **Aggressive**: Lower thresholds, catches more dialogue. Good for whisper/ASMR content. ### Transformers Mode (New in v1.7) Uses HuggingFace's `kotoba-tech/kotoba-whisper-v2.2` model, which is optimized for Japanese conversational speech: ```bash whisperjav video.mp4 --mode transformers # Customize parameters whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20 ``` **Transformers-specific options:** - `--hf-model-id`: Model (default: `kotoba-tech/kotoba-whisper-v2.2`) - `--hf-chunk-length`: Seconds per chunk (default: 15) - `--hf-beam-size`: Beam search width (default: 5) - `--hf-temperature`: Sampling temperature (default: 0.0) - `--hf-scene`: Scene detection method (`none`, `auditok`, `silero`, `semantic`) ### Two-Pass Ensemble Mode (New in v1.7) Runs your video through two different pipelines and merges results. Different models catch different things. ```bash # Pass 1 with transformers, Pass 2 with balanced whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced # Custom sensitivity per pass whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass1-sensitivity aggressive --pass2-pipeline fidelity ``` **Merge strategies:** - `smart_merge` (default): Intelligent overlap detection - `pass1_primary` / `pass2_primary`: Prioritize one pass, fill gaps from other - `full_merge`: Combine everything from both passes ### Speech Enhancement tools (New in v1.7.3) Pre-process audio scenes. When selected runs per-scene after scene detection. Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination. ```bash # ClearVoice denoising (48kHz, best quality) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice # ClearVoice with specific 16kHz model whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice:FRCRN_SE_16K # FFmpeg DSP filters (lightweight, always available) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer ffmpeg-dsp:loudnorm,denoise # ZipEnhancer (lightweight SOTA) whisperjav video.mp4 --mode balanced --pass1-speech-enhancer zipenhancer # BS-RoFormer vocal isolation whisperjav video.mp4 --mode balanced --pass1-speech-enhancer bs-roformer # Ensemble with different enhancers per pass whisperjav video.mp4 --ensemble \ --pass1-pipeline balanced --pass1-speech-enhancer clearvoice \ --pass2-pipeline transformers --pass2-speech-enhancer none ``` **Available backends:** | Backend | Description | Models/Options | |---------|-------------|----------------| | `none` | No enhancement (default) | - | | `ffmpeg-dsp` | FFmpeg audio filters | `loudnorm`, `denoise`, `compress`, `highpass`, `lowpass`, `deess` | | `clearvoice` | ClearerVoice denoising | `MossFormer2_SE_48K` (default), `FRCRN_SE_16K` | | `zipenhancer` | ZipEnhancer 16kHz | `torch` (GPU), `onnx` (CPU) | | `bs-roformer` | Vocal isolation | `vocals`, `other` | **Syntax:** `--pass1-speech-enhancer <backend>` or `--pass1-speech-enhancer <backend>:<model>` ### GUI Parameter Customization The GUI has three tabs: 1. **Transcription Mode**: Select pipeline, sensitivity, language 2. **Advanced Options**: Model override, scene detection method, debug settings 3. **Two-Pass Ensemble**: Configure both passes with full parameter customization via JSON editor The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files. ### AI Translation Generate subtitles and translate them in one step: ```bash # Generate and translate whisperjav video.mp4 --translate # Or translate existing subtitles whisperjav-translate -i subtitles.srt --provider deepseek ``` Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter. **Resume Support**: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the `.subtrans` project file. --- ## What Makes It Work for JAV ### Scene Detection Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word. Three methods are available: - **Auditok** (default): Energy-based detection, fast and reliable - **Silero**: Neural VAD-based detection, better for noisy audio - **Semantic** (new in v1.7.4): Texture-based clustering using MFCC features, groups acoustically similar segments together ### Voice Activity Detection (VAD) Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments. ### Japanese Post-Processing - Handles sentence-ending particles (ね, よ, わ, の) - Preserves aizuchi (うん, はい, ええ) - Recognizes dialect patterns (Kansai-ben, feminine/masculine speech) - Filters out common Whisper hallucinations ### Hallucination Removal Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns. --- ## Content-Specific Recommendations | Content Type | Mode | Sensitivity | Notes | |--------------|------|-------------|-------| | Drama / Dialogue Heavy | balanced | aggressive | Or try transformers mode | | Group Scenes | faster | conservative | Speed matters, less precision needed | | Amateur / Homemade | fast | conservative | Variable audio quality | | ASMR / VR / Whisper | fidelity | aggressive | Maximum accuracy for quiet speech | | Heavy Background Music | balanced | conservative | VAD helps filter music | | Maximum Accuracy | ensemble | varies | Two-pass with different pipelines | --- ## Installation ### Windows Installer (Easiest) Download and run: **WhisperJAV-1.7.4-Windows-x86_64.exe** This installs everything you need including Python and dependencies. ### Upgrading from Previous Installer Versions If you installed v1.5.x or v1.6.x via the Windows installer: 1. Download [upgrade_whisperjav.bat](https://github.com/meizhong986/whisperjav/raw/main/installer/upgrade_whisperjav.bat) 2. Double-click to run 3. Wait 1-2 minutes This updates WhisperJAV without re-downloading PyTorch (~2.5GB) or your AI models (~3GB). ### Install from Source Requires Python 3.9-3.12, FFmpeg, and Git. **Recommended: Use the install scripts** (handles dependency conflicts automatically, auto-detects GPU): <details> <summary><b>Windows</b></summary> ```batch git clone https://github.com/meizhong986/whisperjav.git cd whisperjav installer\install_windows.bat # Auto-detects GPU and CUDA version installer\install_windows.bat --cpu-only # Force CPU only installer\install_windows.bat --cuda118 # Force CUDA 11.8 installer\install_windows.bat --cuda124 # Force CUDA 12.4 installer\install_windows.bat --minimal # Minimal install (no speech enhancement) installer\install_windows.bat --dev # Development/editable install ``` The script automatically: - Detects your NVIDIA GPU and selects optimal CUDA version - Falls back to CPU-only if no GPU found - Checks for WebView2 runtime (required for GUI) - Logs installation to `install_log_windows.txt` - Retries failed downloads up to 3 times </details> <details> <summary><b>Linux / macOS</b></summary> ```bash # Install system dependencies first (Linux only) # Debian/Ubuntu: sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1 # Fedora/RHEL: sudo dnf install python3-devel gcc ffmpeg libsndfile git clone https://github.com/meizhong986/whisperjav.git cd whisperjav chmod +x installer/install_linux.sh ./installer/install_linux.sh # Auto-detects GPU ./installer/install_linux.sh --cpu-only # Force CPU only ./installer/install_linux.sh --minimal # Minimal install ``` </details> <details> <summary><b>Cross-Platform Python Script</b></summary> ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav python install.py # Auto-detects GPU, defaults to CUDA 12.1 python install.py --cpu-only # CPU only python install.py --cuda118 # CUDA 11.8 python install.py --cuda121 # CUDA 12.1 python install.py --cuda124 # CUDA 12.4 python install.py --minimal # Minimal install (no speech enhancement) python install.py --dev # Development/editable install ``` </details> **Alternative: Manual pip install** (may encounter dependency conflicts): ```bash # Install PyTorch with GPU support first (NVIDIA example) pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124 # Then install WhisperJAV pip install git+https://github.com/meizhong986/whisperjav.git@main ``` **Platform Notes:** - **Apple Silicon (M1/M2/M3/M4)**: Just `pip install torch torchaudio` - MPS acceleration works automatically - **AMD GPU (ROCm)**: Experimental. Use `--mode balanced` for best compatibility - **CPU only**: Works but slow. Use `--accept-cpu-mode` to skip the GPU warning - **Linux server (no GPU)**: The install scripts auto-detect and switch to CPU-only - **Linux (Debian/Ubuntu)**: Install system dependencies first: `sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1` ### Prerequisites - **Python 3.9-3.12** (3.13+ not compatible with openai-whisper) - **FFmpeg** in your system PATH - **GPU recommended**: NVIDIA CUDA, Apple MPS, or AMD ROCm - **8GB+ disk space** for installation <details> <summary>Detailed Windows Prerequisites</summary> #### NVIDIA GPU Setup 1. Install latest [NVIDIA drivers](https://www.nvidia.com/drivers) 2. Install [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) matching your driver version 3. Install [cuDNN](https://developer.nvidia.com/cudnn) matching your CUDA version #### FFmpeg 1. Download from [gyan.dev/ffmpeg/builds](https://www.gyan.dev/ffmpeg/builds) 2. Extract to `C:\ffmpeg` 3. Add `C:\ffmpeg\bin` to your PATH #### Python Download from [python.org](https://www.python.org/downloads/windows/). Check "Add Python to PATH" during installation. </details> --- ## CLI Reference ```bash # Basic usage whisperjav video.mp4 whisperjav video.mp4 --mode balanced --sensitivity aggressive # All modes: faster, fast, balanced, fidelity, transformers whisperjav video.mp4 --mode fidelity # Transformers mode with custom parameters whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20 # Two-pass ensemble whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass2-pipeline fidelity --merge-strategy smart_merge # Output options whisperjav video.mp4 --output-dir ./subtitles whisperjav video.mp4 --subs-language english-direct # Batch processing whisperjav /path/to/folder --output-dir ./subtitles whisperjav /path/to/folder --skip-existing # Resume interrupted batch (skip already processed) # Debugging whisperjav video.mp4 --debug --keep-temp # Translation whisperjav video.mp4 --translate --translate-provider deepseek whisperjav-translate -i subtitles.srt --provider gemini ``` Run `whisperjav --help` for all options. --- ## Troubleshooting **FFmpeg not found**: Install FFmpeg and add it to your PATH. **Slow processing / GPU warning**: Your PyTorch might be CPU-only. Reinstall with GPU support: ```bash pip uninstall torch torchvision torchaudio pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 ``` **model.bin error in faster mode**: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder: ```powershell Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Systran--faster-whisper-large-v2" ``` --- ## Performance Rough estimates for processing time per hour of video: | Platform | Time | |----------|------| | NVIDIA GPU (CUDA) | 5-10 minutes | | Apple Silicon (MPS) | 8-15 minutes | | AMD GPU (ROCm) | 10-20 minutes | | CPU only | 30-60 minutes | --- ## Contributing Contributions welcome. See `CONTRIBUTING.md` for guidelines. ```bash git clone https://github.com/meizhong986/whisperjav.git cd whisperjav pip install -e .[dev] python -m pytest tests/ ``` --- ## License MIT License. See [LICENSE](LICENSE) file. --- ## Citation and credits - Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio." (2025). arXiv:2501.11378. - Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down." (2025). arXiv:2505.12969. - PromptASR for Contextualized ASR with Controllable Style." (2024). arXiv:2309.07414. - In-Context Learning Boosts Speech Recognition." (2025). arXiv:2505.1 - Koenecke, A., et al. (2024). "Careless Whisper: Speech-to-Text Hallucination Harms." ACM FAccT 2024. - Bain, M., et al. (2023). "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio." arXiv:2303.00747. ## Acknowledgments - [OpenAI Whisper](https://github.com/openai/whisper) - The underlying ASR model - [stable-ts](https://github.com/jianfch/stable-ts) - Timestamp refinement - [faster-whisper](https://github.com/guillaumekln/faster-whisper) - Optimized CTranslate2 inference - [HuggingFace Transformers](https://github.com/huggingface/transformers) - Transformers pipeline backend - [Kotoba-Whisper](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2) - Japanese-optimized Whisper model - The testing community for feedback and bug reports --- ## Disclaimer This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process. ", Assign "at most 3 tags" to the expected json: {"id":"16603","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"