AI prompts
base on # WhisperJAV
<p align="center">
<img src="https://img.shields.io/badge/version-1.7.4-blue.svg" alt="Version">
<img src="https://img.shields.io/badge/python-3.9--3.12-green.svg" alt="Python">
<img src="https://img.shields.io/badge/license-MIT-orange.svg" alt="License">
</p>
A subtitle generator for Japanese Adult Videos.
---
### What is the idea:
Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the **spontaneous and noisy domain of JAV**. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
#### 1. The Acoustic Profile
JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by:
* **Non-Verbal Vocalisations (NVVs):** A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure.
* **Spectral Mimicry:** These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., *fu*), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
* **Extreme Dynamics:** Volatile shifts in audio intensity, ranging from faint whispers (*sasayaki*) to high-decibel screams, which confuse standard gain control and attention mechanisms.
* **Linguistic Variance:** The prevalence of theatrical onomatopoeia and *Role Language* (*Yakuwarigo*) containing exaggerated intonations and slang absent from standard corpora.
#### 2. Temporal Drift and Hallucination
While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes **contextual drift** and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive **hallucination loops** where the model generates unrelated text to fill the acoustic void.
#### 3. The Pre-processing Paradox & Fine-Tuning Risks
Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific **log-Mel spectrogram** features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering.
Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of **overfitting**. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs.
**WhisperJAV** is an attempt to address above failure points. The inference pipelines do:
1. **Acoustic Filtering:** Deploys **scene-based segmentation** and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
2. **Linguistic Adaptation:** Normalizes **domain-specific terminology** and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in *Kansai-ben*) that standard BPE tokenizers fail to parse [4, 5].
3. **Defensive Decoding:** Tunes **log-probability thresholding** and `no_speech_threshold` to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., `(moans)`) from the final subtitle track [6, 7].
---
## Quick Start
### GUI (Recommended for most users)
```bash
whisperjav-gui
```
A window opens. Add your files, pick a mode, click Start.
### Command Line
```bash
# Basic usage
whisperjav video.mp4
# Specify mode and sensitivity
whisperjav audio.mp3 --mode balanced --sensitivity aggressive
# Process a folder
whisperjav /path/to/media_folder --output-dir ./subtitles
```
---
## Features
### Processing Modes
| Mode | Backend | Scene Detection | VAD | Best For |
|------|---------|-----------------|-----|----------|
| **faster** | stable-ts (turbo) | No | No | Speed priority, clean audio |
| **fast** | stable-ts | Yes | No | General use, mixed quality |
| **balanced** | faster-whisper | Yes | Yes | Default. Noisy audio, dialogue-heavy |
| **fidelity** | OpenAI Whisper | Yes | Yes (Silero) | Maximum accuracy, slower |
| **transformers** | HuggingFace | Optional | Internal | Japanese-optimized model, customizable |
### Sensitivity Settings
- **Conservative**: Higher thresholds, fewer hallucinations. Good for noisy content.
- **Balanced**: Default. Works for most content.
- **Aggressive**: Lower thresholds, catches more dialogue. Good for whisper/ASMR content.
### Transformers Mode (New in v1.7)
Uses HuggingFace's `kotoba-tech/kotoba-whisper-v2.2` model, which is optimized for Japanese conversational speech:
```bash
whisperjav video.mp4 --mode transformers
# Customize parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20
```
**Transformers-specific options:**
- `--hf-model-id`: Model (default: `kotoba-tech/kotoba-whisper-v2.2`)
- `--hf-chunk-length`: Seconds per chunk (default: 15)
- `--hf-beam-size`: Beam search width (default: 5)
- `--hf-temperature`: Sampling temperature (default: 0.0)
- `--hf-scene`: Scene detection method (`none`, `auditok`, `silero`, `semantic`)
### Two-Pass Ensemble Mode (New in v1.7)
Runs your video through two different pipelines and merges results. Different models catch different things.
```bash
# Pass 1 with transformers, Pass 2 with balanced
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced
# Custom sensitivity per pass
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass1-sensitivity aggressive --pass2-pipeline fidelity
```
**Merge strategies:**
- `smart_merge` (default): Intelligent overlap detection
- `pass1_primary` / `pass2_primary`: Prioritize one pass, fill gaps from other
- `full_merge`: Combine everything from both passes
### Speech Enhancement tools (New in v1.7.3)
Pre-process audio scenes. When selected runs per-scene after scene detection.
Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination.
```bash
# ClearVoice denoising (48kHz, best quality)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice
# ClearVoice with specific 16kHz model
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice:FRCRN_SE_16K
# FFmpeg DSP filters (lightweight, always available)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer ffmpeg-dsp:loudnorm,denoise
# ZipEnhancer (lightweight SOTA)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer zipenhancer
# BS-RoFormer vocal isolation
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer bs-roformer
# Ensemble with different enhancers per pass
whisperjav video.mp4 --ensemble \
--pass1-pipeline balanced --pass1-speech-enhancer clearvoice \
--pass2-pipeline transformers --pass2-speech-enhancer none
```
**Available backends:**
| Backend | Description | Models/Options |
|---------|-------------|----------------|
| `none` | No enhancement (default) | - |
| `ffmpeg-dsp` | FFmpeg audio filters | `loudnorm`, `denoise`, `compress`, `highpass`, `lowpass`, `deess` |
| `clearvoice` | ClearerVoice denoising | `MossFormer2_SE_48K` (default), `FRCRN_SE_16K` |
| `zipenhancer` | ZipEnhancer 16kHz | `torch` (GPU), `onnx` (CPU) |
| `bs-roformer` | Vocal isolation | `vocals`, `other` |
**Syntax:** `--pass1-speech-enhancer <backend>` or `--pass1-speech-enhancer <backend>:<model>`
### GUI Parameter Customization
The GUI has three tabs:
1. **Transcription Mode**: Select pipeline, sensitivity, language
2. **Advanced Options**: Model override, scene detection method, debug settings
3. **Two-Pass Ensemble**: Configure both passes with full parameter customization via JSON editor
The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files.
### AI Translation
Generate subtitles and translate them in one step:
```bash
# Generate and translate
whisperjav video.mp4 --translate
# Or translate existing subtitles
whisperjav-translate -i subtitles.srt --provider deepseek
```
Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter.
**Resume Support**: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the `.subtrans` project file.
---
## What Makes It Work for JAV
### Scene Detection
Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.
Three methods are available:
- **Auditok** (default): Energy-based detection, fast and reliable
- **Silero**: Neural VAD-based detection, better for noisy audio
- **Semantic** (new in v1.7.4): Texture-based clustering using MFCC features, groups acoustically similar segments together
### Voice Activity Detection (VAD)
Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.
### Japanese Post-Processing
- Handles sentence-ending particles (ね, よ, わ, の)
- Preserves aizuchi (うん, はい, ええ)
- Recognizes dialect patterns (Kansai-ben, feminine/masculine speech)
- Filters out common Whisper hallucinations
### Hallucination Removal
Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns.
---
## Content-Specific Recommendations
| Content Type | Mode | Sensitivity | Notes |
|--------------|------|-------------|-------|
| Drama / Dialogue Heavy | balanced | aggressive | Or try transformers mode |
| Group Scenes | faster | conservative | Speed matters, less precision needed |
| Amateur / Homemade | fast | conservative | Variable audio quality |
| ASMR / VR / Whisper | fidelity | aggressive | Maximum accuracy for quiet speech |
| Heavy Background Music | balanced | conservative | VAD helps filter music |
| Maximum Accuracy | ensemble | varies | Two-pass with different pipelines |
---
## Installation
### Windows Installer (Easiest)
Download and run: **WhisperJAV-1.7.4-Windows-x86_64.exe**
This installs everything you need including Python and dependencies.
### Upgrading from Previous Installer Versions
If you installed v1.5.x or v1.6.x via the Windows installer:
1. Download [upgrade_whisperjav.bat](https://github.com/meizhong986/whisperjav/raw/main/installer/upgrade_whisperjav.bat)
2. Double-click to run
3. Wait 1-2 minutes
This updates WhisperJAV without re-downloading PyTorch (~2.5GB) or your AI models (~3GB).
### Install from Source
Requires Python 3.9-3.12, FFmpeg, and Git.
**Recommended: Use the install scripts** (handles dependency conflicts automatically, auto-detects GPU):
<details>
<summary><b>Windows</b></summary>
```batch
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
installer\install_windows.bat # Auto-detects GPU and CUDA version
installer\install_windows.bat --cpu-only # Force CPU only
installer\install_windows.bat --cuda118 # Force CUDA 11.8
installer\install_windows.bat --cuda124 # Force CUDA 12.4
installer\install_windows.bat --minimal # Minimal install (no speech enhancement)
installer\install_windows.bat --dev # Development/editable install
```
The script automatically:
- Detects your NVIDIA GPU and selects optimal CUDA version
- Falls back to CPU-only if no GPU found
- Checks for WebView2 runtime (required for GUI)
- Logs installation to `install_log_windows.txt`
- Retries failed downloads up to 3 times
</details>
<details>
<summary><b>Linux / macOS</b></summary>
```bash
# Install system dependencies first (Linux only)
# Debian/Ubuntu:
sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1
# Fedora/RHEL:
sudo dnf install python3-devel gcc ffmpeg libsndfile
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
chmod +x installer/install_linux.sh
./installer/install_linux.sh # Auto-detects GPU
./installer/install_linux.sh --cpu-only # Force CPU only
./installer/install_linux.sh --minimal # Minimal install
```
</details>
<details>
<summary><b>Cross-Platform Python Script</b></summary>
```bash
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
python install.py # Auto-detects GPU, defaults to CUDA 12.1
python install.py --cpu-only # CPU only
python install.py --cuda118 # CUDA 11.8
python install.py --cuda121 # CUDA 12.1
python install.py --cuda124 # CUDA 12.4
python install.py --minimal # Minimal install (no speech enhancement)
python install.py --dev # Development/editable install
```
</details>
**Alternative: Manual pip install** (may encounter dependency conflicts):
```bash
# Install PyTorch with GPU support first (NVIDIA example)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
# Then install WhisperJAV
pip install git+https://github.com/meizhong986/whisperjav.git@main
```
**Platform Notes:**
- **Apple Silicon (M1/M2/M3/M4)**: Just `pip install torch torchaudio` - MPS acceleration works automatically
- **AMD GPU (ROCm)**: Experimental. Use `--mode balanced` for best compatibility
- **CPU only**: Works but slow. Use `--accept-cpu-mode` to skip the GPU warning
- **Linux server (no GPU)**: The install scripts auto-detect and switch to CPU-only
- **Linux (Debian/Ubuntu)**: Install system dependencies first: `sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1`
### Prerequisites
- **Python 3.9-3.12** (3.13+ not compatible with openai-whisper)
- **FFmpeg** in your system PATH
- **GPU recommended**: NVIDIA CUDA, Apple MPS, or AMD ROCm
- **8GB+ disk space** for installation
<details>
<summary>Detailed Windows Prerequisites</summary>
#### NVIDIA GPU Setup
1. Install latest [NVIDIA drivers](https://www.nvidia.com/drivers)
2. Install [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) matching your driver version
3. Install [cuDNN](https://developer.nvidia.com/cudnn) matching your CUDA version
#### FFmpeg
1. Download from [gyan.dev/ffmpeg/builds](https://www.gyan.dev/ffmpeg/builds)
2. Extract to `C:\ffmpeg`
3. Add `C:\ffmpeg\bin` to your PATH
#### Python
Download from [python.org](https://www.python.org/downloads/windows/). Check "Add Python to PATH" during installation.
</details>
---
## CLI Reference
```bash
# Basic usage
whisperjav video.mp4
whisperjav video.mp4 --mode balanced --sensitivity aggressive
# All modes: faster, fast, balanced, fidelity, transformers
whisperjav video.mp4 --mode fidelity
# Transformers mode with custom parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20
# Two-pass ensemble
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass2-pipeline fidelity --merge-strategy smart_merge
# Output options
whisperjav video.mp4 --output-dir ./subtitles
whisperjav video.mp4 --subs-language english-direct
# Batch processing
whisperjav /path/to/folder --output-dir ./subtitles
whisperjav /path/to/folder --skip-existing # Resume interrupted batch (skip already processed)
# Debugging
whisperjav video.mp4 --debug --keep-temp
# Translation
whisperjav video.mp4 --translate --translate-provider deepseek
whisperjav-translate -i subtitles.srt --provider gemini
```
Run `whisperjav --help` for all options.
---
## Troubleshooting
**FFmpeg not found**: Install FFmpeg and add it to your PATH.
**Slow processing / GPU warning**: Your PyTorch might be CPU-only. Reinstall with GPU support:
```bash
pip uninstall torch torchvision torchaudio
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
```
**model.bin error in faster mode**: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder:
```powershell
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Systran--faster-whisper-large-v2"
```
---
## Performance
Rough estimates for processing time per hour of video:
| Platform | Time |
|----------|------|
| NVIDIA GPU (CUDA) | 5-10 minutes |
| Apple Silicon (MPS) | 8-15 minutes |
| AMD GPU (ROCm) | 10-20 minutes |
| CPU only | 30-60 minutes |
---
## Contributing
Contributions welcome. See `CONTRIBUTING.md` for guidelines.
```bash
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
pip install -e .[dev]
python -m pytest tests/
```
---
## License
MIT License. See [LICENSE](LICENSE) file.
---
## Citation and credits
- Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio." (2025). arXiv:2501.11378.
- Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down." (2025). arXiv:2505.12969.
- PromptASR for Contextualized ASR with Controllable Style." (2024). arXiv:2309.07414.
- In-Context Learning Boosts Speech Recognition." (2025). arXiv:2505.1
- Koenecke, A., et al. (2024). "Careless Whisper: Speech-to-Text Hallucination Harms." ACM FAccT 2024.
- Bain, M., et al. (2023). "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio." arXiv:2303.00747.
## Acknowledgments
- [OpenAI Whisper](https://github.com/openai/whisper) - The underlying ASR model
- [stable-ts](https://github.com/jianfch/stable-ts) - Timestamp refinement
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) - Optimized CTranslate2 inference
- [HuggingFace Transformers](https://github.com/huggingface/transformers) - Transformers pipeline backend
- [Kotoba-Whisper](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2) - Japanese-optimized Whisper model
- The testing community for feedback and bug reports
---
## Disclaimer
This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process.
", Assign "at most 3 tags" to the expected json: {"id":"16603","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"