base on Official implementation of "Separate Anything You Describe" # Separate Anything You Describe [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2308.05037) [![GitHub Stars](https://img.shields.io/github/stars/Audio-AGI/AudioSep?style=social)](https://github.com/Audio-AGI/AudioSep/) [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://audio-agi.github.io/Separate-Anything-You-Describe) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Audio-AGI/AudioSep/blob/main/AudioSep_Colab.ipynb) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Audio-AGI/AudioSep) [![Replicate](https://replicate.com/cjwbw/audiosep/badge)](https://replicate.com/cjwbw/audiosep) This repository contains the official implementation of ["Separate Anything You Describe"](https://audio-agi.github.io/Separate-Anything-You-Describe/AudioSep_arXiv.pdf). We introduce AudioSep, a foundation model for open-domain sound separation with natural language queries. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability on numerous tasks, such as audio event separation, musical instrument separation, and speech enhancement. Check out the separated audio examples on the [Demo Page](https://audio-agi.github.io/Separate-Anything-You-Describe/)! <p align="center"> <img align="middle" width="800" src="assets/results.png"/> </p> <hr> ## Setup Clone the repository and setup the conda environment: ```shell git clone https://github.com/Audio-AGI/AudioSep.git && \ cd AudioSep && \ conda env create -f environment.yml && \ conda activate AudioSep ``` Download [model weights](https://huggingface.co/spaces/Audio-AGI/AudioSep/tree/main/checkpoint) at `checkpoint/`. If you're using this checkpoint for the DCASE 2024 Task 9 challenge participation, please note that this checkpoint was trained using audio at 32k Hz, with a window size of 2048 points and a hop size of 320 points in the STFT operation, which is different with the challenge baseline system provided (16k Hz, window size 1024, hop size 160). <hr> ## Inference ```python from pipeline import build_audiosep, inference import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = build_audiosep( config_yaml='config/audiosep_base.yaml', checkpoint_path='checkpoint/audiosep_base_4M_steps.ckpt', device=device) audio_file = 'path_to_audio_file' text = 'textual_description' output_file='separated_audio.wav' # AudioSep processes the audio at 32 kHz sampling rate inference(model, audio_file, text, output_file, device) ``` <hr> To load directly from Hugging Face, you can do the following: ```python from models.audiosep import AudioSep from utils import get_ss_model import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') ss_model = get_ss_model('config/audiosep_base.yaml') model = AudioSep.from_pretrained("nielsr/audiosep-demo", ss_model=ss_model) audio_file = 'path_to_audio_file' text = 'textual_description' output_file='separated_audio.wav' # AudioSep processes the audio at 32 kHz sampling rate inference(model, audio_file, text, output_file, device) ``` <hr> Use chunk-based inference to save memory: ```python inference(model, audio_file, text, output_file, device, use_chunk=True) ``` ## Training To utilize your audio-text paired dataset: 1. Format your dataset to match our JSON structure. Refer to the provided template at `datafiles/template.json`. 2. Update the `config/audiosep_base.yaml` file by listing your formatted JSON data files under `datafiles`. For example: ```yaml data: datafiles: - 'datafiles/your_datafile_1.json' - 'datafiles/your_datafile_2.json' ... ``` Train AudioSep from scratch: ```python python train.py --workspace workspace/AudioSep --config_yaml config/audiosep_base.yaml --resume_checkpoint_path checkpoint/ '' ``` Finetune AudioSep from pretrained checkpoint: ```python python train.py --workspace workspace/AudioSep --config_yaml config/audiosep_base.yaml --resume_checkpoint_path path_to_checkpoint ``` <hr> ## Benchmark Evaluation Download the [evaluation data](https://drive.google.com/drive/folders/1PbCsuvdrzwAZZ_fwIzF0PeVGZkTk0-kL?usp=sharing) under the `evaluation/data` folder. The data should be organized as follows: ```yaml evaluation: data: - audioset/ - audiocaps/ - vggsound/ - music/ - clotho/ - esc50/ ``` Run benchmark inference script, the results will be saved at `eval_logs/` ```python python benchmark.py --checkpoint_path audiosep_base_4M_steps.ckpt """ Evaluation Results: VGGSound Avg SDRi: 9.144, SISDR: 9.043 MUSIC Avg SDRi: 10.508, SISDR: 9.425 ESC-50 Avg SDRi: 10.040, SISDR: 8.810 AudioSet Avg SDRi: 7.739, SISDR: 6.903 AudioCaps Avg SDRi: 8.220, SISDR: 7.189 Clotho Avg SDRi: 6.850, SISDR: 5.242 """ ``` ## Cite this work If you found this tool useful, please consider citing ```bibtex @article{liu2023separate, title={Separate Anything You Describe}, author={Liu, Xubo and Kong, Qiuqiang and Zhao, Yan and Liu, Haohe and Yuan, Yi, and Liu, Yuzhuo, and Xia, Rui and Wang, Yuxuan, and Plumbley, Mark D and Wang, Wenwu}, journal={arXiv preprint arXiv:2308.05037}, year={2023} } ``` ```bibtex @inproceedings{liu22w_interspeech, title={Separate What You Describe: Language-Queried Audio Source Separation}, author={Liu, Xubo and Liu, Haohe and Kong, Qiuqiang and Mei, Xinhao and Zhao, Jinzheng and Huang, Qiushi, and Plumbley, Mark D and Wang, Wenwu}, year=2022, booktitle={Proc. Interspeech}, pages={1801--1805}, } ``` ## Contributors : <a href="https://github.com/Audio-AGI/AudioSep/graphs/contributors"> <img src="https://contrib.rocks/image?repo=Audio-AGI/AudioSep" /> </a> ", Assign "at most 3 tags" to the expected json: {"id":"3496","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"