AI prompts
base on Official implementation of "Sonic: Shifting Focus to Global Audio Perception in Portrait Animation" # Sonic
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation, CVPR 2025.
<a href='https://jixiaozhong.github.io/Sonic/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
<a href="http://demo.sonic.jixiaozhong.online/" style="margin: 0 2px;">
<img src='https://img.shields.io/badge/Demo-Gradio-gold?style=flat&logo=Gradio&logoColor=red' alt='Demo'>
</a>
<a href='https://openaccess.thecvf.com/content/CVPR2025/papers/Ji_Sonic_Shifting_Focus_to_Global_Audio_Perception_in_Portrait_Animation_CVPR_2025_paper.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href="https://huggingface.co/spaces/xiaozhongji/Sonic" style="margin: 0 2px;">
<img src='https://img.shields.io/badge/Space-ZeroGPU-orange?style=flat&logo=Gradio&logoColor=red' alt='Demo'>
</a>
<a href="https://raw.githubusercontent.com/jixiaozhong/Sonic/refs/heads/main/LICENSE" style="margin: 0 2px;">
<img src='https://img.shields.io/badge/License-CC BY--NC--SA--4.0-lightgreen?style=flat&logo=Lisence' alt='License'>
</a>
<p align="center">
π Join our <a href="examples/image/QQ2.jpg" target="_blank">QQ Chat Group</a>
</p>
<p align="center">
## π₯π₯π₯ NEWS
**`2025/05/06`**: We have open-sourced [**ββDICE-Talk**](https://github.com/toto222/DICE-Talk)ββ, a portrait-driven system with emotional expression. Welcome to try it out!
**`2025/03/14`**: Super stoked to share that our Sonic is accpted by the CVPR 2025! See you Nashville!!
**`2025/02/08`**: Many thanks to the open-source community contributors for making the ComfyUI version of Sonic a reality. Your efforts are truly appreciated! [**ComfyUI version of Sonic**](https://github.com/smthemex/ComfyUI_Sonic)
**`2025/02/06`**: Commercialization: Note that our license is **non-commercial**. If commercialization is required, please use Tencent Cloud Video Creation Large Model: [**Introduction**](https://cloud.tencent.com/product/vclm) / [**API documentation**](https://cloud.tencent.com/document/api/1616/109378)
**`2025/01/17`**: Our [**Online huggingface Demo**](https://huggingface.co/spaces/xiaozhongji/Sonic/) is released.
**`2025/01/17`**: Thank you to NewGenAI for promoting our Sonic and creating a Windows-based tutorial on [**YouTube**](https://www.youtube.com/watch?v=KiDDtcvQyS0).
**`2024/12/16`**: Our [**Online Demo**](http://demo.sonic.jixiaozhong.online/) is released.
## π₯ Demo
| Input | Output | Input | Output |
|----------------------|-----------------------|----------------------|-----------------------|
|<img src="examples/image/anime1.png" width="360">|<video src="https://github.com/user-attachments/assets/636c3ff5-210e-44b8-b901-acf828071133" width="360"> </video>|<img src="examples/image/female_diaosu.png" width="360">|<video src="https://github.com/user-attachments/assets/e8207300-2569-47d1-9ad4-4b4c9b0f0bd4" width="360"> </video>|
|<img src="examples/image/hair.png" width="360">|<video src="https://github.com/user-attachments/assets/dcb755c1-de01-4afe-8b4f-0e0b2c2439c1" width="360"> </video>|<img src="examples/image/leonnado.jpg" width="360">|<video src="https://github.com/user-attachments/assets/b50e61bb-62d4-469d-b402-b37cda3fbd27" width="360"> </video>|
For more visual demos, please visit our [**Page**](https://jixiaozhong.github.io/Sonic/).
## π§© Community Contributions
If you develop/use Sonic in your projects, welcome to let us know.
- ComfyUI version of Sonic: [**ComfyUI_Sonic**](https://github.com/smthemex/ComfyUI_Sonic)
## π Updates
**`2025/01/14`**: Our inference code and weights are released. Stay tuned, we will continue to polish the model.
## π Requirements
* An NVIDIA GPU with CUDA support is required.
* The model is tested on a single 32G GPU.
* Tested operating system: Linux
## π Inference
### Installtion
- install pytorch
```shell
pip3 install -r requirements.txt
```
- All models are stored in `checkpoints` by default, and the file structure is as follows
```shell
Sonic
βββcheckpoints
β βββSonic
β β βββaudio2bucket.pth
β β βββaudio2token.pth
β β βββunet.pth
β βββstable-video-diffusion-img2vid-xt
β β βββ...
β βββwhisper-tiny
β β βββ...
β βββRIFE
β β βββflownet.pkl
β βββyoloface_v5m.pt
βββ...
```
Download by `huggingface-cli` follow
```shell
python3 -m pip install "huggingface_hub[cli]"
huggingface-cli download LeonJoe13/Sonic --local-dir checkpoints
huggingface-cli download stabilityai/stable-video-diffusion-img2vid-xt --local-dir checkpoints/stable-video-diffusion-img2vid-xt
huggingface-cli download openai/whisper-tiny --local-dir checkpoints/whisper-tiny
```
or manully download [pretrain model](https://drive.google.com/drive/folders/1oe8VTPUy0-MHHW2a_NJ1F8xL-0VN5G7W?usp=drive_link), [svd-xt](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) and [whisper-tiny](https://huggingface.co/openai/whisper-tiny) to checkpoints/
### Run demo
```shell
python3 demo.py \
'/path/to/input_image' \
'/path/to/input_audio' \
'/path/to/output_video'
```
## π Citation
If you find our work helpful for your research, please consider citing our work.
```bibtex
@inproceedings{ji2025sonic,
title={Sonic: Shifting focus to global audio perception in portrait animation},
author={Ji, Xiaozhong and Hu, Xiaobin and Xu, Zhihong and Zhu, Junwei and Lin, Chuming and He, Qingdong and Zhang, Jiangning and Luo, Donghao and Chen, Yi and Lin, Qin and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={193--203},
year={2025}
}
@article{ji2024realtalk,
title={Realtalk: Real-time and realistic audio-driven face generation with 3d facial prior-guided identity alignment network},
author={Ji, Xiaozhong and Lin, Chuming and Ding, Zhonggan and Tai, Ying and Zhu, Junwei and Hu, Xiaobin and Luo, Donghao and Ge, Yanhao and Wang, Chengjie},
journal={arXiv preprint arXiv:2406.18284},
year={2024}
}
@article{tan2025dicetalk,
title={Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation},
author={Tan, Weipeng and Lin, Chuming and Xu, Chengming and Xu, FeiFan and Hu, Xiaobin and Ji, Xiaozhong and Zhu, Junwei and Wang, Chengjie and Fu, Yanwei},
journal={arXiv preprint arXiv:2504.18087},
year={2025}
}
```
## π Related Works
Explore our related researches:
- **[Super-fast talkοΌreal-time and less GPU computation]** [Realtalk: Real-time and realistic audio-driven face generation with 3d facial prior-guided identity alignment network](https://arxiv.org/pdf/2406.18284)
## π Star History
[](https://star-history.com/#jixiaozhong/Sonic&Date)
", Assign "at most 3 tags" to the expected json: {"id":"13467","tags":[]} "only from the tags list I provide: []" returns me the "expected json"