Trendshift - Ask AI

base on UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation <img src="https://s21.ax1x.com/2025/06/03/pVCBdw8.png" width="200"/> <h2 align="center"> <a href="https://arxiv.org/abs/2506.03147"> UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation </a> </h2> [![slack badge](https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp)](https://discord.gg/YyMBeR4bfS) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/user-attachments/assets/e187584a-f096-44df-b26b-f85aae838a18) [![arXiv](https://img.shields.io/badge/Arxiv-2506.03147-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2506.03147) [![hf_paper](https://img.shields.io/badge/🤗-Paper%20In%20HF-red.svg)](https://huggingface.co/papers/2506.03147) [![model](https://img.shields.io/badge/🤗-Model-blue.svg)](https://huggingface.co/LanguageBind/UniWorld-V1) [![data](https://img.shields.io/badge/🤗-Dataset-blue.svg)](https://huggingface.co/datasets/LanguageBind/UniWorld-V1) [![License](https://img.shields.io/badge/License-MIT-yellow)](https://github.com/PKU-YuanGroup/UniWorld-V1/blob/main/LICENSE) [![Twitter](https://img.shields.io/badge/-Twitter@LinBin46984-black?logo=twitter&logoColor=1D9BF0)](https://x.com/LinBin46984/status/1929905024349679682) [![demo0](https://img.shields.io/badge/🤗-Demo0-blue.svg)](http://8.130.165.159:8800/) [![demo0](https://img.shields.io/badge/🤗-Demo1-blue.svg)](http://8.130.165.159:8801/) [![demo0](https://img.shields.io/badge/🤗-Demo2-blue.svg)](http://8.130.165.159:8802/) [![demo0](https://img.shields.io/badge/🤗-Demo3-blue.svg)](http://8.130.165.159:8803/) [![GitHub repo stars](https://img.shields.io/github/stars/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https://github.com/PKU-YuanGroup/UniWorld-V1/stargazers)  [![GitHub repo forks](https://img.shields.io/github/forks/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Forks)](https://github.com/PKU-YuanGroup/UniWorld-V1/network)  [![GitHub repo watchers](https://img.shields.io/github/watchers/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Watchers)](https://github.com/PKU-YuanGroup/UniWorld-V1/watchers)  [![GitHub repo size](https://img.shields.io/github/repo-size/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Repo%20Size)](https://github.com/PKU-YuanGroup/UniWorld-V1/archive/refs/heads/main.zip) [![GitHub repo contributors](https://img.shields.io/github/contributors-anon/PKU-YuanGroup/UniWorld-V1?style=flat&label=Contributors)](https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors) [![GitHub Commit](https://img.shields.io/github/commit-activity/m/PKU-YuanGroup/UniWorld-V1?label=Commit)](https://github.com/PKU-YuanGroup/UniWorld-V1/commits/main/) [![Pr](https://img.shields.io/github/issues-pr-closed-raw/PKU-YuanGroup/UniWorld-V1.svg?label=Merged+PRs&color=green)](https://github.com/PKU-YuanGroup/UniWorld-V1/pulls) [![GitHub issues](https://img.shields.io/github/issues/PKU-YuanGroup/UniWorld-V1?color=critical&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aopen+is%3Aissue) [![GitHub closed issues](https://img.shields.io/github/issues-closed/PKU-YuanGroup/UniWorld-V1?color=success&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aissue+is%3Aclosed) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniworld-v1-high-resolution-semantic-encoders/image-generation-on-wise)](https://paperswithcode.com/sota/image-generation-on-wise?p=uniworld-v1-high-resolution-semantic-encoders) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniworld-v1-high-resolution-semantic-encoders/image-editing-on-imgedit-data)](https://paperswithcode.com/sota/image-editing-on-imgedit-data?p=uniworld-v1-high-resolution-semantic-encoders) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniworld-v1-high-resolution-semantic-encoders/text-to-image-generation-on-geneval)](https://paperswithcode.com/sota/text-to-image-generation-on-geneval?p=uniworld-v1-high-resolution-semantic-encoders) # 📣 News * **[2025.06.03]** 🤗 We release UniWorld-V1, a unified framework for understanding, generation, and editing. All [data](https://huggingface.co/datasets/LanguageBind/UniWorld-V1), [models](https://huggingface.co/LanguageBind/UniWorld-V1), [training code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-training), and [evaluation code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-evaluation) are open-sourced. Checking our [report](https://arxiv.org/abs/2506.03147) for more details. Welcome to **watch** 👀 this repository for the latest updates. <img src="https://github.com/user-attachments/assets/e187584a-f096-44df-b26b-f85aae838a18" width="200"/> <details open><summary>💡 We also have other image edit projects that may interest you ✨. </summary>  > [**ImgEdit: A Unified Image Editing Dataset and Benchmark**](https://arxiv.org/abs/2505.20275) > Yang Ye and Xianyi He, etc. [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/PKU-YuanGroup/ImgEdit) [![github](https://img.shields.io/github/stars/PKU-YuanGroup/ImgEdit.svg?style=social)](https://github.com/PKU-YuanGroup/ImgEdit) [![arXiv](https://img.shields.io/badge/Arxiv-2505.20275-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.20275) > [**WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation**](https://arxiv.org/abs/2503.07265) > Yuwei Niu, Munan Ning, etc. [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/PKU-YuanGroup/WISE) [![github](https://img.shields.io/github/stars/PKU-YuanGroup/WISE.svg?style=social)](https://github.com/PKU-YuanGroup/WISE) [![arXiv](https://img.shields.io/badge/Arxiv-2503.07265-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2503.07265) > [**Open-Sora Plan: Open-Source Large Video Generation Model**](https://arxiv.org/abs/2412.00131) > Bin Lin, Yunyang Ge and Xinhua Cheng, etc. [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/PKU-YuanGroup/Open-Sora-Plan) [![github](https://img.shields.io/github/stars/PKU-YuanGroup/Open-Sora-Plan.svg?style=social)](https://github.com/PKU-YuanGroup/Open-Sora-Plan) [![arXiv](https://img.shields.io/badge/Arxiv-2412.00131-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2412.00131) > </details> # 😍 Gallery UniWorld-V1 shows excellent performance in **20+** tasks. **Click to play** <a href="https://www.youtube.com/watch?v=77U0PKH7uxs" target="_blank"> <img src="https://github.com/user-attachments/assets/dbb2acf7-3a54-44b5-9bca-b30cb3385056" width="850" style="margin-bottom: 0.2;"/> </a> <img src="https://s21.ax1x.com/2025/06/03/pVCB6ln.png" width="850" style="margin-bottom: 0.2;"/> # 😮 Highlights ### 1. All Resources Fully Open-Sourced - We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures. - We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on. - We annotate 286K long-caption samples using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). We use GPT-4o to filter [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit), result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found [here](https://github.com/PKU-YuanGroup/UniWorld-V1/tree/main?tab=readme-ov-file#data-details). ### 2. Contrastive Semantic Encoders as Reference Control Signals - Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images. - For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions. ### 3. Image Priors via VLM Encoding Without Learnable Tokens - We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format `<instruction><image>` is particularly important. <img src="https://s21.ax1x.com/2025/06/03/pVCB5Y4.jpg" width="850" style="margin-bottom: 0.2;"/> # 🔥 Quick Start 1.Set up environment ``` git clone https://github.com/PKU-YuanGroup/UniWorld-V1 cd UniWorld-V1 conda create -n univa python=3.10 -y conda activate univa pip install -r requirements.txt pip install flash_attn --no-build-isolation ``` 2.Download pretrained checkpoint ``` huggingface-cli download --resume-download LanguageBind/UniWorld-V1 --local-dir ${MODEL_PATH} huggingface-cli download --resume-download black-forest-labs/FLUX.1-dev --local-dir ${FLUX_PATH} huggingface-cli download --resume-download google/siglip2-so400m-patch16-512 --local-dir ${SIGLIP_PATH} ``` 3.Run with cli ```bash MODEL_PATH="path/to/model" FLUX_PATH="path/to/flux" SIGLIP_PATH="path/to/siglip" CUDA_VISIBLE_DEVICES=0 python -m univa.serve.cli \ --model_path ${MODEL_PATH} \ --flux_path ${FLUX_PATH} \ --siglip_path ${SIGLIP_PATH} ``` 4.Run with gradio Highly recommend trying out our web demo by the following command. ```bash python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} ``` For 24G VRAM GPU on Linux, use NF4 quantization. Thank you [@gluttony-10](https://github.com/gluttony-10) very much for contribution! Then you can run the following command: ```bash python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4 ``` Or download [wikeeyang/UniWorld-V1-NF4](https://huggingface.co/wikeeyang/UniWorld-V1-NF4) to ${MODEL_PATH}, and download [diffusers/FLUX.1-dev-bnb-4bit](https://huggingface.co/diffusers/FLUX.1-dev-bnb-4bit) to ${FLUX_PATH} instead. For 24G VRAM GPU on Windows, use NF4 quantization and offload. It just cost 20G VRAM. Then you can run the following command: ```bash python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4 --offload ``` In order to use the Chinese language, run with --zh. 5.Run with ComfyUI Thank you [@judian17](https://github.com/judian17) very much for contribution! [ComfyUI-UniWorld-jd17](https://github.com/judian17/ComfyUI-UniWorld-jd17) is a ComfyUI implementation provided by the open-source community. Please note that the required transformers version is 4.50.0. # 🗝️ Training **We release the [log](https://api.wandb.ai/links/linbin/c8hjfos5) for stage1 at 512 reslution.** ### Data preparation Download the data from [LanguageBind/UniWorld-V1](https://huggingface.co/datasets/LanguageBind/UniWorld-V1). The dataset consists of two parts: source images and annotation JSON files. Prepare a `data.txt` file in the following format: 1. The first column is the root path to the image. 2. The second column is the corresponding annotation JSON file. 3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others. ``` data/BLIP3o-60k,json/blip3o_t2i_58859.json,false data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false data/imgedit,json/imgedit/laion_add_part0_edit.json,true ``` We have prepared a `data.txt` file about ImgEdit for your reference. <details><summary>`data.txt` for ImgEdit</summary> ``` data/imgedit/action/action,json/imgedit/pandam_action_edit.json,true data/imgedit/action/action_part2,json/imgedit/pandam2_action_edit.json,true data/imgedit/action/action_part3,json/imgedit/pandam3_action_edit.json,true data/imgedit/action/action_part4,json/imgedit/pandam4_action_edit.json,true data/imgedit/add/add_part0,json/imgedit/laion_add_part0_edit.json,true data/imgedit/add/add_part1,json/imgedit/laion_add_part1_edit.json,true data/imgedit/add/add_part4,json/imgedit/results_add_laion_part4_edit.json,true data/imgedit/add/add_part5,json/imgedit/results_add_laion_part5_edit.json,true data/imgedit/adjust/adjust_part0,json/imgedit/results_adjust_canny_laion_part0_edit.json,true data/imgedit/adjust/adjust_part2,json/imgedit/results_adjust_canny_laion_part2_edit.json,true data/imgedit/adjust/adjust_part3,json/imgedit/results_adjust_canny_laion_part3_edit.json,true data/imgedit/adjust/adjust_part4,json/imgedit/laion_adjust_canny_part4_edit.json,true data/imgedit/background/background_part0,json/imgedit/results_background_laion_part0_edit.json,true data/imgedit/background/background_part2,json/imgedit/results_background_laion_part2_edit.json,true data/imgedit/background/background_part3,json/imgedit/laion_background_part3_edit.json,true data/imgedit/background/background_part5,json/imgedit/laion_background_part5_edit.json,true data/imgedit/background/background_part7,json/imgedit/laion_background_part7_edit.json,true data/imgedit/compose/compose_part0,json/imgedit/results_compose_part0_edit.json,false data/imgedit/compose/compose_part2,json/imgedit/results_compose_part2_edit.json,false data/imgedit/compose/compose_part6,json/imgedit/results_compose_part6_fix_edit.json,false data/imgedit/refine_replace/refine_replace_part1,json/imgedit/results_extract_ref_part1_refimg_edit.json,true data/imgedit/remove/remove_part0,json/imgedit/laion_remove_part0_edit.json,true data/imgedit/remove/remove_part1,json/imgedit/results_remove_laion_part1_edit.json,true data/imgedit/remove/remove_part4,json/imgedit/results_remove_laion_part4_edit.json,true data/imgedit/remove/remove_part5,json/imgedit/results_remove_laion_part5_edit.json,true data/imgedit/replace/replace_part0,json/imgedit/laion_replace_part0_edit.json,true data/imgedit/replace/replace_part1,json/imgedit/laion_replace_part1_edit.json,true data/imgedit/replace/replace_part4,json/imgedit/results_replace_laion_part4_edit.json,true data/imgedit/replace/replace_part5,json/imgedit/results_replace_laion_part5_edit.json,true data/imgedit/transfer/transfer,json/imgedit/results_style_transfer_edit.json,false data/imgedit/transfer/transfer_part0,json/imgedit/results_style_transfer_part0_cap36472_edit.json,false ``` </details> We provide a simple online verification tool to check whether your paths are set in `data.txt` correctly. ``` python univa/serve/check_data.py ``` <img src="https://s21.ax1x.com/2025/05/30/pV9iP8f.png" width="850" style="margin-bottom: 0.2;"/> ### Data details <details><summary>Text-to-Image Generation</summary> - [BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k): We add text-to-image instructions to half of the data. [108 GB storage usage.] - [OSP1024-286k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/OSP1024-286k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), with captions generated using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.] </details> <details><summary>Image Editing</summary> - [imgedit-724k](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main): Data is filtered using GPT-4o, retaining approximately half. [2.8T storage usage.] - [OmniEdit-368k](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M): For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.] - [SEED-Data-Edit-Part1-Openimages-65k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part1-Openimages): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.] - [SEED-Data-Edit-Part2-3-12k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.] - [PromptfixData-18k](https://huggingface.co/datasets/yeates/PromptfixData): For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.] - [StyleBooth-11k](https://huggingface.co/scepter-studio/stylebooth): For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.] - [Ghibli-36k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/Ghibli-36k): For transfer style data, images have a short side ≥ 1024 pixels. **Warning: This data has not been quality filtered.** [170 GB storage usage.] </details> <details><summary>Extract & Try-on</summary> - [viton_hd-23k](https://huggingface.co/datasets/forgeml/viton_hd): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.] - [deepfashion-27k](https://huggingface.co/datasets/lirus18/deepfashion): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.] - [shop_product-23k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/shop_product-23k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.] </details> <details><summary>Image Perception</summary> - [coco2017_caption_canny-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_canny): img->canny & canny->img [25 GB storage usage.] - [coco2017_caption_depth-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_depth): img->depth & depth->img [8 GB storage usage.] - [coco2017_caption_hed-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_hed): img->hed & hed->img [13 GB storage usage.] - [coco2017_caption_mlsd-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_mlsd): img->mlsd & mlsd->img [ GB storage usage.] - [coco2017_caption_normal-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_normal): img->normal & normal->img [10 GB storage usage.] - [coco2017_caption_openpose-62k](https://huggingface.co/datasets/wangherr/coco2017_caption_openpose): img->pose & pose->img [2 GB storage usage.] - [coco2017_caption_sketch-236k](https://huggingface.co/datasets/wangherr/coco2017_caption_sketch): img->sketch & sketch->img [15 GB storage usage.] - [unsplash_canny-20k](https://huggingface.co/datasets/wtcherr/unsplash_10k_canny): img->canny & canny->img [2 GB storage usage.] - [open_pose-40k](https://huggingface.co/datasets/raulc0399/open_pose_controlnet): img->pose & pose->img [4 GB storage usage.] - [mscoco-controlnet-canny-less-colors-236k](https://huggingface.co/datasets/hazal-karakus/mscoco-controlnet-canny-less-colors): img->canny & canny->img [13 GB storage usage.] - [coco2017_seg_box-448k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/coco2017_seg_box-448k): img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.] - [viton_hd-11k](https://huggingface.co/datasets/forgeml/viton_hd): img->pose [1 GB storage usage.] - [deepfashion-13k](https://huggingface.co/datasets/lirus18/deepfashion): img->pose [1 GB storage usage.] </details> ### Training #### Prepare pretrained weights Download [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) to `$FLUX_PATH`. Download [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) to `$QWENVL_PATH`. We also support other sizes of Qwen2.5-VL. ``` SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32" python scripts/make_univa_qwen2p5vl_weight.py \ --origin_flux_ckpt_path $FLUX_PATH \ --origin_qwenvl_ckpt_path $QWENVL_PATH \ --save_path ${SAVE_PATH} ``` #### Stage 1 You need to specify `pretrained_lvlm_name_or_path` to `${SAVE_PATH}` in `flux_qwen2p5vl_7b_vlm_stage1_512.yaml`. We recommend using `optimizer: prodigy` with `learning_rate: 1.0` in `flux_qwen2p5vl_7b_vlm_stage1_512.yaml`. For training with 512×512 scale images (batch size 1), it consume about 74G in 1 node (8 GPUs). Setting `ema_pretrained_lvlm_name_or_path: null` can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size. ``` # stage 1 # if use prodigy, pip install prodigy bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh ``` #### Stage 2 Download [flux-redux-siglipv2-512.bin](https://huggingface.co/LanguageBind/UniWorld-V1/resolve/main/flux-redux-siglipv2-512.bin?download=true) and set its path to `pretrained_siglip_mlp_path` in `flux_qwen2p5vl_7b_vlm_stage2_512.yaml`. The weight is sourced from [ostris/Flex.1-alpha-Redux](https://huggingface.co/ostris/Flex.1-alpha-Redux), we just re-organize the weight. Download [google/siglip2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) and set its path to `pretrained_siglip_name_or_path` in `flux_qwen2p5vl_7b_vlm_stage2_512.yaml`. You also need to specify `pretrained_mlp2_path`, which is trained by stage 1. For training with 512×512 scale images (batch size 1), it consume about **78G** in 1 node (8 GPUs). Setting `ema_pretrained_lvlm_name_or_path: null` can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size. Using more nodes also can save memory because we use zero2 for main model in stage 2. ``` # stage 2 bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh ``` # ⚡️ Evaluation ### Text-to-Image Generation <details><summary>GenEval</summary> ``` cd univa/eval/geneval # follow the instruction in univa/eval/geneval/README.md ``` </details> <details><summary>WISE</summary> ``` cd univa/eval/wise # follow the instruction in univa/eval/wise/README.md ``` </details> <details><summary>GenAI-Bench</summary> ``` cd univa/eval/genai # follow the instruction in univa/eval/genai/README.md ``` </details> <details><summary>DPG-Bench</summary> ``` cd univa/eval/dpgbench # follow the instruction in univa/eval/dpgbench/README.md ``` </details> ### Image Editing <details><summary>ImgEdit</summary> We have updated `gpt-4.1` results instead of `gpt-4o-2024-08-06` [here](https://paperswithcode.com/sota/image-editing-on-imgedit-data). See [here](https://github.com/PKU-YuanGroup/UniWorld-V1/issues/34) for more details. ``` cd univa/eval/imgedit # follow the instruction in univa/eval/imgedit/README.md ``` </details> <details><summary>GEdit</summary> We discuss the scores related to GEdit-Bench [here](https://github.com/PKU-YuanGroup/UniWorld-V1/issues/6#issuecomment-2939392328). ``` cd univa/eval/gdit # follow the instruction in univa/eval/gdit/README.md ``` </details> # 📊 Benchmarks <img src="https://s21.ax1x.com/2025/06/03/pVPFuTJ.png" width="850" style="margin-bottom: 0.2;"/> # 💡 How to Contribute We greatly appreciate your contributions to the UniWorld-V1 open-source community and helping us make it even better than it is now! For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md). # 👍 Acknowledgement and Related Work * [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit): ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs. * [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): An open‑source text-to-image/video foundation model, which provides a lot of caption data. * [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit): A hybrid dataset for instruction-guided image editing. * [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): The new flagship vision-language model of Qwen. * [FLUX.1-Redux-dev](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev): Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image. * [SigLIP 2](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md): New multilingual vision-language encoders. * [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit): A state-of-the-art image editing model. * [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o): A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. * [BAGEL](https://github.com/ByteDance-Seed/Bagel): An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. # 🧐 FAQ 1. **Visual Encoder:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/5 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/15 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/18 2. **Data Setup:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/17 3. **Editing Evaluation:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/6 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/16 3. **Training Process and Analysis:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/3 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/9 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/14 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/28 # 🔒 License * See [LICENSE](LICENSE) for details. The FLUX weights fall under the [FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md). # ✏️ Citing ```bibtex @article{lin2025uniworld, title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation}, author={Lin, Bin and Li, Zongjian and Cheng, Xinhua and Niu, Yuwei and Ye, Yang and He, Xianyi and Yuan, Shenghai and Yu, Wangbo and Wang, Shaodong and Ge, Yunyang and others}, journal={arXiv preprint arXiv:2506.03147}, year={2025} } @article{ye2025imgedit, title={ImgEdit: A Unified Image Editing Dataset and Benchmark}, author={Ye, Yang and He, Xianyi and Li, Zongjian and Lin, Bin and Yuan, Shenghai and Yan, Zhiyuan and Hou, Bohan and Yuan, Li}, journal={arXiv preprint arXiv:2505.20275}, year={2025} } @article{niu2025wise, title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation}, author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li}, journal={arXiv preprint arXiv:2503.07265}, year={2025} } @article{yan2025gpt, title={Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation}, author={Yan, Zhiyuan and Ye, Junyan and Li, Weijia and Huang, Zilong and Yuan, Shenghai and He, Xiangyang and Lin, Kaiqing and He, Jun and He, Conghui and Yuan, Li}, journal={arXiv preprint arXiv:2504.02782}, year={2025} } @article{lin2024open, title={Open-Sora Plan: Open-Source Large Video Generation Model}, author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others}, journal={arXiv preprint arXiv:2412.00131}, year={2024} } ``` # 🤝 Community contributors <a href="https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors"> <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/UniWorld-V1" /> </a> # ✨ Star History [![Star History Chart](https://api.star-history.com/svg?repos=PKU-YuanGroup/UniWorld-V1&type=Date)](https://www.star-history.com/#PKU-YuanGroup/UniWorld-V1&Date) ", Assign "at most 3 tags" to the expected json: {"id":"13961","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts