AI prompts
base on UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
<p align="center">
<img src="https://s21.ax1x.com/2025/06/03/pVCBdw8.png" width="200"/>
<p>
<h2 align="center">
<a href="https://arxiv.org/abs/2506.03147">
UniWorld-V1: High-Resolution Semantic Encoders for <br> Unified Visual Understanding and Generation
</a>
</h2>
[](https://discord.gg/YyMBeR4bfS)
[](https://github.com/user-attachments/assets/e187584a-f096-44df-b26b-f85aae838a18)<br>
[](https://arxiv.org/abs/2506.03147)
[](https://huggingface.co/papers/2506.03147)
[](https://huggingface.co/LanguageBind/UniWorld-V1)
[](https://huggingface.co/datasets/LanguageBind/UniWorld-V1)
[](https://github.com/PKU-YuanGroup/UniWorld-V1/blob/main/LICENSE)
[](https://x.com/LinBin46984/status/1929905024349679682) <br>
[](http://8.130.165.159:8800/)
[](http://8.130.165.159:8801/)
[](http://8.130.165.159:8802/)
[](http://8.130.165.159:8803/)
[](http://8.130.165.159:8804/)
[](http://8.130.165.159:8805/)
[](http://8.130.165.159:8806/)
[](http://8.130.165.159:8807/) <br>
[](https://github.com/PKU-YuanGroup/UniWorld-V1/stargazers) 
[](https://github.com/PKU-YuanGroup/UniWorld-V1/network) 
[](https://github.com/PKU-YuanGroup/UniWorld-V1/watchers) 
[](https://github.com/PKU-YuanGroup/UniWorld-V1/archive/refs/heads/main.zip) <br>
[](https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors)
[](https://github.com/PKU-YuanGroup/UniWorld-V1/commits/main/)
[](https://github.com/PKU-YuanGroup/UniWorld-V1/pulls)
[](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aopen+is%3Aissue)
[](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aissue+is%3Aclosed)
# 📣 News
* **[2025.06.03]** 🤗 We release UniWorld-V1, a unified framework for understanding, generation, and editing. All [data](https://huggingface.co/datasets/LanguageBind/UniWorld-V1), [models](https://huggingface.co/LanguageBind/UniWorld-V1), [training code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-training), and [evaluation code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-evaluation) are open-sourced. Checking our [report](https://arxiv.org/abs/2506.03147) for more details. Welcome to **watch** 👀 this repository for the latest updates.
<p align="center">
<img src="https://github.com/user-attachments/assets/e187584a-f096-44df-b26b-f85aae838a18" width="200"/>
<p>
<br>
<details open><summary>💡 We also have other image edit projects that may interest you ✨. </summary><p>
<!-- may -->
> [**ImgEdit: A Unified Image Editing Dataset and Benchmark**](https://arxiv.org/abs/2505.20275) <br>
> Yang Ye and Xianyi He, etc. <br>
[](https://github.com/PKU-YuanGroup/ImgEdit) [](https://github.com/PKU-YuanGroup/ImgEdit) [](https://arxiv.org/abs/2505.20275) <br>
> [**WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation**](https://arxiv.org/abs/2503.07265) <br>
> Yuwei Niu, Munan Ning, etc. <br>
[](https://github.com/PKU-YuanGroup/WISE) [](https://github.com/PKU-YuanGroup/WISE) [](https://arxiv.org/abs/2503.07265) <br>
> [**Open-Sora Plan: Open-Source Large Video Generation Model**](https://arxiv.org/abs/2412.00131) <br>
> Bin Lin, Yunyang Ge and Xinhua Cheng, etc. <br>
[](https://github.com/PKU-YuanGroup/Open-Sora-Plan) [](https://github.com/PKU-YuanGroup/Open-Sora-Plan) [](https://arxiv.org/abs/2412.00131) <br>
> </p ></details>
# 😍 Gallery
UniWorld-V1 shows excellent performance in **20+** tasks.
**Click to play**
<p align="left">
<a href="https://www.youtube.com/watch?v=77U0PKH7uxs" target="_blank">
<img src="https://github.com/user-attachments/assets/dbb2acf7-3a54-44b5-9bca-b30cb3385056" width="850" style="margin-bottom: 0.2;"/>
</a>
</p>
<p align="left">
<img src="https://s21.ax1x.com/2025/06/03/pVCB6ln.png" width="850" style="margin-bottom: 0.2;"/>
<p>
# 😮 Highlights
### 1. All Resources Fully Open-Sourced
- We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures.
- We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on.
- We annotate 286K long-caption samples using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). We use GPT-4o to filter [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit), result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found [here](https://github.com/PKU-YuanGroup/UniWorld-V1/tree/main?tab=readme-ov-file#data-details).
### 2. Contrastive Semantic Encoders as Reference Control Signals
- Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images.
- For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.
### 3. Image Priors via VLM Encoding Without Learnable Tokens
- We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format `<instruction><image>` is particularly important.
<p align="left">
<img src="https://s21.ax1x.com/2025/06/03/pVCB5Y4.jpg" width="850" style="margin-bottom: 0.2;"/>
<p>
# 🔥 Quick Start
1.Set up environment
```
git clone https://github.com/PKU-YuanGroup/UniWorld-V1
cd UniWorld-V1
conda create -n univa python=3.10 -y
conda activate univa
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
```
2.Download pretrained checkpoint
```
huggingface-cli download --resume-download LanguageBind/UniWorld-V1 --local-dir ${MODEL_PATH}
huggingface-cli download --resume-download black-forest-labs/FLUX.1-dev --local-dir ${FLUX_PATH}
huggingface-cli download --resume-download google/siglip2-so400m-patch16-512 --local-dir ${SIGLIP_PATH}
```
3.Run with cli
```bash
MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=0 python -m univa.serve.cli \
--model_path ${MODEL_PATH} \
--flux_path ${FLUX_PATH} \
--siglip_path ${SIGLIP_PATH}
```
4.Run with gradio
Highly recommend trying out our web demo by the following command.
```bash
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH}
```
For 24G VRAM GPU on Linux, use NF4 quantization. Thank you [@gluttony-10](https://github.com/gluttony-10) very much for contribution! Then you can run the following command:
```bash
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4
```
Or download [wikeeyang/UniWorld-V1-NF4](https://huggingface.co/wikeeyang/UniWorld-V1-NF4) to ${MODEL_PATH}, and download [diffusers/FLUX.1-dev-bnb-4bit](https://huggingface.co/diffusers/FLUX.1-dev-bnb-4bit) to ${FLUX_PATH} instead.
For 24G VRAM GPU on Windows, use NF4 quantization and offload. It just cost 20G VRAM. Then you can run the following command:
```bash
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4 --offload
```
In order to use the Chinese language, run with --zh.
5.Run with ComfyUI
Thank you [@judian17](https://github.com/judian17) very much for contribution! [ComfyUI-UniWorld-jd17](https://github.com/judian17/ComfyUI-UniWorld-jd17) is a ComfyUI implementation provided by the open-source community. Please note that the required transformers version is 4.50.0.
# 🗝️ Training
### Data preparation
Download the data from [LanguageBind/UniWorld-V1](https://huggingface.co/datasets/LanguageBind/UniWorld-V1). The dataset consists of two parts: source images and annotation JSON files.
Prepare a `data.txt` file in the following format:
1. The first column is the root path to the image.
2. The second column is the corresponding annotation JSON file.
3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.
```
data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
data/imgedit,json/imgedit/laion_add_part0_edit.json,true
```
We have prepared a `data.txt` file about ImgEdit for your reference.
<details><summary>`data.txt` for ImgEdit</summary><p>
```
data/imgedit/action/action,json/imgedit/pandam_action_edit.json,true
data/imgedit/action/action_part2,json/imgedit/pandam2_action_edit.json,true
data/imgedit/action/action_part3,json/imgedit/pandam3_action_edit.json,true
data/imgedit/action/action_part4,json/imgedit/pandam4_action_edit.json,true
data/imgedit/add/add_part0,json/imgedit/laion_add_part0_edit.json,true
data/imgedit/add/add_part1,json/imgedit/laion_add_part1_edit.json,true
data/imgedit/add/add_part4,json/imgedit/results_add_laion_part4_edit.json,true
data/imgedit/add/add_part5,json/imgedit/results_add_laion_part5_edit.json,true
data/imgedit/adjust/adjust_part0,json/imgedit/results_adjust_canny_laion_part0_edit.json,true
data/imgedit/adjust/adjust_part2,json/imgedit/results_adjust_canny_laion_part2_edit.json,true
data/imgedit/adjust/adjust_part3,json/imgedit/results_adjust_canny_laion_part3_edit.json,true
data/imgedit/adjust/adjust_part4,json/imgedit/laion_adjust_canny_part4_edit.json,true
data/imgedit/background/background_part0,json/imgedit/results_background_laion_part0_edit.json,true
data/imgedit/background/background_part2,json/imgedit/results_background_laion_part2_edit.json,true
data/imgedit/background/background_part3,json/imgedit/laion_background_part3_edit.json,true
data/imgedit/background/background_part5,json/imgedit/laion_background_part5_edit.json,true
data/imgedit/background/background_part7,json/imgedit/laion_background_part7_edit.json,true
data/imgedit/compose/compose_part0,json/imgedit/results_compose_part0_edit.json,false
data/imgedit/compose/compose_part2,json/imgedit/results_compose_part2_edit.json,false
data/imgedit/compose/compose_part6,json/imgedit/results_compose_part6_fix_edit.json,false
data/imgedit/refine_replace/refine_replace_part1,json/imgedit/results_extract_ref_part1_refimg_edit.json,true
data/imgedit/remove/remove_part0,json/imgedit/laion_remove_part0_edit.json,true
data/imgedit/remove/remove_part1,json/imgedit/results_remove_laion_part1_edit.json,true
data/imgedit/remove/remove_part4,json/imgedit/results_remove_laion_part4_edit.json,true
data/imgedit/remove/remove_part5,json/imgedit/results_remove_laion_part5_edit.json,true
data/imgedit/replace/replace_part0,json/imgedit/laion_replace_part0_edit.json,true
data/imgedit/replace/replace_part1,json/imgedit/laion_replace_part1_edit.json,true
data/imgedit/replace/replace_part4,json/imgedit/results_replace_laion_part4_edit.json,true
data/imgedit/replace/replace_part5,json/imgedit/results_replace_laion_part5_edit.json,true
data/imgedit/transfer/transfer,json/imgedit/results_style_transfer_edit.json,false
data/imgedit/transfer/transfer_part0,json/imgedit/results_style_transfer_part0_cap36472_edit.json,false
```
</p></details>
We provide a simple online verification tool to check whether your paths are set in `data.txt` correctly.
```
python univa/serve/check_data.py
```
<p align="left">
<img src="https://s21.ax1x.com/2025/05/30/pV9iP8f.png" width="850" style="margin-bottom: 0.2;"/>
<p>
### Data details
<details><summary>Text-to-Image Generation</summary><p>
- [BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k): We add text-to-image instructions to half of the data. [108 GB storage usage.]
- [OSP1024-286k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/OSP1024-286k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), with captions generated using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.]
</p></details>
<details><summary>Image Editing</summary><p>
- [imgedit-724k](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main): Data is filtered using GPT-4o, retaining approximately half. [2.8T storage usage.]
- [OmniEdit-368k](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M): For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.]
- [SEED-Data-Edit-Part1-Openimages-65k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part1-Openimages): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- [SEED-Data-Edit-Part2-3-12k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- [PromptfixData-18k](https://huggingface.co/datasets/yeates/PromptfixData): For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.]
- [StyleBooth-11k](https://huggingface.co/scepter-studio/stylebooth): For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.]
- [Ghibli-36k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/Ghibli-36k): For transfer style data, images have a short side ≥ 1024 pixels. **Warning: This data has not been quality filtered.** [170 GB storage usage.]
</p></details>
<details><summary>Extract & Try-on</summary><p>
- [viton_hd-23k](https://huggingface.co/datasets/forgeml/viton_hd): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- [deepfashion-27k](https://huggingface.co/datasets/lirus18/deepfashion): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- [shop_product-23k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/shop_product-23k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.]
</p></details>
<details><summary>Image Perception</summary><p>
- [coco2017_caption_canny-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_canny): img->canny & canny->img [25 GB storage usage.]
- [coco2017_caption_depth-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_depth): img->depth & depth->img [8 GB storage usage.]
- [coco2017_caption_hed-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_hed): img->hed & hed->img [13 GB storage usage.]
- [coco2017_caption_mlsd-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_mlsd): img->mlsd & mlsd->img [ GB storage usage.]
- [coco2017_caption_normal-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_normal): img->normal & normal->img [10 GB storage usage.]
- [coco2017_caption_openpose-62k](https://huggingface.co/datasets/wangherr/coco2017_caption_openpose): img->pose & pose->img [2 GB storage usage.]
- [coco2017_caption_sketch-236k](https://huggingface.co/datasets/wangherr/coco2017_caption_sketch): img->sketch & sketch->img [15 GB storage usage.]
- [unsplash_canny-20k](https://huggingface.co/datasets/wtcherr/unsplash_10k_canny): img->canny & canny->img [2 GB storage usage.]
- [open_pose-40k](https://huggingface.co/datasets/raulc0399/open_pose_controlnet): img->pose & pose->img [4 GB storage usage.]
- [mscoco-controlnet-canny-less-colors-236k](https://huggingface.co/datasets/hazal-karakus/mscoco-controlnet-canny-less-colors): img->canny & canny->img [13 GB storage usage.]
- [coco2017_seg_box-448k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/coco2017_seg_box-448k): img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.]
- [viton_hd-11k](https://huggingface.co/datasets/forgeml/viton_hd): img->pose [1 GB storage usage.]
- [deepfashion-13k](https://huggingface.co/datasets/lirus18/deepfashion): img->pose [1 GB storage usage.]
</p></details>
### Training
#### Prepare pretrained weights
Download [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) to `$FLUX_PATH`.
Download [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) to `$QWENVL_PATH`. We also support other sizes of Qwen2.5-VL.
```
SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
python scripts/make_univa_qwen2p5vl_weight.py \
--origin_flux_ckpt_path $FLUX_PATH \
--origin_qwenvl_ckpt_path $QWENVL_PATH \
--save_path ${SAVE_PATH}
```
#### Stage 1
You need to specify `pretrained_lvlm_name_or_path` to `${SAVE_PATH}` in `flux_qwen2p5vl_7b_vlm_stage1_512.yaml`.
We recommend using `optimizer: prodigy` with `learning_rate: 1.0` in `flux_qwen2p5vl_7b_vlm_stage1_512.yaml`.
For training with 512×512 scale images (batch size 1), it consume about 74G in 1 node (8 GPUs).
Setting `ema_pretrained_lvlm_name_or_path: null` can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size.
```
# stage 1
# if use prodigy, pip install prodigy
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh
```
#### Stage 2
Download [flux-redux-siglipv2-512.bin](https://huggingface.co/LanguageBind/UniWorld-V1/resolve/main/flux-redux-siglipv2-512.bin?download=true) and set its path to `pretrained_siglip_mlp_path` in `flux_qwen2p5vl_7b_vlm_stage2_512.yaml`. The weight is sourced from [ostris/Flex.1-alpha-Redux](https://huggingface.co/ostris/Flex.1-alpha-Redux), we just re-organize the weight.
Download [google/siglip2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) and set its path to `pretrained_siglip_name_or_path` in `flux_qwen2p5vl_7b_vlm_stage2_512.yaml`.
You also need to specify `pretrained_mlp2_path`, which is trained by stage 1.
For training with 512×512 scale images (batch size 1), it consume about **78G** in 1 node (8 GPUs).
Setting `ema_pretrained_lvlm_name_or_path: null` can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size. Using more nodes also can save memory because we use zero2 for main model in stage 2.
```
# stage 2
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh
```
# ⚡️ Evaluation
### Text-to-Image Generation
<details><summary>GenEval</summary><p>
```
cd univa/eval/geneval
# follow the instruction in univa/eval/geneval/README.md
```
</p></details>
<details><summary>WISE</summary><p>
```
cd univa/eval/wise
# follow the instruction in univa/eval/wise/README.md
```
</p></details>
<details><summary>GenAI-Bench</summary><p>
```
cd univa/eval/genai
# follow the instruction in univa/eval/genai/README.md
```
</p></details>
<details><summary>DPG-Bench</summary><p>
```
cd univa/eval/dpgbench
# follow the instruction in univa/eval/dpgbench/README.md
```
</p></details>
### Image Editing
<details><summary>ImgEdit</summary><p>
```
cd univa/eval/imgedit
# follow the instruction in univa/eval/imgedit/README.md
```
</p></details>
<details><summary>GEdit</summary><p>
We discuss the scores related to GEdit-Bench [here](https://github.com/PKU-YuanGroup/UniWorld-V1/issues/6#issuecomment-2939392328).
```
cd univa/eval/gdit
# follow the instruction in univa/eval/gdit/README.md
```
</p></details>
# 📊 Benchmarks
<p align="left">
<img src="https://s21.ax1x.com/2025/06/03/pVPFuTJ.png" width="850" style="margin-bottom: 0.2;"/>
<p>
# 💡 How to Contribute
We greatly appreciate your contributions to the UniWorld-V1 open-source community and helping us make it even better than it is now!
For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md).
# 👍 Acknowledgement and Related Work
* [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit): ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
* [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): An open‑source text-to-image/video foundation model, which provides a lot of caption data.
* [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit): A hybrid dataset for instruction-guided image editing.
* [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): The new flagship vision-language model of Qwen.
* [FLUX.1-Redux-dev](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev): Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
* [SigLIP 2](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md): New multilingual vision-language encoders.
* [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit): A state-of-the-art image editing model.
* [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o): A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
* [BAGEL](https://github.com/ByteDance-Seed/Bagel): An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.
# 🧐 FAQ
1. **Visual Encoder:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/5 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/15 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/18
2. **Data Setup:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/17
3. **Editing Evaluation:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/6 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/16
3. **Training Process and Analysis:** https://github.com/PKU-YuanGroup/UniWorld-V1/issues/3 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/9 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/14 https://github.com/PKU-YuanGroup/UniWorld-V1/issues/28
# 🔒 License
* See [LICENSE](LICENSE) for details. The FLUX weights fall under the [FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md).
# ✏️ Citing
```bibtex
@article{lin2025uniworld,
title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation},
author={Lin, Bin and Li, Zongjian and Cheng, Xinhua and Niu, Yuwei and Ye, Yang and He, Xianyi and Yuan, Shenghai and Yu, Wangbo and Wang, Shaodong and Ge, Yunyang and others},
journal={arXiv preprint arXiv:2506.03147},
year={2025}
}
@article{ye2025imgedit,
title={ImgEdit: A Unified Image Editing Dataset and Benchmark},
author={Ye, Yang and He, Xianyi and Li, Zongjian and Lin, Bin and Yuan, Shenghai and Yan, Zhiyuan and Hou, Bohan and Yuan, Li},
journal={arXiv preprint arXiv:2505.20275},
year={2025}
}
@article{niu2025wise,
title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
journal={arXiv preprint arXiv:2503.07265},
year={2025}
}
@article{yan2025gpt,
title={Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation},
author={Yan, Zhiyuan and Ye, Junyan and Li, Weijia and Huang, Zilong and Yuan, Shenghai and He, Xiangyang and Lin, Kaiqing and He, Jun and He, Conghui and Yuan, Li},
journal={arXiv preprint arXiv:2504.02782},
year={2025}
}
@article{lin2024open,
title={Open-Sora Plan: Open-Source Large Video Generation Model},
author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
journal={arXiv preprint arXiv:2412.00131},
year={2024}
}
```
# 🤝 Community contributors
<a href="https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors">
<img src="https://contrib.rocks/image?repo=PKU-YuanGroup/UniWorld-V1" />
</a>
# ✨ Star History
[](https://www.star-history.com/#PKU-YuanGroup/UniWorld-V1&Date)
", Assign "at most 3 tags" to the expected json: {"id":"13961","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"