AI prompts
base on [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning" <p align="center" width="100%">
<img src="assets/osprey.png" width="90%">
</p>
<div align=center>
![Static Badge](https://img.shields.io/badge/Osprey-v1-F7C97E) [![arXiv preprint](https://img.shields.io/badge/arxiv-2312.10032-ECA8A7?logo=arxiv)](https://arxiv.org/pdf/2312.10032.pdf) [![Dataset](https://img.shields.io/badge/Dataset-Hugging_Face-CFAFD4)](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K) [![video](https://img.shields.io/badge/Watch_Video-36600E?logo=youtube&logoColor=green)](https://youtu.be/YsxqHBBnDfk) [![Static Badge](https://img.shields.io/badge/Try_Demo-6B88E3?logo=youtubegaming&logoColor=DAE4EE)](http://111.0.123.204:8000/)
</div>
<div align=center>
Demo username & password: <b>osprey</b>
</div>
---
<div align=center>
<img src="./assets/qmsht.gif" />
<br>
A part of <i>Along the River During the Qingming Festival</i> (清明上河图)
<br>
<img src="./assets/qyqx.gif" />
<br>
<i>Spirited Away</i> (千与千寻)
<br>
</div>
## Updates 📌
[2024/3/29]🔥 We released [Osprey-Chat](https://huggingface.co/sunshine-lwt/Osprey-Chat-7b/tree/main) model, which exhibits better conversation and image-level understanding&reasoning capabilities.
[2024/2/27]🔥 Osprey has been accepted to CVPR2024!
[2024/1/15]🔥 We released the [evaluation](./osprey/eval/README.md) code.
[2023/12/29]🔥 We released the training code and [Osprey-724K](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K) dataset.
[2023/12/18]🔥 We released the code, [osprey-7b model](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main) and [online demo](http://111.0.123.204:8000/) for Osprey.
## What is Osprey 👀
Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling **fine-grained visual understanding**. Based on input mask region, Osprey generate the semantic descriptions including **short description** and **detailed description**.
Our Osprey can seamlessly integrate with [SAM](https://github.com/facebookresearch/segment-anything) in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.
<img src="./assets/framework.png" width="800px">
## Watch Video Demo 🎥
<p align="center"> <a href="https://youtu.be/YsxqHBBnDfk"><img src="assets/video_cover.png" width="70%"></a> </p>
## Try Our Demo 🕹️
### Online demo
**Click** 👇 **to try our demo online.**
[**web demo**](http://111.0.123.204:8000/)
```
username: osprey
password: osprey
```
<table>
<tr>
<td style="text-align: center"><br>Point<br></td>
<td><img src="./assets/demo_point.gif" width="700"></td>
</tr>
<tr>
<td style="text-align: center"><br>Box<br></td>
<td><img src="./assets/demo_box.gif" width="700"></td>
</tr>
</tr>
<tr>
<td style="text-align: center"><br>Everything<br></td>
<td><img src="./assets/demo_all.gif" width="700"></td>
</tr>
</table>
### Offline demo
💻 **requirments:** For this demo, it needs about `17GB` GPU memory for Osprey(15GB) and SAM(2GB).
1. First install [Gradio-Osprey-Demo](https://github.com/LiWentomng/gradio-osprey-demo).
2. Install Segment Anything.
```
pip install git+https://github.com/facebookresearch/segment-anything.git
```
3. Download all the checkpoints:
- [Osprey-7b](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
- [CLIP-convnext](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin)
- [ViT-B SAM model](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)
The default path of all the checkpoints:
```
├── demo
├── checkpoints
│ ├── Osprey_7b
│ └── sam_vit_b_01ec64.pth
└── open_clip_pytorch_model.bin
```
Or change the "mm_vision_tower" in `config.json` of Osprey-7b model to the Absolute Path of `open_clip_pytorch_model.bin`.
4. Run `app.py`.
```
cd demo
python app.py --model checkpoints/Osprey_7b
```
## Install 🛠️
1. Clone this repository and navigate to Osprey folder
```
git clone https://github.com/CircleRadon/Osprey.git
cd Osprey
```
2. Install packages
```
conda create -n osprey python=3.10 -y
conda activate osprey
pip install --upgrade pip # enable PEP 660 support
pip install -e .
```
3. Install additional packages for training cases
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
## Dataset 🌟
The all datasets for training can be found in [Dataset preparation](./dataset.md).
**Osprey-724K**: 🤗[Hugging Face](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K)
`Osprey-724K` is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.
<img src="./assets/data.png" />
## Training 🚀
- **Stage1: Image-Text Alignment Pre-training**
- The pretrained projector weights for Convnext-large-CLIP can be found in [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
- **Stage2: Mask-Text Alignment Pre-training**
- Download [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main).
- Download projector weights trained in stage1: [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
- Set `model_name_or_path` in `stage2.sh` to the path of `vicuna-7b-v1.5`.
- Set `pretrain_mm_mlp_adapter` in `stage2.sh` to the path of `mm_projector`.
- Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
- Run `sh scripts/stage2.sh`.
- **Stage3: End-to-End Fine-tuning**
- Set `model_name_or_path` in `stage2.sh` to the path of `stage2 checkpoint`.
- Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
- Run `sh scripts/stage3.sh`.
## Checkpoints 🤖
Osprey-7b model🤗: [model](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
We also provide the checkpoint of intermediate stage2, please check [model](https://huggingface.co/sunshine-lwt/Osprey-7b-stage2/tree/main).
<div align=center>
<img src="./assets/performance.png" />
</div>
## Evaluation 🔎
See [evaluation](./osprey/eval/README.md) for details.
## TODO List 📝
- [x] Release the checkpoints, inference codes and demo.
- [x] Release the dataset and training scripts.
- [x] Release the evaluation code.
- [x] Release the code for data generation pipeline.
## Acknowledgement 💌
- [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
- [SAM](https://github.com/facebookresearch/segment-anything): the demo uses the segmentation result from SAM as the input of Osprey.
## BibTeX 🖊️
```
@misc{Osprey,
title={Osprey: Pixel Understanding with Visual Instruction Tuning},
author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
year={2023},
eprint={2312.10032},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
", Assign "at most 3 tags" to the expected json: {"id":"6205","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"