base on null <div style="display: flex; align-items: center;"> <a href="https://arxiv.org/abs/2403.06199"> <h1>LLaVA-Phi & Mipha: Towards Multimodal Small Language Models</h1> </a> </div> <div align="center"> <img src="docs/mipha.jpg" width="20%"> </div> * **LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model** <br> [![ACMMM 2024 Workshop](https://img.shields.io/badge/Arxiv-2402.03766-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2401.02330) * **Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models** <br> [![arXiv](https://img.shields.io/badge/Arxiv-2312.16886-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.06199) ## 📸 Release * **`March. 23th, 2024`**: 🔥🔥🔥 [LLaVA-Phi](https://arxiv.org/abs/2401.02330) is accepted by ACMMM 2024 Workshop, and [Mipha](https://arxiv.org/abs/2403.06199) is accepted by AAAI 2025 Main Track. * **`March. 23th, 2024`**: Our model 🔥🔥🔥 **Mipha-3B** and corresponding training codes are released. * **`Jan. 26th, 2024`**:Now you can download our [model weight]((#llava-weights)). * **`Jan. 15th, 2024`**:Our model and training codes are released. * **`Jan. 5th, 2024`**: Our codes are currently undergoing an internal review and will be released shortly (expected next week) ## Model Zoo ## Mipha & LLaVA-Phi | Model | LLM | VQAv2 | GQA | SQA<sup>I</sup> | VQA<sup>T</sup> | POPE | MME<sup>P</sup> | MMB | |-------|-------|---|-----|-------|-------|-------|-------|-------| | <div style="width: 93pt"> LLaVA-Phi-3B | <div style="width: 91pt"> Phi-2-2.7B | 71.4 | - | 68.4 | 48.6 | 85.0 | 1335.1 | 59.8 | | <div style="width: 93pt"> Mipha-1.6B | <div style="width: 91pt"> Phi-1.5-1.3B | 77.5 | 62.7 | 58.3 | 45.6 | **86.9** | 1203.1 | 57.7 | | <div style="width: 93pt"> Mipha-2.4B | <div style="width: 91pt"> Gemma-2B | 79.5 | 63.3 | 65.3 | 52.4 | 86.6 | 1397.1 | 59.4 | | <div style="width: 93pt"> Mipha-3B | <div style="width: 91pt"> Phi-2-2.7B | **81.3** | **63.9** | **70.9** | **56.6** | 86.7 | **1488.9** | **69.7** | ## Contents - [Install](#install) - [Mipha Weights](#Mipha-weights) - [Train](#train) - [Evaluation](#evaluation) ## Install 1. Clone this repository and navigate to llava-phi folder ```bash git clone https://github.com/zhuyiche/Mipha.git cd Mipha ``` 2. Install Package ```Shell conda create -n mipha python=3.10 -y conda activate mipha pip install --upgrade pip # enable PEP 660 support pip install -e . ``` ## Mipha Weights Download Mipha-3B at [huggingface](https://huggingface.co/zhumj34/Mipha-3B) ## Train Mipha training consists of two stages: (1) feature alignment stage: use [LLaVA-1.5](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) 558K subset of the LAION-CC-SBU dataset to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions. ### Hyperparameters The hyperparameters used in pretraining and finetuning are provided below. 1. Pretraining | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: | ---: | ---: | ---: | | Mipha | 256 | 1e-3 | 1 | 2048 | 0 | 2. Finetuning | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: |-------:| ---: | ---: | | Mipha | 128 | 2e-5 | 2 | 2048 | 0 | ### Download base checkpoints Our base model is phi-2. You should download the weights from [here](https://huggingface.co/susnato/phi-2), and change the `--model_name_or_path` in [`get_base_model.sh`](https://github.com/zhuyiche/Mipha/blob/main/scripts/mipha/get_base_model.sh). <br> Our vision encoder is SigLIP-SO (0.4B). You should download the weights from [here](https://huggingface.co/google/siglip-so400m-patch14-384). ### Integrate the model Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain). <br> Then, you should integrate phi-2 and SigLIP-SO into a single model by running the following script: ```bash bash ./script/mipha/get_base_model.sh ``` ### Pretrain (feature alignment) ```bash bash ./scripts/mipha/pretrain.sh ``` ### Visual Instruction Tuning Please refer [here](https://github.com/haotian-liu/LLaVA/blob/9a26bd1435b4ac42c282757f2c16d34226575e96/README.md#visual-instruction-tuning) to prepare the instruction tuning data. Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/zhuyiche/Mipha/blob/main/scripts/mipha/finetune.sh). ```bash bash ./scripts/mipha/finetune.sh ``` ## Evaluation To ensure the reproducibility, we evaluate the models with greedy decoding. See [Evaluation.md](https://github.com/zhuyiche/Mipha/blob/main/docs/Evaluation.md). ## CLI Inference Guide You can chat about images using Mipha without the Gradio interface. Here is an example command: ```bash python -m mipha.serve.cli \ --model-path /path/to/mipha-3B \ --image-file "mipha/serve/examples/extreme_ironing.jpg" \ --conv-mode phi ``` ## Citation If you find LLaVA-Phi or Mipha useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX: ``` @inproceedings{zhu2024llava, title={Llava-phi: Efficient multi-modal assistant with small language model}, author={Zhu, Yichen and Zhu, Minjie and Liu, Ning and Xu, Zhiyuan and Peng, Yaxin}, booktitle={Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited}, pages={18--22}, year={2024} } @article{zhu2024comprehensive, title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models}, author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2024} } ``` ## Acknowledgement We build our project based on - [LLaVA](https://github.com/haotian-liu/LLaVA): an amazing open-sourced project for vision language assistant - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): We use this codebase to finetune SLMs - [Safe-RLHF](https://github.com/PKU-Alignment/safe-rlhf): We use this codebase to instruct-tune SLMs ", Assign "at most 3 tags" to the expected json: {"id":"7330","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"