base on Codebase for Merging Language Models (ICML 2024) # Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
<div align="center">
<img src="figures/icon.jpeg" width="25%">
</div>
This repository is built for the paper [Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch](https://arxiv.org/abs/2311.03099).
š If you have any questions or suggestions, please feel free to let us know.
You can directly email [Le Yu](https://yule-buaa.github.io/) using the email address
[email protected] or post an issue on this repository.
## š„ News š„
- š„š„š„[**May 2, 2024**] Our paper is accepted at ICML 2024! The camera ready version is coming soon.
- š„š„š„[**February 9, 2024**] Special thanks to [Sourab Mangrulkar](https://github.com/pacman100) for integrating our work into the [huggingface/peft Project](https://github.com/huggingface/peft)!
- š„š„š„[**January 28, 2024**] Our merged model [supermario_v2](https://huggingface.co/vanillaOVO/supermario_v2) ranks first among 7B models on the Open LLM Leaderboard!
- š„š„š„[**December 4, 2023**] We appreciate [Minhajul Hoque](https://medium.com/@minh.hoque) for sharing our work on [Medium](https://medium.com/@minh.hoque/paper-explained-language-models-are-super-mario-2ebce6c2cf35)!
- š„š„š„[**November 29, 2023**] Special thanks to [papersread.ai](https://papersread.ai/) for sharing [our work](https://papersread.ai/e/language-models-are-super-mario-absorbing-abilities-from-homologous-models-as-a-free-lunch/)!
- š„š„š„[**November 29, 2023**] We appreciate [martyn](https://github.com/martyn) for extending our work to [Stable Diffusion models](https://github.com/martyn/safetensors-merge-supermario)!
- š„š„š„[**November 27, 2023**] Special thanks to [brucethemoose](https://huggingface.co/brucethemoose) for applying our work on the [model](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties) on Hugging Face!
- š„š„š„[**November 26, 2023**] We appreciate [cg123](https://github.com/cg123) for integrating our work into the [mergekit Project](https://github.com/arcee-ai/mergekit)!
- š„š„š„[**November 25, 2023**] Special thanks to [fly51fly](https://twitter.com/fly51fly) for sharing our work on [Twitter](https://twitter.com/fly51fly/status/1728159826742755588)!
- š„š„š„[**November 24, 2023**] We appreciate [uukuguy](https://github.com/uukuguy) for integrating our work into the [Multi-LoRAs Project](https://pypi.org/project/multi-loras/0.2.0)!
- š„š„š„[**November 23, 2023**] Special thanks to [WizardLM](https://twitter.com/WizardLM_AI) for sharing our work on [Twitter](https://twitter.com/WizardLM_AI/status/1727672799391842468)!
- š„š„š„[**November 21, 2023**] We appreciate [PaperWeekly](http://www.paperweekly.info) for sharing our work on [WeChat](https://mp.weixin.qq.com/s/YiqWovBUXIbzmUbL6uT-8g) and [Zhihu](https://zhuanlan.zhihu.com/p/668152236)!
- š„š„š„[**November 11, 2023**] Special thanks to [å¤å°ē¶](https://xixiaoyao.github.io/about/) for sharing our work on [WeChat](https://mp.weixin.qq.com/s?__biz=MzIwNzc2NTk0NQ%3D%3D&mid=2247565881&idx=2&sn=57985427fdb6751d617df801ca7fd810) and [Zhihu](https://zhuanlan.zhihu.com/p/666363702)!
- š„š„š„[**November 6, 2023**] Our paper is available on [arXiv](https://arxiv.org/abs/2311.03099), [Papers With Code](https://paperswithcode.com/paper/language-models-are-super-mario-absorbing), and [Hugging Face](https://huggingface.co/papers/2311.03099).
## Overview
In this work, we uncover that Language Models (LMs), either encoder- or decoder-based, can **obtain new capabilities by assimilating the parameters of homologous models without the need for retraining or GPUs**.
1. We introduce a novel operation called **DARE** to directly set most of (90% or even 99%) the delta parameters to zeros without affecting the capabilities of SFT LMs.
2. We sparsify delta parameters of multiple SFT homologous models with DARE as a **general preprocessing technique** and subsequently merge them into a single model by parameter averaging.
The workflow is shown as follows,
<div align="center">
<img src="figures/framework.jpg" width="80%">
</div>
By conducting extensive experiments, we find that:
1. DARE is effective for SFT models whose delta parameter value ranges are relatively small (e.g., within 0.005), being able to eliminate even 99\% delta parameters. Larger models can tolerate a higher proportion of discarded parameters, indicating that SFT naturally learns an extremely sparse set of delta parameters, and nearly all abilities originate from the pre-trained LMs. See (a) in the figure below.
2. DARE can merge multiple task-specific LMs into one LM with diverse abilities, which is able to possess the functionalities of all SFT models. For instance, the merger of WizardLM and WizardMath increases the GSM8K accuracy of WizardLM from 2.2 to 66.3, maintaining its instruction-following capabilities while surpassing WizardMath's original 64.2 performance. See (b) in the figure below.
<div align="center">
<img src="figures/introduction_llms_merge.jpg" width="80%">
</div>
## Language Models and Datasets
We conduct experiments on both encoder- and decoder-based LMs.
* For encoder-based LMs, we choose bert-base-uncased and roberta-base as pre-trained backbones. Eight datasets from the GLUE benchmark are used, including CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, and RTE.
* For decoder-based LMs, we choose LLaMA, Llama 2, and Code Llama as pre-trained backbones. WizardLM, WizardMath, WizardCoder-Python, and Code Alpaca are used as fine-tuned models.
We evaluate three tasks on five datasets: AlpacaEval (instruction-following), GSM8K and MATH (mathematical reasoning), and HumanEval and MBPP (code-generating).
Note that we provide GSM8K, MATH, and MBPP datasets in ```math_code_data/``` folder, which are obtained from [WizardLM repository](https://github.com/nlpxucan/WizardLM).
Other datasets can be automatically downloaded by our codes. For language models, you can download them either manually or by our codes.
You can also modify the ```cache_dir``` in the ```utils/load_config.py``` file to specify your own path to save datasets and models.
## Model Merging Methods
We provide a well-coded implementation of five model merging methods in this repository, including
[Average Merging](https://arxiv.org/abs/2203.05482),
[Task Arithmetic](https://arxiv.org/abs/2212.04089),
[Fisher Merging](https://arxiv.org/abs/2111.09832),
[RegMean](https://arxiv.org/abs/2212.09849), and
[TIES-Merging](https://arxiv.org/abs/2306.01708).
We also combine the proposed [DARE](https://arxiv.org/abs/2311.03099) with the above methods to facilitate the merging performance.
## Environments
[PyTorch 2.0.1](https://pytorch.org/),
[transformers 4.33.1](https://huggingface.co/docs/transformers/index),
[datasets 2.13.1](https://huggingface.co/docs/datasets/index),
[vllm 0.1.4](https://github.com/vllm-project/vllm),
[human_eval](https://github.com/openai/human-eval),
[numpy](https://github.com/numpy/numpy), and
[tqdm](https://github.com/tqdm/tqdm).
## Executing Scripts for Encoder-based LMs
For encoder-based LMs, we first fine-tune them on the GLUE benchmark (support both single-task and multi-task settings),
and then inference with them. We also provide scripts to merge encoder-based LMs with five model merging methods.
### Scripts for Fine-Tuning on GLUE
* Example of fine-tuning *roberta-base* on *CoLA* dataset under single-task setting:
```{bash}
python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5
```
* Example of fine-tuning *roberta-base* on *CoLA* and *RTE* datasets under multi-task setting:
```{bash}
python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name rte --learning_rate 1e-5 --num_runs 5
```
### Scripts for Inference with DARE and Other Variants
* Example of direct inference on *roberta-base* (drop rate 0.0):
```{bash}
python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.0
```
* Example of inference on *roberta-base* with DARE (drop rate 0.9):
```{bash}
python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale
```
* Example of inference on *roberta-base* with DropOnly (drop rate 0.9):
```{bash}
python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9
```
* Example of inference on *roberta-base* with magnitude-based pruning (drop rate 0.9):
```{bash}
python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --mask_strategy magnitude
```
* Example of inference on *roberta-base* with masking fine-tuned parameters (drop rate 0.9):
```{bash}
python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight
```
### Scripts for Merging Models
* Example of merging pairwise fine-tuned *roberta-base* with Average Merging:
```{bash}
python merge_plms_glue.py --merging_method_name average_merging --language_model_name roberta-base
```
* Example of merging pairwise fine-tuned *roberta-base* with Fisher Merging:
```{bash}
python merge_plms_glue.py --merging_method_name fisher_merging --normalize_fisher_weight --language_model_name roberta-base
```
* Example of merging pairwise fine-tuned *roberta-base* with Average Merging and DARE:
```{bash}
python merge_plms_glue.py --merging_method_name mask_merging --use_weight_rescale --language_model_name roberta-base --mask_apply_method average_merging
```
## Executing Scripts for Decoder-based LMs
Since the decoder-based LMs we use have already been fine-tuned, they can be directly utilized for inference.
We also provide scripts to merge decoder-based LMs with two model merging methods (Average Merging and Task Arithmetic).
### Scripts for Inference with DARE and Other Variants
* Example of direct inference on *WizardMath-7B-V1.0* on *GSM8K* (drop rate 0.0):
```{bash}
python inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.0
```
* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with DARE (drop rate 0.9):
```{bash}
python inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale
```
* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with DropOnly (drop rate 0.9):
```{bash}
python inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9
```
* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with magnitude-based pruning (drop rate 0.9):
```{bash}
python inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --mask_strategy magnitude
```
* Example of inference on *WizardMath-7B-V1.0* on *GSM8K* with masking fine-tuned parameters (drop rate 0.9):
```{bash}
python inference_llms_instruct_math_code.py --dataset_name gsm8k --finetuned_model_name WizardMath-7B-V1.0 --tensor_parallel_size 1 --weight_mask_rate 0.9 --use_weight_rescale --weight_format finetuned_weight
```
### Scripts for Merging Models
* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging:
```{bash}
python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name average_merging --tensor_parallel_size 1
```
* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Task Arithmetic:
```{bash}
python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name task_arithmetic --scaling_coefficient 1.0 --tensor_parallel_size 1
```
* Example of merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging and DARE (drop rate 0.2):
```{bash}
python merge_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1
```
ā**Note 1**: When merging decoder-based LMs, the number of GPUs we should allocate is equals to num_models_to_merge * tensor_parallel_size.
For example, if we want to merge *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with tensor_parallel_size == 1, then we should allocate 2 * 1 = 2 GPUs.
ā**Note 2**: If "AssertionError: data parallel group is already initialized" error is raised by vllm on your device, please try to run ```direct_inference_merged_llms_instruct_math_code.py``` with the corresponding setting.
For example, if this error occurs when merging *WizardLM-13B-V1.2* and *WizardMath-13B-V1.0* with Average Merging and DARE (drop rate 0.2), please run the following command to evaluate on instruct- or math-related task
```{bash}
python direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task instruct
python direct_inference_merged_llms_instruct_math_code.py --merge_instruct --merge_math --merging_method_name mask_merging --use_weight_rescale --weight_mask_rate 0.2 --mask_apply_method average_merging --tensor_parallel_size 1 --evaluate_task math
```
### Evaluation Process for AlpacaEval, HumanEval and MBPP
For AlpacaEval, HumanEval and MBPP, our codes will store the generated files and please additionally run the following evaluation commands to get the final metrics.
* For AlpacaEval:
We use ```chatgpt_fn``` in [alpaca_eval repository](https://github.com/tatsu-lab/alpaca_eval) to compute the win rate. Firstly, please see [alpaca_eval repository](https://github.com/tatsu-lab/alpaca_eval) to install the environment.
Then, if you want to evaluate the generated *WizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json* file, please run
```{bash}
alpaca_eval --model_outputs ./save_gen_instruct_responses_results/alpaca_eval/WizardLM-13B-V1.2_inference_mask_0.2_rescale_True.json --annotators_config chatgpt_fn --name WizardLM-13B-V1.2_inference_mask_0.2_rescale_True
```
* For HumanEval:
Firstly, please see [human-eval repository](https://github.com/openai/human-eval) to install the environment.
Then, if you want to evaluate the generated *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* file, please run
```{bash}
evaluate_functional_correctness ./save_gen_codes_results/human_eval/WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl
```
* For MBPP:
Firstly, please see [bigcode-evaluation-harness repository](https://github.com/bigcode-project/bigcode-evaluation-harness) to install the environment.
Then, if you want to evaluate the generated *WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl* file, please run
```{bash}
accelerate launch ./bigcode-evaluation-harness/main.py --tasks mbpp --allow_code_execution --load_generations_path ./save_gen_codes_results/mbpp/WizardCoder-Python-13B-V1.0_inference_mask_0.2_rescale_True.jsonl
```
## Acknowledgments
We are grateful to the authors of [WizardLM](https://github.com/nlpxucan/WizardLM) for making their project codes publicly available.
## Citation
Please consider citing our paper when using this project.
```{bibtex}
@inproceedings{yu2024language,
title={Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch},
author={Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin},
booktitle={International Conference on Machine Learning},
year={2024},
organization={PMLR}
}
```
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=yule-BUAA/MergeLM&type=Timeline)](https://star-history.com/#yule-BUAA/MergeLM&Timeline)
", Assign "at most 3 tags" to the expected json: {"id":"5339","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"