base on Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024 # `EvalPlus(📖) => 📚` <p align="center"> <a href="https://evalplus.github.io"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a> <a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a> <a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a> <a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg"></a> <a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a> <a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a> </p> <p align="center"> <a href="#-about">📙About</a> • <a href="#-quick-start">🔥Quick Start</a> • <a href="#-llm-backends">🚀LLM Backends</a> • <a href="#-documents">📚Documents</a> • <a href="#-citation">📜Citation</a> • <a href="#-acknowledgement">🙏Acknowledgement</a> </p> ## 📢 News Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including: * [Meta Llama 3.1 and 3.3](https://ai.meta.com/blog/meta-llama-3-1/) * [Allen AI TÜLU 1/2/3](https://github.com/allenai/open-instruct/blob/main/docs/tulu1_tulu2.md#benchmark-based-eval) * [Qwen2.5-Coder](https://qwenlm.github.io/blog/qwen2.5-coder-family/) * [CodeQwen 1.5](https://qwenlm.github.io/blog/codeqwen1.5/) * [DeepSeek-Coder V2](https://arxiv.org/pdf/2406.11931) * [Qwen2](https://arxiv.org/pdf/2407.10671) * [Snowflake Arctic](https://www.snowflake.com/en/data-cloud/arctic/) * [StarCoder2](https://arxiv.org/pdf/2402.19173) * [Magicoder](https://arxiv.org/pdf/2312.02120) * [WizardCoder](https://arxiv.org/pdf/2306.08568) Below tracks the notable updates of EvalPlus: - **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Highlights: *(i)* Code efficiency evaluation via EvalPerf, *(ii)* one command to run all: generation + post-processing + evaluation, *(iii)* support for more inference backends such as Google Gemini & Anthropic, etc. - **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena). - **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected. <details><summary>Earlier news <i>:: click to expand ::</i></summary> <div> - ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32). - ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6) - ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released! </div> </details> ## 📙 About EvalPlus is a rigorous evaluation framework for LLM4Code, with: - ✨ **HumanEval+**: 80x more tests than the original HumanEval! - ✨ **MBPP+**: 35x more tests than the original MBPP! - ✨ **EvalPerf**: evaluating the efficiency of LLM-generated code! - ✨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks. Why EvalPlus? - ✨ **Precise evaluation**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation. - ✨ **Coding rigorousness**: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile. - ✨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs. Want to know more details? Read our papers & materials! - **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf), [Leaderboard](https://evalplus.github.io/leaderboard.html) - **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md), [Leaderboard](https://evalplus.github.io/evalperf.html) ## 🔥 Quick Start ### Code Correctness Evaluation: HumanEval(+) or MBPP(+) ```bash pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus" # Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ``` <details><summary>🛡️ Safe code execution within Docker <i>:: click to expand ::</i></summary> <div> ```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset humaneval \ --backend vllm \ --greedy # Code execution within Docker docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evaluate --dataset humaneval \ --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl ``` </div> </details> ### Code Efficiency Evaluation: EvalPerf (*nix only) ```bash pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" # Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm ``` <details><summary>🛡️ Safe code execution within Docker <i>:: click to expand ::</i></summary> <div> ```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset evalperf \ --backend vllm \ --temperature 1.0 \ --n-samples 100 # Code execution within Docker sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl ``` </div> </details> ## 🚀 LLM Backends ### HuggingFace models - `transformers` backend: ```bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --greedy ``` > [!Note] > > EvalPlus uses different prompts for base and chat models. > By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend. > For other backends, only chat mode is allowed. > > Therefore, if your base models come with a `tokenizer.chat_template`, > please add `--force-base-prompt` to avoid being evaluated > in a chat mode. <details><summary>Enable Flash Attention 2 <i>:: click to expand ::</i></summary> <div> ```bash # Install Flash Attention 2 pip install packaging ninja pip install flash-attn --no-build-isolation # Note: if you have installation problem, consider using pre-built # wheels from https://github.com/Dao-AILab/flash-attention/releases # Run evaluation with FA2 evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --attn-implementation [flash_attention_2|sdpa] \ --greedy ``` </div> </details> - `vllm` backend: ```bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --tp [TENSOR_PARALLEL_SIZE] \ --greedy ``` - `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)): ```bash # OpenAI models export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy # DeepSeek export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy # Grok export OPENAI_API_KEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy # vLLM server # First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy # GPTQModel evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy ``` ### OpenAI models - Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/) ```bash export OPENAI_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gpt-4o" \ --dataset [humaneval|mbpp] \ --backend openai \ --greedy ``` ### Anthropic models - Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/) ```bash export ANTHROPIC_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "claude-3-haiku-20240307" \ --dataset [humaneval|mbpp] \ --backend anthropic \ --greedy ``` ### Google Gemini models - Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/) ```bash export GOOGLE_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gemini-1.5-pro" \ --dataset [humaneval|mbpp] \ --backend google \ --greedy ``` ### Amazon Bedrock models - [Amazon Bedrock](https://aws.amazon.com/bedrock/) ```bash export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ --dataset [humaneval|mbpp] \ --backend bedrock \ --greedy ``` ### Ollama backend - [Ollama](https://ollama.com/) ```bash evalplus.evaluate --model "mistral:7b" \ --dataset [humaneval|mbpp] \ --backend ollama \ --base-url http://localhost:11434/v1 \ --greedy ``` ### Intel® Gaudi® Accelerator - [Intel® Gaudi®](https://docs.habana.ai/en/latest/index.html) To run `hf` backend for Intel Gaudi install [optimum-habana](https://github.com/huggingface/optimum-habana) first. ```bash pip install git+https://github.com/huggingface/optimum-habana.git evalplus.evaluate --model "qwen/CodeQwen1.5-7B-Chat" \ --dataset [humaneval|mbpp] \ --backend hf_gaudi \ --greedy \ --torch_compile ``` OR in Lazy Mode: ```bash PT_HPU_LAZY_MODE=1 evalplus.evaluate --model "qwen/CodeQwen1.5-7B-Chat" \ --dataset [humaneval|mbpp] \ --backend hf_gaudi \ --greedy ``` To run `vllm` backend for Intel Gaudi install [HabanaAI vllm](https://github.com/HabanaAI/vllm-fork) first. ```bash git clone https://github.com/HabanaAI/vllm-fork.git cd vllm-fork git checkout habana_main pip install --upgrade pip pip install -r requirements-hpu.txt python setup.py develop ``` Then run: ```bash PT_HPU_LAZY_MODE=1 evalplus.evaluate --model "qwen/CodeQwen1.5-7B-Chat" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ``` You can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/` <details><summary>⏬ Using EvalPlus as a local repo? <i>:: click to expand ::</i></summary> <div> ```bash git clone https://github.com/evalplus/evalplus.git cd evalplus export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt ``` </div> </details> ## 📚 Documents To learn more about how to use EvalPlus, please refer to: - [EvalPlus Commands](./docs/cli.md) - [EvalPerf](./docs/evalperf.md) - [Program Execution](./docs/execution.md) ## 📜 Citation ```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, } @inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ``` ## 🙏 Acknowledgement - [HumanEval](https://github.com/openai/human-eval) - [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) ", Assign "at most 3 tags" to the expected json: {"id":"8356","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"