base on The Universe of Evaluation. All about the evaluation for LLMs. <div align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="assets/Evalverse_White.png" width=300>
<source media="(prefers-color-scheme: light)" srcset="assets/Evalverse_Color.png" width=300>
<img alt="Evalverse" src="assets/Evalverse_Color.png" width=300>
</picture>
The Universe of Evaluation.
All about the evaluation for LLMs. </br>
Upstage Solar is powered by Evalverse! Try at Upstage [Console](https://console.upstage.ai/)!
[π€HugginFace Space](https://huggingface.co/spaces/upstage/evalverse-space) β’ [πDocs](https://evalverse.gitbook.io/evalverse-docs) β’ [πPaper](https://arxiv.org/abs/2404.00943)
[Examples](https://github.com/UpstageAI/evalverse/tree/main/examples) β’ [FAQ](https://evalverse.gitbook.io/evalverse-docs/documents/faqs) β’ [Contribution Guide](https://github.com/UpstageAI/evalverse/blob/main/contribution/CONTRIBUTING.md) β’ [Contact](mailto:
[email protected]) β’ [Discord](https://discord.gg/D3bBj66K)
</div>
### π Newly updated
- [2024.05.10] LLM-Evaluation Report of Evalverse is now available on [HuggingFace Space](https://huggingface.co/spaces/upstage/evalverse-space).
<div align="center"><img alt="overview" src="assets/overview.png" width=500></div>
## π Welcome to Evalverse!
Evalverse is a freely accessible, open-source project designed to support your LLM (Large Language Model) evaluation needs. We provide a simple, standardized, and user-friendly solution for the processing and management of LLM evaluations, catering to the needs of AI research engineers and scientists. We also support no-code evaluation processes for people who may have less experience working with LLMs. Moreover, you will receive a well-organized report with figures summarizing the evaluation results.
### With Evalverse, you are empowered to
- access various evaluation methods without juggling multiple libraries.
- receive insightful report about the evaluation results that helps you to compare the varied scores across different models.
- initiate evaluation and generate reports without any code via Slack bot.
### Architecture of Evalverse
<div align="center"><img alt="architecture" src="assets/architecture.png" width=700></div>
### Key Features of Evalverse
- **Unified evaluation with Submodules**: Evalverse extends its evaluation capabilities through Git submodules, effortlessly incorporating frameworks like [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [FastChat](https://github.com/lm-sys/FastChat). Swiftly add new tools and keep pace with the latest in LLM evaluation.
- **No-code evaluation request**: With Evalverse, request LLM evaluations without any code, simply by sending `Request!` in a direct message or Slack channel with an activate Evalverse Slack bot. Enter the model name in the Huggingface hub or local model directory path in Slack, and let the bot handle the rest.
- **LLM evaluation report**: Obtain comprehensive, no-code reports from Evalverse. Request with a simple command -`Report!`-, select the model and evaluation criteria, and receive detailed reports with scores, rankings, and visuals, all generated from the stored score database.
If you want to know more about Evalverse, please checkout our [docs](https://evalverse.gitbook.io/evalverse-docs). </br>
By clicking below image, it'll take you to a short intro video!
[![Brief Introduction](./assets/intro-evalverse.png)](https://www.youtube.com/watch?v=-VviAutjpgM)
</br>
## π Installation
### π Option 1: Git clone
Before cloning, please make sure you've registered proper SSH keys linked to your GitHub account.
#### 1. Clone the Evalverse repository
- Notes: add `--recursive` option to also clone submodules
```
git clone --recursive https://github.com/UpstageAI/evalverse.git
```
#### 2. Install requirement packages
```
cd evalverse
pip install -e .
```
### π Option 2: Install via Pypi *(WIP)*
> Currently, installation via Pypi is not supported. Please install Evalverse with option 1.
</br>
## π Configuration
You have to set an API key and/or Token in the `.env` file (rename `.env_sample` to `.env`) to use all features of Evalverse.
- OpenAI API Key (required for `mt_bench`)
- Slack BOT/APP Token (required for slack reporter)
```
OPENAI_API_KEY=sk-...
SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
```
</br>
## π Quickstart
More detailed tutorials are [here](https://github.com/UpstageAI/evalverse/tree/main/examples).
- [basic_usage.ipynb](https://github.com/UpstageAI/evalverse/tree/main/examples/01_basic_usage.ipynb): Very basic usage, like how to use `Evaluator` for evaluation and `Reporter` for generating report.
- [advanced_usage.ipynb](https://github.com/UpstageAI/evalverse/tree/main/examples/02_advanced_usage.ipynb): Introduces methods for evaluating each benchmark and all benchmarks collectively.
### π Evaluation
#### π« Evaluation with Library
The following code is a simple example to evaluate the [SOLAR-10.7B-Instruct-v1.0 model](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) on the `h6_en` ([Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)) benchmark.
```python
import evalverse as ev
evaluator = ev.Evaluator()
model = "upstage/SOLAR-10.7B-Instruct-v1.0"
benchmark = "h6_en"
evaluator.run(model=model, benchmark=benchmark)
```
#### π« Evaluation with CLI
Here is a CLI script that produces the same result as the above code:
```bash
cd evalverse
python3 evaluator.py \
--h6_en \
--ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0
```
### π Report
Currently, generating a report is only available through the library. We will work on a Command Line Interface (CLI) version as soon as possible.
```python
import evalverse as ev
db_path = "./db"
output_path = "./results"
reporter = ev.Reporter(db_path=db_path, output_path=output_path)
reporter.update_db(save=True)
model_list = ["SOLAR-10.7B-Instruct-v1.0", "Llama-2-7b-chat-hf"]
benchmark_list = ["h6_en"]
reporter.run(model_list=model_list, benchmark_list=benchmark_list)
```
<img alt="architecture" src="assets/sample_report.png" width=700>
| Model | Ranking | total_avg | H6-ARC | H6-Hellaswag | H6-MMLU | H6-TruthfulQA | H6-Winogrande | H6-GSM8k |
|--------------------------:|--------:|----------:|-------:|-------------:|--------:|--------------:|--------------:|---------:|
| SOLAR-10.7B-Instruct-v1.0 | 1 | 74.62 | 71.33 | 88.19 | 65.52 | 71.72 | 83.19 | 67.78 |
| Llama-2-7b-chat-hf | 2 | 53.51 | 53.16 | 78.59 | 47.38 | 45.31 | 72.69 | 23.96 |
</br>
## π Supported Evaluations
We currently support four evaluation methods. If you have suggestions for new methods, we welcome your input!
| Evaluation | Original Repository |
|---------------------------|--------------------------------------------|
| H6 (Open LLM Leaderboard) | [EleutherAI](https://github.com/EleutherAI)/[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)|
| MT-bench | [lm-sys](https://github.com/lm-sys)/[FastChat](https://github.com/lm-sys/FastChat)|
| IFEval | [google-research](https://github.com/google-research/google-research/tree/master)/[instruction_following_eval](https://github.com/google-research/google-research/tree/master/instruction_following_eval)|
| EQ-Bench | [EQ-bench](https://github.com/EQ-bench)/[EQ-Bench](https://github.com/EQ-bench/EQ-Bench)|
</br>
## π Evalverse use-case
> If you have any use-cases of your own, please feel free to let us know. </br>We would love to hear about them and possibly feature your case.
*β¨* [`Upstage`](https://www.upstage.ai/) is using Evalverse for evaluating [Solar](https://console.upstage.ai/services/solar?utm_source=upstage.ai&utm_medium=referral&utm_campaign=Main+hero+Solar+card&utm_term=Try+API+for+Free&utm_content=home). </br>
*β¨* [`Upstage`](https://www.upstage.ai/) is using Evalverse for evaluating models at [Open Ko-LLM Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard).
</br>
## π Contributors
<a href="https://github.com/UpstageAI/evalverse/graphs/contributors">
<img src="https://contrib.rocks/image?repo=UpstageAI/evalverse"/>
</a>
## π Acknowledgements
Evalverse is an open-source project orchestrated by the **Data-Centric LLM Team** at `Upstage`, designed as an ecosystem for LLM evaluation. Launched in April 2024, this initiative stands at the forefront of advancing evaluation handling in the realm of large language models (LLMs).
## π License
Evalverse is completely freely-accessible open-source and licensed under the Apache License 2.0.
## π Citation
If you want to cite our π Evalverse project, feel free to use the following bibtex. You can check our paper via [link](https://arxiv.org/abs/2404.00943).
```bibtex
@misc{kim2024evalverse,
title={Evalverse: Unified and Accessible Library for Large Language Model Evaluation},
author={Jihoo Kim and Wonho Song and Dahyun Kim and Yunsu Kim and Yungi Kim and Chanjun Park},
year={2024},
eprint={2404.00943},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
", Assign "at most 3 tags" to the expected json: {"id":"9111","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"