base on Efficient AI Inference & Serving # šŸš€ SwiftInfer ## šŸ”— Table of Contents - [šŸš€ SwiftInfer](#-swiftinfer) - [šŸ”— Table of Contents](#-table-of-contents) - [šŸ“Œ Overview](#-overview) - [šŸš— Quick Start](#-quick-start) - [šŸ›  Installation](#-installation) - [šŸ•¹ Run Llama example](#-run-llama-example) - [āš–ļø Benchmark](#-benchmark) - [šŸ—ŗ Roadmap](#-roadmap) - [šŸ“ƒ Acknowledgement](#-acknowledgement) - [šŸ“ Citation](#-citation) ## šŸ“Œ Overview [**Streaming-LLM**](https://github.com/mit-han-lab/streaming-llm) is a technique to support infinite input length for LLM inference. It leverages [**Attention Sink**](https://arxiv.org/abs/2309.17453) to prevent the model collapse when the attention window shifts. The original work is implemented in PyTorch, we offer **SwiftInfer**, a TensorRT implementation to make StreamingLLM more production-grade. Our implementation was built upon the recently released [**TensorRT-LLM**](https://github.com/NVIDIA/TensorRT-LLM) project. ## šŸš— Quick Start ### šŸ›  Installation We use the API in [**TensorRT-LLM**](https://github.com/NVIDIA/TensorRT-LLM) to construct the model and run inference. As the API of TensorRT-LLM is not stable and changing rapidly, we bind our implementation with the `42af740db51d6f11442fd5509ef745a4c043ce51` commit whose version is `v0.6.0`. We may upgrade this repository as TensorRT-LLM's APIs become more stable. If you have build **TensorRT-LLM V0.6.0**, simply run: ```bash git clone https://github.com/hpcaitech/SwiftInfer.git cd SwiftInfer pip install . ``` Otherwise, you should install TensorRT-LLM first. #### Install TensorRT-LLM with Docker If using docker, you can follow [TensorRT-LLM Installation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md) to install **TensorRT-LLM V0.6.0**. By using docker, you can install SwiftInfer by simply running: ```bash git clone https://github.com/hpcaitech/SwiftInfer.git cd SwiftInfer pip install . ``` #### Install TensorRT-LLM without Docker If not using docker, we provide a script to install TensorRT-LLM automatically. **Prerequisites** Please ensure that you have installed the following packages: - python - build essentials, including gcc/g++, make, cmake - CUDA toolkit - cuDNN - NCCL - TensorRT - PyTorch Make sure the version of TensorRT >= 9.1.0 and CUDA toolkit >= 12.2. To install tensorrt: ```bash ARCH=$(uname -m) if [ "$ARCH" = "arm64" ];then ARCH="aarch64";fi if [ "$ARCH" = "amd64" ];then ARCH="x86_64";fi if [ "$ARCH" = "aarch64" ];then OS="ubuntu-22.04"; else OS="linux";fi wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.1.0/tars/tensorrt-9.1.0.4.$OS.$ARCH-gnu.cuda-12.2.tar.gz tar xzvf tensorrt-9.1.0.4.linux.x86_64-gnu.cuda-12.2.tar.gz PY_VERSION=$(python -c 'import sys; print(".".join(map(str, sys.version_info[0:2])))') PARSED_PY_VERSION=$(echo "${PY_VERSION//./}") pip install TensorRT-9.1.0.4/python/tensorrt-*-cp${PARSED_PY_VERSION}-*.whl export TRT_ROOT=$(realpath TensorRT-9.1.0.4) ``` To download nccl, follow [NCCL download page](https://developer.nvidia.com/nccl/nccl-download). To download cudnn, follow [cuDNN download page](https://developer.nvidia.com/rdp/cudnn-download). **Commands** Before running the following commands, please ensure that you have set `nvcc` correctly. To check it, run: ```bash nvcc --version ``` To install TensorRT-LLM and SwiftInfer, run: ```bash git clone https://github.com/hpcaitech/SwiftInfer.git cd SwiftInfer TRT_ROOT=xxx NCCL_ROOT=xxx CUDNN_ROOT=xxx pip install . ``` ### šŸ•¹ Run Llama example To run the Llama example, you need to first clone the Hugging Face repository for the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model or other Llama-based variants such as [lmsys/vicuna-7b-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3). Then, you can run the following command to build the TensorRT engine. **You need to replace `<model-dir>` with the actual path to the Llama model.** ```bash cd examples/llama python build.py \ --model_dir <model-dir> \ --dtype float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --max_input_len 2048 \ --max_output_len 1024 \ --output_dir ./output/7B-streaming-8k-1k-4-2000/trt_engines/fp16/1-gpu/ \ --max_batch_size 1 ``` Next, you need to download the [MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md#mt-bench) data provided by [LMSYS-FastChat](https://github.com/lm-sys/FastChat). ```bash mkdir mt_bench_data wget -P ./mt_bench_data https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl ``` Finally, you are ready to run the Llama example with the following command. ā—ļøā—ļøā—ļø **Before that, please note that:** 1. The `only_n_first` argument is used to control the number of samples to be evaluated. If you want to evaluate all samples, please remove this argument. ```bash python ../run_conversation.py \ --max_input_length 2048 \ --max_output_len 1024 \ --tokenizer_dir <model-dir> \ --engine_dir ./output/7B-streaming-8k-1k-4-2000/trt_engines/fp16/1-gpu/ \ --input_file ./mt_bench_data/question.jsonl \ --streaming_llm_start_size 4 \ --only_n_first 5 ``` You should expect to see the generation out as follows: ![generation output](./assets/inference-result.png) ## āš–ļø Benchmark We have benchmarked our implementations of Streaming-LLM with the [original PyTorch version](https://github.com/mit-han-lab/streaming-llm). The benchmark command for our implementation is given in the [Run Llama Example](#šŸ•¹-run-llama-example) section while that for the original PyTorch implementation is given in the [torch_streamingllm](./examples/torch_streamingllm/) folder. The hardware used is listed below: - GPU: Nvidia H800 (80GB) - CPU: Intel(R) Xeon(R) Platinum 8468 - RAM: 2TB The results (20 rounds of conversations) are: ![performance](./assets/performance.jpg) We are still working on further performance improvement and adapting to the TensorRT V0.7.1 APIs. We also notice that TensorRT-LLM has integrated StreamingLLM in their [example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm) but it seems it is more suitable for single text generation instead of multi-round conversations. ## šŸ—ŗ Roadmap - [x] Streaming-LLM attention implementation based on TRT-LLM APIs - [x] KV cache adaptation - [x] Early stop adaptation - [x] Contiguous tensor fix - [x] Llama example for multi-round conversation ## šŸ“ƒ Acknowledgement This work is inspired by Streaming-LLM to make it usable for production. Throughout development, we have referenced the following materials and we wish to acknowledge their efforts and contribution to the open-source community and academia. - Streaming-LLM - [Paper](https://arxiv.org/abs/2309.17453) - [Slides](https://github.com/mit-han-lab/streaming-llm/blob/main/assets/StreamingLLM.pdf) - [GitHub Repository](https://github.com/mit-han-lab/streaming-llm) - TensorRT-LLM - [Documentation](https://nvidia.github.io/TensorRT-LLM/) - [GitHub Repository](https://github.com/NVIDIA/TensorRT-LLM) ## šŸ“ Citation If you find StreamingLLM and our TensorRT implementation useful, please kindly cite our repository and the original work proposed by [Xiao et al.](https://github.com/Guangxuan-Xiao) from [MIT Han Lab](https://github.com/mit-han-lab). ```bibtex # our repository # NOTE: the listed authors have equal contribution @misc{streamingllmtrt2023, title = {SwiftInfer}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/hpcaitech/SwiftInfer}}, } # Xiao's original paper @article{xiao2023streamingllm, title={Efficient Streaming Language Models with Attention Sinks}, author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike}, journal={arXiv}, year={2023} } # TensorRT-LLM repo # as TensorRT-LLM team does not provide a bibtex # please let us know if there is any change needed @misc{trtllm2023, title = {TensorRT-LLM}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/NVIDIA/TensorRT-LLM}}, } ``` ", Assign "at most 3 tags" to the expected json: {"id":"6825","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"