Trendshift - Ask AI

base on Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM <div align="center"> <h1> <img width="40" alt="tool icon" src="https://github.com/user-attachments/assets/f9b86465-aefa-4625-a09b-54e158efcf96" /> <span style="font-size:80px;">LLM Compressor</span> </h1> [![docs](https://img.shields.io/badge/docs-LLM--Compressor-blue)](https://docs.vllm.ai/projects/llm-compressor/en/latest/) [![PyPI](https://img.shields.io/pypi/v/llmcompressor.svg)](https://pypi.org/project/llmcompressor/) </div> `llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including: * Comprehensive set of quantization algorithms for weight-only and activation quantization * Seamless integration with Hugging Face models and repositories * `safetensors`-based file format compatible with `vllm` * Large model support via `accelerate` **✨ Read the announcement blog [here](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/)! ✨** <p align="center"> <img alt="LLM Compressor Flow" src="https://github.com/user-attachments/assets/adf07594-6487-48ae-af62-d9555046d51b" width="80%" /> </p> ## 🚀 What's New! Big updates have landed in LLM Compressor! To get a more in-depth look, check out the [deep-dive](https://x.com/RedHat_AI/status/1937865425687093554). Some of the exciting new features include: * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization. * **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). * **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM. * **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4. * **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py). * **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor). ### Supported Formats * Activation Quantization: W8A8 (int8 and fp8) * Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support) * 2:4 Semi-structured and Unstructured Sparsity ### Supported Algorithms * Simple PTQ * GPTQ * AWQ * SmoothQuant * SparseGPT ### When to Use Which Optimization Please refer to [compression_schemes.md](./docs/guides/compression_schemes.md) for detailed information about available optimization schemes and their use cases. ## Installation ```bash pip install llmcompressor ``` ## Get Started ### End-to-End Examples Applying quantization with `llmcompressor`: * [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md) * [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md) * [Activation quantization to `fp4`](examples/quantization_w4a4_fp4/llama3_example.py) * [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py) * [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md) * [Weight only quantization to `int4` using AWQ](examples/awq/README.md) * [Quantizing MoE LLMs](examples/quantizing_moe/README.md) * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md) * [Quantizing Audio-Language Models](examples/multimodal_audio/README.md) * [Quantizing Models Non-uniformly](examples/quantization_non_uniform/README.md) ### User Guides Deep dives into advanced usage of `llmcompressor`: * [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md) ## Quick Tour Let's quantize `TinyLlama` with 8 bit weights and activations using the `GPTQ` and `SmoothQuant` algorithms. Note that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats. ### Apply Quantization Quantization is applied by selecting an algorithm and calling the `oneshot` API. ```python from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor import oneshot # Select quantization algorithm. In this case, we: # * apply SmoothQuant to make the activations easier to quantize # * quantize the weights to int8 with GPTQ (static per channel) # * quantize the activations to int8 (dynamic per token) recipe = [ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]), ] # Apply quantization using the built in open_platypus dataset. # * See examples for demos showing how to pass a custom calibration set oneshot( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", dataset="open_platypus", recipe=recipe, output_dir="TinyLlama-1.1B-Chat-v1.0-INT8", max_seq_length=2048, num_calibration_samples=512, ) ``` ### Inference with vLLM The checkpoints created by `llmcompressor` can be loaded and run in `vllm`: Install: ```bash pip install vllm ``` Run: ```python from vllm import LLM model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8") output = model.generate("My name is") ``` ## Questions / Contribution - If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation. - We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md). ## Citation If you find LLM Compressor useful in your research or projects, please consider citing it: ```bibtex @software{llmcompressor2024, title={{LLM Compressor}}, author={Red Hat AI and vLLM Project}, year={2024}, month={8}, url={https://github.com/vllm-project/llm-compressor}, } ``` ", Assign "at most 3 tags" to the expected json: {"id":"13381","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts