AI prompts
base on A fast communication-overlapping library for tensor/expert parallelism on GPUs. <div align="center">
👋 Hi, everyone!
<br>
We are <b>ByteDance Seed team.</b>
</div>
<p align="center">
You can get to know us better through the following channels👇
<br>
<a href="https://team.doubao.com/">
<img src="https://img.shields.io/badge/Website-%231e37ff?style=for-the-badge&logo=bytedance&logoColor=white"></a>
<a href="https://github.com/user-attachments/assets/93481cda-a7f3-47f3-b333-fe6b3da86b78">
<img src="https://img.shields.io/badge/WeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white"></a>
<a href="https://www.xiaohongshu.com/user/profile/668e7e15000000000303157d?xsec_token=ABl2-aqekpytY6A8TuxjrwnZskU-6BsMRE_ufQQaSAvjc%3D&xsec_source=pc_search">
<img src="https://img.shields.io/badge/Xiaohongshu-%23FF2442?style=for-the-badge&logo=xiaohongshu&logoColor=white"></a>
<a href="https://www.zhihu.com/org/dou-bao-da-mo-xing-tuan-dui/">
<img src="https://img.shields.io/badge/zhihu-%230084FF?style=for-the-badge&logo=zhihu&logoColor=white"></a>
</p>

# Flux: Fine-grained Computation-communication Overlapping GPU Kernel Library
<p align="center">
<a href="https://github.com/bytedance/flux">
<img src="https://img.shields.io/badge/Flux-Project Page-yellow"></a>
<a href="https://arxiv.org/pdf/2502.19811">
<img src="https://img.shields.io/badge/Flux-Tech Report-red"></a>
<a href="XXX">
<img src="https://img.shields.io/badge/License-Apache-blue"></a>
<br>
<a href="https://github.com/user-attachments/assets/d3fcb3bf-466b-4efe-8c3f-5f85258202ae">
<img src="https://img.shields.io/badge/Flux-Wechat Communication Group-07C160"></a>
</p>
Flux is a communication-overlapping library for dense/MoE models on GPUs, providing high-performance and pluggable kernels to support various parallelisms in model training/inference.
Flux's efficient kernels are compatible with Pytorch and can be integrated into existing frameworks easily, supporting various Nvidia GPU architectures and data types.
Welcome to join the [Wechat](https://github.com/bytedance/flux/blob/main/docs/assets/comet_wechat_group.JPG) group and stay tuned!
# News
[2025/03/10]🔥We have released **COMET: Computation-communication Overlapping for Mixture-of-Experts**.
## Getting started
Install Flux either from source or from PyPI.
### Install from Source
```bash
git clone --recursive https://github.com/bytedance/flux.git && cd flux
# Install dependencies
bash ./install_deps.sh
# For Ampere(sm80) GPU
./build.sh --arch 80 --nvshmem
# For Ada Lovelace(sm89) GPU
./build.sh --arch 89 --nvshmem
# For Hopper(sm90) GPU
./build.sh --arch 90 --nvshmem
```
#### Install in a virtual environment
Here is a snippet to install Flux in a virtual environment. Let's finish the installation in an virtual environment with CUDA 12.4, torch 2.6.0 and python 3.11.
```bash
conda create -n flux python=3.11
conda activate flux
pip3 install packaging
pip3 install ninja
pip3 install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
./build.sh --clean-all
./build.sh --arch "80;89;90" --nvshmem --package
```
Then you would expect a wheel package under `dist/` folder that is suitable for your virtual environment.
### Install from PyPI
We also provide some pre-built wheels for Flux, and you can directly install with pip if your wanted version is available. Currently we provide wheels for the following configurations: torch(2.4.0, 2.5.0, 2.6.0), python(3.10, 3.11), cuda(12.4).
```bash
# Make sure that PyTorch is installed.
pip install byte-flux
```
### Customized Installation
#### Build options for source installation
1. Add `--nvshmem` to build Flux with NVSHMEM support. It is essential for the MoE kernels.
2. If you are tired of the cmake process, you can set environment variable `FLUX_BUILD_SKIP_CMAKE` to 1 to skip cmake if `build/CMakeCache.txt` already exists.
3. If you want to build a wheel package, add `--package` to the build command. find the output wheel file under dist/
#### Dependencies
The core dependencies of Flux are NCCL, CUTLASS, and NVSHMEM, which are located under the 3rdparty folder.
1. NCCL: Managed by git submodule automatically.
2. NVSHMEM: Downloaded from https://developer.nvidia.com/nvshmem. The current version is 3.2.5-1.
3. CUTLASS: Flux leverages CUTLASS to generate high-performance GEMM kernels. We currently use CUTLASS 3.7.0 and a tiny patch should be applied to CUTLASS.
## Quick Start
Below are commands to run some basic demos once you have installed Flux successfully.
```bash
# gemm only
python3 test/python/gemm_only/test_gemm_only.py 4096 12288 6144 --dtype=float16
# all-gather fused with gemm (dense MLP layer0)
./launch.sh test/python/ag_gemm/test_ag_kernel.py 4096 49152 12288 --dtype=float16 --iters=10
# gemm fused with reduce-scatter (dense MLP layer1)
./launch.sh test/python/gemm_rs/test_gemm_rs.py 4096 12288 49152 --dtype=float16 --iters=10
# all-gather fused with grouped gemm (MoE MLP layer0)
./launch.sh test/python/moe_ag_scatter/test_moe_ag.py
# grouped gemm fused with reduce-scatter (MoE MLP layer1)
./launch.sh test/python/moe_gather_rs/test_moe_gather_rs.py
```
You can check out the documentations for more details!
* For a more detailed usage on MoE kernels, please refer to [Flux MoE Usage](https://github.com/bytedance/flux/blob/main/docs/moe_usage.md). Try some [examples](https://github.com/bytedance/flux/blob/main/examples) as a quick start. A [minimal MoE layer](https://github.com/bytedance/flux/blob/main/examples/moe_flux_only.py) can be implemented within only a few tens of lines of code using Flux!
* For some performance numbers, please refer to [Performance Doc](https://github.com/bytedance/flux/blob/main/docs/performance.md).
* To learn more about the design principles of Flux, please refer to [Design Doc](https://github.com/bytedance/flux/blob/main/docs/design.md).
## [License](./LICENSE)
The Flux Project is under the Apache License v2.0.
## Citation
If you use Flux in a scientific publication, we encourage you to add the following reference
to the related papers:
```
@misc{chang2024flux,
title={FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion},
author={Li-Wen Chang and Wenlei Bao and Qi Hou and Chengquan Jiang and Ningxin Zheng and Yinmin Zhong and Xuanrun Zhang and Zuquan Song and Ziheng Jiang and Haibin Lin and Xin Jin and Xin Liu},
year={2024},
eprint={2406.06858},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{zhang2025comet,
title={Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts},
author={Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen and Xin Liu},
year={2025},
eprint={2502.19811},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
```
## Reference
* [ArXiv Paper (Flux)](http://arxiv.org/abs/2406.06858)
* [ArXiv Paper (Comet)](https://arxiv.org/abs/2502.19811)
# About [ByteDance Seed Team](https://team.doubao.com/)
Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.
", Assign "at most 3 tags" to the expected json: {"id":"13105","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"