g023/cuda_inf — GitHub trending stats & insights

Reach 125K+ monthly visitors

g023/cuda_inf

A self-contained CUDA inference engine for LiquidAI/LFM2.5-8B-A1B (hybrid conv + GQA-attention MoE, 8.5B params, 1B active) targeting a single RTX 3060 (12 GB). No Python, no frameworks at runtime: a single .cu engine + a header-only byte-level BPE tokenizer.

Visit GitHub

Cuda

1 contributors

MIT License

website

Social mentions

Recent discussions about this repository across the web

Started out working on a structured sparse-attention idea and ended up focusing on a pure C inferencing project w/flash-decoding, so here is my glorious attempt for anyone else to use as they wish…

@g023dev · x.com

Repository activities

repository's daily and monthly activities across stars, forks, merged PRs, issues, and closed issues