Claude Skill
WeianMao/triattention
TriAttention uses trigonometric KV cache compression to enable efficient long reasoning and local deployment of OpenClaw on memory-constrained GPUs.
Overview
Repository
Install this Skill
git clone https://github.com/WeianMao/triattention.gitRegistry
git clone https://github.com/WeianMao/triattention.gitpip install -e .pip install flash-attn --no-build-isolation # recommended (takes 105m in DGX Spark / GB10)
Summary
TriAttention is an efficient long-reasoning technique that uses trigonometric KV cache compression to reduce memory usage, enabling local deployment of large models like OpenClaw on memory-constrained GPUs.
TriAttention — 通过三角键值缓存压缩实现高效长推理。支持在内存受限的GPU上本地部署OpenClaw。
Key features
- Trigonometric KV cache compression for reduced memory footprint
- Enables long-context reasoning on memory-constrained GPUs
- Supports local deployment of OpenClaw models
- Optimized for efficient inference with limited hardware
Use cases
- Running large language models locally on consumer GPUs
- Long-document analysis and summarization
- Memory-efficient AI reasoning for edge devices
README excerpt
<div align="center"> # TriAttention: Efficient Long Reasoning with Trigonometric KV Compression [](https://arxiv.org/abs/2604.04921) [](https://weianmao.github.io/tri-attention-project-page/) [](LICENSE) [](https://www.python.org/downloads/) *Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.* [Weian Mao](https://scholar.google.com/citations?user=Qu-QXTsAAAAJ)<sup>1*</sup>, [Xi Lin](https://profile.erix025.me/)<sup>3*</sup>, [Wei Huang](https://aaron-weihuang.com/)<sup>2*</sup>, Yuxin Xie<sup>1</sup>, Tianfu Fu<sup>1</sup>, [Bohan Zhuang](https://bohanzhuang.github.io)<sup>3</sup>, [Song Han](http://songhan.mit.edu/)<sup>1,2</sup>, [Yukang Chen](https://yukangchen.com/)<sup>2</sup> <sup>1</sup>MIT, <sup>2</sup>NVIDIA, <sup>3</sup>ZJU <sup>*</sup>Equal contribution </div> https://github.com/user-attachments/assets/768e59bb-897e-41bf-81b8-e7376aa72056 ## News - **[2026-04-21]** SGLang backend support added — TriAttention now runs on SGLang in addition to vLLM. See [SGLang Integration](docs/sglang.md). - **[2026-04-14]** Community DGX Spark (GB10/sm-121) enablement by [@dscain](https://github.com/dscain) — vLLM support merged, non-vLLM path in progress. - **[2026-04-12]** TriAttention now supports AR video generation with KV cache compression. See [LongLive README](longlive/README.md). - **[2026-04-11]** Community C/ggml port for llama.cpp (HIP/ROCm) by [@domvox](https://github.com/domvox) — enables TriAttention on AMD GPUs via llama.c
Topics
No topics yet.