Claude Skill
WeianMao/triattention
TriAttention 通过三角键值缓存压缩,在内存受限的 GPU 上实现高效长推理与 OpenClaw 本地部署。
概览
仓库信息
安装这个 Skill
git clone https://github.com/WeianMao/triattention.gitRegistry 信息
git clone https://github.com/WeianMao/triattention.gitpip install -e .pip install flash-attn --no-build-isolation # recommended (takes 105m in DGX Spark / GB10)
项目简介
TriAttention 是一种高效的长推理技术,通过三角键值缓存压缩降低内存占用,支持在内存受限的 GPU 上本地部署 OpenClaw 等大型模型。
TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained GPUs.
要点
- 三角键值缓存压缩,降低内存占用
- 在内存受限的 GPU 上实现长上下文推理
- 支持 OpenClaw 模型的本地部署
- 针对有限硬件资源优化高效推理
使用场景
- 在消费级 GPU 上本地运行大型语言模型
- 长文档分析与摘要
- 面向边缘设备的内存高效 AI 推理
README 摘要
<div align="center"> # TriAttention: Efficient Long Reasoning with Trigonometric KV Compression [](https://arxiv.org/abs/2604.04921) [](https://weianmao.github.io/tri-attention-project-page/) [](LICENSE) [](https://www.python.org/downloads/) *Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.* [Weian Mao](https://scholar.google.com/citations?user=Qu-QXTsAAAAJ)<sup>1*</sup>, [Xi Lin](https://profile.erix025.me/)<sup>3*</sup>, [Wei Huang](https://aaron-weihuang.com/)<sup>2*</sup>, Yuxin Xie<sup>1</sup>, Tianfu Fu<sup>1</sup>, [Bohan Zhuang](https://bohanzhuang.github.io)<sup>3</sup>, [Song Han](http://songhan.mit.edu/)<sup>1,2</sup>, [Yukang Chen](https://yukangchen.com/)<sup>2</sup> <sup>1</sup>MIT, <sup>2</sup>NVIDIA, <sup>3</sup>ZJU <sup>*</sup>Equal contribution </div> https://github.com/user-attachments/assets/768e59bb-897e-41bf-81b8-e7376aa72056 ## News - **[2026-04-21]** SGLang backend support added — TriAttention now runs on SGLang in addition to vLLM. See [SGLang Integration](docs/sglang.md). - **[2026-04-14]** Community DGX Spark (GB10/sm-121) enablement by [@dscain](https://github.com/dscain) — vLLM support merged, non-vLLM path in progress. - **[2026-04-12]** TriAttention now supports AR video generation with KV cache compression. See [LongLive README](longlive/README.md). - **[2026-04-11]** Community C/ggml port for llama.cpp (HIP/ROCm) by [@domvox](https://github.com/domvox) — enables TriAttention on AMD GPUs via llama.c
话题
暂无话题