Claude Skill

WeianMao/triattention

TriAttention uses trigonometric KV cache compression to enable efficient long reasoning and local deployment of OpenClaw on memory-constrained GPUs.

Overview

Stars811
Forks77
LanguagePython
Last pushed2026-07-02
Last synced2026-07-03
View on GitHub

Repository

OwnerWeianMao
Repositorytriattention
Full nameWeianMao/triattention
Repo ID1,200,960,609

Install this Skill

git clone https://github.com/WeianMao/triattention.git

Registry

Typeopenclaw_skill
Quality score75/100
Verificationreadme_parsed
Last verified2026-06-06
Platforms
OpenClaw
Capabilities
pdfmemoryimagevideoterminal
Detected files
README.mddocsrequirements.txt
Install methods
  • git clone https://github.com/WeianMao/triattention.git
  • pip install -e .
  • pip install flash-attn --no-build-isolation # recommended (takes 105m in DGX Spark / GB10)

Summary

TriAttention is an efficient long-reasoning technique that uses trigonometric KV cache compression to reduce memory usage, enabling local deployment of large models like OpenClaw on memory-constrained GPUs.

Chinese description

TriAttention — 通过三角键值缓存压缩实现高效长推理。支持在内存受限的GPU上本地部署OpenClaw。

Key features

  • Trigonometric KV cache compression for reduced memory footprint
  • Enables long-context reasoning on memory-constrained GPUs
  • Supports local deployment of OpenClaw models
  • Optimized for efficient inference with limited hardware

Use cases

  • Running large language models locally on consumer GPUs
  • Long-document analysis and summarization
  • Memory-efficient AI reasoning for edge devices

README excerpt

<div align="center"> # TriAttention: Efficient Long Reasoning with Trigonometric KV Compression [![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2604.04921) [![Project Page](https://img.shields.io/badge/Project-Page-teal)](https://weianmao.github.io/tri-attention-project-page/) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/downloads/) *Compress KV cache by 10.7x and boost throughput by 2.5x on long reasoning tasks -- with no accuracy loss.* [Weian Mao](https://scholar.google.com/citations?user=Qu-QXTsAAAAJ)<sup>1*</sup>, [Xi Lin](https://profile.erix025.me/)<sup>3*</sup>, [Wei Huang](https://aaron-weihuang.com/)<sup>2*</sup>, Yuxin Xie<sup>1</sup>, Tianfu Fu<sup>1</sup>, [Bohan Zhuang](https://bohanzhuang.github.io)<sup>3</sup>, [Song Han](http://songhan.mit.edu/)<sup>1,2</sup>, [Yukang Chen](https://yukangchen.com/)<sup>2</sup> <sup>1</sup>MIT, <sup>2</sup>NVIDIA, <sup>3</sup>ZJU &nbsp;&nbsp; <sup>*</sup>Equal contribution </div> https://github.com/user-attachments/assets/768e59bb-897e-41bf-81b8-e7376aa72056 ## News - **[2026-04-21]** SGLang backend support added — TriAttention now runs on SGLang in addition to vLLM. See [SGLang Integration](docs/sglang.md). - **[2026-04-14]** Community DGX Spark (GB10/sm-121) enablement by [@dscain](https://github.com/dscain) — vLLM support merged, non-vLLM path in progress. - **[2026-04-12]** TriAttention now supports AR video generation with KV cache compression. See [LongLive README](longlive/README.md). - **[2026-04-11]** Community C/ggml port for llama.cpp (HIP/ROCm) by [@domvox](https://github.com/domvox) — enables TriAttention on AMD GPUs via llama.c

Topics

No topics yet.

Explore more

Data from GitHub. Synced on 2026-07-03