prefill stage¶
prefill stage suffers from the quadratic complexity of the attention mechanism.
adaptive layerwise sparsity¶
introduce a layer-wise adaptive ratio assignment scheme for important token, instead of predefined, unified fixed ratio.
prefill (for the j-th token)¶
p is the selected number of important tokens
\({\rm{\bold T}}\) is the index of the selected important tokens, \(nnz(A{:,j})\) is the number of non-zeros in the j-th row of attention matrix, i.e. total attention score of j-th token
decode (newly generated token)¶
do prefill sparsity every 100 token generation
algorithm
ground¶
this allows the ratio to be adjusted according to task complexity.
pros¶
this approach seamlessly integrates with existing fast attention implementations without requiring custom GPU kernels.
decode stage¶
In the decoding phase, each new token interacts with all preceding tokens, requiring to fetch the full key-value (KV)cache from memory.
utilize important set from prefill stage¶
the same set of important tokens is applied to compress the KV cache, where we employ high-bit quantization for caches of important tokens and low-bit quantization for those of less importance
¶
zipvl
Further readings¶
ViT¶
unstructured/semi-structured/structured sparse attention¶
KV cache compression¶
- token dropping-based
- token merging-based
- quantization-based