AbnormalLemmings
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
kenna@lemm.eeM to star technology and research @lemm.eeEnglish ·
edit-2
2 years ago

Efficient Streaming Language Models with Attention Sinks - #AI

arxiv.org

external-link
message-square
0
link
fedilink
1
external-link

Efficient Streaming Language Models with Attention Sinks - #AI

arxiv.org

kenna@lemm.eeM to star technology and research @lemm.eeEnglish ·
edit-2
2 years ago
message-square
0
link
fedilink
Efficient Streaming Language Models with Attention Sinks
arxiv.org
external-link
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.
alert-triangle
You must log in or register to comment.

star technology and research @lemm.ee

star_adjacent@lemm.ee

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !star_adjacent@lemm.ee

generally related/to be applied to design, engineering, or use of socially therapeutic robots

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 2 users / day
  • 2 users / week
  • 2 users / month
  • 2 users / 6 months
  • 1 local subscriber
  • 56 subscribers
  • 41 Posts
  • 0 Comments
  • Modlog
  • mods:
  • kenna@lemm.ee
  • UI: unknown version
  • BE: 0.19.11
  • Modlog
  • Legal
  • Instances
  • Docs
  • Code
  • join-lemmy.org