LLM Decode On An NPU, Without Hand-Waving

Sources: 무엇을 기준으로 삼았는가

이 글은 /Users/USER/Documents/NPU_LLM_Decode_Research_20260605. 에 있는 deep-research 패키지를 바탕으로 작성했습니다. 가장 중요한 근거는 1차 자료입니다. ATOM과 vLLM-RBLN 동작은 RBLN docs와 whitepaper를, GPU 실행 모델은 NVIDIA 문서를, Decode arithmetic intensity와 dataflow 원리는 논문을 기준으로 삼았습니다.

RBLN ATOM architecture Neural Engine, Command Processor, local/global SRAM, NoC, PCIe, GDDR6 계층 [1], [2].

RBLN compiler and runtime Compile-time SRAM 배치, dependency analysis, tiling, command scheduling, optimized RBLN graph 생성 [3], [13].

vLLM and LLM decode GEMV와 GEMM의 arithmetic intensity, Prefill/Decode 차이, KV-cache 압력 [5], [11].

GPU baseline CUDA SIMT model, user-managed scratchpad로서의 shared memory, H100 bandwidth/product context [9], [10].

Dataflow background TPU의 software-managed memory와 Eyeriss의 data-movement 중심 accelerator 설계 [7], [8].

Spatial decode comparison Prefill과 Decode가 서로 다른 배치 전략을 요구할 수 있음을 보여주는 WaferLLM [6].

경계: 공개 자료만으로 RBLN ATOM의 모든 microarchitecture 세부사항을 알 수는 없습니다. 이 글에서 "schedule이 원한다" 또는 "그럴듯한 mapping"이라고 말하는 부분은 공개된 memory/compiler/runtime 사실을 바탕으로 한 해석입니다. 공개되지 않은 내부 구현을 단정하는 설명은 아닙니다.

Mental model: 작은 live vector, 거대한 static weights, 계속 커지는 dynamic cache

Decode에서는 각 request가 한 번에 token 하나씩만 새로 만듭니다. 이 token의 hidden vector를 x_t라고 부르겠습니다. Model weights는 크고, 대부분 고정되어 있습니다. 반대로 KV cache는 크면서도 동적입니다. sequence length가 늘어날수록 커지고, request가 들어오거나 끝나거나 prefix를 공유하면서 물리적 배치도 바뀝니다. output과 partial sum은 임시 값입니다. 아주 짧은 순간에는 중요하지만, 계산이 끝나면 다음 layer의 input으로 넘어갑니다.

Decode step 안에는 성격이 다른 네 종류의 data가 있다

x_t 작은 live activation. 여러 output row에서 반복해서 쓰입니다.

W_tile 거대한 static weights. 보통 tile 단위로 stream합니다.

psum Partial sum. reduction이 끝날 때까지 가까운 곳에 있어야 합니다.

KV[0:t] 계속 커지는 dynamic cache. attention이 읽고, 매 token마다 append됩니다.

x_{t+1} 다음 layer 또는 logits head로 넘어가는 다음 live state입니다.

핵심: Decode는 단순히 "matmul 하나"가 아닙니다. 작은 live vector가 거대한 static tensor와 dynamic memory structure 사이를 지나가는 과정입니다.

이 그림 하나가 하드웨어 관점의 대부분을 설명합니다. GPU는 이렇게 묻습니다. memory가 도착하는 동안 machine을 놀리지 않을 만큼 parallel work를 충분히 던질 수 있는가? NPU는 조금 다르게 묻습니다. graph를 미리 잘 배치해서 data movement를 예측 가능하게 만들고, local하게 유지하고, compute와 겹칠 수 있는가? 둘 다 locality를 중요하게 봅니다. 차이는 locality를 runtime cache와 scheduler가 얼마나 동적으로 찾아내는지, 아니면 compiler와 runtime이 실행 전에 얼마나 정리해 두는지에 있습니다.

End-to-end lifecycle: token 하나는 data-movement program이다

하나의 예시를 계속 사용하겠습니다. request req_A가 128개 prompt token에 대한 Prefill을 막 끝냈다고 합시다. 이제 tok_129를 Decode합니다. 현재 hidden state는 x_129이고, cache에는 token 0부터 128까지의 block이 이미 들어 있습니다.

Decode token 하나가 transformer layer 하나를 지나는 과정

Layer input x_129가 이전 layer에서 넘어옵니다.

Linear projections Wq/Wk/Wv tile을 stream하고, x_129는 잡아두거나 재사용합니다.

cache 위에서 attention 기존 K/V block을 읽고, token 129의 새 K/V를 append합니다.

MLP projections stream, accumulate, reduce 패턴을 다시 반복합니다.

Handoff 다음 hidden state 또는 sampling용 logits를 내보냅니다.

여기서 "compute"는 movement에 둘러싸여 있습니다. off-chip memory에서 weights를 가져오고, SRAM 안에서 임시 sum을 만들고, dynamic address로 KV block을 읽고, layer state를 넘깁니다.

이 lifecycle에서 숨어 있는 비용은 곱셈 그 자체가 아닙니다. operand가 이미 연산 유닛 옆에 있다면 multiplication은 싸게 처리됩니다. 진짜 비용은 operand delivery입니다. 필요한 byte를 필요한 시점에 필요한 위치로 옮기고, 쓸모 있는 reuse를 다 뽑아내기 전까지 비싼 memory로 되돌려 쓰지 않는 것이 핵심입니다.

Prefill과 Decode가 왜 다른 실행 방식을 요구하는가

vLLM report는 이 차이를 아주 간단한 계산으로 설명합니다. BF16 matrix-vector product인 y = W x에서는 accelerator가 대략 2mn FLOPs를 수행하고, 동시에 weight를 대략 2mn bytes 읽습니다. 그래서 arithmetic intensity는 대략 1 FLOP/byte 수준입니다 [5]. 반면 batch size가 b인 matrix-matrix product Y = W X에서는 같은 weight tile을 b개의 vector에 재사용할 수 있으므로, intensity가 b에 비례해 올라갑니다 [5].

Prefill: token 여러 개를 한 번에 처리

XPrompt token matrix

Wweight tile 하나를 반복 사용

Y여러 output

이 경우는 GEMM 모양에 가깝습니다. 같은 weight byte가 여러 token column에 쓰입니다. compute unit이 바쁘게 일할 가능성이 더 큽니다.

Decode: request마다 token 하나

x_tlive vector 하나

W여전히 거대한 weights

y_toutput vector 하나

여러 request를 크게 batch하지 않는 한, 이 모양은 GEMV에 가깝습니다. token 간 reuse가 적기 때문에 weight read가 훨씬 빨리 병목이 됩니다.

WaferLLM도 spatial/local-memory 관점에서 같은 구분을 합니다. Prefill과 Decode는 서로 다른 전략이 필요하며, 특히 GEMV가 지배적인 Decode에서는 배치 방식도 달라져야 한다고 봅니다 [6]. 이것이 ATOM이 WaferLLM과 같은 방식을 구현한다는 뜻은 아닙니다. 다만 넓은 architecture 관점에서 교훈은 같습니다. Decode는 "작아진 Prefill"이 아니라, 다른 dataflow 문제입니다.

Tensor shape 정리: layer 하나에서 실제로 무엇이 움직이는가

꼬리 질문에 답하려면 tensor shape를 구체적으로 잡고 있어야 합니다. Llama 계열 decoder layer를 생각해 보겠습니다. hidden size는 H, query head 수는 n_q, KV head 수는 n_kv, head dimension은 d_h, MLP intermediate size는 I, prompt length는 T, decode batch는 B라고 하겠습니다. Prefill에서는 B * T개의 token vector가 layer를 지나갑니다. Decode에서는 live token matrix가 B x H 정도로 얇아집니다.

Object	Prefill shape	Decode shape	NPU 관점에서 중요한 이유
Layer input	`[B, T, H]`	`[B, 1, H]` or `[B, H]`	Prefill은 token column이 많지만, Decode는 request마다 얇은 live vector만 제공합니다.
Q projection	`[BT, H] @ [H, n_qd_h]`	`[B, H] @ [H, n_q*d_h]`	`B`가 작으면 같은 weights를 재사용할 기회가 크게 줄어듭니다.
K/V projection	`[BT, H] @ [H, n_kvd_h]`	`[B, H] @ [H, n_kv*d_h]`	Decode는 request마다 새 K/V vector 하나를 persistent cache state에 써야 합니다.
Attention cache read	prompt chunk를 처리하면서 주로 만들어짐	`K,V: [B, T_so_far, n_kv, d_h]`	`T_so_far`가 커질수록 attention은 compute이면서 동시에 state traversal이 됩니다.
MLP up/gate/down	`B*T` row에 대한 큰 GEMM	`B` row에 대한 GEMV 또는 작은 GEMM	MLP matrix는 크고 Decode에서 reuse가 낮아서 weight traffic을 크게 만들 수 있습니다.
Partial sums	여러 row의 temporary reduction	적은 row, 많은 output channel	partial sum을 SRAM에 유지하면 아직 완성되지 않은 값을 GDDR6에 쓰는 일을 줄일 수 있습니다.

이 표는 "Decode가 왜 다른가?"라는 질문에 정확히 답하기 위한 기준입니다. transformer layer 자체가 바뀐 것이 아닙니다. 같은 layer가 다른 shape로 실행될 뿐입니다. Prefill은 token dimension이 두껍고, Decode는 live token dimension이 얇은 대신 cache dimension이 계속 커집니다. NPU compiler/runtime은 이 둘을 함께 schedule해야 합니다. weights 위의 regular graph와 KV 위의 dynamic state structure를 동시에 다루는 셈입니다.

같은 layer, 서로 다른 tensor geometry

Prefill geometry

[B*T, H]는 충분히 넓어서 weight tile 하나를 여러 prompt-token row에 재사용할 수 있습니다. GEMM dataflow가 자연스럽게 맞는 구간입니다.

weight tile loaded once -> token rows 0..T-1에서 사용

Decode geometry

[B, H]는 작을 수 있습니다. weight tile은 여전히 크지만 live token row는 적습니다. schedule은 memory traffic과 싸워야 합니다.

weight tile loaded once -> 현재 B개 token에만 사용

Decode roofline 계산: 왜 peak TOPS부터 묻는 것은 부족한가

Decode를 설명할 때 roofline식 bound를 들면 방어력이 좋아집니다. BF16 matrix-vector product y = W x에서 shape가 m x n인 matrix는 대략 2mn FLOPs가 필요합니다. 그런데 BF16 weights를 읽는 데에도 activation, output, metadata, cache traffic을 제외하고 대략 2mn bytes가 필요합니다. 그래서 weight traffic만 보더라도 arithmetic intensity가 약 1 FLOP / byte가 됩니다 [5].

GEMV decode, BF16 weights:
FLOPs ~= 2 * m * n
weight bytes ~= 2 * m * n
arithmetic intensity ~= 1 FLOP / byte

b개의 token vector가 있는 Y = W X에서는 같은 weight byte가 b개의 vector에 쓰입니다. 단순 계산으로 intensity는 대략 b FLOPs / byte가 됩니다 [5]. 그래서 Prefill은 compute-friendly한 GEMM에 가까워지고, Decode는 bandwidth와 locality limit에 더 빨리 부딪힙니다.

GEMM-like prefill or large decode batch:
FLOPs ~= 2 * m * n * b
weight bytes ~= 2 * m * n
arithmetic intensity ~= b FLOPs / byte

질문	약한 답변	더 나은 답변
"NPU가 LLM Decode에 충분히 빠른가?"	"TOPS가 높습니다."	"data가 제때 도착할 때만 peak TOPS가 의미 있습니다. Decode는 weight/KV movement에 묶이는 경우가 많아서 SRAM locality와 graph scheduling이 중요합니다."
"Prefill kernel을 그대로 쓰면 안 되는가?"	"Decode는 더 작습니다."	"tensor shape가 여러 token row에서 request당 한 row로 바뀝니다. serving batch가 크지 않으면 weight reuse가 크게 줄어듭니다."
"NPU는 실제로 무엇을 개선할 수 있는가?"	"matmul을 빠르게 합니다."	"낭비되는 movement를 줄일 수 있습니다. activation과 partial sum을 local에 두고, weights를 예측 가능한 tile로 가져오고, DMA/compute를 overlap하며, KV block을 controlled path로 처리합니다."

그래서 "GPU가 좋은가, NPU가 좋은가"를 단순한 우열 문제로 말하면 곤란합니다. GPU는 거대한 bandwidth와 유연한 kernel 생태계를 가집니다. 추론용 NPU는 off-chip bandwidth가 더 낮을 수 있지만, compiler가 만든 local movement와 특정 serving shape에서의 power efficiency를 강점으로 삼을 수 있습니다. 공정한 비교는 실제 batch size, context length, cache hit rate, model size에서 tokens/s/W가 얼마나 나오는지를 보는 것입니다.

GPU versus NPU: 제어 방식의 차이

CUDA는 GPU를 유연한 SIMT 실행 모델로 보여 줍니다. grid, block, thread, warp, register, global memory, cache, shared memory가 그 기본 단위입니다 [9]. H100급 GPU는 여기에 매우 큰 memory bandwidth와 Tensor Core throughput까지 갖습니다 [10]. 보통의 성능 질문은 kernel이 충분한 parallel work와 memory coalescing을 드러내서 이 유연성을 실제 성능으로 바꿀 수 있는가입니다.

RBLN ATOM의 공개 architecture는 다른 쪽에 무게를 둡니다. Neural Engine, local SRAM, shared SRAM, NoC, DMA command, task manager, 그리고 실행 전에 SRAM placement와 dependency를 계획하는 compiler가 핵심입니다 [1], [2], [3]. parallelism을 덜 중요하게 본다는 뜻은 아닙니다. model graph의 일부로 data movement를 더 명시적으로 schedule한다는 뜻에 가깝습니다.

같은 Decode layer를 바라보는 두 가지 scheduling 감각

GPU식 감각

Launch kernelsSIMT parallelism과 library kernel을 사용합니다.

Hide latencymemory request가 기다리는 동안 충분한 warp를 실행합니다.

Use caches/shared memorykernel 작성자가 가능한 곳에서 hot data를 tile합니다.

NPU식 감각

Compile graphgraph shape, batch variant, device placement를 정합니다.

Schedule movementDMA, SRAM residency, compute dependency를 계획합니다.

Admit controlled dynamic stateKV block table과 runtime address resolution을 처리합니다.

"GPU는 dynamic, NPU는 static"이라고 단순히 나누면 안 됩니다. 더 정확히 말하면, GPU의 유연성은 kernel과 runtime scheduling을 통해 드러나고, ATOM은 fixed SRAM resource 위에서 compiler/runtime이 계획한 movement로 실행을 더 많이 표현합니다.

ATOM memory 계층: byte는 어디에 머물러야 하는가

RBLN architecture 문서에 따르면 ATOM의 각 Neural Engine은 4 MB local SRAM을 갖고, 모든 Neural Engine이 32 MB global SRAM을 공유하며, device에는 16 GB off-chip GDDR6가 있습니다 [1]. ATOM whitepaper도 RBLN-CA12에 대해 64 MB on-chip SRAM, 8개 Neural Engine, 16 GB GDDR6, 256 GB/s memory bandwidth를 제시합니다 [2]. 이 숫자는 단순한 스펙 나열이 아닙니다. 어떤 object를 compute 가까이에 staging해야 하고, 어떤 object를 stream해야 하는지 알려 주는 단서입니다.

Decode를 이해하기 위한 memory hierarchy

Compute lanes Neural Engine 안의 MAC/SIMD/MIMD element. use now

Local SRAM 현재 tile, activation, partial sum을 위한 Neural Engine별 scratchpad. 4 MB / NE [1]

Shared SRAM engine 간 workspace, tiled attention state, shared staging 공간. 32 MB [1]

GDDR6 SRAM에 올리기 어려운 큰 model weights와 KV cache block이 머무는 곳. 16 GB [1], [2]

Host / cluster Serving runtime, multi-device execution, RSD를 통한 device보다 큰 model 실행. PCIe / RSD

SRAM은 단순히 "빠른 memory"가 아닙니다. 이런 accelerator에서는 compiler의 schedule이 실제 물리적 배치로 바뀌는 workspace입니다.

흔한 실수는 16 GB GDDR6만 "accelerator memory"라고 보고 SRAM을 보너스처럼 생각하는 것입니다. Decode에서 SRAM은 schedule의 작업대입니다. x_t, W_tile, psum, local attention fragment가 만나는 장소입니다. GDDR6는 on-chip에 올릴 수 없는 큰 static/dynamic object가 머무는 곳입니다.

Linear layer walkthrough: 진짜 질문은 stationarity다

Decode projection y = W x_t를 보겠습니다. 핵심 질문은 하드웨어가 곱셈을 할 수 있는가가 아닙니다. 당연히 할 수 있습니다. 진짜 질문은 어떤 object를 stationary하게 둘지, 어떤 object를 stream할지, final output이 나오기 전까지 partial sum을 어디에 둘지입니다.

Decode를 위한 plausible tiled GEMV dataflow

1. Pin x_t 현재 token vector는 작습니다. 가까이에 두거나 broadcast합니다.

2. Stream W_tile Weights는 큽니다. off-chip에서 tile 하나를 SRAM으로 가져옵니다.

3. Accumulate psum 모든 tile이 기여할 때까지 partial output을 local에 둡니다.

4. Handoff y 완성된 output만 다음 operator로 넘깁니다.

"matmul"이라는 단어는 이 scheduling 문제를 가립니다. 여기서 중요한 동사는 pin, stream, accumulate, reduce, hand off입니다.

Pseudo-schedule: decode projection, not RBLN source code

for out_tile in output_tiles:
    psum = local_sram.zeros(out_tile)

    for k_tile in reduction_tiles:
        W_tile = dma_gddr6_to_sram(W[out_tile, k_tile])
        x_tile = local_view(x_t[k_tile])
        psum += neural_engine_mac(W_tile, x_tile)

    y[out_tile] = reduce_and_write(psum)

의미: accelerator가 W 전체를 on-chip에 올리려는 것은 아닙니다. 가까이에 들어온 tile을 최대한 쓸모 있게 쓰고, final output을 쓸 수밖에 없는 순간까지 psum을 local에 유지하려는 것입니다.

예시: x_t가 여러 output row에서 재사용된다면, 이를 off-chip memory에서 반복해서 가져오는 것은 낭비입니다. psum을 tile마다 spill하면, 아직 final도 아닌 값 때문에 extra traffic을 냅니다.

흔한 오해: "activation stationary"나 "output stationary"는 구호가 아닙니다. 어떤 byte는 멈춰 두고, 어떤 byte는 그 주변으로 흘려보낼지에 대한 구체적인 선택입니다.

Attention path: Decode는 old state를 읽고 new state를 쓴다

linear projection 이야기는 weight traffic을 설명해 줍니다. Attention은 여기에 두 번째 압력을 더합니다. 바로 persistent KV state입니다. token t에서 layer는 q_t, k_t, v_t를 만듭니다. 새 k_t와 v_t는 cache에 append되어야 합니다. 그 다음 q_t가 cached K[0:t], V[0:t] 위에서 attention을 수행합니다. weights와 달리 이 cache는 compile-time에 주소가 깔끔하게 정해진 고정 tensor가 아닙니다. request history에 따라 달라집니다.

Decode attention step 하나는 stateful data movement다

x_t 현재 token hidden vector.

Q,K,V projections weight tile이 stream되고, 현재 token이 새 Q/K/V를 만듭니다.

cache 읽기 기존 K/V block을 block-table address를 통해 가져옵니다.

Softmax/value mix 가능하면 tile reduction을 on-chip에 유지합니다.

Append K/V 다음 token을 위해 cache state가 갱신됩니다.

Attention을 구체적으로 설명하려면, 보통 "attention을 실행한다" 한마디로 합쳐 버리는 memory action을 나눠 봐야 합니다. 첫째, 현재 token을 Q/K/V로 projection합니다. 둘째, 이 request의 기존 KV block 위치를 찾습니다. 셋째, tiled working set으로 attention score와 value accumulation을 수행합니다. 넷째, 새 K/V vector를 cache에 씁니다. 세 번째는 tiling을 이야기할 만큼 compute-heavy합니다. 두 번째와 네 번째는 serving-state 문제입니다.

Attention substep	Data object	Movement problem	NPU-friendly framing
Q/K/V projection	현재 `x_t`, projection weights	큰 weights, 작은 current token batch	GEMV 또는 작은 GEMM과 같은 tiled projection logic을 사용합니다.
KV lookup	Block table, physical KV blocks	logical order와 physical memory order가 다릅니다	controlled dynamic path에서 block address를 해결합니다. RBLN은 dynamic DMA와 CP address evaluation을 설명합니다 [4].
Score and softmax	`q_t`, tiled `K`, online softmax state	긴 context 전체는 local SRAM에 들어가지 않습니다	context를 tile하고, running max/sum과 active fragment를 local에 둡니다.
Value accumulation	Softmax probabilities와 tiled `V`	context block 전체에 걸쳐 partial output을 accumulate해야 합니다	context scan이 끝날 때까지 partial attention output을 local에 유지합니다.
KV append	새 `k_t`, `v_t`	cache allocator가 다음 slot/block 위치를 정합니다	새 vector를 block-structured cache state에 씁니다.

linear layer와 비슷한 점이 보입니다. partial sum은 spill하지 않는 편이 좋고, tile은 on-chip에 있는 동안 최대한 쓸모 있게 써야 합니다. 하지만 attention에는 한 가지 twist가 있습니다. sequence dimension이 token마다 커지므로, context length가 길어질수록 cache read가 비싸집니다. 그래서 long-context Decode는 SRAM tiling, cache layout, request scheduling이 합쳐진 문제가 됩니다.

Compiler도 architecture의 일부다

RBLN compiler 자료는 memory allocation, SRAM use, dependency management가 compile time에 처리된다고 설명합니다 [3]. Optimum RBLN의 Llama 문서는 conversion 과정에서 Hugging Face checkpoint weights를 optimized RBLN graph로 옮기고, 그 graph를 compile한다고 말합니다 [13]. vLLM-RBLN은 같은 물리적 압력을 드러내는 설정을 노출합니다. max_model_len, block_size, prefix_block_size, decoder_batch_sizes, RBLN device placement가 그 예입니다 [12].

model graph에서 Decode-time executable movement까지

HF model Llama graph와 checkpoint tensor.

RBLN graph weights가 optimized graph로 옮겨집니다 [13].

Compiler tiling, fusion, memory allocation, dependency analysis [3].

Commands dependency가 허용하는 범위에서 DMA와 compute를 overlap하도록 ordering합니다.

Runtime graph variant를 고르고, device placement를 정하고, cache state를 관리합니다.

이 경로에서 "compilation"은 단순 packaging 단계가 아닙니다. SRAM과 movement 결정이 executable의 일부가 되는 지점입니다.

중요한 점은 dynamic serving이 사라지는 것이 아니라는 사실입니다. request length와 batch 구성은 계속 바뀝니다. 다만 runtime은 이 dynamism을 compiled system이 감당할 수 있는 shape로 표현하려고 합니다. 눈에 보이는 예가 decoder_batch_sizes입니다. RBLN은 여러 decoder graph를 compile해 두고, in-flight request 수에 맞는 가장 작은 graph를 선택할 수 있습니다 [12]. 하나의 완전히 generic한 kernel path와는 꽤 다른 방식입니다.

KV cache: regular graph와 irregular serving이 만나는 곳

KV cache는 이 이야기에서 가장 까다로운 부분입니다. model graph는 규칙적이지만 serving state는 그렇지 않습니다. request length는 서로 다르고, 끝난 request는 block을 반환합니다. shared prefix가 있으면 이전 cache를 재사용할 수도 있습니다. physical memory block이 연속적이지 않아도 attention은 logical sequence를 순서대로 읽어야 합니다.

logical token block이 physical KV block으로 mapping되는 모습

logical

physical

blk 0

P17

blk 1

P04

blk 2

P23

blk 3

P08

P04tokens 32-63

P05free

P08tokens 96-127

P17tokens 0-31

P23tokens 64-95

P31다른 req

PagedAttention은 logical sequence order와 physical block placement를 분리해서 memory fragmentation을 줄입니다 [11]. RBLN의 NPU-specific 질문은 이것입니다. planned movement를 포기하지 않으면서 kernel이 이 indirection을 어떻게 따라갈 수 있는가?

RBLN LLM-serving whitepaper는 중요한 단서를 줍니다. RBLN의 PagedAttention path는 vLLM block table과 호환되고, block table을 kernel 안으로 넘기며, dynamic DMA를 사용해 Command Processor가 address를 즉석에서 평가하고 arbitrary DRAM location에 접근할 수 있게 합니다 [4]. 이것이 타협점입니다. high-level execution은 compiled/SRAM-aware하게 유지하되, 정말 동적인 구조인 KV cache에는 controlled dynamic address path를 허용하는 것입니다.

NPU-friendly KV cache 처리는 controlled irregularity다

Block table req_A의 logical token block.

CP evaluates runtime address를 execution path 안에서 해결합니다 [4].

Dynamic DMA 연속적이지 않은 physical KV block을 가져옵니다.

SRAM tile active KV fragment를 attention compute 가까이에서 사용합니다.

Append new K/V token 129가 cache state에 기록됩니다.

runtime이 cache를 static하다고 가정하는 것은 아닙니다. irregularity를 NPU path가 다룰 수 있을 만큼 block-structured하게 만드는 것입니다.

Serving loop: batching과 graph variant가 왜 중요한가

Decode token 하나도 이미 memory program입니다. 실제 serving은 그 바깥에 또 하나의 loop를 붙입니다. request가 들어오고, Prefill이 실행되고, request가 Decode로 넘어오고, 일부는 끝나고, 새 request가 합류합니다. 그래서 live Decode batch는 iteration마다 바뀝니다. NPU는 batch가 영원히 하나의 고정 숫자라고 가정할 수 없습니다. 동시에 compiler-planned accelerator는 무제한의 shape 자유도도 원하지 않습니다.

compiled Decode path를 감싸는 dynamic serving

Requests arrive prompt length와 output limit가 request마다 다릅니다.

Prefill chunks prompt token이 initial KV cache state가 됩니다.

Decode batch active request가 step마다 token 하나를 생성합니다.

Graph choice runtime이 맞는 compiled decoder variant를 고릅니다 [12].

Cache update KV block은 read, append, share, free됩니다.

RBLN-specific 단서는 decoder_batch_sizes입니다. vLLM-RBLN 문서는 listed batch size마다 decoder graph 하나를 compile하고, in-flight request 수에 맞는 가장 작은 decoder graph를 선택한다고 설명합니다 [12]. 이것이 "controlled dynamism"의 좋은 예입니다. serving workload는 dynamic하지만, hardware-facing shape는 작은 수의 precompiled option으로 bucket됩니다.

Serving event	무엇이 바뀌는가	NPU가 신경 쓰는 이유
새 request가 Prefill에 들어옴	prompt token이 많은 K/V entry를 만듭니다	Prefill은 max sequence와 cache-block constraint에 맞춰 chunk/compile될 수 있습니다.
request가 Decode로 이동	step마다 새 token은 하나뿐입니다	graph가 GEMV/small-GEMM 중심이 되므로 SRAM reuse를 조심해서 잡아야 합니다.
batch size 변화	live token row 수가 늘거나 줄어듭니다	runtime은 하나의 fully generic shape 대신 다른 decoder graph bucket을 고를 수 있습니다 [12].
Prefix cache hit	일부 KV block을 재사용할 수 있습니다	block granularity가 Prefill chunking과 KV-cache block constraint에 맞아야 합니다 [12].
request 종료	KV block을 회수할 수 있습니다	physical block이 재활용되어도 다음 attention step은 logical sequence를 그대로 봐야 합니다.

이 주제는 과제에 잘 맞습니다. architecture를 RBLN SDK, vLLM 기반 LLM inference, compilation, serving, benchmarking과 바로 연결하기 때문입니다. 단순히 "NPU에는 SRAM이 있다"라고 말하는 것이 아닙니다. 실제 serving stack이 irregular LLM request를 추론용 NPU가 실행할 수 있는 shape와 block movement로 어떻게 바꾸는지를 설명하는 것입니다.

흔한 함정: 그럴듯하지만 얕은 설명들

선발 안내문은 원리를 이해했는지, 그리고 자기 언어로 설명할 수 있는지를 본다고 말합니다. 발표 자료와 면접에서 아래 표현들은 피하는 편이 좋습니다.

얕은 표현	왜 약한가	더 나은 표현
"Decode는 sequential이라 느립니다."	맞지만 불완전합니다. 각 step 내부의 hardware bottleneck을 설명하지 못합니다.	"Decode는 generated token 사이에서 sequential하고, 각 step은 낮은 weight reuse와 커지는 KV-cache read를 가집니다."
"NPU는 SRAM이 있어서 빠릅니다."	compiler schedule이 spill/reload를 잘못 만들면 SRAM만으로는 아무것도 해결되지 않습니다.	"SRAM은 schedule이 activation, partial sum, active KV tile을 local에 두고, 큰 weights를 예측 가능하게 stream할 때 효과가 있습니다."
"Prefill은 compute-bound, Decode는 memory-bound입니다."	너무 단정적입니다. batch size, model size, precision, cache length가 모두 영향을 줍니다.	"Prefill은 high-arithmetic-intensity GEMM 쪽으로 가고, small-batch Decode는 low-reuse GEMV와 KV traffic 쪽으로 갑니다."
"PagedAttention이 KV cache 문제를 해결합니다."	allocation waste와 sharing 문제를 줄이지만, block-table indirection은 여전히 남습니다.	"PagedAttention은 cache memory를 block-structured하게 만듭니다. NPU는 여전히 block address를 효율적으로 따라가고 attention을 tile하는 방법이 필요합니다."
"compiler는 model을 변환해 줄 뿐입니다."	NPU의 핵심 아이디어를 가려 버립니다.	"ATOM-style execution에서 compilation은 graph lowering, tiling, SRAM placement, dependency, command scheduling을 계획합니다 [3], [13]."

예상 질문: Decode가 memory-bound라면 batching은 왜 도움이 되는가? weight tile 하나가 evict되기 전에 더 많은 live token vector에 쓰일 수 있기 때문입니다. vLLM의 단순 계산에서는 GEMV의 intensity가 대략 1 FLOP/byte이고, b개 vector batch에서는 b FLOPs/byte 쪽으로 올라갑니다 [5]. 다만 batching은 공짜가 아닙니다. latency가 늘 수 있고 cache scheduling도 더 복잡해집니다.

예상 질문: 여기서 말하는 "dataflow"는 정확히 무엇인가? computation이 진행되는 동안 weights, activation, partial sum, KV block이 어디에 머무는지에 대한 선택입니다. Decode에서 중요한 dataflow 선택은 activation reuse, weight streaming, partial-sum locality, attention tiling, block-addressed KV movement입니다.

예상 질문: RBLN ATOM이 특정 stationarity를 쓴다고 주장하는가? 아닙니다. 공개 자료는 모든 operator의 exact internal mapping을 공개하지 않습니다. 여기서는 standard dataflow 용어를 사용해 공개된 ATOM 사실이 무엇을 시사하는지 설명합니다. software-managed SRAM, compiler scheduling, DMA/compute overlap, KV block address를 위한 dynamic DMA가 그 근거입니다 [1], [3], [4].

작은 trace: ATOM-style NPU에서 token 129가 지나가는 길

위 mental model을 하나의 구체적인 trace로 압축해 보겠습니다. 이것은 정확한 RBLN execution dump가 아닙니다. 공개 architecture가 시사하는 source-grounded 실행 모양입니다.

Runtime이 decoder graph를 선택req_A가 in-flight batch에 합류합니다. 여러 decoder batch size가 compile되어 있다면 runtime은 맞는 가장 작은 decoder variant를 고를 수 있습니다 [12].

Compiler schedule은 이미 memory plan을 알고 있음SRAM allocation, dependency ordering, DMA/compute overlap은 token 129에서 즉석으로 만들어지지 않습니다 [3].

Linear projection이 weight tile을 streamx_129는 작고 재사용 가능합니다. W_tile은 local/shared SRAM을 거쳐 stream되고, psum은 완성될 때까지 local에 머뭅니다.

Attention이 block table을 참조logical KV history가 physical block으로 mapping됩니다. dynamic DMA는 fixed contiguous allocation을 강제하지 않고 위치를 해결합니다 [4], [11].

새 K/V가 append됨cache가 바뀝니다. 그래서 Decode는 linear algebra 문제이면서 동시에 state-management 문제입니다.

다음 layer가 compact state를 받음완성된 hidden vector가 앞으로 이동합니다. temporary tile과 partial sum은 쓸모가 끝난 뒤 오래 남아 있으면 안 됩니다.

5장 발표로 압축하기

제출 PDF는 정확히 5장이어야 합니다. 주제 내용 4장, 그리고 NPU에 올려보고 싶은 model 1장입니다. 발표 자료가 이 문서 전체를 억지로 압축할 필요는 없습니다. 이 문서는 백업 지식으로 두고, 발표에서는 하나의 흐름만 깨끗하게 보여 주면 됩니다. Decode는 data-movement 문제이고, ATOM-style NPU는 SRAM-aware compiled execution과 controlled KV-cache dynamism으로 이 문제를 다룬다.

Slide 1: Decode는 작아진 Prefill이 아니다

핵심 주장: Prefill은 token row가 많고, Decode는 request마다 live row가 하나씩만 있습니다.

그림: 두 tensor block: [B,T,H] Prefill versus [B,H] Decode.

말할 문장: "transformer layer는 같지만 shape가 바뀝니다. token dimension이 얇아지면 weight reuse가 무너집니다."

Slide 2: Decode의 roofline intuition

핵심 주장: GEMV-like Decode는 BF16 weight traffic 기준으로 1 FLOP/byte 근처까지 내려갈 수 있습니다 [5].

그림: 2mn FLOPs와 2mn bytes를 나란히 놓고, batch b가 reuse를 올리는 모습을 보여 줍니다.

말할 문장: "먼저 물어야 할 것은 peak TOPS가 아닙니다. operand가 충분한 reuse를 가지고 도착하는가입니다."

Slide 3: NPU가 local에 붙잡아 두려는 것

핵심 주장: SRAM은 schedule의 workspace입니다. x_t, W_tile, psum, active KV fragment가 여기서 만납니다.

그림: memory ladder: local SRAM, shared SRAM, GDDR6, host/cluster.

말할 문장: "중요한 동사는 pin, stream, accumulate, reduce, hand off입니다."

Slide 4: KV cache는 controlled irregularity다

핵심 주장: Decode attention은 기존 K/V block을 읽고, 매 token마다 새 K/V를 append해야 합니다.

그림: logical block table이 non-contiguous physical block으로 mapping되는 그림.

말할 문장: "PagedAttention은 memory를 block-structured하게 만들고, RBLN은 arbitrary DRAM location 접근을 위해 dynamic DMA와 command-processor address evaluation을 설명합니다 [4], [11]."

Slide 5: NPU에 올려보고 싶은 model

Model: 제공되는 RBLN 환경에 따라 Llama 3.1 8B Instruct 또는 Gemma 계열 instruct model.

Reason, 200자 내외:

제가 NPU에 올려보고 싶은 모델은 Llama 3.1 8B Instruct입니다. 실제 서비스형 LLM에 가깝고, prefill과 decode의 병목, KV cache 관리, batch 크기 변화가 모두 드러나기 때문입니다. RBLN SDK와 vLLM에서 컴파일·서빙·벤치마크를 해 보면 NPU가 어떤 데이터 이동을 잘 처리하고 어디서 막히는지 가장 선명하게 배울 수 있다고 생각합니다.

Deck discipline: title slide, agenda slide, generic NPU intro는 넣지 않습니다. 앞의 4장은 각각 claim 하나, 구체적인 diagram 하나, 질문을 받아도 방어할 수 있는 문장 하나로 구성하는 편이 좋습니다.

연습 질문: 진짜로 이해했는지 확인하기

아래 질문이 편하게 느껴지면 mental model이 어느 정도 자리 잡은 것입니다. 애매하다면 용어를 외우기보다 diagram으로 다시 돌아가 보세요.

1. Stationarity

batch size 1에서 y = W x_t를 계산할 때, W, x_t, psum 중 무엇을 stationary하게 두고 싶은가? batch size가 커지면 무엇이 달라지는가?

2. SRAM pressure

tile이 local SRAM에 들어가지 않는다면 피하고 싶은 나쁜 결과는 무엇인가? extra off-chip read, partial-sum spill, idle compute 중 어떤 비용이 생기는가?

3. KV blocks

block table은 serving에는 왜 좋고 dataflow에는 왜 까다로운가? dynamic DMA는 무엇을 회복시켜 주는가?

References

전체 citation registry와 evidence ledger는 이 HTML 파일 옆의 sources.jsonl, evidence.jsonl에 있습니다. 아래는 이 walkthrough에서 인용한 sources입니다.

[1] Rebellions. RBLN NPU Architecture. docs.rbln.ai

[2] Rebellions. ATOM Architecture: Finding the Sweet Spot for GenAI. PDF

[3] Rebellions. Understanding RBLN Compiler. rebellions.ai

[4] Rebellions. LLM Serving with NPU: Re-engineered, Built for Scale and Efficiency. PDF

[5] Kwon et al. vLLM: An Efficient Inference Engine for Large Language Models. PDF

[6] He et al. WaferLLM: Large Language Model Inference at Wafer Scale. PDF

[7] Jouppi et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. Google Research

[8] Chen, Emer, and Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow. NVIDIA Research

[9] NVIDIA. CUDA Programming Guide: Writing CUDA SIMT Kernels. docs.nvidia.com

[10] NVIDIA. H100 Tensor Core GPU. nvidia.com

[11] Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv

[12] Rebellions. vLLM RBLN Configuration Guide. docs.rbln.ai

[13] Rebellions. Optimum RBLN Llama Documentation. docs.rbln.ai