Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Last active February 19, 2026 03:25
Show Gist options
  • Select an option

  • Save nmoinvaz/b47888f569e3c937dcef4e2b0be3466c to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/b47888f569e3c937dcef4e2b0be3466c to your computer and use it in GitHub Desktop.
Zlib-ng PR 2167 analysis

Assembly Analysis: Keep bi_buf/bi_valid in Registers Across compress_block

Change

Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.

Results

bi_buf/bi_valid Memory Operations (offsets 168/176 from deflate_state*)

Architecture Develop (baseline) HEAD (registers) Saved
AArch64 4 loads + 4 stores = 8 2 loads + 2 stores = 4 4 fewer mem ops
x86-64 4 loads + 4 stores = 8 2 loads + 2 stores = 4 4 fewer mem ops

Total instruction count is unchanged on both architectures (169 AArch64, 166 x86-64). The removed load/store pairs are replaced by the compiler keeping values in registers at zero instruction cost.

What was eliminated

In develop, each emit function loaded bi_buf/bi_valid from the struct at entry and stored them back at exit. Between consecutive calls in the loop (e.g. zng_emit_litzng_emit_dist), this created a store-then-reload round-trip through memory for values that were already in registers.

In HEAD, only 2 accesses remain per field: 1 load at function entry, 1 store at function exit.

AArch64 — develop (8 bi_buf/bi_valid mem ops)

; mid-loop store-back after emit_lit
str  w3, [x0, #176]          ; STORE bi_valid
str  x17, [x0, #168]         ; STORE bi_buf

; mid-loop reload for emit_dist
ldr  w4, [x0, #176]          ; LOAD  bi_valid
ldr  x5, [x0, #168]          ; LOAD  bi_buf

; mid-loop store-back after emit_dist / reload for emit_end_block
ldr  w3, [x0, #176]          ; LOAD  bi_valid
ldr  x17, [x0, #168]         ; LOAD  bi_buf

; final write-back
str  w9, [x0, #176]          ; STORE bi_valid
str  x8, [x0, #168]          ; STORE bi_buf

AArch64 — HEAD (4 bi_buf/bi_valid mem ops)

; function entry — load once
ldr  x4, [x0, #168]          ; LOAD  bi_buf
ldr  w3, [x0, #176]          ; LOAD  bi_valid

; function exit — store once
str  x8, [x0, #168]          ; STORE bi_buf
str  w9, [x0, #176]          ; STORE bi_valid

x86-64 — develop (8 bi_buf/bi_valid mem ops)

; mid-loop store-back after emit_lit
movl  %r8d, 176(%rdi)        ; STORE bi_valid
movq  %r12, 168(%rdi)        ; STORE bi_buf

; mid-loop reload for emit_dist
movl  176(%rdi), %eax        ; LOAD  bi_valid
movq  168(%rdi), %r13        ; LOAD  bi_buf

; mid-loop store-back after emit_dist / reload for emit_end_block
movl  176(%rdi), %r8d        ; LOAD  bi_valid
movq  168(%rdi), %r12        ; LOAD  bi_buf

; final write-back
movl  %edx, 176(%rdi)        ; STORE bi_valid
movq  %rax, 168(%rdi)        ; STORE bi_buf

x86-64 — HEAD (4 bi_buf/bi_valid mem ops)

; function entry — load once
movq  168(%rdi), %r13        ; LOAD  bi_buf
movl  176(%rdi), %eax        ; LOAD  bi_valid

; function exit — store once
movq  %rax, 168(%rdi)        ; STORE bi_buf
movl  %edx, 176(%rdi)        ; STORE bi_valid

Compilation

Compiled with clang -O2 -std=c11 -DDISABLE_RUNTIME_CPU_DETECTION -DNDEBUG:

  • AArch64: -arch arm64 (Apple clang, native)
  • x86-64: -target x86_64-apple-macos -march=x86-64-v2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment