Skip to content

Instantly share code, notes, and snippets.

View nmoinvaz's full-sized avatar

Nathan Moinvaziri nmoinvaz

  • Phoenix, United States
View GitHub Profile
@nmoinvaz
nmoinvaz / benchmark_compress_results.md
Created February 21, 2026 00:19
zlib-ng compress benchmark: improvements/tally-v2 vs develop

Compress Benchmark: HEAD (improvements/tally-v2) vs develop

Environment

  • Platform: macOS Darwin 24.6.0, Apple Silicon (ARM64)
  • CPU: 8 cores, L1D 64 KiB, L1I 128 KiB, L2 4096 KiB
  • Build: CMake Release, static libs

Commits

  • HEAD (improvements/tally-v2): c51ce99e — Combine extra_lbits/base_length and extra_dbits/base_dist lookup tables
  • develop: 1b880ba9 — Make extra length/distance bits computation branchless using bit masking
@nmoinvaz
nmoinvaz / compress_block_bi_buf_register_optimization.md
Last active February 19, 2026 03:25
Zlib-ng PR 2167 analysis

Assembly Analysis: Keep bi_buf/bi_valid in Registers Across compress_block

Change

Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.

Results

bi_buf/bi_valid Memory Operations (offsets 168/176 from deflate_state*)

@nmoinvaz
nmoinvaz / conditional-preload-pr-2088.md
Created February 19, 2026 02:34
Zlib-ng PR 2088 conditional preload AI analysis

Conditional Preload Optimization Analysis

Comparison of develop (08fa4859) vs HEAD (conditional preload with MIN_HAVE=15).

The patch decodes the next iteration's Huffman symbol before performing the chunk copy, allowing the table lookup latency to overlap with copy operations. A can_preload flag skips the preload when the bit accumulator is low (the UNLIKELY 2+ literal path), keeping INFLATE_FAST_MIN_HAVE at 15 instead of 22.

Benchmark Results

@nmoinvaz
nmoinvaz / variant-matrix-pr-2139.md
Created February 18, 2026 05:42
Zlib-ng variant matrix for PR 2139

Functable Dispatch Matrix — x86 -march Variants

Extracted by inspecting undefined symbols in functable.c.o for each build — these are the function pointers the functable actually assigns at runtime. Builds use clang -target x86_64-apple-macos with runtime CPU detection enabled (the default).

-march native features

-march SSE2 SSSE3 SSE4.1 SSE4.2 PCLMUL AVX2 AVX-512 AVX512VNNI VPCLMUL
x86-64 - - - - - - - - -
nehalem native native native native - - - - -
@nmoinvaz
nmoinvaz / deflate_sym_macros.h
Created February 11, 2026 21:51
Zlib-ng deflate symbol macros
/* ===========================================================================
* Symbol buffer write/read macros.
*
* The symbol buffer stores literal and distance/length pairs. The storage
* format differs based on LIT_MEM (separate buffers) vs sym_buf (interleaved),
* and on whether the platform supports fast unaligned 32-bit access
* (OPTIMAL_CMP >= 32), which allows packing a 3-byte symbol into a single
* 32-bit write/read.
*
* SYM_WRITE_LIT and SYM_WRITE_DIST write a symbol and advance sym_next.
@nmoinvaz
nmoinvaz / comparebench.py
Last active February 8, 2026 07:04
Deflatebench comparison
import re, sys
def parse_results(path):
results = {}
with open(path) as f:
for line in f:
m = re.match(r'\s+(\d+|avg\d?)\s', line)
if not m:
continue
level = m.group(1)
@nmoinvaz
nmoinvaz / benchmark.yml
Last active February 8, 2026 06:51
Zlib-ng benchmark workflow
name: Benchmark
on:
issue_comment:
types: [created]
workflow_dispatch:
inputs:
pr_number:
description: 'PR number to benchmark (results posted as PR comment)'
required: false
type: number
@nmoinvaz
nmoinvaz / functable_part.c
Created February 2, 2026 01:57
Zlib-ng functable without fallbacks
// Set up generic C code fallbacks
#ifndef WITH_ALL_FALLBACKS
// Only use necessary generic functions when no suitable simd versions are available.
// These conditions mirror the native_* defines in arch/*_functions.h headers.
# if (defined(X86_SSE2) && defined(__SSE2__)) || (defined(ARCH_X86) && defined(ARCH_64BIT))
ft.adler32 = &adler32_c;
ft.adler32_copy = &adler32_copy_c;
ft.crc32 = &crc32_braid;
ft.crc32_copy = &crc32_copy_braid;
@nmoinvaz
nmoinvaz / benchmark_tally.cc
Created February 1, 2026 01:10
Zlib-ng benchmark for deflate tallying
/* benchmark_tally.cc -- benchmark sym_buf read/write strategies
* Copyright (C) 2024 zlib-ng contributors
* For conditions of distribution and use, see copyright notice in zlib.h
*
* Compares:
* 1. LIT_MEM (separate d_buf/l_buf arrays)
* 2. sym_buf with zng_memread_4/zng_memwrite_4 (batched)
* 3. sym_buf with byte-by-byte access (original)
*/
@nmoinvaz
nmoinvaz / quick-bench-count-matches.cc
Last active January 26, 2026 04:06
Benchmark count matching bytes
#include <benchmark/benchmark.h>
#include <cstdint>
static inline uint32_t count_matching_bytes_ctzll(uint64_t mask) {
return __builtin_ctzll(mask);
}
static inline uint32_t count_matching_bytes_ctz32(uint64_t mask) {
uint32_t lo = (uint32_t)mask;
if (lo)