Nathan Moinvaziri nmoinvaz

Compress Benchmark: HEAD (improvements/tally-v2) vs develop

Environment

Platform: macOS Darwin 24.6.0, Apple Silicon (ARM64)
CPU: 8 cores, L1D 64 KiB, L1I 128 KiB, L2 4096 KiB
Build: CMake Release, static libs

Commits

HEAD (improvements/tally-v2): c51ce99e — Combine extra_lbits/base_length and extra_dbits/base_dist lookup tables
develop: 1b880ba9 — Make extra length/distance bits computation branchless using bit masking

Assembly Analysis: Keep bi_buf/bi_valid in Registers Across compress_block

Change

Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.

Results

bi_buf/bi_valid Memory Operations (offsets 168/176 from deflate_state*)

Conditional Preload Optimization Analysis

Comparison of develop (08fa4859) vs HEAD (conditional preload with MIN_HAVE=15).

The patch decodes the next iteration's Huffman symbol before performing the chunk copy, allowing the table lookup latency to overlap with copy operations. A can_preload flag skips the preload when the bit accumulator is low (the UNLIKELY 2+ literal path), keeping INFLATE_FAST_MIN_HAVE at 15 instead of 22.

Benchmark Results

Functable Dispatch Matrix — x86 `-march` Variants

Extracted by inspecting undefined symbols in functable.c.o for each build — these are the function pointers the functable actually assigns at runtime. Builds use clang -target x86_64-apple-macos with runtime CPU detection enabled (the default).

`-march` native features

`-march`	SSE2	SSSE3	SSE4.1	SSE4.2	PCLMUL	AVX2	AVX-512	AVX512VNNI	VPCLMUL
x86-64	-	-	-	-	-	-	-	-	-
nehalem	native	native	native	native	-	-	-	-	-

	/* ===========================================================================
	* Symbol buffer write/read macros.
	*
	* The symbol buffer stores literal and distance/length pairs. The storage
	* format differs based on LIT_MEM (separate buffers) vs sym_buf (interleaved),
	* and on whether the platform supports fast unaligned 32-bit access
	* (OPTIMAL_CMP >= 32), which allows packing a 3-byte symbol into a single
	* 32-bit write/read.
	*
	* SYM_WRITE_LIT and SYM_WRITE_DIST write a symbol and advance sym_next.

	import re, sys

	def parse_results(path):
	results = {}
	with open(path) as f:
	for line in f:
	m = re.match(r'\s+(\d+\|avg\d?)\s', line)
	if not m:
	continue
	level = m.group(1)

	name: Benchmark
	on:
	issue_comment:
	types: [created]
	workflow_dispatch:
	inputs:
	pr_number:
	description: 'PR number to benchmark (results posted as PR comment)'
	required: false
	type: number


	// Set up generic C code fallbacks
	#ifndef WITH_ALL_FALLBACKS
	// Only use necessary generic functions when no suitable simd versions are available.
	// These conditions mirror the native_* defines in arch/*_functions.h headers.
	# if (defined(X86_SSE2) && defined(__SSE2__)) \|\| (defined(ARCH_X86) && defined(ARCH_64BIT))
	ft.adler32 = &adler32_c;
	ft.adler32_copy = &adler32_copy_c;
	ft.crc32 = &crc32_braid;
	ft.crc32_copy = &crc32_copy_braid;

	/* benchmark_tally.cc -- benchmark sym_buf read/write strategies
	* Copyright (C) 2024 zlib-ng contributors
	* For conditions of distribution and use, see copyright notice in zlib.h
	*
	* Compares:
	* 1. LIT_MEM (separate d_buf/l_buf arrays)
	* 2. sym_buf with zng_memread_4/zng_memwrite_4 (batched)
	* 3. sym_buf with byte-by-byte access (original)
	*/

	#include <benchmark/benchmark.h>
	#include <cstdint>

	static inline uint32_t count_matching_bytes_ctzll(uint64_t mask) {
	return __builtin_ctzll(mask);
	}

	static inline uint32_t count_matching_bytes_ctz32(uint64_t mask) {
	uint32_t lo = (uint32_t)mask;
	if (lo)