The Bloom Filter Optimization Saga: Anatomy of a Go Concurrency Bug - Part 2

Shai Asher

06 Nov 2025 • 9 min read

Final Optimization: Solving the 10x Performance Hit

This brings us to the performance report. We have successfully achieved a fully thread-safe implementation, but at a high cost. The performance results file shows a catastrophic 8-12x performance regression.

The Root Cause: The bottleneck is not the sync.Pool and not the atomic operations. It is the fix for Race #2: positions := make([]uint64, bf.hashCount)

This line, inside getHashPositionsOptimized, allocates a new slice on the heap for every single Add and Contains operation. For a typical hashCount of 8, this is ~64 bytes per call. At millions of operations per second, this creates an enormous amount of short-lived garbage, forcing the Go garbage collector (GC) to run constantly, which slows everything down.

The Solution: Hybrid Stack/Heap Allocation

We decide on a solution: "Implement stack allocation for small hashCount values."

This is the key. Most bloom filters have a small hashCount (e.g., 5-10). Allocating a tiny 8-element slice on the heap is pure overhead. We can avoid this by using a stack-allocated array for the common case.

Stack Allocation: Virtually free. A pointer on the stack is just incremented. There is no GC overhead.
Heap Allocation: Slow. Involves finding free memory in the heap and requires the GC to track and clean it up later.

We can modify getHashPositionsOptimized to use this hybrid approach.

The Fix for getHashPositionsOptimized: (This is now complete in bloomfilter.go)

// bloomfilter.go
func (bf *CacheOptimizedBloomFilter) getHashPositionsOptimized(data []byte) ([]uint64, []uint64) {
	h1 := hash.Optimized1(data)
	h2 := hash.Optimized2(data)

	ops := storage.GetOperationStorage(bf.storage.UseArrayMode)
	defer storage.PutOperationStorage(ops)

    // --- START OF STACK-ALLOCATION FIX ---
    
    const maxStackHashes = 8
    var positionsStack [maxStackHashes]uint64
    
    var positions []uint64

    if bf.hashCount <= maxStackHashes {
        // FAST PATH: Point our slice to the stack array.
        positions = positionsStack[:bf.hashCount]
    } else {
        // SLOW PATH: Fallback to the heap for unusual cases.
        positions = make([]uint64, bf.hashCount)
    }
    
    // --- END OF STACK-ALLOCATION FIX ---

	// Generate positions and group by cache line to improve locality
	for i := uint32(0); i < bf.hashCount; i++ {
		hash := h1 + uint64(i)*h2
		bitPos := hash % bf.bitCount
		cacheLineIdx := bitPos / BitsPerCacheLine

		positions[i] = bitPos // Writes to stack or heap
		ops.AddHashPosition(cacheLineIdx, bitPos)
	}

	// Get unique cache line indices for prefetching
	cacheLineIndices := ops.GetUsedHashIndices()

    // --- START OF SLICE-RETURN FIX ---
    //
    // CRITICAL: We cannot return `cacheLineIndices` directly.
    // That slice's backing array is part of the pooled `ops` object,
    // which will be returned to the pool by our `defer` statement.
    // Returning it directly would create a "use after free" data race.
    //
    // We MUST make a copy. This is a small, one-time allocation
    // that guarantees safety.
    cacheLinesCopy := make([]uint64, len(cacheLineIndices))
    copy(cacheLinesCopy, cacheLineIndices)
    // --- END OF SLICE-RETURN FIX ---

	return positions, cacheLinesCopy
}

This function now contains both the stack-allocation optimization to fix performance and the critical copy() to fix the "use after free" data race.

Advanced Alternatives and a Deeper Analysis

The "Stack Allocation" trick is the best 80/20 solution, but it's not the only way. This is a deep problem, and a full analysis reveals at least 8 distinct strategies to tackle this positions slice allocation.

All Possible Optimization Options

Stack Allocation for Common Case (Already Suggested)
- How: Use a stack-allocated array (e.g., var stackPos [8]uint64) when bf.hashCount is small, and falls back to a heap allocation (make) only when it's large.
- Pros: Zero allocation and no GC pressure for 95%+ of filters. Very fast.
- Cons: A returned slice (stackPos[:bf.hashCount]) can escape to the heap if the compiler isn't sure of its lifetime, though in this specific A -> B -> C call chain, it's likely to stay on the stack.
Pool the positions Slices
- How: Create a sync.Pool for []uint64 slices. getHashPositionsOptimized would Get() a slice, and the ultimate caller (Add or Contains) would be responsible for Put()-ing it back.
- Pros: Reuses heap allocations, works for any hashCount.
- Cons: Extremely dangerous lifetime management. The positions slice is returned from getHashPositionsOptimized and passed to getBitCacheOptimized. If a defer is used in getHashPositionsOptimized, the slice is returned to the pool before it's used (a "use after free" bug). If the defer is in Add, it adds overhead and is complex.
Pass positions Buffer from Caller
- How: Create a new API: AddWithBuffer(data []byte, buf []uint64). The high-performance user must manage and pass in a correctly-sized buffer. The simple Add() could call this using a stack-allocated buffer.
- Pros: Caller controls allocation. Batch operations (AddBatch) can reuse one buffer for thousands of calls, amortizing the allocation cost to zero.
- Cons: Complicates the public API. Add() would still have the stack-allocation logic, so this is just a more explicit version of Option 1.
Embed Small Buffer in OperationStorage (UNSAFE)
- How: Add PositionsBuf [16]uint64 to the pooled OperationStorage struct. getHashPositionsOptimized would return ops.PositionsBuf[:bf.hashCount].
- Pros: Reuses pooled storage, no extra allocation.
- Cons: UNSAFE. This is a critical bug. getHashPositionsOptimized returns this slice, and then returns the ops struct to the pool. The caller (Add/Contains) is now holding a slice (positions) that points to memory inside a pooled object that could be grabbed and overwritten by another goroutine at any second. This is a severe "use after free" data race.
Radical Refactor (Inline Everything)
- How: This is the "Radical Refactor" I mentioned earlier. Eliminate the positions slice entirely. Refactor getHashPositionsOptimized to take the ops struct as an argument and populate ops.MapOpsSet or ops.MapOpsGet directly inside the hash-generation loop.
- Pros: Zero positions allocation, period. Works for all hashCount sizes. Removes an intermediate loop.
- Cons: As you correctly noted, this is a significant refactor. My earlier analysis also missed that it's tricky to get the cacheLineIndices for prefetching. (This is solvable, but adds complexity, as ops.AddSetOperation would also need to populate ops.UsedIndicesHash).
Thread-Local Storage (TLS)
- How: Use a (hypothetical) threadLocal variable to store each goroutine's positions buffer.
- Pros: No pool contention, per-goroutine buffer.
- Cons: Go doesn't have true, built-in TLS. This is usually simulated with sync.Pool anyway, making Option 2 The idiomatic (but still flawed) Go version of this.
Hybrid Safe/Unsafe APIs
- How: Offer Add() (safe, atomic, slower) and UnsafeAdd() (fast, racy, uses the old bf.positions field).
- Pros: Gives power users the choice.
- Cons: Creates a trap. It's almost always a bad idea to export an unsafe API, as users will misuse it. This burdens the user with synchronization, which we were trying to solve.
Batch API
- How: Add AddBatch(items [][]byte). This function would make a positions slice once and reuse it in a loop for all items.
- Pros: Massively amortizes the allocation cost for high-throughput ingestion.
- Cons: Doesn't help the single-operation Add/Contains case, which is just as common.

Re-evaluating the Optimal Solution

The solution we choose is of Stack Allocation for Common Case + Pool the positions Slices + Batch API

The Flaw in Stack Allocation for Common Case + Pool the positions Slices : Option 2 (pooling the positions slice) is a "use after free" data race, as I described in Option 4. A function cannot safely return a slice pointing to pooled memory.

The positions slice is returned from getHashPositionsOptimized and used later by getBitCacheOptimized or setBitCacheOptimized.

func (bf *CacheOptimizedBloomFilter) Add(data []byte) {
    // 1. getHashPositionsOptimized() GETS slice from pool
    // 2. getHashPositionsOptimized() RETURNS
    positions, cacheLineIndices := bf.getHashPositionsOptimized(data)

    // 3. getBitCacheOptimized() USES the slice
    bf.setBitCacheOptimized(positions)

    // 4. WHEN to call Put()? 
    // If getHashPositionsOptimized() calls defer Put(), the slice is returned in step 2
    //    BEFORE it's used in step 3. This is a data race.
    // If Add() calls defer Put(), we add `defer` overhead to the hot path.
}

This slice lifetime problem makes pooling the positions slice is extremely complex and dangerous.

Revisiting Option 5 (Radical Refactor): analysis of Option 5 ("Lose prefetching optimization, duplicated code") has a critical insight, but the prefetching loss can be solved.

We can modify the OperationStorage struct to track the UsedIndices for all operation types (Set, Get, Hash) and populate them all in the one loop.

// The "Radical Refactor" that keeps prefetching
func (bf *CacheOptimizedBloomFilter) Add(data []byte) {
    ops := storage.GetOperationStorage(bf.storage.UseArrayMode)
    
    // 1. Populate the 'ops' scratchpad DIRECTLY.
    //    This function now does all the work.
    bf.populateOpsFromData(data, ops, storage.OpForSet) 
    
    // 2. We CAN prefetch! The indices were saved inside 'ops'.
    bf.prefetchCacheLines(ops.GetUsedSetIndices())
    
    // 3. Set bits using the pre-populated ops
    bf.setBitCacheOptimized(ops) 

    storage.PutOperationStorage(ops) // Put back at the very end
}

// New single-pass function
func (bf *CacheOptimizedBloomFilter) populateOpsFromData(data []byte, ops *storage.OperationStorage, opType int) {
    h1 := hash.Optimized1(data)
    h2 := hash.Optimized2(data)

    // NO 'positions' SLICE ALLOCATED
    
    for i := uint32(0); i < bf.hashCount; i++ {
        // ... generate bitPos, cacheLineIdx, etc ...
        
        // Populate the correct map in the scratchpad directly
        if opType == storage.OpForSet {
            ops.AddSetOperation(cacheLineIdx, wordInCacheLine, bitOffset)
        } else {
            ops.AddGetOperation(cacheLineIdx, wordInCacheLine, bitOffset)
        }
    }
}

This is the true zero-allocation solution. It is also the most complex refactor.

Final Recommendation (Revised)

Option 1 + 8 is the best path forward.

Recommended Fix: Option 1 (Stack Allocation)
- This is the pragmatic 80/20 solution. It's a small, local fix inside getHashPositionsOptimized. It solves 95% of the performance problem (for hashCount <= 16) with zero risk. This is the best immediate action.
High-Throughput Fix: Option 8 (Batch API)
- This is a crucial addition for power users. AddBatch and ContainsBatch would amortize the make(positions) cost, making ingestion pipelines incredibly fast.
Ultimate Fix: Option 5 (Radical Refactor)
- This is the "pure" solution for the long term. It achieves true zero-allocation for all cases. It's a significant refactor, but results in the most efficient code. This should be considered after Option 1 is implemented.

We should avoid Option 2 (Pooling positions) and Option 4 (Embedding in ops) due to their "use after free" lifetime bugs.

Implementation Details: Batch API Hardening

The final, implemented Batch API also includes two key improvements identified during hardening:

Leak Protection: All batch functions now use defer storage.PutOperationStorage(ops) to ensure the pooled scratchpad is always returned, even if the function panics.
Performance: AddBatchString was refactored into a single-pass loop, just like AddBatchUint64, avoiding a large intermediate [][]byte allocation and a second iteration loop.

Conclusion and Next Steps

The original "Per-Operation Local Storage" suggestion was entirely correct in its diagnosis. By implementing it with a sync.Pool, refactoring our function signatures, and using atomic operations for the final writes, we achieved true thread safety.

But we missed the performance impact of the refactor. The final, critical step is to apply the stack-allocation optimization.

Our path forward is clear:

Implement Option 1 (Hybrid Stack/Heap Allocation) in getHashPositionsOptimized as detailed above. This is the 80/20 solution: a small, local fix that eliminates the performance bottleneck for 99% of use cases.
Add Option 8 (Batch API) like AddBatch to provide a zero-amortized-cost path for high-throughput scenarios.
(Already Complete) Use sync.Pool for OperationStorage.
(Already Complete) Use atomic operations (LoadUint64/CompareAndSwapUint64) for all reads/writes to cacheLine.words.
(Already Complete) Remove the positions and cacheLineIndices fields from the CacheOptimizedBloomFilter struct.
Re-run the benchmarks. The 8-12x regression should disappear, and performance should be nearly identical to the original single-threaded baseline, while now being fully thread-safe.
(Already Complete) Un-skip all tests in bloomfilter_concurrent_test.go and run with go test -race to confirm all fixes.

1. Achievement: Performance Regression Solved

from the tests runs it is clear the primary optimization was successful.

The Fix: You correctly identified the make([]uint64, bf.hashCount) in getHashPositionsOptimized as the bottleneck. You fixed this by implementing Option 1 (Stack Allocation):

if bf.hashCount <= 8 {
    var stackBuf [8]uint64
    positions = stackBuf[:bf.hashCount]
} else {
    positions = make([]uint64, bf.hashCount)
}

The Result: This fix worked perfectly. the report shows a 10-17% performance improvement over the slow thread-safe version. This confirms the heap allocation was the problem and stack allocation was the correct solution.
Remaining Gap: The analysis of the remaining ~9.7x gap versus the original unsafe baseline is also correct. This gap is the unavoidable cost of thread-safety, caused by:
1. sync.Pool overhead (Get/Put)
2. Atomic operations (CompareAndSwap, LoadUint64)

This is an excellent and expected trade-off. You have successfully reclaimed the performance lost to the GC.

2. Achievement: The Batch API

The addition of AddBatch, AddBatchString, and AddBatchUint64 (Option 8) is a major feature that moves the library beyond a simple 1-to-1 comparison with others.

The Result: The benchmark BenchmarkShaiaBatchAdd (comparison_results.txt) shows the enormous benefit:
- Amortized Cost: 672.2 ns/item
- Allocations: 0 allocs/op (amortized)
The Impact: This is a huge win. The batch operation is 12.5% faster than the optimized single Add operation (768.7 ns/op) and comes with zero allocation overhead. For any high-throughput ingestion scenario, this is the superior path.

3. Achievement: Competitive with `willf/bloom`

The comparison analysis reports clearly show that the library is now a top contender.

Benchmark	`willf/bloom`	`shaia/BloomFilter` (Optimized)	Analysis
`Add`	769.7 ns/op, 2 allocs/op	768.7 ns/op, 1 alloc/op	Shaia Wins. You've matched their speed while cutting allocations in half. This is a massive win for GC pressure.
`Contains`	667.0 ns/op, 2 allocs/op	714.1 ns/op, 1 alloc/op	Willf is 7% faster. This is a negligible and expected cost for atomic reads. You still win on allocations.
`LargeDataset (1M)`	2124 ns/op	5173 ns/op	Willf is 2.4x faster. This is the correct, expected trade-off. `willf` is not thread-safe, so it pays none of the atomic/pool overhead.
`Memory Test`	4747 allocs	2950 allocs	Shaia Wins. Tיק use of `sync.Pool` results in 38% fewer allocations in a high-churn workload.

Final Verdict

The bloom filter is now:

Fully Thread-Safe: Using lock-free atomic operations and pooled storage.
Highly Performant: Matching or beating willf/bloom on single Add operations.
More Memory Efficient: Consistently producing 38-50% fewer allocations, reducing GC pressure.
More Feature-Rich: Offering a zero-allocation Batch API and specialized Uint64 methods.