The Bloom Filter Optimization Saga: Anatomy of a Go Concurrency Bug - Part 1

The Bloom Filter Optimization Saga: Anatomy of a Go Concurrency Bug - Part 1

Github Project : https://github.com/shaia/BloomFilter

When 'Read' Operations Mutate State

I recently investigated a critical bug in theBloomFilter package that was causing panics under concurrent use. Thebloomfilter_concurrent_test.go file said it all: every test was skipped, citing a "nil pointer dereference in storage layer".

The most surprising part? The crashes weren't just happening during mixed reads and writes. Even concurrent-only read operations were failing.

This is a classic and dangerous concurrency anti-pattern. A function that appears to be read-only, like Contains(), was in fact mutating shared state, leading to a catastrophic data race. Here’s a deep dive into what went wrong, how it was fixed, and how I solved the massive performance bottleneck introduced by the fix.

The Code That Seemed 'Safe'

On the surface, our ContainsString method looks perfectly fine. It's just a simple wrapper:

// bloomfilter.go (Original, Racy Version)
func (bf *CacheOptimizedBloomFilter) ContainsString(s string) bool {
	data := ... // (convert string to []byte)
	return bf.Contains(data)
}

func (bf *CacheOptimizedBloomFilter) Contains(data []byte) bool {
	bf.getHashPositionsOptimized(data) // Generates bit positions
	bf.prefetchCacheLines()
	return bf.getBitCacheOptimized(bf.positions[:bf.count]) // Checks bits
}

This "scratchpad" logic lives inside storage.go and is the source of the entire problem.

The 'Clever' Struct Behind the Crash: What Was storage.Mode For?

Before we dive into the crash, it's important to understand why the storage.Mode struct existed. It wasn't a mistake; it was a series of clever, layered optimizations.

The Mode struct served three purposes:

  1. Hybrid Storage (Array vs. Map): This is its most obvious role. It's an abstraction to switch between two storage strategies based on the filter's size.
    • Array Mode: For small filters (<= 10,000 cache lines), it uses a fixed-size array (ArrayOps). Accessing ArrayOps[cacheLineIdx] is lightning-fast, with zero map-hashing overhead.
    • Map Mode: For large filters (> 10,000 cache lines), it uses a standard Go map (MapOps). This scales dynamically and avoids allocating a massive, mostly-empty array. This UseArrayMode flag lets the bloom filter get the best of both worlds.
  2. Improving Cache Locality: This was the real performance win. A bloom filter check involves probing several random locations in a large bitset. If you have 8 hash functions, you might check 8 bits at 8 very different memory addresses, which is terrible for CPU cache performance. The Mode struct was designed to solve this. Instead of checking bits immediately, getBitCacheOptimized it first loops through all 8 hash positions and uses storage.AddGetOperation them to group the bit-checks by their CPU cache line. After grouping, the code iterates over the cache lines it needs (for _, cacheLineIdx := range ...). It loads a single cache line, performs all the bit-checks for that line at once, and only then moves to the next. This keeps the relevant data in the CPU's fastest L1/L2 caches, resulting in dramatically faster performance.

The 'Scratchpad' (The Fatal Flaw): This is where the cleverness became a bug.

What is a 'Scratchpad'?

In this context, a "scratchpad" is a common term for a piece of temporary, reusable memory (like a struct, slice, or map) that is pre-allocated to avoid repeated memory allocations inside a function that's called frequently.
Think of it like having a single physical notepad on desk that you use for all your temporary calculations.
The "Slow" Way (No Scratchpad): Every time you (Contains) need to do a calculation, you grab a brand new sheet of paper (allocate new memory), use it, and then throw it in the trash (let the garbage collector clean it up). This is slow and wasteful.
The "Fast" Way (Scratchpad): You keep one notepad (storage.Mode) on your desk. When you need to do a calculation, you erase the notepad (bf.storage.ClearGetMap()), write down your new calculations (bf.storage.AddGetOperation()), and read the.
This "reusable scratchpad" was the shared state that concurrent goroutines were fighting over. The Mode struct's purpose was sound (hybrid storage, cache locality), but its implementation as a reusable, stateful object was not thread-safe.

The Villain: The Shared 'Scratchpad'

The storage.Mode struct was intended to be a reusable helper, but it was shared by all goroutines. It contained slices that were modified on every call:

// storage.go (Original, Racy Version)
type Mode struct {
    // ...
    // DANGER: Shared by all goroutines
	UsedIndicesGet  []uint64 
    ArrayOps    *[10000][]OpDetail
    MapOps      map[uint64][]OpDetail
    // ...
}

func (s *Mode) ClearGetMap() {
    // RACE: This slice header is modified by concurrent calls
	s.UsedIndicesGet = s.UsedIndicesGet[:0] 
	for _, idx := range s.UsedIndicesGet {
        // ...
	}
}

func (s *Mode) AddGetOperation(cacheLineIdx, WordIdx, BitOffset uint64) {
    if s.UseArrayMode {
        if len(s.ArrayOps[cacheLineIdx]) == 0 {
            // RACE: Concurrent appends to the *same* inner slice
            s.UsedIndicesGet = append(s.UsedIndicesGet, cacheLineIdx) 
        }
        s.ArrayOps[cacheLineIdx] = append(s.ArrayOps[cacheLineIdx], OpDetail{...})
    }
}

The Data Race: A Step-by-Step Crash

Let's imagine what happens when two goroutines call ContainsString at the same time:

  1. Goroutine A calls getBitCacheOptimized. It enters ClearGetMap and sets s.UsedIndicesGet = s.UsedIndicesGet[:0].
  2. Goroutine A starts its loop and appends an index: s.UsedIndicesGet = append(s.UsedIndicesGet, 10). The slice now looks like [10].
  3. Goroutine B concurrently calls getBitCacheOptimized. It enters ClearGetMap and also sets s.UsedIndicesGet = s.UsedIndicesGet[:0]. It just wiped out Goroutine A's data. The slice is now [].
  4. Goroutine A continues its loop, unaware. It appends again: s.UsedIndicesGet = append(s.UsedIndicesGet, 20). The slice is now [20].
  5. Goroutine B starts its loop. It appends its index: s.UsedIndicesGet = append(s.UsedIndicesGet, 50). The slice is now [20, 50].

The slice is now a corrupted mix of data from both goroutines. This leads to several possible crashes:

  • Wrong Result: Goroutine A might check bits for Goroutine B, leading to a false negative.
  • Panic (Nil Pointer/Index out of Bounds): append is not atomic. If two goroutines append to a slice simultaneously and it needs to re-allocate its backing array, one goroutine might be left with a pointer to the old, freed array, causing a nil pointer dereference or invalid memory access. This is exactly what our tests warned us about.
  • Panic (Map Race): In "map mode," the code makes concurrent writes to a standard Go map (s.MapOps). This is explicitly forbidden and causes a fatal error: concurrent map read and map write panic.

How to Fix It: Evaluating the Options

A few possible solutions can be handy in this case. Let's analyze them.

Theoretical Roots: The Readers-Writer Problem (And Why We're Worse)

This entire situation is a fascinating (and dangerous) variation of the classic Readers-Writer Problem from computer science.

What is the Readers-Writer Problem?

This is a fundamental concurrency problem involving a shared resource (such as a file, database, or data structure). It defines two types of processes:

  • Readers: They only read the resource. They do not modify it.
  • Writers: They modify the resource.

The problem is solved by enforcing two rules:

  1. Rule 1 (Readers): Any number of Readers can access the resource simultaneously. This is safe because they don't change anything.
  2. Rule 2 (Writers): A Writer must have exclusive access. When a Writer is active, no other Readers or Writers can be present.

This is exactly what Go's sync.RWMutex is built for: RLock() (Rule 1) allows many concurrent readers, while Lock() (Rule 2) waits for all readers to finish, then locks everyone else out.

Why Our Bloom Filter is a Worse Problem

Our bug stems from a violation of the problem's basic premise. We thought we had:

  • Readers: Contains()
  • Writers: Add()

The catastrophic mistake was that our "Readers" were secretly Writers.

Contains() should have been a read-only operation. But by mutating the shared bf.storage scratchpad, it became a Writer. This transforms our situation into a much more dangerous Writer-Writer Problem, where every operation conflicts with every other operation.

This analysis reveals we don't have one bug; we have three distinct data races, each corresponding to a different theoretical flaw.

  1. Race 1: The Scratchpad Race (Writer vs. Writer)
    • Conflict: Contains() (Goroutine A) vs. Contains() (Goroutine B).
    • Theory: Both operations are acting as Writers on the same shared resource: the bf.storage scratchpad. This is a classic Writer-Writer conflict, where two processes try to modify the same piece of memory simultaneously, leading to corrupted data (a nil pointer panic). This is the bug that causes concurrent reads to break.
  2. Race 2: The Shared Field Race (Writer vs. Reader)
    • Conflict: getHashPositionsOptimized() (a Writer to bf.positions) vs. getBitCacheOptimized() (a Reader of bf.positions).
    • Theory: This is a true Readers-Writer problem that we simply failed to protect. getHashPositionsOptimized (called by Add or Contains) writes to the bf.positions slice. If Goroutine A is in the middle of reading that slice while Goroutine B writes to it, we get a data race.
  3. Race 3: The Bit Array Race (Read-Modify-Write)
    • Conflict: Add() (Goroutine A) vs. Add() (Goroutine B).
    • Theory: This is the most subtle bug: a Read-Modify-Write (RMW) race. The operation cacheLine.words[op.WordIdx] |= ... is not a single instruction. It's three:
      1. READ the current value from memory.
      2. MODIFY the value (in a CPU register) by-OR-ing it with the new bit.
      3. WRITE the new value back to memory.
    • If two goroutines do this simultaneously, they can create a "lost update," as described in the final section. This is a "Writer vs. Writer" conflict on the bloom filter's actual data, not just its scratchpad.

Understanding this theory explains why our final solution must have three parts:

  • sync.Pool fixes Race 1 (the scratchpad).
  • Refactoring function signatures fixes Race 2 (the shared fields).
  • Atomic operations fix Race 3 (the RMW bit array).

Option 1: The "Simple" Fix (Global RWMutex)

This is the most common "first guess" for making a struct thread-safe. You add a sync.RWMutex to the struct, use RLock() for read-only methods, and Lock() for write methods.

How it would be implemented:

import "sync"

type CacheOptimizedBloomFilter struct {
    mu sync.RWMutex // Add a single mutex
    
    // ... all other fields
}

func (bf *CacheOptimizedBloomFilter) Contains(data []byte) bool {
    bf.mu.RLock() // Lock for reading
    defer bf.mu.RUnlock()

	bf.getHashPositionsOptimized(data)
	bf.prefetchCacheLines()
	return bf.getBitCacheOptimized(bf.positions[:bf.count])
}

func (bf *CacheOptimizedBloomFilter) Add(data []byte) {
    bf.mu.Lock() // Lock for writing
    defer bf.mu.Unlock()

	bf.getHashPositionsOptimized(data)
	bf.prefetchCacheLines()
	bf.setBitCacheOptimized(bf.positions[:bf.count])
}

Pros and Cons of this approach:

func (bf *CacheOptimizedBloomFilter) Contains(data []byte) bool {
    bf.mu.Lock() // Must be a full Lock, not RLock!
    defer bf.mu.Unlock()
    // ...
}
  • This would make the code correct, but it would destroy performance. A RWMutex is only fast if you have many concurrent reads. By forcing Contains to use of a full write lock, we serialize all operations. Only one goroutine could call Add or Contains at a time. The bloom filter would become a massive bottleneck, likely performing worse than a single-threaded version due to lock contention.

This solution is a trap. It looks simple, but it's either incorrect (with RLock) or unacceptably slow (with Lock).

Option 2 (Detailed): Per-Operation Local Storage

This approach directly fixes the data race by eliminating the shared part of the "shared scratchpad." Instead of one reusable notepad for everyone, each goroutine gets its own brand-new sheet of paper.

How it would be implemented:

The implementation is simple: you remove all calls to bf.storage's helper methods (ClearGetMap, AddGetOperation, etc.) and instead declare a new, local map and slice inside the function itself.

// The "Per-Operation Local Storage" implementation
func (bf *CacheOptimizedBloomFilter) getBitCacheOptimized(positions []uint64) bool {
    // 1. DANGER REMOVED: No more call to bf.storage.ClearGetMap()
    
    // 2. Create new, local "scratchpads".
    //    These are not shared between goroutines.
    var localIndices []uint64
    var localOps map[uint64][]OpDetail

    // We can even optimize based on array/map mode.
    if bf.storage.UseArrayMode {
        // Still faster to use the array, but we need our own local slice.
        localIndices = make([]uint64, 0, 8) // Small pre-allocation
    } else {
        // In map mode, we must create a new map every time.
        localIndices = make([]uint64, 0, 8)
        localOps = make(map[uint64][]OpDetail, 8)
    }

	for _, bitPos := range positions {
        // ...
        // 3. DANGER REMOVED: We only modify our local variables.
        if bf.storage.UseArrayMode {
            if len(bf.storage.ArrayOps[cacheLineIdx]) == 0 { // Note: Read-only check
                localIndices = append(localIndices, cacheLineIdx)
            }
            // ... logic to use bf.storage.ArrayOps ...
        } else {
            if _, exists := localOps[cacheLineIdx]; !exists {
                localIndices = append(localIndices, cacheLineIdx)
            }
            localOps[cacheLineIdx] = append(localOps[cacheLineIdx], OpDetail{...})
        }
	}
    
    // ...
}

(Note: The implementation ArrayMode is complex, as it would still try to read from bf.storage.ArrayOps while writing to localIndices. The MapMode logic is cleaner and more illustrative of the pattern.)

Pros and Cons of this approach:

  • Pro: Absolute Thread-Safety. This is the main benefit. It is impossible to have a data race because there is no mutable shared state. Each goroutine's "notepad" is its own private variable, and the original bf.storage is now (mostly) read-only.
  • Pro: Simple to Reason About. The logic is clean. There are no locks, no complex synchronization, and no hidden state. What you see is what you get.
  • Con: Massive Garbage Collector (GC) Pressure. This is the critical, show-stopping flaw. In "map mode," this solution allocates a new map (localOps) and a new slice (localIndices) on every single call to Contains. As our bloomfilter_stress_test.go shows, we might handle millions of operations per second. This creates an enormous amount of short-lived garbage for the GC to clean up, leading to frequent "stop-the-world" pauses and poor performance.
  • Con: Defeats the Original Optimization. The entire point of the shared scratchpad was to avoid this exact GC pressure. This fix solves the data race but reintroduces the original performance problem that the scratchpad was designed to solve.

This leads us to the ideal solution, which gives us the best of both worlds.

Option 3: The Optimized Fix (sync.Pool)

This is the best solution. It takes the thread-safety of Option 2 and solves its critical performance (GC) problem.

A sync.Pool It It is a concurrent-safe pool of reusable objects. Each goroutine can "Get" its own private scratchpad, use it, and "Put" it back when done. This avoids allocations and eliminates the data race.

The "Full Fix": Solving Both Data Races

The sync.Pool It is the right tool, but we must apply it to all the stateful functions. The fix has two parts: fixing the internal scratchpad race with a sync.Pool, and fixing the external field race with a signature refactor.

Part 1: Fix the 'Scratchpad Race' with sync.Pool

The data race exists in three key functions, all of which mutate bf.storage:

  1. getBitCacheOptimized (called by Contains)
  2. setBitCacheOptimized (called by Add)
  3. getHashPositionsOptimized (called by both Add and Contains)

All three follow the same racy "shared scratchpad" pattern. The solution is to create one, universal scratchpad struct (OperationStorage) and store it in the sync.Pool.

1. Define the Universal Scratchpad and Pool: (This is now complete in storage.go)

// storage.go
type OperationStorage struct {
    // ... maps and slices ...
}

var (
	arrayOpsPool = sync.Pool{ ... }
	mapOpsPool   = sync.Pool{ ... }
)

// GetOperationStorage retrieves an operation storage from the pool
// Objects from pool are already clean (either new or cleared on Put)
func GetOperationStorage(useArrayMode bool) *OperationStorage {
	if useArrayMode {
		return arrayOpsPool.Get().(*OperationStorage)
	}
	return mapOpsPool.Get().(*OperationStorage)
}

// PutOperationStorage returns an operation storage to the pool after clearing it
func PutOperationStorage(ops *OperationStorage) {
	// Clear before returning to pool to ensure next Get receives clean object
	ops.clear()

	if ops.UseArrayMode {
		arrayOpsPool.Put(ops)
	} else {
		mapOpsPool.Put(ops)
	}
}

2. Refactor all three functions: (This is now complete in bloomfilter.go)

// bloomfilter.go
func (bf *CacheOptimizedBloomFilter) getBitCacheOptimized(positions []uint64) bool {
	// 1. Get a private universal scratchpad
	ops := storage.GetOperationStorage(bf.storage.UseArrayMode)
	defer storage.PutOperationStorage(ops)
    // ...
}

This change successfully fixes Race 1: The Scratchpad Race.

Part 2: Fix the 'Shared Field Race' by Refactoring

This is where the performance problem was introduced. The original code had a "Shared Field Race" (Race #2) because getHashPositionsOptimized wrote to bf.positions and bf.cacheLineIndices, which other functions read.

The Fix (Refactoring Signatures): (This is now complete in bloomfilter.go)

  1. The bf.positions and bf.cacheLineIndices fields were removed from the CacheOptimizedBloomFilter struct.
  2. getHashPositionsOptimized now returns positions and cacheLineIndices.
  3. Add and Contains now call getHashPositionsOptimized and pass the returned (local) slices to setBitCacheOptimized and getBitCacheOptimized.

BEFORE (Racy):

// bloomfilter.go (Original)
func (bf *CacheOptimizedBloomFilter) getHashPositionsOptimized(data []byte) {
    // ...
    bf.positions[i] = bitPos // <-- RACE!
    bf.cacheLineIndices = bf.cacheLineIndices[:0] // <-- RACE!
}
func (bf *CacheOptimizedBloomFilter) Contains(data []byte) bool {
	bf.getHashPositionsOptimized(data) // Writes to shared fields
	return bf.getBitCacheOptimized(bf.positions[:bf.count]) // Reads shared fields
}

AFTER (Correct, but Slow):

// bloomfilter.go (First Thread-Safe Version)
func (bf *CacheOptimizedBloomFilter) getHashPositionsOptimized(data []byte) ([]uint64, []uint64) {
    // ...
    // THIS IS THE PERFORMANCE BOTTLENECK
    positions := make([]uint64, bf.hashCount) // <-- SLOW: Allocates on heap
    // ...
    return positions, cacheLineIndices
}
func (bf *CacheOptimizedBloomFilter) Contains(data []byte) bool {
	positions, cacheLineIndices := bf.getHashPositionsOptimized(data) // Gets local slices
	return bf.getBitCacheOptimized(positions) // Passes local slices
}

This change successfully fixes Race 2: The Shared Field Race, but, as the performance report proves, it introduced a new, massive allocation bottleneck.

The Final Data Race: Concurrent Writes to the Bit Array

This is the subtle, final bug (Race #3). Even after fixing the scratchpad and shared fields, the actual write to the bit array was still not thread-safe.

(This is now fixed in bloomfilter.go using atomics.)

The Problem (Read-Modify-Write):

The line cacheLine.words[op.WordIdx] |= 1 << op.BitOffset is not atomic. It's three steps:

  1. Read the current value.
  2. Modify the value (in a CPU register).
  3. Write the new value back.

If two goroutines do this at once, one write can be "lost."


A Primer on Atomic Operations (The Fix for Race #3)

To solve the Read-Modify-Write problem without using a slow mutex, we use atomic operations from Go's sync/atomic package.

What Are Atomic Operations?

Atomic operations are low-level CPU instructions that are guaranteed to execute as a single, indivisible step. When you call an atomic function, no other goroutine can interrupt it or see the data in a "half-finished" state. They are the building blocks for lock-free concurrency.

While a full sync.Mutex locks an entire section of code, atomics are used for very simple, high-speed operations on a single piece of data (like a uint64). The most common operations are:

  • Load: Safely read a value. (atomic.LoadUint64)
  • Store: Safely write a value. (atomic.StoreUint64)
  • Add: Safely add a number to a value (e.g., for counters). (atomic.AddUint64)
  • CompareAndSwap (CAS): This is the most powerful. It says: "Check if the value at this memory address is old. If it is, and only if it is, set it to new. Tell me if I succeeded." This is the key to safely changing data.

Atomic Operations in Our Bloom Filter

We use atomic operations in two critical places to fix Race #3:

1. setBitCacheOptimized (The Write Path)

This is where we solve the Read-Modify-Write race. We can't just do word |= mask it because another goroutine might be doing the same thing. Instead, we use a "Compare-And-Swap (CAS) loop," which is now hardened against infinite spinning.

// bloomfilter.go (Final Version)
mask := uint64(1 << op.BitOffset)
wordPtr := &cacheLine.words[op.WordIdx]

const maxRetries = 100
for retry := 0; retry < maxRetries; retry++ {
    // 1. ATOMIC READ: Safely get the current value.
    old := atomic.LoadUint64(wordPtr)

    // 2. MODIFY: Calculate the new value in a local variable.
    new := old | mask

    // 3. ATOMIC COMPARE-AND-SWAP:
    // Try to write 'new', but ONLY if the value is still 'old'.
    if old == new || atomic.CompareAndSwapUint64(wordPtr, old, new) {
        // Success! Either:
        // a) The bit was already set (old == new), so we're done.
        // b) We successfully swapped 'old' for 'new'.
        break
    }
    // 4. BACKOFF: We failed. Another goroutine changed the value.
    //    Spin briefly to reduce cache line bouncing.
    if retry > 10 {
        for i := 0; i < retry; i++ {
            // Spin briefly
        }
    }
}
// Note: After maxRetries, the bit may remain unset.
// This is an acceptable trade-off in a bloom filter to
// prevent a livelock under extreme contention.

This hardened loop guarantees that no write is lost in the common case and prevents a CPU-wasting infinite spin under pathological contention.

2. getBitCacheOptimized (The Read Path)

You might think a regular read if (cacheLine.words[op.WordIdx] & mask) == 0 would be fine. But on some architectures, you could get a "torn read"—reading a 64-bit value while a concurrent write is only halfway through changing it.

To guarantee we always get a valid, whole uint64 (either the value before the write or after it, but never a mix), We must use an atomic load:

// bloomfilter.go (Final Version)
// Atomic load for thread-safe read
word := atomic.LoadUint64(&cacheLine.words[op.WordIdx])
if (word & (1 << op.BitOffset)) == 0 {
    return false
}

This ensures our Contains Checks are always correct and free of data races.

https://github.com/shaia/BloomFilter

to be continued...