Amr Mubarak
<- Back to blog
Database Internals||15 min read

Building an LSM-Tree Storage Engine in Go, Part 3: The SSTable

Where We Are

Post 1 built the WAL — every write is fsynced to disk before being acknowledged. Post 2 built the memtable — an in-memory skip list that keeps keys sorted and serves reads in O(log n). When the memtable hits its size limit, we need to write it to disk permanently. That is what this post builds.

The structure we write to disk is called an SSTable — Sorted String Table. It is the on-disk representation of a frozen memtable. Once written, it is never modified. New writes go into a new memtable. The SSTable sits on disk, immutable, until compaction merges it with other SSTables.


The Problem With Naive Flushing

The simplest thing you could do when the memtable is full is dump every key-value pair to a file sequentially:

user:1 → Alice
user:2 → Bob
user:3 → Carol
...

Writing this is fast — sequential writes to disk are the fastest I/O pattern. But reading from it is a disaster. To find user:2, you scan from the beginning until you hit it. That is O(n) — the same problem the WAL has. A 64MB file with 4KB average values has 16,000 entries. Finding one key means reading up to 16,000 entries.

The SSTable solves this with three structures layered on top of the raw data: an index block for O(log n) lookups, a bloom filter to skip files that cannot contain a key, and a footer that tells you where the index and bloom filter are so you can read them without scanning the file.


Part 1: The Data Block

The data block is the raw payload — key-value pairs written sequentially in sorted order. Sorted order comes for free because the memtable's Iter() already returns entries sorted.

Each entry in the data block has this layout:

┌──────────────┬──────────────┬────────────┬──────────────┬──────────┐
│ KeyLen       │ ValueLen     │ Deleted    │ Key          │ Value    │
│ uint32       │ uint32       │ uint8      │ KeyLen bytes │ ValueLen │
│ 4 bytes      │ 4 bytes      │ 1 byte     │              │ bytes    │
└──────────────┴──────────────┴────────────┴──────────────┴──────────┘

KeyLen and ValueLen — 4 bytes each. Tells us exactly how many bytes to read for the key and value. Without these, we would need a delimiter to know where the key ends and the value begins. Delimiters require scanning. Length prefixes let us read the exact bytes in one call.

Deleted — 1 byte, either 0 or 1. Tombstones from the memtable must be written to the SSTable so they can shadow older versions of the same key in older SSTables. A tombstone entry has Deleted=1 and ValueLen=0.

Key and Value — raw bytes. No encoding, no escaping.

Total overhead per entry: 9 bytes of header regardless of key/value size.


Part 2: The Index Block

The index block maps keys to their byte offsets in the data block. When you want to find a key, you binary search the index to find the offset, then seek directly to that position in the data block.

Each index entry:

┌──────────────┬──────────────┬──────────────┐
│ KeyLen       │ Offset       │ Key          │
│ uint32       │ uint64       │ KeyLen bytes │
│ 4 bytes      │ 8 bytes      │              │
└──────────────┴──────────────┴──────────────┘

Offset — 8 bytes, uint64. The byte position in the file where this entry starts in the data block. After binary searching the index and finding the right key, we Seek(offset) directly to it.

The index is written after all data blocks. It is compact — keys only, no values — so it fits in memory even for large SSTables. When we open an SSTable for reading, we load the entire index into memory once and keep it there. All subsequent lookups binary search the in-memory index.


Part 3: The Bloom Filter

The index tells us where a key is. The bloom filter tells us whether a key is even worth looking for.

A bloom filter is a probabilistic data structure that answers one question: "has this key been inserted into this filter?" It has two possible answers:

  • Definitely no — the key is 100% not in this SSTable. Skip the file entirely.
  • Probably yes — the key might be in this SSTable. Check the index to be sure.

It never produces false negatives — if a key was inserted, the filter will always say "probably yes." It can produce false positives — it might say "probably yes" for a key that was never inserted. The false positive rate is tunable and typically set to 1%.

How it works internally:

A bloom filter is a bit array of m bits, all initially 0. To insert a key, you hash it k different ways and set the bits at the k resulting positions to 1:

Insert "user:1":
  hash1("user:1") = 3  → set bit 3
  hash2("user:1") = 7  → set bit 7
  hash3("user:1") = 12 → set bit 12

bit array: 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
                 ↑           ↑           ↑
                 3           7           12

To check if "user:2" exists:

hash1("user:2") = 3  → bit 3 is 1 ✓
hash2("user:2") = 9  → bit 9 is 0 ✗ → DEFINITELY NOT in the filter

If any of the k bit positions is 0, the key was definitely never inserted. If all k bits are 1, the key was probably inserted — but it is possible that other keys happened to set those same bits (false positive).

The false positive rate depends on three parameters:

  • m — number of bits. More bits = lower false positive rate.
  • k — number of hash functions. Optimal k = (m/n) × ln(2), where n is the number of inserted keys.
  • n — number of keys inserted.

For a target false positive rate of 1% with n keys, you need approximately m = 10 × n bits. We will use this formula.

We implement the bloom filter using double hashing — two actual hash functions combined linearly to simulate k independent hash functions. This is a standard technique that avoids implementing k different hash functions:

hash_i(key) = (hash1(key) + i × hash2(key)) % m

Part 4: The Footer

The footer is a fixed-size block at the very end of the file. It stores the byte offsets of the index block and bloom filter so we know where to read them when opening the file.

┌────────────────────┬──────────────────────┬──────────────────────────┐
│ IndexOffset        │ BloomOffset          │ Magic                    │
│ uint64 (8 bytes)   │ uint64 (8 bytes)     │ uint64 (8 bytes)         │
└────────────────────┴──────────────────────┴──────────────────────────┘
                                              total: 24 bytes

Magic — a fixed constant (0xDEADC0FFEE) written at the end of every SSTable. When opening a file, we read the last 8 bytes and check for the magic number. If it is wrong, the file is corrupt or truncated. This is a quick sanity check before doing anything else.

Because the footer is fixed-size and at the end, opening an SSTable is always: seek to fileSize - 24, read the footer, seek to IndexOffset, read the index, seek to BloomOffset, read the bloom filter. Three seeks total. No scanning.


Implementation

The Bloom Filter

// sstable/bloom.go

package sstable

import (
	"encoding/binary"
	"math"
)

// bloomFilter is a space-efficient probabilistic data structure for
// testing set membership. It guarantees no false negatives and a
// tunable false positive rate.
type bloomFilter struct {
	bits []byte // bit array, stored as bytes (8 bits per byte)
	m    uint64 // total number of bits
	k    uint64 // number of hash functions
}

// newBloomFilter creates a bloom filter sized for n expected keys
// at the given false positive rate (e.g. 0.01 for 1%).
//
// Formula for m (number of bits):
//   m = -(n × ln(p)) / (ln(2))²
// Formula for k (number of hash functions):
//   k = (m/n) × ln(2)
//
// At p=0.01 and n keys: m ≈ 10n bits, k ≈ 7 hash functions.
func newBloomFilter(n int, falsePositiveRate float64) *bloomFilter {
	m := uint64(math.Ceil(-float64(n) * math.Log(falsePositiveRate) / (math.Log(2) * math.Log(2))))
	k := uint64(math.Ceil(float64(m) / float64(n) * math.Log(2)))

	// Round m up to the nearest byte boundary.
	// We store bits in a []byte, so m must be a multiple of 8.
	m = ((m + 7) / 8) * 8

	return &bloomFilter{
		bits: make([]byte, m/8),
		m:    m,
		k:    k,
	}
}

// add inserts a key into the bloom filter.
// It sets k bits in the bit array using double hashing.
func (bf *bloomFilter) add(key []byte) {
	h1, h2 := hash128(key)
	for i := uint64(0); i < bf.k; i++ {
		// Double hashing: hash_i = (h1 + i*h2) % m
		// This simulates k independent hash functions using only two.
		pos := (h1 + i*h2) % bf.m
		bf.bits[pos/8] |= 1 << (pos % 8)
	}
}

// mayContain returns false if the key is definitely not in the filter,
// true if it is probably in the filter (may be a false positive).
func (bf *bloomFilter) mayContain(key []byte) bool {
	h1, h2 := hash128(key)
	for i := uint64(0); i < bf.k; i++ {
		pos := (h1 + i*h2) % bf.m
		if bf.bits[pos/8]&(1<<(pos%8)) == 0 {
			return false // this bit is 0 — key definitely not inserted
		}
	}
	return true
}

// encode serializes the bloom filter to bytes for writing to disk.
// Layout: [k:8][m:8][bits:m/8]
func (bf *bloomFilter) encode() []byte {
	buf := make([]byte, 16+len(bf.bits))
	binary.LittleEndian.PutUint64(buf[0:8], bf.k)
	binary.LittleEndian.PutUint64(buf[8:16], bf.m)
	copy(buf[16:], bf.bits)
	return buf
}

// decodeBloomFilter deserializes a bloom filter from bytes read off disk.
func decodeBloomFilter(data []byte) *bloomFilter {
	k := binary.LittleEndian.Uint64(data[0:8])
	m := binary.LittleEndian.Uint64(data[8:16])
	bits := make([]byte, m/8)
	copy(bits, data[16:])
	return &bloomFilter{bits: bits, m: m, k: k}
}

// hash128 produces two independent 64-bit hashes of key using FNV-1a.
// These two hashes are used as the basis for double hashing.
// FNV-1a is fast and has good distribution for short keys.
func hash128(key []byte) (uint64, uint64) {
	const (
		fnvPrime1  = 1099511628211
		fnvPrime2  = 1099511628211 * 1099511628211
		fnvOffset1 = 14695981039346656037
		fnvOffset2 = 14695981039346656037 ^ 0xdeadbeef
	)

	h1, h2 := uint64(fnvOffset1), uint64(fnvOffset2)
	for _, b := range key {
		h1 ^= uint64(b)
		h1 *= fnvPrime1
		h2 ^= uint64(b)
		h2 *= fnvPrime2
	}
	return h1, h2
}

The SSTable Writer

The writer takes a sorted sequence of entries (from memtable.Iter()) and produces an SSTable file. It writes in one pass: data blocks first, then the index, then the bloom filter, then the footer.

// sstable/writer.go

package sstable

import (
	"bufio"
	"encoding/binary"
	"fmt"
	"os"

	"github.com/amrrdev/lsm-engine/memtable"
)

const (
	// magic is written at the end of every SSTable file.
	// On open, we verify this value to detect truncated or corrupt files.
	magic = uint64(0xDEADC0FFEE)

	// footerSize is the fixed byte size of the SSTable footer.
	// IndexOffset(8) + BloomOffset(8) + Magic(8) = 24 bytes.
	footerSize = 24

	// falsePositiveRate is the target bloom filter false positive rate.
	// 1% means 1 in 100 bloom filter checks on a key that does not exist
	// in the file will incorrectly say "probably yes", causing an unnecessary
	// index lookup. This is a standard production value.
	falsePositiveRate = 0.01
)

// indexEntry records the key and its byte offset in the data block section.
// The full index is held in memory during writing and flushed after all data.
type indexEntry struct {
	key    string
	offset uint64 // byte offset of this entry in the file
}

// Writer writes a single SSTable file from a sorted sequence of entries.
type Writer struct {
	file    *os.File
	buf     *bufio.Writer
	index   []indexEntry
	bloom   *bloomFilter
	offset  uint64 // current write position in the file
}

// NewWriter creates an SSTable writer for the given file path.
// The file must not already exist — SSTables are always new files.
func NewWriter(path string) (*Writer, error) {
	f, err := os.OpenFile(path, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0644)
	if err != nil {
		return nil, fmt.Errorf("create sstable %q: %w", path, err)
	}
	return &Writer{
		file: f,
		buf:  bufio.NewWriterSize(f, 64*1024), // 64KB write buffer
	}, nil
}

// Flush writes all entries from the memtable to the SSTable file.
// entries must be sorted in ascending key order — memtable.Iter() guarantees this.
// After Flush returns, the file is complete and durable on disk.
func (w *Writer) Flush(entries []memtable.Entry) error {
	if len(entries) == 0 {
		return fmt.Errorf("cannot flush empty memtable")
	}

	// Initialize the bloom filter sized for the number of entries.
	w.bloom = newBloomFilter(len(entries), falsePositiveRate)

	// Phase 1: write all data blocks sequentially.
	// Track each entry's byte offset for the index.
	for _, entry := range entries {
		if err := w.writeEntry(entry); err != nil {
			return fmt.Errorf("write entry %q: %w", entry.Key, err)
		}
	}

	// Phase 2: write the index block.
	// Record where the index starts so we can write it in the footer.
	indexOffset := w.offset
	if err := w.writeIndex(); err != nil {
		return fmt.Errorf("write index: %w", err)
	}

	// Phase 3: write the bloom filter.
	bloomOffset := w.offset
	if err := w.writeBloom(); err != nil {
		return fmt.Errorf("write bloom: %w", err)
	}

	// Phase 4: write the footer.
	if err := w.writeFooter(indexOffset, bloomOffset); err != nil {
		return fmt.Errorf("write footer: %w", err)
	}

	// Flush the userspace buffer to the kernel, then fsync to disk.
	// Both steps are required — see WAL post for the full explanation.
	if err := w.buf.Flush(); err != nil {
		return fmt.Errorf("flush buffer: %w", err)
	}
	if err := w.file.Sync(); err != nil {
		return fmt.Errorf("fsync: %w", err)
	}

	return w.file.Close()
}

// writeEntry encodes one key-value entry into the data block format and
// writes it to the file. It records the entry's offset in the index and
// adds the key to the bloom filter.
func (w *Writer) writeEntry(entry memtable.Entry) error {
	key := []byte(entry.Key)
	value := entry.Value

	var deleted uint8
	if entry.Deleted {
		deleted = 1
		value = nil // tombstones carry no value
	}

	// Record offset before writing — this is where this entry starts.
	w.index = append(w.index, indexEntry{
		key:    entry.Key,
		offset: w.offset,
	})

	// Add key to bloom filter — both live and deleted keys are added.
	// A tombstone is a real entry; readers need to find it.
	w.bloom.add(key)

	// Encode: [KeyLen:4][ValueLen:4][Deleted:1][Key][Value]
	header := make([]byte, 9)
	binary.LittleEndian.PutUint32(header[0:4], uint32(len(key)))
	binary.LittleEndian.PutUint32(header[4:8], uint32(len(value)))
	header[8] = deleted

	if _, err := w.buf.Write(header); err != nil {
		return err
	}
	if _, err := w.buf.Write(key); err != nil {
		return err
	}
	if _, err := w.buf.Write(value); err != nil {
		return err
	}

	w.offset += uint64(9 + len(key) + len(value))
	return nil
}

// writeIndex encodes the index block and writes it to the file.
// Each index entry: [KeyLen:4][Offset:8][Key]
func (w *Writer) writeIndex() error {
	for _, entry := range w.index {
		key := []byte(entry.key)
		buf := make([]byte, 12+len(key))
		binary.LittleEndian.PutUint32(buf[0:4], uint32(len(key)))
		binary.LittleEndian.PutUint64(buf[4:12], entry.offset)
		copy(buf[12:], key)

		if _, err := w.buf.Write(buf); err != nil {
			return err
		}
		w.offset += uint64(len(buf))
	}
	return nil
}

// writeBloom encodes the bloom filter and writes it to the file.
func (w *Writer) writeBloom() error {
	encoded := w.bloom.encode()
	if _, err := w.buf.Write(encoded); err != nil {
		return err
	}
	w.offset += uint64(len(encoded))
	return nil
}

// writeFooter writes the 24-byte footer at the end of the file.
// [IndexOffset:8][BloomOffset:8][Magic:8]
func (w *Writer) writeFooter(indexOffset, bloomOffset uint64) error {
	footer := make([]byte, footerSize)
	binary.LittleEndian.PutUint64(footer[0:8], indexOffset)
	binary.LittleEndian.PutUint64(footer[8:16], bloomOffset)
	binary.LittleEndian.PutUint64(footer[16:24], magic)

	_, err := w.buf.Write(footer)
	return err
}

The SSTable Reader

The reader is used during Get operations. It opens an SSTable file, reads the footer to locate the index and bloom filter, loads them into memory, and then answers point lookups using the bloom filter first and the index second.

// sstable/reader.go

package sstable

import (
	"encoding/binary"
	"fmt"
	"io"
	"os"
	"sort"
)

// Reader reads from a single SSTable file.
// The index and bloom filter are loaded into memory on Open.
// Data blocks are read on demand via direct seeks.
type Reader struct {
	file  *os.File
	index []indexEntry  // full index loaded into memory
	bloom *bloomFilter  // bloom filter loaded into memory
}

// Open opens an SSTable file for reading.
// It reads and validates the footer, then loads the index and bloom filter
// into memory. These stay resident for the lifetime of the Reader.
func Open(path string) (*Reader, error) {
	f, err := os.Open(path)
	if err != nil {
		return nil, fmt.Errorf("open sstable %q: %w", path, err)
	}

	r := &Reader{file: f}

	if err := r.readFooterAndLoad(); err != nil {
		f.Close()
		return nil, err
	}

	return r, nil
}

// readFooterAndLoad seeks to the footer, validates the magic number,
// then reads the index and bloom filter from their recorded offsets.
func (r *Reader) readFooterAndLoad() error {
	info, err := r.file.Stat()
	if err != nil {
		return err
	}
	if info.Size() < footerSize {
		return fmt.Errorf("file too small to be a valid sstable")
	}

	// Read footer from the end of the file.
	footer := make([]byte, footerSize)
	if _, err := r.file.ReadAt(footer, info.Size()-footerSize); err != nil {
		return fmt.Errorf("read footer: %w", err)
	}

	indexOffset := binary.LittleEndian.Uint64(footer[0:8])
	bloomOffset := binary.LittleEndian.Uint64(footer[8:16])
	fileMagic := binary.LittleEndian.Uint64(footer[16:24])

	if fileMagic != magic {
		return fmt.Errorf("invalid magic number: file is corrupt or not an sstable")
	}

	// Load index — spans from indexOffset to bloomOffset.
	indexData := make([]byte, bloomOffset-indexOffset)
	if _, err := r.file.ReadAt(indexData, int64(indexOffset)); err != nil {
		return fmt.Errorf("read index: %w", err)
	}
	r.index = decodeIndex(indexData)

	// Load bloom filter — spans from bloomOffset to start of footer.
	bloomSize := uint64(info.Size()) - footerSize - bloomOffset
	bloomData := make([]byte, bloomSize)
	if _, err := r.file.ReadAt(bloomData, int64(bloomOffset)); err != nil {
		return fmt.Errorf("read bloom filter: %w", err)
	}
	r.bloom = decodeBloomFilter(bloomData)

	return nil
}

// decodeIndex parses the raw index bytes into a slice of indexEntry.
// Each entry: [KeyLen:4][Offset:8][Key]
func decodeIndex(data []byte) []indexEntry {
	var entries []indexEntry
	pos := 0
	for pos < len(data) {
		keyLen := int(binary.LittleEndian.Uint32(data[pos : pos+4]))
		offset := binary.LittleEndian.Uint64(data[pos+4 : pos+12])
		key := string(data[pos+12 : pos+12+keyLen])
		entries = append(entries, indexEntry{key: key, offset: offset})
		pos += 12 + keyLen
	}
	return entries
}

// Get retrieves a key from the SSTable.
// Returns (value, true) if found, (nil, false) if not found.
// A tombstone returns (nil, false) with deleted=true — the caller
// must distinguish "not found" from "found but deleted".
//
// The read path:
//   1. Check bloom filter — if definitely not present, return immediately.
//   2. Binary search the index — find the byte offset of the key.
//   3. Seek to that offset — read the entry from the data block.
func (r *Reader) Get(key string) (value []byte, deleted bool, err error) {
	// Step 1: bloom filter check.
	// This is O(k) where k is the number of hash functions (~7).
	// If the filter says no, we skip the file entirely — no disk I/O.
	if !r.bloom.mayContain([]byte(key)) {
		return nil, false, nil
	}

	// Step 2: binary search the in-memory index.
	// The index is sorted by key, so sort.Search gives us O(log n).
	// We are looking for the first index entry whose key >= our target key.
	idx := sort.Search(len(r.index), func(i int) bool {
		return r.index[i].key >= key
	})

	if idx >= len(r.index) || r.index[idx].key != key {
		// Key is not in this SSTable — bloom filter gave a false positive.
		return nil, false, nil
	}

	// Step 3: seek to the data block entry and read it.
	return r.readEntryAt(r.index[idx].offset)
}

// readEntryAt seeks to the given byte offset and reads one data block entry.
// Entry format: [KeyLen:4][ValueLen:4][Deleted:1][Key][Value]
func (r *Reader) readEntryAt(offset uint64) (value []byte, deleted bool, err error) {
	header := make([]byte, 9)
	if _, err := r.file.ReadAt(header, int64(offset)); err != nil {
		return nil, false, fmt.Errorf("read entry header at offset %d: %w", offset, err)
	}

	keyLen := binary.LittleEndian.Uint32(header[0:4])
	valueLen := binary.LittleEndian.Uint32(header[4:8])
	isDeleted := header[8] == 1

	// Read key and value in one ReadAt call to minimize syscalls.
	payload := make([]byte, keyLen+valueLen)
	if _, err := r.file.ReadAt(payload, int64(offset)+9); err != nil {
		if err == io.EOF && uint32(len(payload)) == keyLen+valueLen {
			// ReadAt on the last entry may return EOF alongside the data.
			// This is valid — treat it as a successful read.
		} else {
			return nil, false, fmt.Errorf("read entry payload: %w", err)
		}
	}

	if isDeleted {
		return nil, true, nil
	}
	return payload[keyLen:], false, nil
}

// Close closes the underlying file.
func (r *Reader) Close() error {
	return r.file.Close()
}

The Flush Path

The flush path is the bridge between the memtable and the SSTable. It is called by db.go when memtable.ShouldFlush() returns true. We show it here as a standalone function — it will move into db.go in Post 5.

// sstable/flush.go

package sstable

import (
	"fmt"
	"path/filepath"

	"github.com/amrrdev/lsm-engine/memtable"
)

// Flush writes the memtable contents to a new SSTable file.
// The filename encodes the sequence number so SSTables can be ordered
// from newest to oldest during reads — newer SSTables shadow older ones.
//
// seqNum must be monotonically increasing across all flushes.
// It is the caller's responsibility to track and increment seqNum.
func Flush(dir string, seqNum uint64, mem *memtable.Memtable) (*Reader, error) {
	entries := mem.Iter()
	if len(entries) == 0 {
		return nil, fmt.Errorf("nothing to flush")
	}

	// Name format: %016d.sst — zero-padded decimal sequence number.
	// Lexicographic order matches numeric order, so filepath.Glob + sort
	// gives SSTables oldest to newest without parsing filenames.
	path := filepath.Join(dir, fmt.Sprintf("%016d.sst", seqNum))

	w, err := NewWriter(path)
	if err != nil {
		return nil, err
	}

	if err := w.Flush(entries); err != nil {
		return nil, fmt.Errorf("flush to %q: %w", path, err)
	}

	// Open the newly written SSTable immediately so the caller can
	// add it to the active reader list without a separate Open call.
	return Open(path)
}

Reading Across Multiple SSTables

Once you have multiple SSTables on disk, a Get must check them in the right order. The rule is: newest SSTable first. A newer SSTable always shadows an older one — if a key appears in both, the newer version wins.

memtable         → check first (most recent writes)
SSTable seq=3    → check second (most recent flush)
SSTable seq=2    → check third
SSTable seq=1    → check last (oldest flush)

This is the read path that will live in db.go:

// preview of db.go read path

func (db *DB) Get(key string) ([]byte, bool, error) {
	// 1. Check memtable first — it has the most recent writes.
	if value, ok := db.mem.Get(key); ok {
		return value, true, nil
	}

	// 2. Check SSTables newest to oldest.
	// The first SSTable that contains the key wins.
	for i := len(db.readers) - 1; i >= 0; i-- {
		value, deleted, err := db.readers[i].Get(key)
		if err != nil {
			return nil, false, err
		}
		if deleted {
			// Found a tombstone — key was deleted. Stop searching.
			// Do NOT continue to older SSTables — the tombstone is the answer.
			return nil, false, nil
		}
		if value != nil {
			return value, true, nil
		}
	}

	return nil, false, nil
}

The tombstone handling is critical. When we find a tombstone in an SSTable, we stop and return "not found" — we do not continue searching older SSTables. The tombstone is the authoritative answer that this key was deleted. If we kept searching, we would find the old value in an older SSTable and incorrectly return it.


The Complete File Layout

After a flush, an SSTable file on disk looks like this:

Byte 0:
┌─────────────────────────────────────────────┐
│  DATA BLOCK                                 │
│  [KeyLen:4][ValueLen:4][Deleted:1][Key][Val]│  ← entry 1
│  [KeyLen:4][ValueLen:4][Deleted:1][Key][Val]│  ← entry 2
│  ...                                        │
│  [KeyLen:4][ValueLen:4][Deleted:1][Key][Val]│  ← entry n
├─────────────────────────────────────────────┤
│  INDEX BLOCK                                │
│  [KeyLen:4][Offset:8][Key]                  │  ← index entry 1
│  [KeyLen:4][Offset:8][Key]                  │  ← index entry 2
│  ...                                        │
├─────────────────────────────────────────────┤
│  BLOOM FILTER                               │
│  [k:8][m:8][bits: m/8 bytes]               │
├─────────────────────────────────────────────┤
│  FOOTER (always 24 bytes)                   │
│  [IndexOffset:8][BloomOffset:8][Magic:8]    │
└─────────────────────────────────────────────┘

Reading a key from this file:

  1. ReadAt(fileSize - 24) → footer → get IndexOffset, BloomOffset
  2. ReadAt(BloomOffset) → bloom filter → load into memory
  3. ReadAt(IndexOffset) → index → load into memory
  4. bloom.mayContain(key) → false? return immediately
  5. sort.Search(index, key) → get Offset
  6. ReadAt(Offset) → read the actual entry

Maximum 4 ReadAt calls to answer any point lookup in any SSTable. Steps 1–3 happen once at open time and are cached. Steps 4–6 happen on every Get.


Design Decisions and Tradeoffs

Loading the full index into memory. We read the entire index when opening an SSTable and keep it resident in memory. The alternative is a two-level index — a sparse index of index blocks — which reduces memory usage at the cost of an extra seek per lookup. For SSTables up to 64MB with 4KB average values, the index holds at most 16,000 entries × ~30 bytes per entry = ~480KB. Keeping 480KB per SSTable in memory is acceptable. RocksDB uses a similar approach with its block-based table format.

ReadAt instead of Read. We use file.ReadAt(buf, offset) rather than seeking and then reading. ReadAt is atomic — it reads from the given offset without changing the file's current position. This matters for concurrent readers: multiple goroutines can call ReadAt on the same file concurrently without interfering. Seek + Read is a two-step operation — not atomic, not safe for concurrent use without a mutex.

Sequence number in filename vs. timestamp. SSTable files are named by a monotonically increasing sequence number, not a timestamp. Timestamps are not monotonic — clock skew, NTP adjustments, or two flushes in the same millisecond can produce collisions or incorrect ordering. A sequence number managed by the engine is always strictly ordered.

False positive rate of 1%. At 1% FPR, roughly 1 in 100 Get calls for a non-existent key will incorrectly pass the bloom filter and perform an index lookup. The cost of a false positive is one index binary search + one ReadAt. Reducing FPR to 0.1% requires 10x more bits — the memory cost doubles. 1% is the standard production default (used by RocksDB, Cassandra, LevelDB).

O_EXCL on SSTable creation. Like the WAL segments, SSTable files are created with O_EXCL — fail if the file exists. A new SSTable must never already exist. If it does, a previous flush crashed after creating the file but before completing it, leaving a corrupt partial file. Failing loudly on O_EXCL prevents silently appending to a corrupt file.


What's Next

We now have three components:

  • WAL — durability
  • Memtable — fast in-memory writes and reads
  • SSTable — durable, queryable on-disk storage

The problem SSTables introduce: over time, you accumulate many SSTable files. A Get must check all of them. Read performance degrades linearly with the number of SSTables. Deleted keys waste space — the tombstone and the original value both exist on disk across different files.

Post 4 solves this with compaction: merging multiple SSTables into one, physically removing tombstones and overwritten values, and bounding the number of files a read must check.


The full code is on GitHub.