Technology

Built from scratch.
For today's hardware.

How Rust, io_uring, zero-copy deserialization, and thread-per-core architecture combine to deliver sub-millisecond median latency, tight P99.9 tails, and multi-GB/s throughput on a single node.

Apache Iggy is not an adaptation of an existing engine. It is a ground-up reimagining of what message streaming looks like when you start with today's hardware constraints and work backward.

< 1ms

P99 latency

Multi GB/s

per node throughput

Zero

GC pauses, ever

4

innovations compounding

The Demand

The AI Era changed what streaming must do.

AI systems don't just process data; they chain together. Every agent, every inference call, every retrieval feeds the next. A single latency spike at step two compounds through every downstream step. The streaming layer cannot be the bottleneck.

LLM Inference Pipelines

Token streams, completion routing, and context assembly require microsecond handoff, not milliseconds.

Real-Time RAG

Retrieval-augmented generation needs live data freshness. Stale retrieval = stale answers. The pipeline must stay current.

Agent Chains

Multi-agent orchestration multiplies latency. A p99 event at one step becomes the worst-case for every step downstream.

Model Monitoring

Drift detection and anomaly alerts need real-time signal. Batch telemetry pipelines miss the moment that matters.

Feature Stores

Online feature computation for real-time ML inference requires sub-ms fresh features, not pre-computed approximations.

Embedding Pipelines

Continuous re-embedding of live data at model cadence demands high throughput with zero tolerance for head-of-line blocking.

Hardware

The hardware changed. We changed with it.

Four fundamental shifts in computing infrastructure happened over the last decade. Apache Iggy is designed to exploit every one of them.

Storage

Page cache (fsync gambles)
NVMe SSD

~1 ms P99

Built for NVMe from day one. Sequential append-only writes, parallel hardware queues, predictable throughput. Modern SSDs are the floor we design for.

Parallelism

4–8 cores (per machine)
64–128 cores

20× more cores

Thread-per-core architecture pins one thread to one CPU core, scaling linearly with every additional core added.

I/O Model

select / epoll (readiness-based)
io_uring

Zero syscalls per I/O

Shared ring buffers between user space and kernel eliminate per-operation syscalls and context switches.

Runtime

JVM + GC (stop-the-world pauses)
Rust

Zero GC pauses

Rust's ownership model gives memory safety at compile time with no garbage collector touching latency at runtime.

Architecture

Single binary. Append-only log.

No ZooKeeper. No external coordination service. No hidden dependencies. Apache Iggy is one binary that runs everything.

Data hierarchy

STREAM

Top-level namespace

TOPIC

Logical category in a stream

PARTITION

Parallel ordered log

SEGMENT

.log + .index on disk

O(1) offset lookup

Binary index maps message offsets to byte positions in constant time.

Consumer groups

Multiple independent consumers track their own offset per partition.

Configurable retention

Time-based or size-based retention per stream with no performance penalty.

Innovation 01

Zero-Copy Deserialization

Traditional serialization copies every field into a new struct on every message. At millions of messages per second, those allocations dominate CPU time. Iggy eliminates them entirely.

IggyMessageView is a reference into the original byte buffer, not a parsed copy. IggyMessageBatch is flat bytes with index offsets. The message is the bytes. No transformation. No owned struct.

Before: Traditional Serialization

Network buffer
→ Parse every field
→ Allocate struct
→ Copy values
→ Store
→ Re-serialize
→ Copy to send

Every field copied. New struct allocated. Every message. Every time.

Iggy: Zero-Copy Data Flow

One flat buffer. Zero transformations.

header
payload bytes
index
IggyMessageBatch: flat bytes, no per-message allocIggyMessageView: reference, not copy

Wire

arrives as bytes

Cache

same buffer in memory

Disk

appended directly

Client

served from cache

0 copies · 0 allocations · same bytes end to end

Innovation 03

io_uring: True Async I/O

epoll works on readiness: "tell me when the file descriptor is ready." But disk I/O syscalls block the OS thread regardless of what the scheduler says. io_uring works on completions: the kernel does the work while your thread does something else.

Iggy submits batches of I/O operations to a shared ring buffer. The kernel drains them asynchronously. No syscall overhead. No context switches. And because Iggy's executor is custom-built for io_uring's completion model, the driver and scheduler are in perfect lockstep, something Tokio's readiness-based epoll model fundamentally cannot do.

epoll: Readiness model

1App submits I/O request
2Thread waits for readiness signalthread blocked
3OS signals 'fd is ready'
4App issues actual syscall1 syscall per I/O
5Thread waits againcontext switch

Disk I/O blocks the OS thread; Tokio cannot help with this.

io_uring: Completion model

Shared Memory (User ↔ Kernel)

Submission Queue

Completion Queue

Zero syscalls per I/O · kernel polls ring directly

1App batch-submits to SQno syscall
2Kernel drains SQ asynchronouslythread free
3Completions land in CQ
4App reads CQno syscall

Iggy ships a custom thread-per-core executor built on io_uring, decoupled from Tokio's readiness model, so the driver and executor work in perfect lockstep.

Innovation 04

Thread-per-Core: Shared Nothing

Work-stealing schedulers are brilliant at keeping cores busy. They're terrible at keeping latency tails short, because they move tasks between cores, destroying the L1/L2 cache warmth that makes fast paths fast.

Iggy pins one thread to one CPU core, and that thread owns its partition's data exclusively. No Arc<Mutex<>>. No cross-core contention. The L1 and L2 caches stay hot, every read is cache-local, and P99.9 latency drops to a fraction of what work-stealing delivers on the same hardware.

Work Stealing: Shared State

Core 0Arc<Mutex<State>>
T0
T1
T2

Tasks can migrate across cores

Core 1Arc<Mutex<State>>
T0
T1
T2

Tasks can migrate across cores

Core 2Arc<Mutex<State>>
T0
T1
T2

Tasks can migrate across cores

Lock contention, cache misses, unpredictable tail latency.

Thread-per-Core: Shared Nothing

Core 0, pinned
L1/L2 hot

Partition 0

owns its own data

No cross-core traffic. No locks. No contention.

Core 1, pinned
L1/L2 hot

Partition 1

owns its own data

No cross-core traffic. No locks. No contention.

Core 2, pinned
L1/L2 hot

Partition 2

owns its own data

No cross-core traffic. No locks. No contention.

CPU affinity + shared-nothing = predictable latency at every percentile.

P99.9 tail latency: Thread-per-Core vs Work Stealing

Benchmarked on AWS i3en.3xlarge · Intel Xeon 8259CL @ 2.50 GHz. Thread-per-core architecture delivers materially lower and more consistent tail latencies at every percentile.

See the benchmarks →

Progress

We keep beating ourselves.

Every optimization is measurable. Every commit is a benchmark run. The numbers improve with each release, and we publish them publicly so you can verify every claim.

Q1 2025

Foundation

feat(server): append-only log foundation

Persistent append-only log with Stream → Topic → Partition → Segment hierarchy. The baseline everything else is measured against.

Apr 2025

Zero-Copy

↓ memory pressure · ↑ throughput

feat(server): implement zero-copy message handling

IggyMessageView and IggyMessageBatch eliminate every per-message allocation. Same bytes flow wire → cache → disk → client with zero copies.

Nov 2025

io_uring + Thread-per-Core

↓ P99.9 tail latency · ↑ CPU efficiency

feat(io_uring): implement thread per core io_uring

Custom completion-model executor paired with CPU-pinned, shared-nothing partitions. The combination that drives sub-ms median and tight P99.9 tails.

2026 →

Always Improving

Ongoing

Chasing Microseconds

Viewstamped Replication, custom allocators, NUMA awareness. There is always one more thing to optimize, and we will find it.

benchmarks.iggy.apache.org

Fully reproducible. Open methodology. Run them yourself.

See what the engineering delivers.

Sub-millisecond P99+ latency and multi-GB/s throughput on a single node. Start with a free deployment and measure it yourself.