Technology
Built from scratch.
For today's hardware.
How Rust, io_uring, zero-copy deserialization, and thread-per-core architecture combine to deliver sub-millisecond median latency, tight P99.9 tails, and multi-GB/s throughput on a single node.
Apache Iggy is not an adaptation of an existing engine. It is a ground-up reimagining of what message streaming looks like when you start with today's hardware constraints and work backward.
< 1ms
P99 latency
Multi GB/s
per node throughput
Zero
GC pauses, ever
4
innovations compounding
The Demand
The AI Era changed what streaming must do.
AI systems don't just process data; they chain together. Every agent, every inference call, every retrieval feeds the next. A single latency spike at step two compounds through every downstream step. The streaming layer cannot be the bottleneck.
LLM Inference Pipelines
Token streams, completion routing, and context assembly require microsecond handoff, not milliseconds.
Real-Time RAG
Retrieval-augmented generation needs live data freshness. Stale retrieval = stale answers. The pipeline must stay current.
Agent Chains
Multi-agent orchestration multiplies latency. A p99 event at one step becomes the worst-case for every step downstream.
Model Monitoring
Drift detection and anomaly alerts need real-time signal. Batch telemetry pipelines miss the moment that matters.
Feature Stores
Online feature computation for real-time ML inference requires sub-ms fresh features, not pre-computed approximations.
Embedding Pipelines
Continuous re-embedding of live data at model cadence demands high throughput with zero tolerance for head-of-line blocking.
Hardware
The hardware changed. We changed with it.
Four fundamental shifts in computing infrastructure happened over the last decade. Apache Iggy is designed to exploit every one of them.
Storage
~1 ms P99
Built for NVMe from day one. Sequential append-only writes, parallel hardware queues, predictable throughput. Modern SSDs are the floor we design for.
Parallelism
20× more cores
Thread-per-core architecture pins one thread to one CPU core, scaling linearly with every additional core added.
I/O Model
Zero syscalls per I/O
Shared ring buffers between user space and kernel eliminate per-operation syscalls and context switches.
Runtime
Zero GC pauses
Rust's ownership model gives memory safety at compile time with no garbage collector touching latency at runtime.
Architecture
Single binary. Append-only log.
No ZooKeeper. No external coordination service. No hidden dependencies. Apache Iggy is one binary that runs everything.
Data hierarchy
STREAM
Top-level namespace
TOPIC
Logical category in a stream
PARTITION
Parallel ordered log
SEGMENT
.log + .index on disk
O(1) offset lookup
Binary index maps message offsets to byte positions in constant time.
Consumer groups
Multiple independent consumers track their own offset per partition.
Configurable retention
Time-based or size-based retention per stream with no performance penalty.
Zero-Copy Deserialization
Traditional serialization copies every field into a new struct on every message. At millions of messages per second, those allocations dominate CPU time. Iggy eliminates them entirely.
IggyMessageView is a reference into the original byte buffer, not a parsed copy. IggyMessageBatch is flat bytes with index offsets. The message is the bytes. No transformation. No owned struct.
Before: Traditional Serialization
Every field copied. New struct allocated. Every message. Every time.
Iggy: Zero-Copy Data Flow
One flat buffer. Zero transformations.
Wire
arrives as bytes
Cache
same buffer in memory
Disk
appended directly
Client
served from cache
0 copies · 0 allocations · same bytes end to end
io_uring: True Async I/O
epoll works on readiness: "tell me when the file descriptor is ready." But disk I/O syscalls block the OS thread regardless of what the scheduler says. io_uring works on completions: the kernel does the work while your thread does something else.
Iggy submits batches of I/O operations to a shared ring buffer. The kernel drains them asynchronously. No syscall overhead. No context switches. And because Iggy's executor is custom-built for io_uring's completion model, the driver and scheduler are in perfect lockstep, something Tokio's readiness-based epoll model fundamentally cannot do.
epoll: Readiness model
Disk I/O blocks the OS thread; Tokio cannot help with this.
io_uring: Completion model
Submission Queue
Completion Queue
Zero syscalls per I/O · kernel polls ring directly
Iggy ships a custom thread-per-core executor built on io_uring, decoupled from Tokio's readiness model, so the driver and executor work in perfect lockstep.
Thread-per-Core: Shared Nothing
Work-stealing schedulers are brilliant at keeping cores busy. They're terrible at keeping latency tails short, because they move tasks between cores, destroying the L1/L2 cache warmth that makes fast paths fast.
Iggy pins one thread to one CPU core, and that thread owns its partition's data exclusively. No Arc<Mutex<>>. No cross-core contention. The L1 and L2 caches stay hot, every read is cache-local, and P99.9 latency drops to a fraction of what work-stealing delivers on the same hardware.
Work Stealing: Shared State
Tasks can migrate across cores
Tasks can migrate across cores
Tasks can migrate across cores
Lock contention, cache misses, unpredictable tail latency.
Thread-per-Core: Shared Nothing
Partition 0
owns its own data
No cross-core traffic. No locks. No contention.
Partition 1
owns its own data
No cross-core traffic. No locks. No contention.
Partition 2
owns its own data
No cross-core traffic. No locks. No contention.
CPU affinity + shared-nothing = predictable latency at every percentile.
P99.9 tail latency: Thread-per-Core vs Work Stealing
Benchmarked on AWS i3en.3xlarge · Intel Xeon 8259CL @ 2.50 GHz. Thread-per-core architecture delivers materially lower and more consistent tail latencies at every percentile.
Progress
We keep beating ourselves.
Every optimization is measurable. Every commit is a benchmark run. The numbers improve with each release, and we publish them publicly so you can verify every claim.
Q1 2025
Foundation
feat(server): append-only log foundation
Persistent append-only log with Stream → Topic → Partition → Segment hierarchy. The baseline everything else is measured against.
Apr 2025
Zero-Copy
↓ memory pressure · ↑ throughputfeat(server): implement zero-copy message handling
IggyMessageView and IggyMessageBatch eliminate every per-message allocation. Same bytes flow wire → cache → disk → client with zero copies.
Nov 2025
io_uring + Thread-per-Core
↓ P99.9 tail latency · ↑ CPU efficiencyfeat(io_uring): implement thread per core io_uring
Custom completion-model executor paired with CPU-pinned, shared-nothing partitions. The combination that drives sub-ms median and tight P99.9 tails.
2026 →
Always Improving
OngoingChasing Microseconds
Viewstamped Replication, custom allocators, NUMA awareness. There is always one more thing to optimize, and we will find it.
Fully reproducible. Open methodology. Run them yourself.
See what the engineering delivers.
Sub-millisecond P99+ latency and multi-GB/s throughput on a single node. Start with a free deployment and measure it yourself.