Performance Tuning

This guide covers strategies for maximising throughput and efficiency: worker pool sizing, channel depth, memory management, cache tuning, and profiling.

Worker pool sizing

WorkerPool bounds concurrency for a set of services. The optimal size depends on whether work is I/O-bound or CPU-bound.

I/O-bound services (HTTP, browser, AI APIs)

Network operations spend most of their time waiting for remote responses. A large pool hides latency with concurrency:

#![allow(unused)]
fn main() {
// Rule of thumb: 10–100× logical CPU count
let concurrency = num_cpus::get() * 50;
let queue_depth  = concurrency * 4;   // back-pressure buffer

let pool = WorkerPool::new(concurrency, queue_depth);
}

Start with 50× and adjust based on:

Target API rate limits (reduce if hitting 429s)
Available file descriptors (ulimit -n)
Memory pressure (every in-flight request buffers a response body)

CPU-bound services (parsing, transforms, NLP)

CPU work cannot overlap on the same core. Match the pool to physical cores to avoid context-switch overhead:

#![allow(unused)]
fn main() {
// Rule of thumb: 1–2× logical CPU count
let concurrency = num_cpus::get();
let queue_depth  = concurrency * 2;

let pool = WorkerPool::new(concurrency, queue_depth);
}

Mixed workloads

When a pipeline mixes HTTP fetching and CPU-heavy extraction, use separate pools per service type:

#![allow(unused)]
fn main() {
let http_pool = WorkerPool::new(num_cpus::get() * 40, 512);
let cpu_pool  = WorkerPool::new(num_cpus::get(),        32);
}

queue_depth controls how many tasks accumulate before callers block. A shallow queue applies back-pressure sooner, limiting memory. A deep queue smooths bursts but can hide overload. A ratio of 4–8× concurrency is a good starting point.

Channel sizing

Stygian uses bounded tokio::sync::mpsc channels internally.

Channel depth	Throughput	Latency	Memory
1	Low	Low	Minimal
64 (default)	Medium	Medium	Low
512	High	Higher	Moderate
Unbounded	Max	Variable	Uncapped

Always use bounded channels in library code. Unbounded channels are only appropriate for producers with a known-small, finite rate (e.g. a timer).

Detect saturation at runtime: when capacity - len consistently approaches zero, the downstream consumer is a bottleneck — either scale it out or increase depth.

Memory optimisation

Limit response body size

Unbounded response buffering will exhaust memory on large responses:

#![allow(unused)]
fn main() {
const MAX_BODY_BYTES: usize = 8 * 1024 * 1024; // 8 MiB

let body = response.bytes().await?;
if body.len() > MAX_BODY_BYTES {
    return Err(ScrapingError::ResponseTooLarge(body.len()));
}
}

Arena allocation for wave processing

Allocate all intermediate data for a wave from a single arena and free it in one shot:

# Cargo.toml
bumpalo = "3"

#![allow(unused)]
fn main() {
use bumpalo::Bump;

async fn process_wave(inputs: &[ServiceInput]) {
    let arena = Bump::new();
    let urls: Vec<&str> = inputs
        .iter()
        .map(|i| arena.alloc_str(&i.url))
        .collect();
    // arena dropped here — all scratch freed in one operation
}
}

Buffer pooling

Avoid allocating a fresh Vec<u8> for every HTTP response — reuse from a pool:

#![allow(unused)]
fn main() {
use tokio::sync::Mutex;

struct BufferPool {
    pool:     Mutex<Vec<Vec<u8>>>,
    buf_size: usize,
}

impl BufferPool {
    async fn acquire(&self) -> Vec<u8> {
        let mut p = self.pool.lock().await;
        p.pop().unwrap_or_else(|| Vec::with_capacity(self.buf_size))
    }

    async fn release(&self, mut buf: Vec<u8>) {
        buf.clear();
        self.pool.lock().await.push(buf);
    }
}
}

Cache tuning

Measure hit rate first

A cache that rarely hits wastes memory and adds lookup overhead:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};

struct CacheMetrics { hits: AtomicU64, misses: AtomicU64 }

fn hit_rate(m: &CacheMetrics) -> f64 {
    let h = m.hits.load(Relaxed)   as f64;
    let m = m.misses.load(Relaxed) as f64;
    if h + m == 0.0 { return 0.0; }
    h / (h + m)
}
}

Target ≥ 0.80 for URL-keyed caches before increasing capacity.

LRU vs DashMap

	`BoundedLruCache`	`DashMapCache`
Eviction policy	LRU (capacity-based)	TTL-based (background task)
Best for	Fixed working set	Time-sensitive data
Concurrency overhead	Higher (LRU list mutex)	Lower (sharded map)

For AI extraction results with a 24 h freshness requirement, use DashMapCache with ttl = Duration::from_secs(86_400). For deduplication of seen URLs, BoundedLruCache is the right choice.

Rayon vs Tokio

	Tokio	Rayon
Use for	I/O-bound work (network, disk)	CPU-bound work (parsing, transforms)
Blocking?	No — async tasks never block threads	Yes — tasks block Rayon threads
Thread pool	Shared Tokio runtime	Separate Rayon thread pool

Rule: never call synchronous CPU-heavy work directly in a Tokio task. Offload with tokio::task::spawn_blocking or rayon::spawn:

#![allow(unused)]
fn main() {
// CPU-heavy HTML parsing — do NOT do this in an async fn directly
let html = response.text().await?;
let extracted = tokio::task::spawn_blocking(move || {
    heavy_parsing(html)
}).await??;
}

Profiling

Flamegraph with cargo-flamegraph

cargo install flamegraph
# Profile a benchmark
cargo flamegraph --bench dag_executor -- --bench

Criterion benchmarks

The stygian-graph crate ships Criterion benchmarks in benches/:

cargo bench                       # run all benchmarks
cargo bench dag_executor          # run a specific group
cargo bench -- --save-baseline v0 # save baseline for comparison
cargo bench -- --load-baseline v0 # compare against baseline

Key benchmark targets:

Benchmark	What it measures
`dag_executor/wave_10`	10-node wave execution overhead
`dag_executor/wave_100`	100-node wave execution overhead
`http_adapter/single`	Single HTTP request round-trip
`cache/lru_hit`	LRU cache read under contention

Benchmarks (Apple M4 Pro)

Operation	Latency
DAG executor overhead per wave	~50 µs
HTTP adapter (cached DNS)	~2 ms
Browser acquisition (warm pool)	<100 ms
LRU cache read (1 M ops/s)	~1 µs