Introduction
stygian is a high-performance web scraping toolkit for Rust, delivered as two complementary crates in a single workspace.
| Crate | Purpose |
|---|---|
stygian-graph | Graph-based scraping engine — DAG pipelines, AI extraction, distributed execution |
stygian-browser | Anti-detection browser automation — stealth profiles, browser pooling, CDP automation |
Both crates share a common philosophy: zero-cost abstractions, extreme composability, and secure defaults.
At a glance
Design goals
- Hexagonal architecture — the domain core has zero I/O dependencies; all external capabilities are declared as port traits and injected via adapters.
- DAG execution — scraping pipelines are directed acyclic graphs. Nodes run concurrently within each topological wave, maximising parallelism.
- AI-first extraction — Claude, GPT-4o, Gemini, GitHub Copilot, and Ollama are first-class adapters. Structured data flows out of raw HTML without writing parsers.
- Anti-bot resilience — the browser crate ships stealth scripts that pass Cloudflare, DataDome, PerimeterX, and Akamai checks on Advanced stealth level.
- Fault-tolerant — circuit breakers, retry policies, and idempotency keys are built into the execution path, not bolted on.
Minimum supported Rust version
1.94.0 — Rust 2024 edition. Requires stable toolchain only.
Installation
Add both crates to Cargo.toml:
[dependencies]
stygian-graph = "0.2"
stygian-browser = "0.2" # optional — only needed for JS-rendered pages
tokio = { version = "1", features = ["full"] }
serde_json = "1"
Enable optional feature groups on stygian-graph:
stygian-graph = { version = "0.2", features = ["browser", "ai-claude", "distributed"] }
Available features:
| Feature | Includes |
|---|---|
browser | BrowserAdapter backed by stygian-browser |
ai-claude | Anthropic Claude adapter |
ai-openai | OpenAI adapter |
ai-gemini | Google Gemini adapter |
ai-copilot | GitHub Copilot adapter |
ai-ollama | Ollama (local) adapter |
distributed | Redis/Valkey work queue adapter |
metrics | Prometheus metrics export |
Quick start — scraping pipeline
use stygian_graph::{Pipeline, adapters::HttpAdapter}; use serde_json::json; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let config = json!({ "nodes": [ {"id": "fetch", "service": "http"}, {"id": "extract", "service": "ai_claude"} ], "edges": [{"from": "fetch", "to": "extract"}] }); let pipeline = Pipeline::from_config(config)?; let results = pipeline.execute(json!({"url": "https://example.com"})).await?; println!("{}", serde_json::to_string_pretty(&results)?); Ok(()) }
Quick start — browser automation
use stygian_browser::{BrowserConfig, BrowserPool, WaitUntil}; use std::time::Duration; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let pool = BrowserPool::new(BrowserConfig::default()).await?; let handle = pool.acquire().await?; let mut page = handle.browser().new_page().await?; page.navigate( "https://example.com", WaitUntil::Selector("body".to_string()), Duration::from_secs(30), ).await?; println!("Title: {}", page.title().await?); handle.release().await; Ok(()) }
Repository layout
stygian/
├── crates/
│ ├── stygian-graph/ # Scraping engine
│ └── stygian-browser/ # Browser automation
├── book/ # This documentation (mdBook)
├── docs/ # Architecture reference docs
├── examples/ # Example pipeline configs (.toml)
└── .github/workflows/ # CI, release, security, docs
Source, issues, and pull requests live at github.com/greysquirr3l/stygian.
Documentation
| Resource | URL |
|---|---|
| This guide | greysquirr3l.github.io/stygian |
API reference (stygian-graph) | greysquirr3l.github.io/stygian/api/stygian_graph |
API reference (stygian-browser) | greysquirr3l.github.io/stygian/api/stygian_browser |
crates.io (stygian-graph) | crates.io/crates/stygian-graph |
crates.io (stygian-browser) | crates.io/crates/stygian-browser |
Architecture
stygian-graph is built around the Hexagonal Architecture (Ports & Adapters) pattern.
The domain core is pure Rust with zero I/O dependencies. All external capabilities — HTTP, AI,
caching, queues — are declared as port traits and injected from the outside.
Layer diagram
┌──────────────────────────────────────────────────────┐
│ CLI / Entry Points │
│ (src/bin/stygian.rs) │
├──────────────────────────────────────────────────────┤
│ Application Layer │
│ ServiceRegistry · PipelineParser · Metrics │
├──────────────────────────────────────────────────────┤
│ Domain Core │
│ Pipeline · DagExecutor · WorkerPool │
│ IdempotencyKey · PipelineTypestate │
├──────────────────────────────────────────────────────┤
│ Ports │
│ ScrapingService · AIProvider · CachePort · ... │
├──────────────────────────────────────────────────────┤
│ Adapters │
│ http · claude · openai · gemini · browser · ... │
└──────────────────────────────────────────────────────┘
The dependency arrow always points inward:
CLI → Application → Domain ← Ports ← Adapters
The domain never imports from adapters. Adapters implement port traits and are injected at startup. This lets you swap any adapter — or mock it in tests — without touching business logic.
Domain layer (src/domain/)
Pure Rust. Only std, serde, and arithmetic/pure-data crates allowed. No tokio, no
reqwest, no file I/O.
| Type | File | Purpose |
|---|---|---|
Pipeline | domain/graph.rs | Owned graph of Nodes and Edges; wraps a petgraph DAG |
DagExecutor | domain/graph.rs | Topological sort → wave-based concurrent execution |
WorkerPool | domain/executor.rs | Bounded Tokio worker pool with back-pressure |
IdempotencyKey | domain/idempotency.rs | ULID-based key for safe retries |
Pipeline typestate
Pipelines enforce their lifecycle at compile time using the typestate pattern:
#![allow(unused)] fn main() { use stygian_graph::domain::pipeline::{ PipelineUnvalidated, PipelineValidated, PipelineExecuting, PipelineComplete, }; let p: PipelineUnvalidated = PipelineUnvalidated::new("crawl", metadata); let p: PipelineValidated = p.validate()?; // rejects invalid graphs here let p: PipelineExecuting = p.execute(); // only valid pipelines may execute let p: PipelineComplete = p.complete(result); // only executing pipelines may complete }
Out-of-order transitions are compiler errors. Phantom types carry zero runtime cost.
Ports layer (src/ports.rs)
Port traits are the only interface the domain exposes to infrastructure. No adapter code ever leaks inward.
#![allow(unused)] fn main() { /// Any scraping backend — HTTP, browser, Playwright, custom. pub trait ScrapingService: Send + Sync + 'static { fn name(&self) -> &'static str; async fn execute(&self, input: ServiceInput) -> Result<ServiceOutput>; } /// Any LLM — cloud APIs or local Ollama. pub trait AIProvider: Send + Sync { fn capabilities(&self) -> ProviderCapabilities; async fn extract(&self, prompt: String, schema: Value) -> Result<Value>; async fn stream_extract( &self, prompt: String, schema: Value, ) -> Result<BoxStream<'static, Result<Value>>>; } /// Any cache backend — LRU, DashMap, Redis. pub trait CachePort: Send + Sync { async fn get(&self, key: &str) -> Result<Option<String>>; async fn set(&self, key: &str, value: String, ttl: Option<Duration>) -> Result<()>; async fn invalidate(&self, key: &str) -> Result<()>; async fn exists(&self, key: &str) -> Result<bool>; } /// Circuit breaker abstraction. pub trait CircuitBreaker: Send + Sync { fn state(&self) -> CircuitState; fn record_success(&self); fn record_failure(&self); fn attempt_reset(&self) -> bool; } /// Token-bucket rate limiter abstraction. pub trait RateLimiter: Send + Sync { async fn check_rate_limit(&self, key: &str) -> Result<bool>; async fn record_request(&self, key: &str) -> Result<()>; } }
Adapters layer (src/adapters/)
Adapters implement port traits and handle real I/O. They are never imported by the domain.
| Adapter | Port | Notes |
|---|---|---|
HttpAdapter | ScrapingService | reqwest, UA rotation, cookie jar, retry |
BrowserAdapter | ScrapingService | chromiumoxide via stygian-browser |
ClaudeProvider | AIProvider | Anthropic API, streaming |
OpenAiProvider | AIProvider | OpenAI Chat Completions API |
GeminiProvider | AIProvider | Google Gemini API |
CopilotProvider | AIProvider | GitHub Copilot API |
OllamaProvider | AIProvider | Local LLM via Ollama HTTP API |
BoundedLruCache | CachePort | LRU eviction, NonZeroUsize capacity limit |
DashMapCache | CachePort | Concurrent hash map with TTL cleanup task |
CircuitBreakerImpl | CircuitBreaker | Sliding-window failure threshold |
NoopCircuitBreaker | CircuitBreaker | Passthrough — useful in tests |
NoopRateLimiter | RateLimiter | Always allows — useful in tests |
Adding a new adapter
- Identify the port trait your adapter will implement (e.g.
ScrapingService). - Create
src/adapters/my_adapter.rs. - Implement the trait — use native
async fn(Rust 2024, no#[async_trait]wrapper needed). - Re-export from
src/adapters/mod.rs. - Register via
ServiceRegistryat startup.
No changes to domain code are required.
Application layer (src/application/)
Orchestrates adapters, holds runtime configuration, and owns the ServiceRegistry.
| Module | Role |
|---|---|
ServiceRegistry | Runtime map of name → Arc<dyn ScrapingService> |
PipelineParser | Parses JSON/TOML pipeline configs into Pipeline |
MetricsCollector | Prometheus counter/histogram/gauge facade |
DagExecutor (app) | Top-level entry point: parse → validate → execute → collect |
Config | Environment-driven configuration with validation |
DAG execution model
Pipeline execution proceeds in topological waves:
- Parse — JSON/TOML config →
Pipelinedomain object. - Validate — check node uniqueness, edge validity, absence of cycles (Kahn's algorithm).
- Sort — compute topological order; group into waves of independent nodes.
- Execute wave-by-wave — all nodes in a wave run concurrently via
tokio::spawn. - Collect — outputs from each wave become inputs to the next.
Wave-based execution maximises parallelism while respecting data dependencies.
Building Pipelines
A stygian pipeline is a directed acyclic graph (DAG) of service nodes. Pipelines can be defined in JSON, TOML, or built programmatically in Rust.
JSON pipeline format
{
"id": "product-scraper",
"nodes": [
{
"id": "fetch_html",
"service": "http",
"config": {
"timeout_ms": 10000,
"user_agent": "Mozilla/5.0 (compatible; stygian/0.1)"
}
},
{
"id": "render_js",
"service": "browser",
"config": {
"wait_strategy": "network_idle",
"stealth_level": "advanced"
}
},
{
"id": "extract_data",
"service": "ai_claude",
"config": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 2048,
"schema": {
"title": "string",
"price": "number",
"availability": "boolean",
"images": ["string"]
}
}
}
],
"edges": [
{"from": "fetch_html", "to": "render_js"},
{"from": "render_js", "to": "extract_data"}
]
}
Field reference
| Field | Type | Required | Description |
|---|---|---|---|
id | string | no | Human-readable pipeline identifier |
nodes[].id | string | yes | Unique node identifier within the pipeline |
nodes[].service | string | yes | Registered service name (must exist in ServiceRegistry) |
nodes[].config | object | no | Service-specific configuration (passed as ServiceInput.config) |
edges[].from | string | yes | Source node id |
edges[].to | string | yes | Target node id |
TOML pipeline format
The same structure works in TOML:
id = "product-scraper"
[[nodes]]
id = "fetch_html"
service = "http"
[nodes.config]
timeout_ms = 10000
[[nodes]]
id = "extract_data"
service = "ai_claude"
[nodes.config]
model = "claude-3-5-sonnet-20241022"
max_tokens = 2048
[[edges]]
from = "fetch_html"
to = "extract_data"
Programmatic builder
For pipelines constructed at runtime:
#![allow(unused)] fn main() { use stygian_graph::domain::pipeline::PipelineUnvalidated; use stygian_graph::domain::graph::{Node, Edge}; use serde_json::json; let pipeline = PipelineUnvalidated::builder() .id("product-scraper") .node(Node { id: "fetch".into(), service: "http".into(), config: json!({}) }) .node(Node { id: "extract".into(), service: "ai_claude".into(), config: json!({ "model": "claude-3-5-sonnet-20241022" })}) .edge(Edge { from: "fetch".into(), to: "extract".into() }) .build()?; let validated = pipeline.validate()?; let results = validated.execute().await?; }
Pipeline validation
validate() runs four checks before the pipeline may execute:
- Node uniqueness — all
node.idvalues are distinct. - Edge validity — every edge references nodes that exist.
- Cycle detection — Kahn's topological sort; fails if a cycle is detected.
- Connectivity — all nodes are reachable from at least one source node.
Any validation failure returns a typed PipelineError — never panics.
#![allow(unused)] fn main() { use stygian_graph::domain::error::PipelineError; match pipeline.validate() { Ok(validated) => { /* proceed */ } Err(PipelineError::DuplicateNode(id)) => eprintln!("duplicate node: {id}"), Err(PipelineError::CycleDetected) => eprintln!("pipeline has a cycle"), Err(PipelineError::UnknownEdge { from, to }) => eprintln!("edge {from} → {to} references unknown nodes"), Err(e) => eprintln!("validation failed: {e}"), } }
Idempotency
Every execution is assigned an IdempotencyKey — a ULID that acts as a deduplication
token across retries:
#![allow(unused)] fn main() { use stygian_graph::domain::idempotency::IdempotencyKey; // Auto-generated (recommended) let key = IdempotencyKey::new(); // Deterministic from a stable input (replays return the same result) let key = IdempotencyKey::from_input("pipeline-1", "https://example.com/product/123"); }
Pass the key to execute_idempotent(). If the same key is seen again within the TTL,
the cached result is returned immediately without re-executing.
Branching and fan-out
A node can have multiple outgoing edges. All downstream nodes receive the same output:
{
"nodes": [
{"id": "fetch", "service": "http"},
{"id": "store_raw","service": "s3"},
{"id": "extract", "service": "ai_claude"}
],
"edges": [
{"from": "fetch", "to": "store_raw"},
{"from": "fetch", "to": "extract"}
]
}
store_raw and extract run concurrently in the same wave.
Conditional execution
Nodes support an optional condition field (JSONPath expression):
{
"id": "render_js",
"service": "browser",
"condition": "$.content_type == 'application/javascript'"
}
When the condition evaluates to false the node is skipped and its outputs are forwarded
as empty, allowing downstream nodes to handle the gap gracefully.
Built-in Adapters
stygian-graph ships adapters for every major scraping and AI workload. All adapters
implement the port traits defined in src/ports.rs and are registered by name in the
ServiceRegistry.
HTTP Adapter
The default content-fetching adapter. Uses reqwest with connection pooling, automatic
redirect following, and configurable retry logic.
#![allow(unused)] fn main() { use stygian_graph::adapters::{HttpAdapter, HttpConfig}; use std::time::Duration; let adapter = HttpAdapter::with_config(HttpConfig { timeout: Duration::from_secs(15), user_agent: Some("stygian/0.1".to_string()), follow_redirects: true, max_redirects: 10, ..Default::default() }); }
Registered service name: "http"
| Config field | Default | Description |
|---|---|---|
timeout | 30 s | Per-request timeout |
user_agent | None | Override User-Agent header |
follow_redirects | true | Follow 3xx responses |
max_redirects | 10 | Redirect chain limit |
proxy | None | HTTP/HTTPS/SOCKS5 proxy URL |
REST API Adapter
Purpose-built for structured JSON REST APIs. Handles authentication, automatic multi-strategy pagination, JSON response extraction, and retry — without the caller needing to manage any of that manually.
#![allow(unused)] fn main() { use stygian_graph::adapters::rest_api::{RestApiAdapter, RestApiConfig}; use stygian_graph::ports::{ScrapingService, ServiceInput}; use serde_json::json; use std::time::Duration; let adapter = RestApiAdapter::with_config(RestApiConfig { timeout: Duration::from_secs(20), max_retries: 3, ..Default::default() }); let input = ServiceInput { url: "https://api.github.com/repos/rust-lang/rust/issues".to_string(), params: json!({ "auth": { "type": "bearer", "token": "${env:GITHUB_TOKEN}" }, "query": { "state": "open", "per_page": "100" }, "pagination": { "strategy": "link_header", "max_pages": 10 }, "response": { "data_path": "" } }), }; // let output = adapter.execute(input).await?; }
Registered service name: "rest-api"
Config fields
| Field | Default | Description |
|---|---|---|
timeout | 30 s | Per-request timeout |
max_retries | 3 | Retry attempts on transient errors (429, 5xx, network) |
retry_base_delay | 1 s | Base for exponential backoff |
proxy_url | None | HTTP/HTTPS/SOCKS5 proxy URL |
ServiceInput.params contract
| Param | Required | Default | Description |
|---|---|---|---|
method | — | "GET" | GET, POST, PUT, PATCH, DELETE, HEAD |
body | — | — | JSON body for POST/PUT/PATCH |
body_raw | — | — | Raw string body (takes precedence over body) |
headers | — | — | Extra request headers object |
query | — | — | Extra query string parameters object |
accept | — | "application/json" | Accept header |
auth | — | none | Authentication object (see below) |
response.data_path | — | full body | Dot path into the JSON response to extract |
response.collect_as_array | — | false | Force multi-page results into a JSON array |
pagination.strategy | — | "none" | "none", "offset", "cursor", "link_header" |
pagination.max_pages | — | 1 | Maximum pages to fetch |
Authentication
# Bearer token
[nodes.params.auth]
type = "bearer"
token = "${env:API_TOKEN}"
# HTTP Basic
[nodes.params.auth]
type = "basic"
username = "${env:API_USER}"
password = "${env:API_PASS}"
# API key in header
[nodes.params.auth]
type = "api_key_header"
header = "X-Api-Key"
key = "${env:API_KEY}"
# API key in query string
[nodes.params.auth]
type = "api_key_query"
param = "api_key"
key = "${env:API_KEY}"
Pagination strategies
| Strategy | How it works | Best for |
|---|---|---|
"none" | Single request | Simple endpoints |
"offset" | Increments page_param from start_page | REST APIs with ?page=N |
"cursor" | Extracts next cursor from cursor_field (dot path), sends as cursor_param | GraphQL-REST hybrids, Stripe-style |
"link_header" | Follows RFC 8288 Link: <url>; rel="next" | GitHub API, GitLab API |
Offset example
[nodes.params.pagination]
strategy = "offset"
page_param = "page"
page_size_param = "per_page"
page_size = 100
start_page = 1
max_pages = 20
Cursor example
[nodes.params.pagination]
strategy = "cursor"
cursor_param = "after"
cursor_field = "meta.next_cursor"
max_pages = 50
Output
ServiceOutput.data — pretty-printed JSON string of the extracted data.
ServiceOutput.metadata:
{
"url": "https://...",
"page_count": 3
}
OpenAPI Adapter
Purpose-built for APIs with an OpenAPI 3.x
specification document. At runtime the adapter fetches and caches the spec, resolves the
target operation by operationId or "METHOD /path" syntax, binds arguments to path/
query/body parameters, and delegates the actual HTTP call to the inner RestApiAdapter.
#![allow(unused)] fn main() { use stygian_graph::adapters::openapi::{OpenApiAdapter, OpenApiConfig}; use stygian_graph::adapters::rest_api::RestApiConfig; use stygian_graph::ports::{ScrapingService, ServiceInput}; use serde_json::json; use std::time::Duration; let adapter = OpenApiAdapter::with_config(OpenApiConfig { rest: RestApiConfig { timeout: Duration::from_secs(20), max_retries: 3, ..Default::default() }, }); let input = ServiceInput { url: "https://petstore3.swagger.io/api/v3/openapi.json".to_string(), params: json!({ "operation": "findPetsByStatus", "args": { "status": "available" }, "auth": { "type": "api_key_header", "header": "api_key", "key": "${env:PETSTORE_API_KEY}" }, }), }; // let output = adapter.execute(input).await?; }
Registered service name: "openapi"
ServiceInput contract
| Field | Required | Description |
|---|---|---|
url | ✅ | URL of the OpenAPI spec document (.json or .yaml) |
params.operation | ✅ | operationId (e.g. "listPets") or "METHOD /path" (e.g. "GET /pet/findByStatus") |
params.args | — | Key/value map of path, query, and body arguments (all merged; adapter classifies them) |
params.auth | — | Same shape as REST API auth |
params.server.url | — | Override spec's servers[0].url at runtime |
params.rate_limit | — | Proactive rate throttle (see below) |
Operation resolution
# By operationId
[nodes.params]
operation = "listPets"
# By HTTP method + path
[nodes.params]
operation = "GET /pet/findByStatus"
Argument binding
The adapter classifies each key in params.args against the operation's declared
parameters:
| Parameter location | What happens |
|---|---|
in: path | Value is substituted into the URL template ({petId} → 42) |
in: query | Value is appended to the query string |
in: body (requestBody) | All remaining keys are collected into the JSON request body |
[nodes.params]
operation = "getPetById"
[nodes.params.args]
petId = 42 # path parameter — substituted into /pet/{petId}
Rate limiting
Two independent layers operate simultaneously:
- Proactive — optional
params.rate_limitblock enforced before the HTTP call:
[nodes.params.rate_limit]
max_requests = 100
window_secs = 60
strategy = "token_bucket" # or "sliding_window" (default)
max_delay_ms = 30000
- Reactive — inherited from
RestApiAdapter: a429response with aRetry-Afterheader causes an automatic sleep-and-retry without any additional configuration.
Spec caching
The parsed OpenAPI document is cached per spec URL for the lifetime of the adapter
instance (or any clone of it — all clones share the same cache Arc). The cache is
populated on the first call and reused on every subsequent call, so spec fetching is
a one-time cost per URL.
Full TOML example
[[services]]
name = "openapi"
kind = "openapi"
# Unauthenticated list
[[nodes]]
id = "list-pets"
service = "openapi"
url = "https://petstore3.swagger.io/api/v3/openapi.json"
[nodes.params]
operation = "findPetsByStatus"
[nodes.params.args]
status = "available"
# API key auth + server override + proactive rate limit
[[nodes]]
id = "get-pet"
service = "openapi"
url = "https://petstore3.swagger.io/api/v3/openapi.json"
[nodes.params]
operation = "getPetById"
[nodes.params.args]
petId = 1
[nodes.params.auth]
type = "api_key_header"
header = "api_key"
key = "${env:PETSTORE_API_KEY}"
[nodes.params.server]
url = "https://petstore3.swagger.io/api/v3"
[nodes.params.rate_limit]
max_requests = 50
window_secs = 60
strategy = "token_bucket"
Output
ServiceOutput.data — pretty-printed JSON string returned by the resolved endpoint.
ServiceOutput.metadata:
{
"url": "https://...",
"page_count": 1,
"openapi_spec_url": "https://petstore3.swagger.io/api/v3/openapi.json",
"operation_id": "findPetsByStatus",
"method": "GET",
"path_template": "/pet/findByStatus",
"server_url": "https://petstore3.swagger.io/api/v3",
"resolved_url": "https://petstore3.swagger.io/api/v3/pet/findByStatus"
}
Browser Adapter
Delegates to stygian-browser for JavaScript-rendered pages. Requires the browser
feature flag and a Chrome binary.
#![allow(unused)] fn main() { use stygian_graph::adapters::{BrowserAdapter, BrowserAdapterConfig}; use stygian_browser::StealthLevel; use std::time::Duration; let adapter = BrowserAdapter::with_config(BrowserAdapterConfig { headless: true, stealth_level: StealthLevel::Advanced, timeout: Duration::from_secs(30), viewport_width: 1920, viewport_height: 1080, ..Default::default() }); }
Registered service name: "browser"
See the Browser Automation section for the full feature set.
AI Adapters
All AI adapters implement AIProvider and perform structured extraction: they receive raw
HTML (or text) from an upstream scraping node and return a typed JSON object matching the
schema declared in the node config.
Claude (Anthropic)
#![allow(unused)] fn main() { use stygian_graph::adapters::ClaudeAdapter; let adapter = ClaudeAdapter::new( std::env::var("ANTHROPIC_API_KEY")?, "claude-3-5-sonnet-20241022", ); }
Registered service name: "ai_claude"
| Config field | Description |
|---|---|
model | Model ID (e.g. claude-3-5-sonnet-20241022) |
max_tokens | Max response tokens (default 4096) |
system_prompt | Optional system-level instruction |
schema | JSON schema for structured output |
OpenAI
#![allow(unused)] fn main() { use stygian_graph::adapters::OpenAiAdapter; let adapter = OpenAiAdapter::new( std::env::var("OPENAI_API_KEY")?, "gpt-4o", ); }
Registered service name: "ai_openai"
Gemini (Google)
#![allow(unused)] fn main() { use stygian_graph::adapters::GeminiAdapter; let adapter = GeminiAdapter::new( std::env::var("GOOGLE_API_KEY")?, "gemini-2.0-flash", ); }
Registered service name: "ai_gemini"
GitHub Copilot
Uses the Copilot API with your personal access token (PAT) or GitHub App credentials.
#![allow(unused)] fn main() { use stygian_graph::adapters::CopilotAdapter; let adapter = CopilotAdapter::new( std::env::var("GITHUB_TOKEN")?, "gpt-4o", ); }
Registered service name: "ai_copilot"
Ollama (local)
Run any GGUF model locally without sending data to an external API.
#![allow(unused)] fn main() { use stygian_graph::adapters::OllamaAdapter; let adapter = OllamaAdapter::new( "http://localhost:11434", "llama3.3", ); }
Registered service name: "ai_ollama"
AI fallback chain
Adapters can be wrapped in a fallback chain. If the primary provider fails (rate-limit, outage), the next in the list is tried:
#![allow(unused)] fn main() { use stygian_graph::adapters::AiFallbackChain; let chain = AiFallbackChain::new(vec![ Arc::new(ClaudeAdapter::new(api_key.clone(), "claude-3-5-sonnet-20241022")), Arc::new(OpenAiAdapter::new(openai_key, "gpt-4o")), Arc::new(OllamaAdapter::new("http://localhost:11434", "llama3.3")), ]); }
Registered service name: "ai_fallback"
Resilience adapters
Wrap any ScrapingService with circuit breaker and retry logic without touching the
underlying implementation:
#![allow(unused)] fn main() { use stygian_graph::adapters::resilience::{ CircuitBreakerImpl, RetryPolicy, ResilientAdapter, }; use std::time::Duration; let cb = CircuitBreakerImpl::new( 5, // open after 5 consecutive failures Duration::from_secs(120), // half-open attempt after 2 min ); let policy = RetryPolicy::exponential( 3, // max 3 attempts Duration::from_millis(200), // initial back-off Duration::from_secs(5), // max back-off ); let resilient = ResilientAdapter::new( Arc::new(http_adapter), Arc::new(cb), policy, ); }
Cache adapters
Two in-process cache implementations are included. Both implement CachePort.
BoundedLruCache
Thread-safe LRU with a hard capacity limit:
#![allow(unused)] fn main() { use stygian_graph::adapters::BoundedLruCache; use std::num::NonZeroUsize; let cache = BoundedLruCache::new(NonZeroUsize::new(10_000).unwrap()); }
DashMapCache
Concurrent hash-map backed cache with a background TTL cleanup task:
#![allow(unused)] fn main() { use stygian_graph::adapters::DashMapCache; use std::time::Duration; let cache = DashMapCache::new(Duration::from_secs(300)); // 5-minute default TTL }
GraphQL adapter
The GraphQlService adapter executes queries against any GraphQL endpoint using
GraphQlTargetPlugin implementations registered in a GraphQlPluginRegistry.
For most APIs, use GenericGraphQlPlugin via the fluent builder rather than writing
a dedicated struct:
#![allow(unused)] fn main() { use stygian_graph::adapters::graphql_plugins::generic::GenericGraphQlPlugin; use stygian_graph::adapters::graphql_throttle::CostThrottleConfig; let plugin = GenericGraphQlPlugin::builder() .name("github") .endpoint("https://api.github.com/graphql") .bearer_auth("${env:GITHUB_TOKEN}") .header("X-Github-Next-Global-ID", "1") .cost_throttle(CostThrottleConfig::default()) .page_size(30) .build() .expect("name and endpoint required"); }
For runtime-rotating credentials inject an AuthPort:
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::adapters::graphql::{GraphQlConfig, GraphQlService}; use stygian_graph::ports::auth::{EnvAuthPort, ErasedAuthPort}; let service = GraphQlService::new(GraphQlConfig::default(), Some(Arc::new(registry))) .with_auth_port(Arc::new(EnvAuthPort::new("MY_API_TOKEN")) as Arc<dyn ErasedAuthPort>); }
See the GraphQL Plugins page for the full builder reference,
AuthPort implementation guide, proactive cost throttling, and custom plugin examples.
Cloudflare Browser Rendering adapter
Submits a multi-page crawl job to the Cloudflare Browser Rendering API, polls until it completes, and returns the aggregated content. All page rendering is done inside Cloudflare's infrastructure — no local Chrome binary needed.
Feature flag: cloudflare-crawl (not included in default or browser; add it
explicitly or use full).
Quick start
# Cargo.toml
[dependencies]
stygian-graph = { version = "0.1", features = ["cloudflare-crawl"] }
#![allow(unused)] fn main() { use stygian_graph::adapters::cloudflare_crawl::{ CloudflareCrawlAdapter, CloudflareCrawlConfig, }; use std::time::Duration; let adapter = CloudflareCrawlAdapter::with_config(CloudflareCrawlConfig { poll_interval: Duration::from_secs(3), job_timeout: Duration::from_secs(120), ..Default::default() }); }
Registered service name: "cloudflare-crawl"
ServiceInput.params contract
All per-request options are passed via ServiceInput.params. account_id and
api_token are required; the rest are optional and forwarded verbatim to the
Cloudflare API.
| Param key | Required | Default | Description |
|---|---|---|---|
account_id | ✅ | — | Cloudflare account ID |
api_token | ✅ | — | Cloudflare API token with Browser Rendering permission |
output_format | — | "markdown" | "markdown", "html", or "raw" |
max_depth | — | API default | Maximum crawl depth from the seed URL |
max_pages | — | API default | Maximum pages to crawl |
url_pattern | — | API default | Regex or glob restricting which URLs are followed |
modified_since | — | API default | ISO-8601 timestamp; skip pages not modified since |
max_age_seconds | — | API default | Skip cached pages older than this many seconds |
static_mode | — | false | Set "true" to skip JS execution (faster, static HTML only) |
Config fields
| Field | Default | Description |
|---|---|---|
poll_interval | 2 s | How often to poll for job completion |
job_timeout | 5 min | Hard timeout per crawl job; returns ServiceError::Timeout if exceeded |
Output
ServiceOutput.data contains the page content of all crawled pages joined by newlines.
ServiceOutput.metadata is a JSON object:
{
"job_id": "some-uuid",
"pages_crawled": 12,
"output_format": "markdown"
}
TOML pipeline usage
[[nodes]]
id = "crawl"
type = "scrape"
target = "https://docs.example.com"
[nodes.params]
account_id = "${env:CF_ACCOUNT_ID}"
api_token = "${env:CF_API_TOKEN}"
output_format = "markdown"
max_depth = "3"
max_pages = "50"
url_pattern = "https://docs.example.com/**"
[nodes.service]
name = "cloudflare-crawl"
Error mapping
| Condition | StygianError variant |
|---|---|
Missing account_id or api_token | ServiceError::Unavailable |
| Cloudflare API non-2xx | ServiceError::Unavailable (with CF error code) |
Job still pending after job_timeout | ServiceError::Timeout |
| Unexpected response shape | ServiceError::InvalidResponse |
Request signing adapters
The SigningPort trait lets any adapter attach signatures, HMAC tokens,
device attestation headers, or OAuth material to outbound requests without
coupling the adapter to the scheme.
| Adapter | Use case |
|---|---|
NoopSigningAdapter | Passthrough — no headers added; useful as a default or in unit tests |
HttpSigningAdapter | Delegate to any external sidecar (Frida RPC bridge, AWS SigV4 server, OAuth 1.0a service, …) |
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::adapters::signing::{HttpSigningAdapter, HttpSigningConfig}; use stygian_graph::ports::signing::ErasedSigningPort; let signer: Arc<dyn ErasedSigningPort> = Arc::new( HttpSigningAdapter::new(HttpSigningConfig { endpoint: "http://localhost:27042/sign".to_string(), ..Default::default() }) ); }
See Request Signing for the full sidecar wire format, Frida RPC bridge example, and guide to implementing a pure-Rust SigningPort.
Request Signing
The SigningPort trait decouples request signing from the adapters that make HTTP calls.
Any adapter that needs signed outbound requests holds an Arc<dyn ErasedSigningPort> and
calls sign() with the request material before dispatching it.
This separation means the calling adapter never knows or cares how requests are signed — whether by a Frida RPC bridge, an AWS SDK, a pure-Rust HMAC function, or a lightweight Python sidecar.
When to use SigningPort
| Signing scheme | Typical use |
|---|---|
| Frida RPC bridge | Hook native .so signing code inside a running mobile app (Tinder, Snapchat, …) via a thin HTTP sidecar |
| AWS Signature V4 | Sign S3 / API Gateway requests; keep IAM credentials out of the graph pipeline |
| OAuth 1.0a | Generate per-request oauth_signature for Twitter/X API v1 endpoints |
| Custom HMAC | Add X-Request-Signature + X-Signed-At headers required by trading or payment APIs |
| Timestamp + nonce | Anti-replay headers for any API that validates request freshness |
| Device attestation | Attach Play Integrity / Apple DeviceCheck tokens to every request |
| mTLS client credentials | Surface a client certificate thumbprint as a header when TLS termination is upstream |
Input and output types
SigningInput
The request material passed to the signer:
#![allow(unused)] fn main() { use stygian_graph::ports::signing::SigningInput; use serde_json::json; let input = SigningInput { method: "POST".to_string(), url: "https://api.example.com/v2/messages".to_string(), headers: Default::default(), // headers already present before signing body: Some(b"{\"text\":\"hello\"}".to_vec()), context: json!({ "nonce_seed": 42 }), // arbitrary caller data }; }
| Field | Type | Description |
|---|---|---|
method | String | HTTP method ("GET", "POST", …) |
url | String | Fully-qualified target URL |
headers | HashMap<String, String> | Headers already present on the request |
body | Option<Vec<u8>> | Raw request body; None for bodyless methods |
context | serde_json::Value | Caller-supplied metadata (nonce seeds, session tokens, …) |
SigningOutput
The material to merge into the request:
#![allow(unused)] fn main() { use stygian_graph::ports::signing::SigningOutput; use std::collections::HashMap; let mut headers = HashMap::new(); headers.insert("Authorization".to_string(), "HMAC-SHA256 sig=abc123".to_string()); headers.insert("X-Signed-At".to_string(), "1710676800000".to_string()); let output = SigningOutput { headers, query_params: vec![], body_override: None, }; }
| Field | Type | Description |
|---|---|---|
headers | HashMap<String, String> | Headers to add or override on the request |
query_params | Vec<(String, String)> | Query parameters to append to the URL |
body_override | Option<Vec<u8>> | If Some, replaces the request body (for digest-in-body schemes) |
All fields default to empty — a default SigningOutput is a valid no-op.
Built-in adapters
NoopSigningAdapter
Passes requests through unsigned. Use as a default when signing is optional, or to disable signing in tests:
#![allow(unused)] fn main() { use stygian_graph::adapters::signing::NoopSigningAdapter; use stygian_graph::ports::signing::{SigningPort, SigningInput}; use serde_json::json; tokio::runtime::Runtime::new().unwrap().block_on(async { let signer = NoopSigningAdapter; let output = signer.sign(SigningInput { method: "GET".to_string(), url: "https://example.com".to_string(), headers: Default::default(), body: None, context: json!({}), }).await.unwrap(); assert!(output.headers.is_empty()); }); }
HttpSigningAdapter
Delegates signing to any external HTTP sidecar. The sidecar receives a JSON payload describing the request and returns the headers / query params / body override to apply.
#![allow(unused)] fn main() { use stygian_graph::adapters::signing::{HttpSigningAdapter, HttpSigningConfig}; use std::time::Duration; let signer = HttpSigningAdapter::new(HttpSigningConfig { endpoint: "http://localhost:27042/sign".to_string(), timeout: Duration::from_secs(5), bearer_token: Some("sidecar-secret".to_string()), ..Default::default() }); }
Config fields
| Field | Default | Description |
|---|---|---|
endpoint | "http://localhost:27042/sign" | Full URL of the sidecar's sign endpoint |
timeout | 10 s | Per-request timeout when calling the sidecar |
bearer_token | None | Bearer token used to authenticate with the sidecar |
extra_headers | {} | Static headers forwarded on every sidecar call |
Sidecar wire format
The sidecar receives a POST with this JSON body:
{
"method": "GET",
"url": "https://api.tinder.com/v2/profile",
"headers": { "Content-Type": "application/json" },
"body_b64": null,
"context": {}
}
The request body (if any) is base64-encoded in body_b64. The sidecar must
respond with a JSON object:
{
"headers": { "X-Auth-Token": "abc123", "X-Signed-At": "1710676800000" },
"query_params": [],
"body_b64": null
}
All response fields are optional — omit any field that your scheme does not use.
Frida RPC bridge
The most common use of HttpSigningAdapter is hooking a mobile app's native signing function via Frida and exposing it
through a thin HTTP sidecar.
┌─────────── Your machine ────────────────────────────────┐
│ │
│ stygian-graph pipeline │
│ └─ HttpSigningAdapter → POST http://localhost:27042 │─ adb forward ─►┐
│ │ │
└─────────────────────────────────────────────────────────┘ │
▼
┌──── Android device / emulator ────┐
│ │
│ Frida sidecar (Python/Flask) │
│ └─ frida.attach("com.example") │
│ └─ libauth.so!computeHMAC() │
│ │
└───────────────────────────────────┘
Example Python sidecar (frida_sidecar.py):
import frida
from flask import Flask, request, jsonify
import base64
session = frida.get_usb_device().attach("com.example.app")
script = session.create_script("""
const computeHmac = new NativeFunction(
Module.findExportByName("libauth.so", "_Z11computeHMACPKcS0_"), // gitleaks:allow
'pointer', ['pointer', 'pointer']
);
rpc.exports.sign = (url, body) =>
computeHmac(
Memory.allocUtf8String(url),
Memory.allocUtf8String(body)
).readUtf8String();
""")
script.load()
app = Flask(__name__)
@app.route("/sign", methods=["POST"])
def sign():
req = request.json
body = base64.b64decode(req["body_b64"]).decode() if req.get("body_b64") else ""
sig = script.exports_sync.sign(req["url"], body)
return jsonify({"headers": {"X-Request-Signature": sig}})
app.run(host="0.0.0.0", port=27042)
Forward the port and point the adapter at it:
adb forward tcp:27042 tcp:27042
#![allow(unused)] fn main() { use stygian_graph::adapters::signing::{HttpSigningAdapter, HttpSigningConfig}; let signer = HttpSigningAdapter::new(HttpSigningConfig { endpoint: "http://localhost:27042/sign".to_string(), ..Default::default() }); }
Implementing a custom SigningPort
For pure-Rust schemes, implement SigningPort directly — no sidecar needed:
#![allow(unused)] fn main() { use std::collections::HashMap; use std::time::{SystemTime, UNIX_EPOCH}; use stygian_graph::ports::signing::{SigningError, SigningInput, SigningOutput, SigningPort}; pub struct TimestampNonceAdapter { secret: Vec<u8>, } impl SigningPort for TimestampNonceAdapter { async fn sign(&self, _input: SigningInput) -> Result<SigningOutput, SigningError> { let ts = SystemTime::now() .duration_since(UNIX_EPOCH) .map_err(|e| SigningError::Other(e.to_string()))? .as_millis() .to_string(); let mut headers = HashMap::new(); headers.insert("X-Timestamp".to_string(), ts); headers.insert("X-Nonce".to_string(), uuid::Uuid::new_v4().to_string()); // Add HMAC over (method + url + ts) using self.secret here … Ok(SigningOutput { headers, ..Default::default() }) } } }
Follow the Custom Adapters guide for the full checklist.
Wiring into a pipeline
Use Arc<dyn ErasedSigningPort> to hold any signer at runtime:
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::adapters::signing::{HttpSigningAdapter, HttpSigningConfig}; use stygian_graph::ports::signing::ErasedSigningPort; let signer: Arc<dyn ErasedSigningPort> = Arc::new( HttpSigningAdapter::new(HttpSigningConfig { endpoint: "http://localhost:27042/sign".to_string(), ..Default::default() }) ); // Pass `signer` to any adapter or service that accepts `Arc<dyn ErasedSigningPort>` }
ErasedSigningPort is the object-safe version of SigningPort that enables dynamic
dispatch. The blanket impl<T: SigningPort> ErasedSigningPort for T means any concrete
SigningPort implementation can be erased without additional boilerplate.
Error handling
SigningError converts to StygianError::Service(ServiceError::AuthenticationFailed)
via the From trait, so signing failures surface as authentication errors in the pipeline.
| Variant | Meaning |
|---|---|
BackendUnavailable(msg) | Sidecar is unreachable (network error, DNS failure) |
InvalidResponse(msg) | Sidecar returned an unexpected HTTP status or malformed JSON |
CredentialsMissing(msg) | Signing key or secret was not configured |
Timeout(ms) | Sidecar did not respond within the configured timeout |
Other(msg) | Catch-all for any other signing failure |
GraphQL Plugins
stygian-graph ships a generic, builder-based GraphQL plugin system built on top of
the GraphQlTargetPlugin port trait. Instead of writing a dedicated struct for each
API you want to query, reach for GenericGraphQlPlugin.
GenericGraphQlPlugin
GenericGraphQlPlugin implements GraphQlTargetPlugin and is configured entirely
via a fluent builder. Only name and endpoint are required; everything else is
optional with sensible defaults.
#![allow(unused)] fn main() { use stygian_graph::adapters::graphql_plugins::generic::GenericGraphQlPlugin; use stygian_graph::adapters::graphql_throttle::CostThrottleConfig; let plugin = GenericGraphQlPlugin::builder() .name("github") .endpoint("https://api.github.com/graphql") .bearer_auth("${env:GITHUB_TOKEN}") .header("X-Github-Next-Global-ID", "1") .cost_throttle(CostThrottleConfig::default()) .page_size(30) .description("GitHub GraphQL API v4") .build() .expect("name and endpoint are required"); }
Builder reference
| Method | Required | Description |
|---|---|---|
.name(impl Into<String>) | yes | Plugin identifier used in the registry |
.endpoint(impl Into<String>) | yes | Full GraphQL endpoint URL |
.bearer_auth(impl Into<String>) | no | Shorthand: sets a Bearer auth token |
.auth(GraphQlAuth) | no | Full auth struct (Bearer, API key, or custom header) |
.header(key, value) | no | Add a single request header (repeatable) |
.headers(HashMap<String, String>) | no | Bulk-replace all headers |
.cost_throttle(CostThrottleConfig) | no | Enable proactive point-budget throttling |
.page_size(usize) | no | Default page size for paginated queries (default 50) |
.description(impl Into<String>) | no | Human-readable description |
.build() | — | Returns Result<GenericGraphQlPlugin, BuildError> |
Auth options
#![allow(unused)] fn main() { use stygian_graph::ports::{GraphQlAuth, GraphQlAuthKind}; // Bearer token (most common) let plugin = GenericGraphQlPlugin::builder() .name("shopify") .endpoint("https://my-store.myshopify.com/admin/api/2025-01/graphql.json") .bearer_auth("${env:SHOPIFY_ACCESS_TOKEN}") .build() .unwrap(); // Custom header (e.g. X-Shopify-Access-Token) let plugin = GenericGraphQlPlugin::builder() .name("shopify-legacy") .endpoint("https://my-store.myshopify.com/admin/api/2025-01/graphql.json") .auth(GraphQlAuth { kind: GraphQlAuthKind::Header, token: "${env:SHOPIFY_ACCESS_TOKEN}".to_string(), header_name: Some("X-Shopify-Access-Token".to_string()), }) .build() .unwrap(); }
Tokens containing ${env:VAR_NAME} are expanded by the pipeline template processor
(expand_template) or by GraphQlService::apply_auth, not by EnvAuthPort itself.
EnvAuthPort reads the env var value directly — no template syntax is needed when
using it as a runtime auth port.
AuthPort — runtime credential management
For credentials that rotate, expire, or need a refresh flow, implement the
AuthPort trait and inject it into GraphQlService.
#![allow(unused)] fn main() { use stygian_graph::ports::auth::{AuthPort, AuthError, TokenSet}; pub struct MyOAuthPort { /* ... */ } impl AuthPort for MyOAuthPort { async fn load_token(&self) -> Result<Option<TokenSet>, AuthError> { // Return None if no cached token exists; Some(token) if you have one. Ok(Some(TokenSet { access_token: fetch_stored_token().await?, refresh_token: Some(fetch_stored_refresh_token().await?), expires_at: Some(std::time::SystemTime::now() + std::time::Duration::from_secs(3600)), scopes: vec!["read".to_string()], })) } async fn refresh_token(&self) -> Result<TokenSet, AuthError> { // Exchange the refresh token for a new access token. Ok(TokenSet { access_token: exchange_refresh_token().await?, refresh_token: Some(fetch_new_refresh_token().await?), expires_at: Some(std::time::SystemTime::now() + std::time::Duration::from_secs(3600)), scopes: vec!["read".to_string()], }) } } }
Wiring into GraphQlService
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::adapters::graphql::{GraphQlConfig, GraphQlService}; use stygian_graph::ports::auth::ErasedAuthPort; let service = GraphQlService::new(GraphQlConfig::default(), None) .with_auth_port(Arc::new(MyOAuthPort { /* ... */ }) as Arc<dyn ErasedAuthPort>); }
The service calls resolve_token before each request. If the token is expired (or
within 60 seconds of expiry), refresh_token is called automatically.
EnvAuthPort — zero-config static token
For non-rotating tokens, EnvAuthPort reads a bearer token from an environment
variable at load time:
#![allow(unused)] fn main() { use stygian_graph::ports::auth::EnvAuthPort; let auth = EnvAuthPort::new("GITHUB_TOKEN"); }
If GITHUB_TOKEN is not set, EnvAuthPort::load_token returns Ok(None). An error
(AuthError::TokenNotFound) is only raised later when resolve_token is called and
finds no token available.
Cost throttling
GraphQL APIs that expose extensions.cost.throttleStatus (Shopify Admin API,
Jobber, and others) can be configured for proactive point-budget management.
CostThrottleConfig
#![allow(unused)] fn main() { use stygian_graph::ports::graphql_plugin::CostThrottleConfig; let config = CostThrottleConfig { max_points: 10_000.0, // bucket capacity restore_per_sec: 500.0, // points restored per second min_available: 50.0, // don't send if fewer points remain max_delay_ms: 30_000, // wait at most 30 s before giving up estimated_cost_per_request: 100.0, // pessimistic per-request reservation }; }
| Field | Default | Description |
|---|---|---|
max_points | 10_000.0 | Total bucket capacity |
restore_per_sec | 500.0 | Points/second restored |
min_available | 50.0 | Points threshold below which we pre-sleep |
max_delay_ms | 30_000 | Hard ceiling on proactive sleep duration (ms) |
estimated_cost_per_request | 100.0 | Pessimistic reservation per request to prevent concurrent tasks from all passing the pre-flight check simultaneously |
Attach config to a plugin via .cost_throttle(config) on the builder, or override
GraphQlTargetPlugin::cost_throttle_config() on a custom plugin implementation.
How budget tracking works
- Pre-flight reserve:
pre_flight_reserveinspects the currentLiveBudgetfor the plugin. If the projected available points (net of in-flight reservations) fall belowmin_available + estimated_cost_per_request, it sleeps for the exact duration needed to restore enough points, up tomax_delay_ms. It then atomically reservesestimated_cost_per_requestpoints so concurrent tasks immediately see a reduced balance and cannot all pass the pre-flight check simultaneously. - Post-response:
update_budgetparsesextensions.cost.throttleStatusout of the response JSON and updates the per-pluginLiveBudgetto the true server-reported balance.release_reservationis then called to remove the in-flight reservation. - Reactive back-off: If a request is throttled anyway (HTTP 429 or
extensions.costsignals exhaustion),reactive_backoff_mscomputes an exponential delay.
The budgets are stored in a HashMap<String, PluginBudget> keyed by plugin name
and protected by a tokio::sync::RwLock, so all concurrent requests share the
same view of remaining points.
Request rate limiting
While cost throttling manages point budgets returned by the server, request rate limiting operates entirely client-side: it counts (or tokens) before each request leaves the process. The two systems are complementary and can both be active at the same time.
Enable rate limiting by returning a RateLimitConfig from
GraphQlTargetPlugin::rate_limit_config(). The default implementation returns None
(disabled).
RateLimitConfig
#![allow(unused)] fn main() { use std::time::Duration; use stygian_graph::ports::graphql_plugin::{RateLimitConfig, RateLimitStrategy}; let config = RateLimitConfig { max_requests: 100, // requests allowed per window window: Duration::from_secs(60), // rolling window duration max_delay_ms: 30_000, // hard cap on pre-flight sleep (ms) strategy: RateLimitStrategy::SlidingWindow, // algorithm (see below) }; }
| Field | Default | Description |
|---|---|---|
max_requests | 100 | Maximum requests allowed inside any rolling window |
window | 60 s | Rolling window duration |
max_delay_ms | 30 000 | Maximum pre-flight sleep before giving up with an error |
strategy | SlidingWindow | Rate-limiting algorithm — see below |
RateLimitStrategy
RateLimitStrategy selects which algorithm protects outgoing requests. Both variants
also honour server-returned Retry-After headers regardless of which is active.
| Variant | Behaviour | Best for |
|---|---|---|
SlidingWindow (default) | Counts requests in a rolling time window; blocks new requests once max_requests is reached until old timestamps expire | APIs with strict fixed-window quotas (e.g. "100 req / 60 s") |
TokenBucket | Refills tokens at max_requests / window per second; absorbed bursts up to max_requests capacity before blocking | APIs that advertise burst allowances — allows short spikes then throttles gracefully |
Sliding window example
#![allow(unused)] fn main() { use std::time::Duration; use stygian_graph::ports::graphql_plugin::{RateLimitConfig, RateLimitStrategy}; // GitHub GraphQL API: 5 000 points/hour with a hard req/s ceiling let config = RateLimitConfig { max_requests: 10, window: Duration::from_secs(1), max_delay_ms: 5_000, strategy: RateLimitStrategy::SlidingWindow, }; }
Token bucket example
#![allow(unused)] fn main() { use std::time::Duration; use stygian_graph::ports::graphql_plugin::{RateLimitConfig, RateLimitStrategy}; // Shopify Admin API: 2 req/s sustained, bursts up to 40 let config = RateLimitConfig { max_requests: 40, // bucket depth = burst capacity window: Duration::from_secs(20), // refill rate = 40 / 20 = 2 req/s max_delay_ms: 30_000, strategy: RateLimitStrategy::TokenBucket, }; }
Wiring into a custom plugin
#![allow(unused)] fn main() { use std::time::Duration; use stygian_graph::ports::graphql_plugin::{ GraphQlTargetPlugin, RateLimitConfig, RateLimitStrategy, }; pub struct ShopifyPlugin { token: String } impl GraphQlTargetPlugin for ShopifyPlugin { fn name(&self) -> &str { "shopify" } fn endpoint(&self) -> &str { "https://my-store.myshopify.com/admin/api/2025-01/graphql.json" } fn rate_limit_config(&self) -> Option<RateLimitConfig> { Some(RateLimitConfig { max_requests: 40, window: Duration::from_secs(20), max_delay_ms: 30_000, strategy: RateLimitStrategy::TokenBucket, }) } // ...other methods omitted } }
For GenericGraphQlPlugin, pass it via the builder:
#![allow(unused)] fn main() { use stygian_graph::adapters::graphql_plugins::generic::GenericGraphQlPlugin; use stygian_graph::ports::graphql_plugin::{RateLimitConfig, RateLimitStrategy}; use std::time::Duration; let plugin = GenericGraphQlPlugin::builder() .name("shopify") .endpoint("https://my-store.myshopify.com/admin/api/2025-01/graphql.json") .bearer_auth("${env:SHOPIFY_ACCESS_TOKEN}") .rate_limit(RateLimitConfig { max_requests: 40, window: Duration::from_secs(20), max_delay_ms: 30_000, strategy: RateLimitStrategy::TokenBucket, }) .build() .expect("name and endpoint are required"); }
Rate limiting vs cost throttling
| Request rate limiting | Cost throttling | |
|---|---|---|
| What it counts | Raw request count | Query complexity points |
| Data source | Local state only | Server extensions.cost response |
| Algorithm options | Sliding window, token bucket | Leaky-bucket (token refill) |
| Reactive to 429? | Yes — honours Retry-After | Yes — exponential back-off |
| Enabled by | rate_limit_config() | cost_throttle_config() |
Writing a custom plugin
For complex APIs — multi-tenant endpoints, per-request header mutations, non-standard
auth flows — implement GraphQlTargetPlugin directly:
#![allow(unused)] fn main() { use std::collections::HashMap; use stygian_graph::ports::{GraphQlAuth, GraphQlAuthKind}; use stygian_graph::ports::graphql_plugin::{CostThrottleConfig, GraphQlTargetPlugin}; pub struct AcmeApi { token: String, } impl GraphQlTargetPlugin for AcmeApi { fn name(&self) -> &str { "acme" } fn endpoint(&self) -> &str { "https://api.acme.io/graphql" } fn version_headers(&self) -> HashMap<String, String> { [("Acme-Api-Version".to_string(), "2025-01".to_string())] .into_iter() .collect() } fn default_auth(&self) -> Option<GraphQlAuth> { Some(GraphQlAuth { kind: GraphQlAuthKind::Bearer, token: self.token.clone(), header_name: None, }) } fn default_page_size(&self) -> usize { 25 } fn description(&self) -> &str { "Acme Corp GraphQL API" } fn supports_cursor_pagination(&self) -> bool { true } // opt-in to proactive throttling fn cost_throttle_config(&self) -> Option<CostThrottleConfig> { Some(CostThrottleConfig::default()) } } }
Register it the same way as any built-in plugin:
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::application::graphql_plugin_registry::GraphQlPluginRegistry; let mut registry = GraphQlPluginRegistry::new(); registry.register(Arc::new(AcmeApi { token: /* ... */ })); }
Custom Adapters
This guide walks through building a production-quality custom adapter from scratch.
The worked example implements a hypothetical PlaywrightService — a browser automation
adapter that drives the Playwright protocol instead of CDP.
By the end you will know how to:
- Implement any port trait as a new adapter
- Register the adapter in the service registry
- Wrap the adapter in resilience primitives
- Write integration tests against it
Prerequisites
Read Architecture first. Adapters live in src/adapters/ and implement
traits defined in src/ports.rs. The domain never imports adapters — only ports.
Step 1: Choose the right port
| Port | Use when |
|---|---|
ScrapingService | Fetching or processing content in a new way |
AIProvider | Adding a new LLM or language model API |
CachePort | Adding a new cache backend (Redis, Memcached, …) |
SigningPort | Attaching signatures, HMAC tokens, or authentication material to outgoing requests |
PlaywrightService fetches rendered HTML, so it implements ScrapingService.
Step 2: Scaffold the adapter
Create src/adapters/playwright.rs:
#![allow(unused)] fn main() { //! Playwright browser adapter. use std::time::Duration; use serde_json::Value; use crate::domain::error::{StygianError, ServiceError}; use crate::ports::{ScrapingService, ServiceInput, ServiceOutput}; // ── Configuration ───────────────────────────────────────────────────────────── /// Configuration for the Playwright adapter. /// /// # Example /// /// ```rust /// use stygian_graph::adapters::playwright::PlaywrightConfig; /// /// let cfg = PlaywrightConfig { /// ws_endpoint: "ws://localhost:3000/playwright".into(), /// default_timeout: std::time::Duration::from_secs(30), /// headless: true, /// }; /// ``` #[derive(Debug, Clone)] pub struct PlaywrightConfig { pub ws_endpoint: String, pub default_timeout: Duration, pub headless: bool, } impl Default for PlaywrightConfig { fn default() -> Self { Self { ws_endpoint: "ws://localhost:3000/playwright".into(), default_timeout: Duration::from_secs(30), headless: true, } } } // ── Adapter ─────────────────────────────────────────────────────────────────── /// Browser adapter backed by a Playwright JSON-RPC server. pub struct PlaywrightService { config: PlaywrightConfig, } impl PlaywrightService { pub fn new(config: PlaywrightConfig) -> Self { Self { config } } } // ── Port implementation ─────────────────────────────────────────────────────── impl ScrapingService for PlaywrightService { fn name(&self) -> &'static str { "playwright" } /// Navigate to `input.url`, wait for network idle, return rendered HTML. /// /// # Errors /// /// `ServiceError::Unavailable` — Playwright server is unreachable. /// `ServiceError::Timeout` — navigation exceeded `default_timeout`. async fn execute(&self, input: ServiceInput) -> crate::domain::error::Result<ServiceOutput> { // Real implementation would: // 1. Connect to self.config.ws_endpoint via WebSocket // 2. Send Browser.newPage // 3. Navigate to input.url with self.config.default_timeout // 4. Await "networkidle" lifecycle event // 5. Call Page.getContent() for rendered HTML // 6. Close the page Err(StygianError::Service(ServiceError::Unavailable( format!("PlaywrightService not connected (url={})", input.url), ))) } } }
Step 3: Re-export
Add a pub mod playwright; line to src/adapters/mod.rs:
#![allow(unused)] fn main() { // src/adapters/mod.rs pub mod http; pub mod browser; pub mod claude; // ... existing adapters ... pub mod playwright; // ← add this line }
Step 4: Register in the service registry
In your binary entry point or application/executor.rs startup code:
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_graph::adapters::playwright::{PlaywrightConfig, PlaywrightService}; use stygian_graph::application::registry::ServiceRegistry; let registry = ServiceRegistry::new(); let config = PlaywrightConfig { ws_endpoint: std::env::var("PLAYWRIGHT_WS_ENDPOINT") .unwrap_or_else(|_| "ws://localhost:3000/playwright".into()), ..Default::default() }; registry.register( "playwright".into(), Arc::new(PlaywrightService::new(config)), ); }
Pipelines can now reference the adapter with service = "playwright" in any node.
Step 5: Add resilience wrappers
Wrap before registering to get circuit-breaker and retry behaviour for free:
#![allow(unused)] fn main() { use std::sync::Arc; use std::time::Duration; use stygian_graph::adapters::resilience::{CircuitBreakerImpl, RetryPolicy, ResilientAdapter}; let cb = CircuitBreakerImpl::new( 5, // open after 5 consecutive failures Duration::from_secs(120), // half-open probe after 2 min ); let policy = RetryPolicy::exponential( 3, // max 3 attempts Duration::from_millis(200), // initial back-off Duration::from_secs(5), // cap ); let resilient = ResilientAdapter::new( Arc::new(PlaywrightService::new(config)), Arc::new(cb), policy, ); registry.register("playwright".into(), Arc::new(resilient)); }
Step 6: Integration tests
Use stygian's built-in mock transport for unit tests that don't require a real server:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; use stygian_graph::ports::ServiceInput; fn make_input(url: &str) -> ServiceInput { ServiceInput { url: url.into(), config: serde_json::Value::Null, ..Default::default() } } #[tokio::test] async fn returns_unavailable_when_not_connected() { let svc = PlaywrightService::new(PlaywrightConfig::default()); let err = svc.execute(make_input("https://example.com")).await.unwrap_err(); assert!(err.to_string().contains("not connected")); } } }
For full integration tests that require a real Playwright server, mark them with
#[ignore = "requires playwright server"] so CI passes without the dependency.
Adapter checklist
Before merging a new adapter:
- Implements the correct port trait
-
name()returns a lowercase, kebab-case identifier - All public types have doc comments with an example
-
Defaultimpl provided where sensible - Config can be loaded from environment variables
- At least one unit test that does not require external services
-
Integration tests marked
#[ignore = "requires …"] -
Re-exported from
src/adapters/mod.rs - Registered in the service registry example in the binary
Distributed Execution
stygian-graph supports horizontal scaling via a Redis/Valkey-backed work queue.
Multiple worker processes can consume from the same queue, enabling throughput that
scales linearly with the number of nodes.
Overview
In distributed mode, the DistributedDagExecutor splits pipeline execution across a
shared work queue:
Producer process Redis/Valkey Worker processes
───────────────── ────────────── ─────────────────────
PipelineParser DistributedDagExecutor
↓ ↓
DagExecutor ──── work items ────► ServiceRegistry
↓ ↓
IdempotencyKey ◄─── results ──────── ScrapingService / AI
Setup
1. Start Redis or Valkey
# Docker — Valkey (Redis-compatible, open-source fork)
docker run -d --name valkey -p 6379:6379 valkey/valkey:8
# Or Redis
docker run -d --name redis -p 6379:6379 redis:7
2. Enable the distributed feature
# Cargo.toml
stygian-graph = { version = "0.1", features = ["distributed"] }
3. Create a work queue and executor
use stygian_graph::adapters::{DistributedDagExecutor, RedisWorkQueue}; use stygian_graph::application::registry::ServiceRegistry; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let queue = RedisWorkQueue::new("redis://localhost:6379").await?; let registry = ServiceRegistry::default_with_env()?; // Spawn 10 concurrent workers on this process let executor = DistributedDagExecutor::new(Arc::new(queue), registry, 10); // Execute a wave of tasks let results = executor .execute_wave("pipeline-id-1", tasks, &services) .await?; for result in results { println!("{}", serde_json::to_string_pretty(&result)?); } Ok(()) }
Work queue operations
RedisWorkQueue implements the WorkQueue port trait:
#![allow(unused)] fn main() { pub trait WorkQueue: Send + Sync { /// Enqueue a task and return its unique ID. async fn enqueue(&self, task: Task) -> Result<TaskId>; /// Block until a task is available; returns `None` if queue is empty and closed. async fn dequeue(&self) -> Result<Option<Task>>; /// Acknowledge successful completion. async fn ack(&self, id: &TaskId) -> Result<()>; /// Return a task to the queue for another worker to retry. async fn nack(&self, id: &TaskId) -> Result<()>; } }
Tasks that are nack-ed (e.g. worker crashes mid-execution) are requeued automatically
after a visibility timeout.
Idempotency in distributed mode
Every task carries an IdempotencyKey. Before executing, the worker checks a shared
idempotency store (backed by the same Redis instance):
- Key not seen — execute, store result under key.
- Key already present — return stored result immediately, skip execution.
This makes distributed task execution safe to retry — duplicate network deliveries and worker restarts produce the same observable outcome.
#![allow(unused)] fn main() { use stygian_graph::domain::idempotency::IdempotencyKey; // Deterministic key from pipeline id + input URL — replays the same result let key = IdempotencyKey::from_input("pipeline-1", "https://example.com/item/42"); }
Pipeline config for distributed mode
Enable distributed execution in a TOML pipeline:
[execution]
mode = "distributed"
queue = "redis://localhost:6379"
workers = 20
[[nodes]]
id = "fetch"
service = "http"
[[nodes]]
id = "extract"
service = "ai_claude"
[[edges]]
from = "fetch"
to = "extract"
Scaling tips
- Worker count — start at
num_cpus * 4for I/O-heavy pipelines; reduce for CPU-heavy extraction workloads. - Redis connection pool — the adapter maintains a pool internally; set
REDIS_MAX_CONNECTIONSto tune (default20). - Backpressure — the work queue has a configurable maximum depth. When full,
enqueue()blocks the producer, preventing runaway memory growth. - Multiple queues — use separate queue names per pipeline type to prevent high-volume crawls from starving high-priority extractions.
Observability
stygian-graph exposes structured metrics and distributed tracing out of the box.
Both are opt-in: neither requires code changes to your adapters or domain logic.
Prometheus metrics
Enable the metrics feature flag:
stygian-graph = { version = "0.1", features = ["metrics"] }
Creating a collector
#![allow(unused)] fn main() { use stygian_graph::application::MetricsCollector; let metrics = MetricsCollector::new(); }
MetricsCollector registers counters, histograms, and gauges on the global Prometheus
registry automatically. It is Clone + Send + Sync and safe to share across threads.
Exposing /metrics
Attach the Prometheus scrape handler to any HTTP server. Example with Axum:
#![allow(unused)] fn main() { use axum::{Router, routing::get}; use stygian_graph::application::MetricsCollector; let metrics = MetricsCollector::new(); let handler = metrics.prometheus_handler(); let app = Router::new() .route("/metrics", get(handler)) .route("/health", get(|| async { "ok" })); axum::serve(listener, app).await?; }
Available metrics
| Metric name | Type | Labels | Description |
|---|---|---|---|
stygian_requests_total | counter | service, status | Total requests per adapter |
stygian_request_duration_seconds | histogram | service | Request latency distribution |
stygian_errors_total | counter | service, error_kind | Errors by type |
stygian_worker_pool_active | gauge | pool | Active workers |
stygian_worker_pool_queued | gauge | pool | Queued tasks |
stygian_circuit_breaker_state | gauge | service | 0=closed, 1=open, 2=half-open |
stygian_cache_hits_total | counter | cache | Cache hits |
stygian_cache_misses_total | counter | cache | Cache misses |
Structured tracing
stygian-graph instruments all hot paths with the tracing crate.
Any compatible subscriber (JSON, OTLP, Jaeger) receives full span trees.
Basic JSON logging
#![allow(unused)] fn main() { use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter}; tracing_subscriber::registry() .with(EnvFilter::from_default_env() .add_directive("stygian_graph=debug".parse()?) .add_directive("stygian_browser=info".parse()?)) .with(tracing_subscriber::fmt::layer().json()) .init(); }
Set RUST_LOG=stygian_graph=trace at runtime for full span output.
OpenTelemetry export (Jaeger / OTLP)
[dependencies]
opentelemetry = "0.22"
opentelemetry-otlp = { version = "0.15", features = ["grpc-tonic"] }
tracing-opentelemetry = "0.23"
#![allow(unused)] fn main() { use opentelemetry_otlp::WithExportConfig; use tracing_opentelemetry::OpenTelemetryLayer; let tracer = opentelemetry_otlp::new_pipeline() .tracing() .with_exporter( opentelemetry_otlp::new_exporter() .tonic() .with_endpoint("http://localhost:4317"), ) .install_batch(opentelemetry::runtime::Tokio)?; tracing_subscriber::registry() .with(EnvFilter::from_default_env()) .with(OpenTelemetryLayer::new(tracer)) .init(); }
Key spans
| Span | Attributes | Emitted by |
|---|---|---|
dag_execute | pipeline_id, node_count, wave_count | DagExecutor |
wave_execute | wave, node_ids[] | DagExecutor |
service_call | service, url | ServiceRegistry |
ai_extract | provider, model, tokens_in, tokens_out | AI adapters |
cache_lookup | hit, key_prefix | Cache adapters |
circuit_breaker | service, state_transition | CircuitBreakerImpl |
Health checks
MetricsCollector exposes a health-check endpoint that reports the state of every
registered service:
#![allow(unused)] fn main() { let health_json = metrics.health_check(®istry).await; // {"status":"ok","services":{"http":"healthy","ai_claude":"healthy"}} }
A service is reported as "degraded" when its circuit breaker is half-open, and
"unhealthy" when it is open.
Performance Tuning
This guide covers strategies for maximising throughput and efficiency: worker pool sizing, channel depth, memory management, cache tuning, and profiling.
Worker pool sizing
WorkerPool bounds concurrency for a set of services. The optimal size depends on whether
work is I/O-bound or CPU-bound.
I/O-bound services (HTTP, browser, AI APIs)
Network operations spend most of their time waiting for remote responses. A large pool hides latency with concurrency:
#![allow(unused)] fn main() { // Rule of thumb: 10–100× logical CPU count let concurrency = num_cpus::get() * 50; let queue_depth = concurrency * 4; // back-pressure buffer let pool = WorkerPool::new(concurrency, queue_depth); }
Start with 50× and adjust based on:
- Target API rate limits (reduce if hitting 429s)
- Available file descriptors (
ulimit -n) - Memory pressure (every in-flight request buffers a response body)
CPU-bound services (parsing, transforms, NLP)
CPU work cannot overlap on the same core. Match the pool to physical cores to avoid context-switch overhead:
#![allow(unused)] fn main() { // Rule of thumb: 1–2× logical CPU count let concurrency = num_cpus::get(); let queue_depth = concurrency * 2; let pool = WorkerPool::new(concurrency, queue_depth); }
Mixed workloads
When a pipeline mixes HTTP fetching and CPU-heavy extraction, use separate pools per service type:
#![allow(unused)] fn main() { let http_pool = WorkerPool::new(num_cpus::get() * 40, 512); let cpu_pool = WorkerPool::new(num_cpus::get(), 32); }
Back-pressure
queue_depth controls how many tasks accumulate before callers block.
A shallow queue applies back-pressure sooner, limiting memory. A deep queue
smooths bursts but can hide overload. A ratio of 4–8× concurrency is a
good starting point.
Channel sizing
Stygian uses bounded tokio::sync::mpsc channels internally.
| Channel depth | Throughput | Latency | Memory |
|---|---|---|---|
| 1 | Low | Low | Minimal |
| 64 (default) | Medium | Medium | Low |
| 512 | High | Higher | Moderate |
| Unbounded | Max | Variable | Uncapped |
Always use bounded channels in library code. Unbounded channels are only appropriate for producers with a known-small, finite rate (e.g. a timer).
Detect saturation at runtime: when capacity - len consistently approaches zero,
the downstream consumer is a bottleneck — either scale it out or increase depth.
Memory optimisation
Limit response body size
Unbounded response buffering will exhaust memory on large responses:
#![allow(unused)] fn main() { const MAX_BODY_BYTES: usize = 8 * 1024 * 1024; // 8 MiB let body = response.bytes().await?; if body.len() > MAX_BODY_BYTES { return Err(ScrapingError::ResponseTooLarge(body.len())); } }
Arena allocation for wave processing
Allocate all intermediate data for a wave from a single arena and free it in one shot:
# Cargo.toml
bumpalo = "3"
#![allow(unused)] fn main() { use bumpalo::Bump; async fn process_wave(inputs: &[ServiceInput]) { let arena = Bump::new(); let urls: Vec<&str> = inputs .iter() .map(|i| arena.alloc_str(&i.url)) .collect(); // arena dropped here — all scratch freed in one operation } }
Buffer pooling
Avoid allocating a fresh Vec<u8> for every HTTP response — reuse from a pool:
#![allow(unused)] fn main() { use tokio::sync::Mutex; struct BufferPool { pool: Mutex<Vec<Vec<u8>>>, buf_size: usize, } impl BufferPool { async fn acquire(&self) -> Vec<u8> { let mut p = self.pool.lock().await; p.pop().unwrap_or_else(|| Vec::with_capacity(self.buf_size)) } async fn release(&self, mut buf: Vec<u8>) { buf.clear(); self.pool.lock().await.push(buf); } } }
Cache tuning
Measure hit rate first
A cache that rarely hits wastes memory and adds lookup overhead:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; struct CacheMetrics { hits: AtomicU64, misses: AtomicU64 } fn hit_rate(m: &CacheMetrics) -> f64 { let h = m.hits.load(Relaxed) as f64; let m = m.misses.load(Relaxed) as f64; if h + m == 0.0 { return 0.0; } h / (h + m) } }
Target ≥ 0.80 for URL-keyed caches before increasing capacity.
LRU vs DashMap
BoundedLruCache | DashMapCache | |
|---|---|---|
| Eviction policy | LRU (capacity-based) | TTL-based (background task) |
| Best for | Fixed working set | Time-sensitive data |
| Concurrency overhead | Higher (LRU list mutex) | Lower (sharded map) |
For AI extraction results with a 24 h freshness requirement, use DashMapCache with
ttl = Duration::from_secs(86_400). For deduplication of seen URLs, BoundedLruCache
is the right choice.
Rayon vs Tokio
| Tokio | Rayon | |
|---|---|---|
| Use for | I/O-bound work (network, disk) | CPU-bound work (parsing, transforms) |
| Blocking? | No — async tasks never block threads | Yes — tasks block Rayon threads |
| Thread pool | Shared Tokio runtime | Separate Rayon thread pool |
Rule: never call synchronous CPU-heavy work directly in a Tokio task. Offload with
tokio::task::spawn_blocking or rayon::spawn:
#![allow(unused)] fn main() { // CPU-heavy HTML parsing — do NOT do this in an async fn directly let html = response.text().await?; let extracted = tokio::task::spawn_blocking(move || { heavy_parsing(html) }).await??; }
Profiling
Flamegraph with cargo-flamegraph
cargo install flamegraph
# Profile a benchmark
cargo flamegraph --bench dag_executor -- --bench
Criterion benchmarks
The stygian-graph crate ships Criterion benchmarks in benches/:
cargo bench # run all benchmarks
cargo bench dag_executor # run a specific group
cargo bench -- --save-baseline v0 # save baseline for comparison
cargo bench -- --load-baseline v0 # compare against baseline
Key benchmark targets:
| Benchmark | What it measures |
|---|---|
dag_executor/wave_10 | 10-node wave execution overhead |
dag_executor/wave_100 | 100-node wave execution overhead |
http_adapter/single | Single HTTP request round-trip |
cache/lru_hit | LRU cache read under contention |
Benchmarks (Apple M4 Pro)
| Operation | Latency |
|---|---|
| DAG executor overhead per wave | ~50 µs |
| HTTP adapter (cached DNS) | ~2 ms |
| Browser acquisition (warm pool) | <100 ms |
| LRU cache read (1 M ops/s) | ~1 µs |
Browser Automation Overview
stygian-browser is a high-performance, anti-detection browser automation library for Rust.
It is built on the Chrome DevTools Protocol
via chromiumoxide and ships a comprehensive suite
of stealth features for bypassing modern bot-detection systems.
Feature summary
| Feature | Description |
|---|---|
| Browser pooling | Warm pool with configurable min/max, LRU eviction, backpressure, and per-context segregation |
| Anti-detection | navigator spoofing, canvas noise, WebGL randomisation, UA patching |
| Headless mode | HeadlessMode::New default (--headless=new) — shares Chrome's headed rendering pipeline, harder to fingerprint-detect |
| Human behaviour | Bézier-curve mouse paths, realistic keystroke timing, typo simulation |
| CDP leak protection | Hides Runtime.enable artifacts that expose automation |
| WebRTC control | Block, proxy-route, or allow — prevents IP leaks |
| Fingerprint generation | Statistically-weighted device profiles (Windows, Mac, Linux, mobile) |
| Stealth levels | None / Basic / Advanced — tune evasion vs. performance |
| Resource filtering | Block images, fonts, media per-tab to speed up text scraping |
| Cookie persistence | Save/restore full session state (cookies + localStorage); inject_cookies() for seeding individual tokens |
Use cases
| Scenario | Recommended config |
|---|---|
| Public HTML scraping (no bot detection) | StealthLevel::None, HTTP adapter preferred |
| Single-page app rendering | StealthLevel::Basic, WaitUntil::NetworkIdle |
| Cloudflare / DataDome protected sites | StealthLevel::Advanced |
| Price monitoring (authenticated sessions) | StealthLevel::Basic + cookie persistence |
| CAPTCHA-adjacent flows | StealthLevel::Advanced + human behaviour + proxy |
Quick start
use stygian_browser::{BrowserConfig, BrowserPool, WaitUntil}; use std::time::Duration; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // Default config: headless, Advanced stealth, 2 warm browsers let pool = BrowserPool::new(BrowserConfig::default()).await?; let handle = pool.acquire().await?; // < 100 ms from warm pool let mut page = handle.browser().new_page().await?; page.navigate( "https://example.com", WaitUntil::Selector("body".to_string()), Duration::from_secs(30), ).await?; println!("Title: {}", page.title().await?); println!("HTML: {} bytes", page.content().await?.len()); handle.release().await; // returns browser to pool Ok(()) }
Performance targets
| Operation | Target |
|---|---|
| Browser acquisition (warm pool) | < 100 ms |
| Browser launch (cold start) | < 2 s |
| Advanced stealth injection overhead | 10–30 ms per page |
| Pool health check | < 5 ms |
Platform support
| Platform | Status |
|---|---|
| macOS (Apple Silicon / Intel) | Fully supported, actively tested |
| Linux (x86-64, ARM64) | Fully supported, CI tested |
| Windows | Depends on chromiumoxide backend; not actively tested |
| Headless CI (GitHub Actions) | Supported — default config is headless: true |
Installation
[dependencies]
stygian-browser = "0.1"
tokio = { version = "1", features = ["full"] }
To disable stealth features for a minimal build:
stygian-browser = { version = "0.1", default-features = false }
Chrome 120+ must be available on the system or specified via STYGIAN_CHROME_PATH.
On CI, install it with apt-get install google-chrome-stable or use the
browser-actions/setup-chrome GitHub Action.
Configuration
All browser behaviour is controlled through BrowserConfig. Every field can be set
programmatically or overridden at runtime via environment variables — no recompilation needed.
Builder pattern
#![allow(unused)] fn main() { use stygian_browser::{BrowserConfig, HeadlessMode, StealthLevel}; use stygian_browser::config::PoolConfig; use stygian_browser::webrtc::{WebRtcConfig, WebRtcPolicy}; use std::time::Duration; let config = BrowserConfig::builder() // ── Browser ────────────────────────────────────────────────────────── .headless(true) .headless_mode(HeadlessMode::New) // default; --headless=new shares headed rendering .window_size(1920, 1080) // .chrome_path("/usr/bin/google-chrome".into()) // auto-detect if omitted // .user_data_dir("/tmp/my-profile") // omit for auto unique temp dir per instance // ── Stealth ─────────────────────────────────────────────────────────── .stealth_level(StealthLevel::Advanced) // ── Network ─────────────────────────────────────────────────────────── // .proxy("http://user:pass@proxy.example.com:8080".to_string()) .webrtc(WebRtcConfig { policy: WebRtcPolicy::DisableNonProxied, ..Default::default() }) // ── Pool ────────────────────────────────────────────────────────────── .pool(PoolConfig { min_size: 2, max_size: 10, idle_timeout: Duration::from_secs(300), acquire_timeout: Duration::from_secs(10), }) .build(); }
Field reference
Browser settings
| Field | Type | Default | Description |
|---|---|---|---|
headless | bool | true | Run without visible window |
headless_mode | HeadlessMode | New | New = --headless=new (full Chromium rendering, default since Chrome 112); Legacy = classic --headless flag for Chromium < 112 |
window_size | Option<(u32, u32)> | (1920, 1080) | Browser viewport dimensions |
chrome_path | Option<PathBuf> | auto-detect | Path to Chrome/Chromium binary |
stealth_level | StealthLevel | Advanced | Anti-detection level |
proxy | Option<String> | None | Proxy URL (http://, https://, socks5://) |
user_data_dir | Option<PathBuf> | auto-generated | Per-instance temp dir ($TMPDIR/stygian-<id>); set explicitly to share a persistent profile. Auto-generation prevents SingletonLock races between concurrent pools. |
args | Vec<String> | [] | Additional Chrome command-line flags |
Pool settings (PoolConfig)
| Field | Type | Default | Description |
|---|---|---|---|
min_size | usize | 2 | Browsers kept warm at all times |
max_size | usize | 10 | Maximum concurrent browsers |
idle_timeout | Duration | 5 min | Evict idle browser after this duration |
acquire_timeout | Duration | 30 s | Max wait for a pool slot |
WebRTC settings (WebRtcConfig)
| Field | Type | Default | Description |
|---|---|---|---|
policy | WebRtcPolicy | DisableNonProxied | WebRTC IP leak policy |
location | Option<ProxyLocation> | None | Simulated geo-location for WebRTC |
WebRtcPolicy variants:
| Variant | Behaviour |
|---|---|
Allow | Default browser behaviour — real IPs may leak |
DisableNonProxied | Block direct connections; only proxied paths allowed |
BlockAll | Block all WebRTC — safest for anonymous scraping |
Environment variable overrides
All config values can be overridden without touching source code:
| Variable | Default | Description |
|---|---|---|
STYGIAN_CHROME_PATH | auto-detect | Path to Chrome/Chromium binary |
STYGIAN_HEADLESS | true | Set false for headed mode |
STYGIAN_HEADLESS_MODE | new | new (--headless=new) or legacy (classic --headless for Chromium < 112) |
STYGIAN_STEALTH_LEVEL | advanced | none, basic, advanced |
STYGIAN_POOL_MIN | 2 | Minimum warm browsers |
STYGIAN_POOL_MAX | 10 | Maximum concurrent browsers |
STYGIAN_POOL_IDLE_SECS | 300 | Idle timeout before browser eviction |
STYGIAN_POOL_ACQUIRE_SECS | 5 | Seconds to wait for a pool slot |
STYGIAN_LAUNCH_TIMEOUT_SECS | 10 | Browser launch timeout |
STYGIAN_CDP_TIMEOUT_SECS | 30 | Per-operation CDP timeout |
STYGIAN_CDP_FIX_MODE | addBinding | addBinding, isolatedworld, enabledisable |
STYGIAN_PROXY | — | Proxy URL |
STYGIAN_PROXY_BYPASS | — | Comma-separated proxy bypass list (e.g. <local>,localhost) |
STYGIAN_DISABLE_SANDBOX | auto-detect | true inside containers, false on bare metal |
Examples
Minimal — fast text scraping
#![allow(unused)] fn main() { use stygian_browser::{BrowserConfig, StealthLevel}; let config = BrowserConfig::builder() .headless(true) .stealth_level(StealthLevel::None) // no overhead .build(); }
Headed debugging session
#![allow(unused)] fn main() { let config = BrowserConfig::builder() .headless(false) .stealth_level(StealthLevel::Basic) .build(); }
Proxy with full stealth
#![allow(unused)] fn main() { use stygian_browser::webrtc::{WebRtcConfig, WebRtcPolicy}; let config = BrowserConfig::builder() .headless(true) .stealth_level(StealthLevel::Advanced) .proxy("socks5://user:pass@proxy.example.com:1080".to_string()) .webrtc(WebRtcConfig { policy: WebRtcPolicy::BlockAll, ..Default::default() }) .build(); }
Anti-detection for JS-heavy sites (X/Twitter, LinkedIn)
StealthLevel::Advanced combined with HeadlessMode::New is the most evasion-resistant
configuration. HeadlessMode::New is the default since v0.1.11 — existing code
elevates automatically.
#![allow(unused)] fn main() { use stygian_browser::{BrowserConfig, HeadlessMode, StealthLevel}; use stygian_browser::webrtc::{WebRtcConfig, WebRtcPolicy}; let config = BrowserConfig::builder() .headless(true) .headless_mode(HeadlessMode::New) // default; shared rendering pipeline with headed Chrome .stealth_level(StealthLevel::Advanced) .webrtc(WebRtcConfig { policy: WebRtcPolicy::BlockAll, ..Default::default() }) .build(); }
For Chromium ≥ 112 (all modern Chrome / Chromium builds), New is the right
choice. Legacy falls back to the classic --headless flag which uses an older
rendering pipeline — use it only when targeting Chromium < 112.
#![allow(unused)] fn main() { // Only needed for Chromium < 112 let config = BrowserConfig::builder() .headless_mode(HeadlessMode::Legacy) .build(); }
Stealth & Anti-Detection
stygian-browser implements a layered anti-detection system. Each layer targets a different
class of bot-detection signal.
Stealth levels
| Level | navigator spoof | Canvas noise | WebGL random | CDP protection | Human behaviour |
|---|---|---|---|---|---|
None | — | — | — | — | — |
Basic | ✓ | — | — | ✓ | — |
Advanced | ✓ | ✓ | ✓ | ✓ | ✓ |
Trade-offs:
None— maximum performance; no evasion. Suitable for internal services or sites with no bot detection.Basic— hidesnavigator.webdriver, masks the headless User-Agent, enables CDP protection. Adds < 1 ms overhead. Appropriate for most scraping workloads.Advanced— full fingerprint injection (canvas, WebGL, audio, fonts, hardware concurrency, device memory) plus human-like mouse and keyboard events. Adds 10–30 ms per page but passes all major detection suites.
Headless mode
The classic --headless flag (HeadlessMode::Legacy) is a well-known detection signal:
sites like X/Twitter and LinkedIn inspect the Chrome renderer version string and reject
old-headless sessions before any session state is even checked.
Since v0.1.11, stygian-browser defaults to --headless=new (HeadlessMode::New),
which shares the same rendering pipeline as headed Chrome and is significantly
harder to fingerprint-detect.
#![allow(unused)] fn main() { use stygian_browser::{BrowserConfig, HeadlessMode}; // Default since v0.1.11 — no change needed for existing code let config = BrowserConfig::builder() .headless_mode(HeadlessMode::New) .build(); // Legacy mode: only needed for Chromium < 112 let config = BrowserConfig::builder() .headless_mode(HeadlessMode::Legacy) .build(); }
Or via env var (no recompilation):
STYGIAN_HEADLESS_MODE=legacy cargo run # opt back to old behaviour
navigator spoofing
Executed on every new document context before any page script runs.
- Sets
navigator.webdriver→undefined - Patches
navigator.pluginswith a realisticPluginArray - Sets
navigator.languages,navigator.language,navigator.vendor - Aligns
navigator.hardwareConcurrencyandnavigator.deviceMemorywith the chosen device fingerprint
Two layers of protection prevent webdriver detection:
- Instance patch —
Object.defineProperty(navigator, 'webdriver', { get: () => undefined })hides the flag from direct access (navigator.webdriver === undefined). - Prototype patch —
Object.defineProperty(Navigator.prototype, 'webdriver', ...)hides the underlying getter fromObject.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver'), which some scanners (e.g. pixelscan.net, Akamai) probe directly.
Both patches are injected into every new document context before any page script runs.
The fingerprint is drawn from statistically-weighted device profiles:
#![allow(unused)] fn main() { use stygian_browser::fingerprint::{DeviceProfile, Platform}; let profile = DeviceProfile::random(); // weighted: Windows 60%, Mac 25%, Linux 15% println!("Platform: {:?}", profile.platform); println!("CPU cores: {}", profile.hardware_concurrency); println!("Device RAM: {} GB", profile.device_memory); println!("Screen: {}×{}", profile.screen_width, profile.screen_height); }
Canvas fingerprint noise
HTMLCanvasElement.toDataURL() and CanvasRenderingContext2D.getImageData() are patched
to add sub-pixel noise (< 1 px) — visually indistinguishable but unique per page load,
preventing cross-site canvas fingerprint correlation.
The noise function is applied via JavaScript injection into each document:
// Simplified representation of the injected script
const origToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(...args) {
const result = origToDataURL.apply(this, args);
return injectNoise(result, sessionNoiseSeed);
};
WebGL randomisation
GPU-based fingerprinting reads RENDERER and VENDOR strings from WebGL. These are
intercepted and replaced with plausible — but randomised — GPU family names:
| Real value | Spoofed value (example) |
|---|---|
ANGLE (Apple, ANGLE Metal Renderer: Apple M4 Pro, Unspecified Version) | ANGLE (NVIDIA, ANGLE Metal Renderer: NVIDIA GeForce RTX 3070 Ti) |
Google SwiftShader | ANGLE (Intel, ANGLE Metal Renderer: Intel Iris Pro) |
The spoofed values are consistent within a session and coherent with the chosen device profile.
CDP leak protection
The Chrome DevTools Protocol itself can expose automation. Three modes are available,
set via STYGIAN_CDP_FIX_MODE or BrowserConfig::cdp_fix_mode:
| Mode | Protection | Compatibility |
|---|---|---|
AddBinding (default) | Wraps calls to hide Runtime.enable side-effects | Best overall |
IsolatedWorld | Runs injection in a separate execution context | Moderate |
EnableDisable | Toggles enable/disable around each command | Broad |
Human behaviour simulation (Advanced only)
Mouse movement — MouseSimulator
Generates Bézier-curve paths with natural arc shapes:
- Distance-aware step counts (12 steps for < 100 px, up to 120 for > 1 000 px)
- Perpendicular control-point offsets for curved trajectories
- Sub-pixel micro-tremor jitter (± 0.3 px per step)
- 10–50 ms inter-event delays
#![allow(unused)] fn main() { use stygian_browser::behavior::MouseSimulator; let sim = MouseSimulator::new(); // Move from (100, 200) to (450, 380) with realistic arc sim.move_to(&page, 100.0, 200.0, 450.0, 380.0).await?; sim.click(&page, 450.0, 380.0).await?; }
Keyboard — TypingSimulator
Models realistic typing cadence:
- Per-key WPM variation (70–130 WPM base rate)
- Configurable typo-and-correct probability
- Burst/pause rhythm typical of human typists
#![allow(unused)] fn main() { use stygian_browser::behavior::TypingSimulator; let typer = TypingSimulator::new() .wpm(90) .typo_rate(0.03); // 3% typo probability typer.type_into(&page, "#search-input", "rust async web scraping").await?; }
Network Information API spoofing
navigator.connection (Network Information API) reveals connection quality and type.
Headless browsers return null here, which is an immediate headless signal on connection-aware scanners.
Advanced stealth injects a realistic NetworkInformation-like object:
| Property | Spoofed value |
|---|---|
effectiveType | "4g" |
type | "wifi" |
downlink | Seeded from performance.timeOrigin (stable per session, ≈ 10 Mbps range) |
rtt | Seeded jitter (50–100 ms range) |
saveData | false |
Battery Status API spoofing
navigator.getBattery() returns null in headless Chrome — a clear automation signal
for scanners that enumerate battery state.
Advanced stealth overrides getBattery() to resolve with a plausible disconnected-battery state:
| Property | Spoofed value |
|---|---|
charging | false |
chargingTime | Infinity |
dischargingTime | Seeded (≈ 3600–7200 s) |
level | Seeded (0.65–0.95) |
The seed values are derived from performance.timeOrigin so they are stable within a page
load but differ across sessions, preventing replay detection.
Fingerprint consistency
All spoofed signals are derived from a single DeviceProfile generated at browser
launch. The profile is consistent across tabs and across the entire session, preventing
inconsistency-based detection (e.g. a Windows User-Agent combined with macOS font metrics).
Browser Pool
The pool maintains a configurable number of warm browser instances, enforces a maximum concurrency limit, and applies backpressure when all slots are occupied.
How it works
BrowserPool
├── [Browser 0] — idle ← returned immediately on acquire()
├── [Browser 1] — active ← in use by a caller
└── [Browser 2] — idle ← available
acquire() → lease one idle browser → returns BrowserHandle
release() → return browser to pool → health check, keep or discard
When all browsers are active and the pool is at max_size, callers block in
acquire() until one becomes available or acquire_timeout expires.
Creating a pool
#![allow(unused)] fn main() { use stygian_browser::{BrowserConfig, BrowserPool}; use stygian_browser::config::PoolConfig; use std::time::Duration; let config = BrowserConfig::builder() .pool(PoolConfig { min_size: 2, // launch 2 browsers immediately max_size: 8, // cap at 8 concurrent browsers idle_timeout: Duration::from_secs(300), acquire_timeout: Duration::from_secs(10), }) .build(); let pool = BrowserPool::new(config).await?; }
BrowserPool::new() launches min_size browsers in parallel and returns once they are
all ready. Subsequent calls to acquire() return warm instances with no launch overhead.
Acquiring and releasing
#![allow(unused)] fn main() { // Acquire — blocks if pool is saturated let handle = pool.acquire().await?; // Do work on the browser let mut page = handle.browser().new_page().await?; page.navigate("https://example.com", WaitUntil::Load, Duration::from_secs(30)).await?; // Release — returns browser to pool; discards if health check fails handle.release().await; }
BrowserHandle also implements Drop: if you forget to call release() the browser
is returned to the pool automatically, though doing so inside an async context is
preferred.
Pool stats
#![allow(unused)] fn main() { let stats = pool.stats(); println!("available : {}", stats.available); // idle, ready to use println!("active : {}", stats.active); // currently leased out println!("max : {}", stats.max); // pool capacity println!("total : {}", stats.total); // active + available }
Note:
stats().idlealways returns 0 — this counter is intentionally not maintained on the hot acquire/release path to avoid contention. Useavailableandactiveinstead.
Health checks
When a browser is released, the pool runs a lightweight health check:
- Check the CDP connection is still open (no round-trip required).
- Verify that the browser process is alive.
- If either fails, discard the browser and spawn a replacement asynchronously.
Discarded browsers are replaced in the background — the pool never drops below min_size
for long, and callers are never blocked waiting for a replace that will never arrive.
Idle eviction
Browsers that have been idle longer than idle_timeout are gracefully closed and removed
from the pool. This reclaims system memory when scraping activity is low.
If eviction would drop the pool below min_size, eviction is skipped for those browsers
(they stay warm).
Eviction applies to both the shared queue and all per-context queues. Empty context queues are pruned automatically.
Context segregation
When multiple bots or tenants share a single pool, use acquire_for() to keep their
browser instances isolated. Browsers acquired for one context are never returned to a
different context.
#![allow(unused)] fn main() { // Bot A and Bot B use the same pool, but their browsers never mix let a = pool.acquire_for("bot-a").await?; let b = pool.acquire_for("bot-b").await?; // Each handle knows its context assert_eq!(a.context_id(), Some("bot-a")); assert_eq!(b.context_id(), Some("bot-b")); a.release().await; b.release().await; }
The global max_size still governs total capacity across every context. If the pool is
full, acquire_for() blocks just like acquire().
Releasing a context
When a bot or tenant is deprovisioned, drain its idle browsers:
#![allow(unused)] fn main() { let shut_down = pool.release_context("bot-a").await; println!("Closed {shut_down} browsers for bot-a"); }
Active handles for that context are unaffected; they will be disposed normally when released or dropped.
Listing contexts
#![allow(unused)] fn main() { let ids = pool.context_ids().await; println!("Active contexts with idle browsers: {ids:?}"); }
Shared vs scoped
| Method | Queue | Reuse scope |
|---|---|---|
acquire() | shared | any acquire() caller |
acquire_for("x") | scoped to "x" | only acquire_for("x") |
Both paths share the same semaphore and max_size, so global backpressure
is applied regardless of how browsers were acquired.
Cold start behaviour
If acquire() is called when the pool is empty (e.g. on first call before min_size
browsers have launched) or when all browsers are active and total < max_size, a
new browser is launched on demand. Cold starts take < 2 s on modern hardware.
Graceful shutdown
#![allow(unused)] fn main() { // Closes all browsers gracefully — waits for active handles to be released first pool.shutdown().await; }
shutdown() signals the pool to stop accepting new acquire() calls, waits for all
active handles to be released (or times out after acquire_timeout), then closes every
browser.
Page Operations
A Page represents a single browser tab. You get one by calling browser.new_page() on a
BrowserHandle.
Navigation
#![allow(unused)] fn main() { use stygian_browser::WaitUntil; use std::time::Duration; // Wait for a specific CSS selector to appear page.navigate("https://example.com", WaitUntil::Selector("h1".into()), Duration::from_secs(30)).await?; // Wait for the DOM to be parsed (fast; before images/CSS load) page.navigate("https://example.com", WaitUntil::DomContentLoaded, Duration::from_secs(30)).await?; // Wait for network to go idle (good for SPAs — ≤ 2 in-flight requests for 500 ms) page.navigate("https://example.com", WaitUntil::NetworkIdle, Duration::from_secs(30)).await?; }
WaitUntil variants from fastest to safest:
| Variant | Condition |
|---|---|
DomContentLoaded | HTML fully parsed; DOM ready (fires before images/stylesheets) |
NetworkIdle | Load event fired and ≤ 2 in-flight requests for 500 ms |
Selector(css) | document.querySelector(selector) returns a non-null element |
Reading page content
#![allow(unused)] fn main() { // Full page HTML let html = page.content().await?; // Page title let title = page.title().await?; // Current URL (may differ from navigated URL after redirects) let url = page.url().await?; // HTTP status code of the last navigation, if available // (None if no navigation has committed yet, the URL is non-HTTP such as file://, // or network events were not captured) if let Some(status) = page.status_code()? { println!("HTTP {status}"); } }
JavaScript evaluation
#![allow(unused)] fn main() { // Evaluate an expression; return type must implement serde::DeserializeOwned let title: String = page.eval("document.title").await?; let is_auth: bool = page.eval("!!document.cookie.match(/session=/)").await?; let item_count: u32 = page.eval("document.querySelectorAll('.item').length").await?; // Execute a statement (return value ignored) page.eval::<serde_json::Value>("window.scrollTo(0, document.body.scrollHeight)").await?; }
Element interaction
High-level click/type helpers are provided by the human behaviour module
(stygian_browser::behavior) when the stealth feature is enabled. These simulate
realistic mouse paths and typing cadence. See the Stealth & Anti-Detection
page for full usage.
#![allow(unused)] fn main() { use stygian_browser::behavior::{MouseSimulator, TypingSimulator}; let mouse = MouseSimulator::new(); mouse.move_to(&page, 100.0, 200.0, 450.0, 380.0).await?; mouse.click(&page, 450.0, 380.0).await?; let typer = TypingSimulator::new().wpm(90); typer.type_into(&page, "#search-input", "rust async scraping").await?; // For selector-based waiting without mouse simulation: page.wait_for_selector(".results", Duration::from_secs(10)).await?; }
Screenshots
#![allow(unused)] fn main() { // Full-page screenshot — returns raw PNG bytes let png: Vec<u8> = page.screenshot().await?; tokio::fs::write("screenshot.png", &png).await?; }
Resource filtering
Block resource types to reduce bandwidth and speed up text-only scraping:
#![allow(unused)] fn main() { use stygian_browser::page::{ResourceFilter, ResourceType}; // Block images, fonts, CSS, and media page.set_resource_filter(ResourceFilter::block_media()).await?; // Block only images and fonts (keep styles for layout-sensitive work) page.set_resource_filter(ResourceFilter::block_images_and_fonts()).await?; // Custom filter page.set_resource_filter( ResourceFilter::default() .block(ResourceType::Image) .block(ResourceType::Font) .block(ResourceType::Stylesheet) ).await?; }
Must be called before navigate() to take effect.
Cookie management
Session persistence is handled via the session module. Save and restore full session
state (cookies + localStorage) across runs, or inject individual cookies without a full
round-trip.
#![allow(unused)] fn main() { use stygian_browser::session::{save_session, restore_session, SessionSnapshot, SessionCookie}; // Save full session state after login let snapshot: SessionSnapshot = save_session(&page).await?; snapshot.save_to_file("session.json")?; // Restore in a later run let snapshot = SessionSnapshot::load_from_file("session.json")?; restore_session(&page, &snapshot).await?; // Inject individual cookies without a full snapshot (e.g. seed a known token) let cookies = vec![SessionCookie { name: "session".to_string(), value: "abc123".to_string(), domain: ".example.com".to_string(), path: "/".to_string(), expires: -1.0, // session cookie http_only: true, secure: true, same_site: "Lax".to_string(), }]; page.inject_cookies(&cookies).await?; }
Check whether a saved snapshot is still fresh before restoring:
#![allow(unused)] fn main() { let mut snapshot = SessionSnapshot::load_from_file("session.json")?; snapshot.ttl_secs = Some(3600); // 1-hour TTL if snapshot.is_expired() { // re-authenticate } else { restore_session(&page, &snapshot).await?; } }
Closing a tab
#![allow(unused)] fn main() { page.close().await?; }
Always close pages explicitly when done. Unreleased pages count against the browser's internal tab limit and may degrade performance.
Complete example
use stygian_browser::{BrowserConfig, BrowserPool, WaitUntil}; use stygian_browser::page::ResourceFilter; use std::time::Duration; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let pool = BrowserPool::new(BrowserConfig::default()).await?; let handle = pool.acquire().await?; let mut page = handle.browser().new_page().await?; // Block images to reduce bandwidth page.set_resource_filter(ResourceFilter::block_media()).await?; page.navigate( "https://example.com/products", WaitUntil::Selector(".product-list".to_string()), Duration::from_secs(30), ).await?; let count: u32 = page.eval("document.querySelectorAll('.product').length").await?; println!("{count} products found"); let url = page.url().await?; let status = page.status_code()?; match status { Some(code) => println!("{url} → HTTP {code}"), None => println!("{url} → HTTP status unknown"), } let html = page.content().await?; // … pass html to an AI extraction node … page.close().await?; handle.release().await; Ok(()) }
Proxy Rotation Overview
stygian-proxy is a high-performance, resilient proxy rotation library for the Stygian
scraping ecosystem. It manages a pool of proxy endpoints, tracks per-proxy health and
latency metrics, and integrates directly with both stygian-graph HTTP adapters and
stygian-browser page contexts.
Feature summary
| Feature | Description |
|---|---|
| Rotation strategies | Round-robin, random, weighted (by proxy weight), least-used (by request count) |
| Per-proxy metrics | Atomic latency and success-rate tracking — zero lock contention |
| Async health checker | Configurable-interval background task; each proxy probed concurrently via JoinSet |
| Circuit breaker | Per-proxy lock-free FSM: Closed → Open → HalfOpen; auto-recovery after cooldown |
| In-memory pool | No external database required; satisfies the ProxyStoragePort trait |
| graph integration | ProxyManagerPort trait for stygian-graph HTTP adapters (feature graph) |
| browser integration | Per-context proxy binding for stygian-browser (feature browser) |
| SOCKS support | Socks4 and Socks5 proxy types (feature socks) |
Quick start
Add the dependency:
[dependencies]
stygian-proxy = { version = "0.1", features = ["graph"] }
Build a pool and make a request:
use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; use stygian_proxy::types::{Proxy, ProxyType}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let storage = Arc::new(MemoryProxyStore::default()); let manager = Arc::new( ProxyManager::with_round_robin(storage, ProxyConfig::default())? ); // Register proxies at startup (or dynamically at any time) manager.add_proxy(Proxy { url: "http://proxy1.example.com:8080".into(), proxy_type: ProxyType::Http, username: None, password: None, weight: 1, tags: vec!["us-east".into()], }).await?; // Start background health checks let (cancel, _task) = manager.start(); // Acquire a proxy for a request let handle = manager.acquire_proxy().await?; println!("using proxy: {}", handle.proxy_url); // Signal success — omitting this counts as a failure toward the circuit breaker handle.mark_success(); cancel.cancel(); // stop health checker Ok(()) }
Architecture
┌─────────────────────────────────────────┐
│ ProxyManager │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ HealthChecker│ │ CircuitBreakers │ │
│ │ (background)│ │ (per proxy) │ │
│ └──────────────┘ └─────────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ RotationStrategy │ │
│ │ RoundRobin / Random / Weighted │ │
│ │ / LeastUsed │ │
│ └──────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ ProxyStoragePort │ │
│ │ MemoryProxyStore (built-in) │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
│ │
▼ ▼
stygian-graph stygian-browser
HTTP adapters page contexts
ProxyManager is the main entry point. It composes a storage backend, a rotation strategy,
a background health checker, and a map of per-proxy circuit breakers. Callers interact only
with acquire_proxy() → ProxyHandle → mark_success().
Cargo features
| Feature | Enables |
|---|---|
| (default: none) | Core pool, strategies, health checker, circuit breaker |
graph | ProxyManagerPort trait + blanket impl + NoopProxyManager |
browser | BrowserProxySource trait + ProxyManagerBridge |
socks | ProxyType::Socks4 and ProxyType::Socks5 variants |
ProxyConfig defaults
| Field | Default | Description |
|---|---|---|
health_check_url | https://httpbin.org/ip | URL probed to verify liveness |
health_check_interval | 60 s | How often to run checks |
health_check_timeout | 5 s | Per-probe HTTP timeout |
circuit_open_threshold | 5 | Consecutive failures before circuit opens |
circuit_half_open_after | 30 s | Cooldown before attempting recovery |
Rotation Strategies
stygian-proxy ships four built-in rotation strategies. All implement the
RotationStrategy trait and operate on a slice of ProxyCandidate values
built from the live pool. Strategies that find zero healthy candidates return
ProxyError::AllProxiesUnhealthy rather than panicking.
Comparison
| Strategy | Best for | Notes |
|---|---|---|
RoundRobinStrategy | Even distribution across identical proxies | Atomic counter, lock-free |
RandomStrategy | Spreading load unpredictably | rand::rng() per call; no shared state |
WeightedStrategy | Prioritising faster or higher-quota proxies | Weighted random sampling; O(n) |
LeastUsedStrategy | Never overloading a single proxy | Picks the candidate with the lowest total request count |
RoundRobinStrategy
Distributes requests evenly across healthy proxies in insertion order. Uses an atomic
counter so there is no Mutex on the hot path.
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; let storage = Arc::new(MemoryProxyStore::default()); let manager = ProxyManager::with_round_robin(storage, ProxyConfig::default()).unwrap(); }
ProxyManager::with_round_robin is the recommended default. The counter wraps safely
at u64::MAX, which at 1 million requests per second takes ~585,000 years.
RandomStrategy
Picks a healthy proxy at random on every call. Useful when you want to avoid any predictable rotation pattern that fingerprinting could detect.
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; use stygian_proxy::strategy::RandomStrategy; let storage = Arc::new(MemoryProxyStore::default()); let manager = ProxyManager::builder() .storage(storage) .strategy(Arc::new(RandomStrategy)) .config(ProxyConfig::default()) .build() .unwrap(); }
WeightedStrategy
Each Proxy has a weight: u32 field (default 1). WeightedStrategy performs weighted
random sampling so proxies with higher weights are selected proportionally more often.
#![allow(unused)] fn main() { use stygian_proxy::types::{Proxy, ProxyType}; // This proxy is 3× more likely to be selected than a weight-1 proxy. let fast_proxy = Proxy { url: "http://fast.example.com:8080".into(), proxy_type: ProxyType::Http, weight: 3, ..Default::default() }; }
Use this strategy when proxies have different capacities, quotas, or observed speeds.
LeastUsedStrategy
Selects the healthy proxy with the lowest total request count at the time of the call. This maximises even distribution over time even when proxies are added dynamically.
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; use stygian_proxy::strategy::LeastUsedStrategy; let storage = Arc::new(MemoryProxyStore::default()); let manager = ProxyManager::builder() .storage(storage) .strategy(Arc::new(LeastUsedStrategy)) .config(ProxyConfig::default()) .build() .unwrap(); }
Custom strategies
Implement RotationStrategy to plug in your own selection logic:
#![allow(unused)] fn main() { use async_trait::async_trait; use stygian_proxy::error::ProxyResult; use stygian_proxy::strategy::{ProxyCandidate, RotationStrategy}; /// Always pick the proxy with the best success rate. pub struct BestSuccessRateStrategy; #[async_trait] impl RotationStrategy for BestSuccessRateStrategy { async fn select<'a>( &self, candidates: &'a [ProxyCandidate], ) -> ProxyResult<&'a ProxyCandidate> { candidates .iter() .filter(|c| c.healthy) .max_by(|a, b| { let ra = a.metrics.success_rate(); let rb = b.metrics.success_rate(); ra.partial_cmp(&rb).unwrap_or(std::cmp::Ordering::Equal) }) .ok_or(stygian_proxy::error::ProxyError::AllProxiesUnhealthy) } } }
Pass it to ProxyManager::builder().storage(storage).strategy(Arc::new(BestSuccessRateStrategy)).config(config).build().unwrap().
ProxyCandidate fields
| Field | Type | Description |
|---|---|---|
id | Uuid | Stable proxy identifier |
weight | u32 | Relative selection weight |
metrics | Arc<ProxyMetrics> | Shared atomics: requests, failures, latency |
healthy | bool | Result of the last health check |
ProxyMetrics exposes success_rate() -> f64 and avg_latency_ms() -> f64 computed from
the atomic counters without any locking.
Health Checking & Circuit Breaker
stygian-proxy keeps the pool fresh through two complementary mechanisms: an
async health checker that periodically probes each proxy, and a per-proxy
circuit breaker that trips automatically when a proxy starts failing live
requests.
Health checker
HealthChecker runs a background tokio task that probes every registered
proxy on a configurable interval. Probes run concurrently via JoinSet so a
slow or timing-out proxy does not delay checks for healthy ones.
Starting the health checker
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; let storage = Arc::new(MemoryProxyStore::default()); let config = ProxyConfig { health_check_url: "https://httpbin.org/ip".into(), health_check_interval: std::time::Duration::from_secs(60), health_check_timeout: std::time::Duration::from_secs(5), ..ProxyConfig::default() }; let manager = Arc::new( ProxyManager::with_round_robin(storage, config)? ); let (cancel, _task) = manager.start(); // When shutting down: cancel.cancel(); }
On-demand check
Call health_checker.check_once().await to run a single probe cycle without
spawning a background task. Useful in tests or before the first request batch.
Health state
Each proxy's health state is stored in HealthMap — an Arc<DashMap<Uuid, bool>>.
The rotation strategy reads this map when building the ProxyCandidate slice; proxies
marked unhealthy are filtered out before selection.
Circuit breaker
Every proxy gets its own CircuitBreaker when it is added to the pool. The
circuit breaker is a lock-free atomic FSM with three states:
failure ≥ threshold
CLOSED ──────────────────────► OPEN
▲ │
│ success on probe │ half_open_after elapsed
│ ▼
└──────────────────────── HALF-OPEN
success on probe
| State | Behaviour |
|---|---|
| Closed | Proxy is selectable; failures increment counter |
| Open | Proxy is excluded from selection; requests are not attempted |
| Half-Open | One probe attempt allowed; success → Closed, failure → Open (timer reset) |
How it integrates with ProxyHandle
acquire_proxy() returns a ProxyHandle. The handle holds an Arc<CircuitBreaker>.
handle.mark_success()— resets the failure counter, moves the circuit to Closed.- Drop without
mark_success— records a failure; opens the circuit aftercircuit_open_thresholdconsecutive failures.
#![allow(unused)] fn main() { let handle = manager.acquire_proxy().await?; match do_request(&handle.proxy_url).await { Ok(_) => handle.mark_success(), Err(e) => { // handle is dropped here → failure recorded automatically eprintln!("request failed: {e}"); } } }
ProxyHandle::direct()
For code paths that conditionally use a proxy, ProxyHandle::direct() returns a
sentinel handle with an empty URL and a noop circuit breaker that can never trip.
Pass it wherever a ProxyHandle is expected when no proxy should be used.
#![allow(unused)] fn main() { use stygian_proxy::manager::ProxyHandle; let handle = if use_proxy { manager.acquire_proxy().await? } else { ProxyHandle::direct() }; }
PoolStats
manager.pool_stats().await returns a PoolStats snapshot:
#![allow(unused)] fn main() { let stats = manager.pool_stats().await?; println!("total={} healthy={} open_circuits={}", stats.total, stats.healthy, stats.open); }
| Field | Description |
|---|---|
total | Total proxies registered |
healthy | Proxies that passed the last health check |
open | Proxies whose circuit breaker is currently Open |
Tuning recommendations
| Scenario | Recommendation |
|---|---|
| High-churn scraping (many short requests) | Lower circuit_open_threshold to 2–3; shorter circuit_half_open_after (10–15 s) |
| Long-lived connections | Raise circuit_open_threshold to 10+; extend circuit_half_open_after (60–120 s) |
| Residential proxies (naturally flaky) | Use WeightedStrategy with lower weights for known flaky proxies; keep threshold at default (5) |
| Dev / testing | health_check_interval = 5 s; health_check_url pointing to a local echo server |
Ecosystem Integrations
stygian-proxy can be wired into both stygian-graph HTTP adapters and
stygian-browser page contexts via optional Cargo features.
stygian-graph (graph feature)
Enable the graph feature to get ProxyManagerPort — the trait that decouples
graph HTTP adapters from any specific proxy implementation.
[dependencies]
stygian-proxy = { version = "0.1", features = ["graph"] }
ProxyManagerPort
#![allow(unused)] fn main() { use stygian_proxy::graph::ProxyManagerPort; // ProxyManager already implements ProxyManagerPort via a blanket impl. // Inside any HTTP adapter: async fn fetch(proxy_src: &dyn ProxyManagerPort, url: &str) -> Result<String, Box<dyn std::error::Error>> { let handle = proxy_src.acquire_proxy().await?; // ... make request through handle.proxy_url ... handle.mark_success(); Ok(String::new()) } }
BoxedProxyManager
BoxedProxyManager is a type alias for Arc<dyn ProxyManagerPort>. Use it to
store a proxy source in HttpAdapter or any other adapter without naming the
concrete type:
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::graph::{BoxedProxyManager, ProxyManagerPort}; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; let storage = Arc::new(MemoryProxyStore::default()); let manager = Arc::new( ProxyManager::with_round_robin(storage, ProxyConfig::default()).unwrap() ); // Coerce to the trait object let boxed: BoxedProxyManager = manager; }
NoopProxyManager
When no proxying is needed, pass NoopProxyManager instead. It returns
ProxyHandle::direct() on every call — a noop handle with an empty URL.
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::graph::{BoxedProxyManager, NoopProxyManager}; let no_proxy: BoxedProxyManager = Arc::new(NoopProxyManager); }
This avoids Option<BoxedProxyManager> branches in adapter code: always pass a
BoxedProxyManager, just swap implementations at construction time.
stygian-browser (browser feature)
Enable the browser feature to bind a specific proxy to each browser page context.
[dependencies]
stygian-proxy = { version = "0.1", features = ["browser"] }
ProxyManagerBridge
ProxyManagerBridge wraps a ProxyManager and exposes bind_proxy(), which
acquires one proxy and returns (proxy_url, ProxyHandle). The URL can be passed
directly to chromiumoxide when launching a browser context.
#![allow(unused)] fn main() { use std::sync::Arc; use stygian_proxy::{MemoryProxyStore, ProxyConfig, ProxyManager}; use stygian_proxy::browser::ProxyManagerBridge; let storage = Arc::new(MemoryProxyStore::default()); let manager = Arc::new( ProxyManager::with_round_robin(storage, ProxyConfig::default()).unwrap() ); let bridge = ProxyManagerBridge::new(Arc::clone(&manager)); // In your browser context setup: let (proxy_url, handle) = bridge.bind_proxy().await?; // Pass proxy_url to chromiumoxide BrowserConfig::builder().proxy(proxy_url) // ... handle.mark_success(); // after the page session completes }
BrowserProxySource
BrowserProxySource is the trait implemented by ProxyManagerBridge. Implement
it directly to plug in any proxy source without depending on ProxyManager:
#![allow(unused)] fn main() { use async_trait::async_trait; use stygian_proxy::browser::BrowserProxySource; use stygian_proxy::manager::ProxyHandle; use stygian_proxy::error::ProxyResult; pub struct MyProxySource; #[async_trait] impl BrowserProxySource for MyProxySource { async fn bind_proxy(&self) -> ProxyResult<(String, ProxyHandle)> { // return (proxy_url, handle) Ok(( "http://my-proxy.example.com:8080".into(), ProxyHandle::direct(), )) } } }
Failure tracking across integrations
ProxyHandle uses RAII to track request outcomes. The same contract applies
regardless of whether it came from a graph or browser integration:
- Acquire a handle from
acquire_proxy()(graph) orbind_proxy()(browser). - Perform the I/O operation.
- Call
handle.mark_success()on success. - Drop the handle (success or failure is recorded in the circuit breaker).
If the handle is dropped without mark_success(), the failure counter
increments. After circuit_open_threshold consecutive failures the circuit
opens and the proxy is skipped until after the circuit_half_open_after cooldown.
Environment Variables
All configuration values for both crates can be overridden at runtime via environment variables. No recompilation required.
stygian-browser
| Variable | Default | Description |
|---|---|---|
STYGIAN_CHROME_PATH | auto-detect | Absolute path to Chrome or Chromium binary |
STYGIAN_HEADLESS | true | false for headed mode (displays browser window) |
STYGIAN_HEADLESS_MODE | new | new (--headless=new) or legacy (classic --headless for Chromium < 112) |
STYGIAN_STEALTH_LEVEL | advanced | none, basic, or advanced |
STYGIAN_POOL_MIN | 2 | Minimum warm browser instances |
STYGIAN_POOL_MAX | 10 | Maximum concurrent browser instances |
STYGIAN_POOL_IDLE_SECS | 300 | Idle timeout before a browser is evicted |
STYGIAN_POOL_ACQUIRE_SECS | 5 | Seconds to wait for a pool slot before error |
STYGIAN_LAUNCH_TIMEOUT_SECS | 10 | Browser launch timeout |
STYGIAN_CDP_TIMEOUT_SECS | 30 | Per-operation CDP command timeout |
STYGIAN_CDP_FIX_MODE | addBinding | addBinding, isolatedworld, or enabledisable |
STYGIAN_PROXY | — | Proxy URL (http://, https://, or socks5://) |
STYGIAN_PROXY_BYPASS | — | Comma-separated proxy bypass list (e.g. <local>,localhost) |
STYGIAN_DISABLE_SANDBOX | auto-detect | true inside containers, false on bare metal |
Chrome binary auto-detection order
When STYGIAN_CHROME_PATH is not set, the library searches:
google-chromeon$PATHchromiumon$PATH/usr/bin/google-chrome/usr/bin/chromium-browser/Applications/Google Chrome.app/Contents/MacOS/Google Chrome(macOS)C:\Program Files\Google\Chrome\Application\chrome.exe(Windows)
stygian-proxy
stygian-proxy is configured by constructing ProxyConfig directly in code.
There are no environment variable overrides; pass the struct to
ProxyManager::with_round_robin or ProxyManager::with_strategy.
| Field | Default | Description |
|---|---|---|
health_check_url | https://httpbin.org/ip | URL probed to verify proxy liveness |
health_check_interval | 60 s | How often the background task runs |
health_check_timeout | 5 s | Per-probe HTTP timeout |
circuit_open_threshold | 5 | Consecutive failures before circuit opens |
circuit_half_open_after | 30 s | Cooldown before attempting recovery |
stygian-graph
| Variable | Default | Description |
|---|---|---|
STYGIAN_LOG | info | Alias for RUST_LOG if set; otherwise defers to RUST_LOG |
STYGIAN_WORKERS | num_cpus * 4 | Default worker pool concurrency |
STYGIAN_QUEUE_DEPTH | workers * 4 | Worker pool channel depth |
STYGIAN_CACHE_CAPACITY | 10000 | Default LRU cache entry limit |
STYGIAN_CACHE_TTL_SECS | 300 | Default DashMap cache TTL in seconds |
STYGIAN_HTTP_TIMEOUT_SECS | 30 | HTTP adapter request timeout |
STYGIAN_HTTP_MAX_REDIRECTS | 10 | HTTP redirect chain limit |
AI provider variables
| Variable | Used by |
|---|---|
ANTHROPIC_API_KEY | ClaudeAdapter |
OPENAI_API_KEY | OpenAiAdapter |
GOOGLE_API_KEY | GeminiAdapter |
GITHUB_TOKEN | CopilotAdapter |
OLLAMA_BASE_URL | OllamaAdapter (default: http://localhost:11434) |
Distributed execution variables
| Variable | Default | Description |
|---|---|---|
REDIS_URL | redis://localhost:6379 | Redis/Valkey connection URL |
REDIS_MAX_CONNECTIONS | 20 | Redis connection pool size |
STYGIAN_QUEUE_NAME | stygian:work | Default work queue key |
STYGIAN_VISIBILITY_TIMEOUT_SECS | 60 | Task visibility timeout for in-flight items |
Tracing and logging
| Variable | Example | Description |
|---|---|---|
RUST_LOG | stygian_graph=debug,stygian_browser=info | Log level per crate |
RUST_LOG | trace | Enable all tracing (very verbose) |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP trace export endpoint |
OTEL_SERVICE_NAME | stygian-scraper | Service name in trace metadata |
Testing & Coverage
Running tests
# All workspace tests (no Chrome required)
cargo test --workspace
# All features including browser integration tests
cargo test --workspace --all-features
# Run only the tests that need Chrome (previously ignored)
cargo test --workspace --all-features -- --include-ignored
# Specific crate
cargo test -p stygian-graph
cargo test -p stygian-browser
Test organisation
stygian-graph
Tests live alongside their module in src/ (unit tests) and in crates/stygian-graph/tests/
(integration tests).
| Layer | Location | Approach |
|---|---|---|
| Domain | src/domain/*/tests | Pure Rust, no I/O |
| Adapters | src/adapters/*/tests | Mock port implementations |
| Application | src/application/*/tests | In-process service registry |
| Integration | tests/ | End-to-end with real HTTP (httpbin) |
All tests in the graph crate pass without any external services.
stygian-browser
Browser tests fall into two categories:
| Category | Attribute | Runs in CI |
|---|---|---|
| Pure-logic tests (config, stealth scripts, math) | none | ✅ always |
| Integration tests (real Chrome required) | #[ignore = "requires Chrome"] | ❌ opt-in |
To run integration tests locally:
# Ensure Chrome 120+ is on PATH, then:
cargo test -p stygian-browser --all-features -- --include-ignored
Coverage
Coverage is measured with
cargo-tarpaulin.
Install
cargo install cargo-tarpaulin
Measure
# Workspace summary (excludes Chrome-gated tests)
cargo tarpaulin --workspace --all-features --ignore-tests --out Lcov
# stygian-graph only
cargo tarpaulin -p stygian-graph --all-features --ignore-tests --out Lcov
# stygian-browser logic-only (no Chrome)
cargo tarpaulin -p stygian-browser --lib --ignore-tests --out Lcov
Current numbers
| Scope | Line coverage | Notes |
|---|---|---|
| Workspace | 65.74 % | 2 882 / 4 384 lines · 209 tests |
stygian-graph | ~72 % | All unit and integration logic covered |
stygian-browser | structurally bounded | Chrome-gated tests excluded from CI |
High-coverage modules in stygian-graph:
| Module | Coverage |
|---|---|
application/config.rs | ~100 % |
application/executor.rs | ~100 % |
domain/idempotency.rs | ~100 % |
application/registry.rs | ~100 % |
adapters/claude.rs | ~95 % |
adapters/openai.rs | ~95 % |
adapters/gemini.rs | ~95 % |
Why browser coverage is bounded
Every test that launches a real Chrome instance is annotated:
#![allow(unused)] fn main() { #[tokio::test] #[ignore = "requires Chrome"] async fn pool_acquire_release() { // ... } }
This keeps cargo test fast and green in CI environments without a Chrome binary.
Pure-logic coverage (fingerprint generation, stealth script math, config validation,
simulator algorithms) is high; only the CDP I/O paths are excluded.
Adding tests
Unit test pattern (graph)
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[tokio::test] async fn my_adapter_returns_error_on_empty_url() { let adapter = MyAdapter::default(); let result = adapter.execute(ServiceInput::default()).await; assert!(result.is_err()); } } }
Browser test pattern (logic only)
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn stealth_level_debug_display() { assert_eq!(format!("{:?}", StealthLevel::Advanced), "Advanced"); } #[tokio::test] #[ignore = "requires Chrome"] async fn pool_warm_start() { let pool = BrowserPool::new(BrowserConfig::default()).await.unwrap(); assert!(pool.stats().available >= 1); pool.shutdown().await; } } }
CI test matrix
The GitHub Actions CI workflow (.github/workflows/ci.yml) runs:
| Job | Runs on | Command |
|---|---|---|
test | ubuntu-latest | cargo test --workspace --all-features |
clippy | ubuntu-latest | cargo clippy --workspace --all-features -- -D warnings |
fmt | ubuntu-latest | cargo fmt --check |
docs | ubuntu-latest | cargo doc --workspace --no-deps --all-features |
msrv | ubuntu-latest | cargo +1.94.0 check --workspace |
cross-platform | windows-latest, macos-latest | cargo test --workspace |
Browser integration tests (--include-ignored) are not run in CI — they require a
Chrome binary and a display server which are not available in the default runner images.