stygian_charon/vendor_classifier/mod.rs
1//! Vendor fingerprinting confidence classifier (T89).
2//!
3//! Identifies likely anti-bot vendor(s) for a target and produces
4//! a confidence-scored evidence bundle for policy routing. The
5//! classifier consumes cookies, response headers, challenge URLs,
6//! and body markers; each piece of evidence is labelled by
7//! [`EvidenceSource`] so the diagnostic payload can be audited
8//! without re-running the match.
9//!
10//! ## Vendor taxonomy
11//!
12//! The four **Tier 1** vendors ship with signal catalogues
13//! embedded at compile time:
14//!
15//! | `VendorId` | Display name | TOML file |
16//! |----------------|-----------------------------|----------------------------------|
17//! | `DataDome` | DataDome | `data/vendors/datadome.toml` |
18//! | `PerimeterX` | PerimeterX / HUMAN Security | `data/vendors/perimeter_x.toml` |
19//! | `Akamai` | Akamai Bot Manager | `data/vendors/akamai.toml` |
20//! | `Cloudflare` | Cloudflare | `data/vendors/cloudflare.toml` |
21//!
22//! Tier 2 vendors ([`VendorId::Hcaptcha`], [`VendorId::Recaptcha`],
23//! [`VendorId::Kasada`], [`VendorId::FingerprintCom`],
24//! [`VendorId::ShapeSecurity`], [`VendorId::Imperva`]) are present
25//! in the enum so downstream T88/T90 layers can name them, but no
26//! baseline signals ship for them. Operators register their own
27//! catalogues via [`VendorDefinition`].
28//!
29//! [`VendorId::Unknown`] is the catch-all when no vendor matched
30//! or no classification can be produced. It must remain the
31//! **last** variant so it sorts last in the deterministic
32//! tie-break rule.
33//!
34//! ## Determinism
35//!
36//! The classifier is fully deterministic:
37//!
38//! 1. Patterns are case-folded at load time and at the match site,
39//! so a vendor's score is byte-stable across runs.
40//! 2. The top-score tie-break is **VendorId discriminant order**:
41//! the lower the variant is declared in [`VendorId`], the higher
42//! its priority when scores are equal.
43//! 3. Confidence is `top_score / (top_score + second_score)`, so
44//! a single matched vendor always reports `1.0`.
45//! 4. The `ranked` output is a `Vec` sorted by `(score DESC,
46//! VendorId ASC)`. The `evidence` bundle is sorted by
47//! `(source, signal)` and deduplicated so the JSON form is
48//! byte-stable.
49//!
50//! ## High-confidence threshold
51//!
52//! The classifier carries a configurable threshold
53//! ([`DEFAULT_HIGH_CONFIDENCE_THRESHOLD`] = 0.60). The
54//! [`VendorClassification::is_high_confidence`] flag is set when
55//! the top vendor's confidence crosses the threshold. Callers can
56//! override the threshold via
57//! [`VendorClassifier::with_threshold`].
58//!
59//! ## Feature flag
60//!
61//! The module is **default-on** and lives in
62//! `crates/stygian-charon/src/vendor_classifier/`. It adds two new
63//! public types ([`VendorClassification`] and the underlying
64//! [`VendorScore`]) and a single additive field on
65//! [`crate::bundle::DiagnosticBundle`] (gated by
66//! `#[serde(default, skip_serializing_if = "Option::is_none")]`).
67//! No new feature gate is introduced because the additions are
68//! purely additive.
69//!
70//! # Example
71//!
72//! ```
73//! use stygian_charon::vendor_classifier::{VendorClassifier, VendorId, EvidenceSource};
74//! use std::collections::BTreeMap;
75//!
76//! let classifier = VendorClassifier::with_builtin_defaults();
77//! let cookies = vec!["datadome=xyz; path=/".to_string()];
78//! let mut headers = BTreeMap::new();
79//! headers.insert("x-datadome".to_string(), "protected".to_string());
80//! headers.insert("x-datadome-cid".to_string(), "abc".to_string());
81//! let body = Some("captcha-delivery.com iframe");
82//! let url = "https://www.example.com/cdn-cgi/challenge-platform";
83//!
84//! let classification = classifier.classify(&cookies, &headers, body, url);
85//! assert_eq!(classification.top_vendor, VendorId::DataDome);
86//! assert!(classification.is_high_confidence);
87//! assert!(classification.evidence.source_summary.contains_key(&EvidenceSource::Cookie));
88//! ```
89
90mod builtins;
91mod classifier;
92mod error;
93mod evidence;
94mod vendor;
95
96pub use classifier::{
97 DEFAULT_HIGH_CONFIDENCE_THRESHOLD, VendorClassification, VendorClassifier, VendorScore,
98};
99pub use error::VendorError;
100pub use evidence::{Evidence, EvidenceBundle, EvidenceSource};
101pub use vendor::{VendorDefinition, VendorId, VendorSignal, parse_vendor_definition};