The Scrubber
The Scrubber is the static analysis and transformation engine of papertowel. Its purpose is to identify “AI fingerprints”—stylistic tells that suggest a piece of code was generated by an LLM—and provide mechanisms to remove or modify them.
Overview
The Scrubber operates on a pluggable architecture. Instead of one giant monolithic parser, it uses a series of independent Detectors. Each detector is responsible for a specific category of “slop.”
The Detection Pipeline
- Scanning: The Scrubber traverses the target directory, respecting
.papertowelignorefiles and[exclude]patterns in.papertowel.toml. - Directive Check: Files containing a
// papertowel:ignore-filecomment are skipped entirely. - Analysis: Each enabled detector runs against the source code, producing a set of
Findingobjects. - Suppression: Findings on lines preceded by
// papertowel:ignore-next-lineare removed. - Scoring: Remaining findings are assigned a severity (Low, Medium, High) based on how strongly they correlate with AI generation.
- Transformation: When running in
scrubmode, the Scrubber applies transformations to the code to resolve the findings.
See Configuration and Ignoring for full details on suppression options.
Pluggable Detectors
papertowel ships with several built-in detectors, each targeting a different aspect of AI style.
Recipe-Based Detection
The newest and most flexible detector is the recipe system. Recipes are TOML files that describe pattern groups (words, phrases, regex), cluster-scoring rules, and optional applies_to/excludes glob gating. Built-in recipes cover slop vocabulary, phrase patterns, and comment boilerplate. You can add your own in .papertowel/recipes/. See Recipes for the full format.
Lexical Analysis
The lexical detector searches for “slop vocabulary.” LLMs have a strong preference for certain words that humans rarely use in a technical context unless they are writing marketing copy. Examples include:
- Adjectives: robust, comprehensive, streamlined, intuitive, performant, granular.
- Verbs: utilize, facilitate, leverage, delve.
- Phrases: “it’s worth noting,” “in order to,” “under the hood.”
Comment Thinning
AI-generated comments often suffer from “stating the obvious.” A human might comment why a complex regex is used; an AI will comment // This function uses a regex to validate the email. The comments detector scores comments based on their redundancy and “cookie-cutter” phrasing.
Structural Fingerprints
The structure detector looks for suspiciously uniform code organization—such as perfectly balanced module layouts or a lack of the “scar tissue” (slight inconsistencies) that typically accumulates as a human refactors code over time.
Metadata and Boilerplate
The metadata detector identifies the “instant project” syndrome: when a repository appears with a perfect CONTRIBUTING.md, CODE_OF_CONDUCT.md, and SECURITY.md all in the very first commit.
Security Vulnerabilities
The security detector flags insecure patterns frequently produced by AI code generation. It includes 15 regex-based rules covering OWASP Top 10 categories: SQL/shell injection, weak cryptography, disabled TLS verification, unsafe deserialization, hardcoded secrets, and more. Each rule carries a per-rule confidence score; see the full rule reference for details. See Security Vulnerabilities for the full rule reference.
Additional Detectors
The following detectors run automatically alongside those described above. They can each be toggled individually in .papertowel.toml under [detectors].
| Detector | Category | What it detects |
|---|---|---|
commit_pattern | History | Machine-clean git history: perfectly uniform commit cadence, 100% conventional message format, zero recovery commits (wip, oops, fixup). |
tests | Testing | Missing or shallow test coverage — AI often generates a tests/ directory with one trivial smoke test and nothing else. |
workflow | Workflow | Template-burst GitHub Actions workflows: multiple .github/workflows/*.yml files appearing in the first commit that contain obvious template boilerplate. |
promotion | Promotion | Disproportionate marketing language in README relative to actual codebase size — more sales copy than technical content. |
maintenance | Maintenance | Hollow repo shape: many code/config files but empty or placeholder docs, indicating a generated scaffold that was never actually used. |
name_credibility | Credibility | Generic or AI-flavoured project names (e.g. ai-tool-app, nextgen-scaffold) combined with repetitive self-promotional usage in the README. |
idiom_mismatch | Style | Language-specific idiom violations — e.g. getter/setter patterns in Rust where idiomatic code would use direct field access or a builder. |
prompt | Prompt Leakage | Residual LLM prompt fragments in source files or docs: phrases like “As an AI language model”, “Assistant:”, or “Let’s break this down”. |
Using the Scrubber
Scanning
To identify fingerprints without modifying your code, use the scan command:
papertowel scan .
This will output a report of all findings, categorized by severity.
Scrubbing
To automatically fix the identified fingerprints, use the scrub command:
papertowel scrub .
Pro Tip: Always use the --dry-run flag first to see what changes would be made before applying them to your source.