The Scrubber

The Scrubber is the static analysis and transformation engine of papertowel. Its purpose is to identify “AI fingerprints”—stylistic tells that suggest a piece of code was generated by an LLM—and provide mechanisms to remove or modify them.

Overview

The Scrubber operates on a pluggable architecture. Instead of one giant monolithic parser, it uses a series of independent Detectors. Each detector is responsible for a specific category of “slop.”

The Detection Pipeline

Scanning: The Scrubber traverses the target directory, respecting .papertowelignore files and [exclude] patterns in .papertowel.toml.
Directive Check: Files containing a // papertowel:ignore-file comment are skipped entirely.
Analysis: Each enabled detector runs against the source code, producing a set of Finding objects.
Suppression: Findings on lines preceded by // papertowel:ignore-next-line are removed.
Scoring: Remaining findings are assigned a severity (Low, Medium, High) based on how strongly they correlate with AI generation.
Transformation: When running in scrub mode, the Scrubber applies transformations to the code to resolve the findings.

See Configuration and Ignoring for full details on suppression options.

Pluggable Detectors

papertowel ships with several built-in detectors, each targeting a different aspect of AI style.

Recipe-Based Detection

The newest and most flexible detector is the recipe system. Recipes are TOML files that describe pattern groups (words, phrases, regex), cluster-scoring rules, and optional applies_to/excludes glob gating. Built-in recipes cover slop vocabulary, phrase patterns, and comment boilerplate. You can add your own in .papertowel/recipes/. See Recipes for the full format.

Lexical Analysis

The lexical detector searches for “slop vocabulary.” LLMs have a strong preference for certain words that humans rarely use in a technical context unless they are writing marketing copy. Examples include:

Adjectives: robust, comprehensive, streamlined, intuitive, performant, granular.
Verbs: utilize, facilitate, leverage, delve.
Phrases: “it’s worth noting,” “in order to,” “under the hood.”

Comment Thinning

AI-generated comments often suffer from “stating the obvious.” A human might comment why a complex regex is used; an AI will comment // This function uses a regex to validate the email. The comments detector scores comments based on their redundancy and “cookie-cutter” phrasing.

Structural Fingerprints

The structure detector looks for suspiciously uniform code organization—such as perfectly balanced module layouts or a lack of the “scar tissue” (slight inconsistencies) that typically accumulates as a human refactors code over time.

Metadata and Boilerplate

The metadata detector identifies the “instant project” syndrome: when a repository appears with a perfect CONTRIBUTING.md, CODE_OF_CONDUCT.md, and SECURITY.md all in the very first commit.

Security Vulnerabilities

The security detector flags insecure patterns frequently produced by AI code generation. It includes 15 regex-based rules covering OWASP Top 10 categories: SQL/shell injection, weak cryptography, disabled TLS verification, unsafe deserialization, hardcoded secrets, and more. Each rule carries a per-rule confidence score; see the full rule reference for details. See Security Vulnerabilities for the full rule reference.

Additional Detectors

The following detectors run automatically alongside those described above. They can each be toggled individually in .papertowel.toml under [detectors].

Detector	Category	What it detects
`commit_pattern`	History	Machine-clean git history: perfectly uniform commit cadence, 100% conventional message format, zero recovery commits (`wip`, `oops`, `fixup`).
`tests`	Testing	Missing or shallow test coverage — AI often generates a `tests/` directory with one trivial smoke test and nothing else.
`workflow`	Workflow	Template-burst GitHub Actions workflows: multiple `.github/workflows/*.yml` files appearing in the first commit that contain obvious template boilerplate.
`promotion`	Promotion	Disproportionate marketing language in README relative to actual codebase size — more sales copy than technical content.
`maintenance`	Maintenance	Hollow repo shape: many code/config files but empty or placeholder docs, indicating a generated scaffold that was never actually used.
`name_credibility`	Credibility	Generic or AI-flavoured project names (e.g. `ai-tool-app`, `nextgen-scaffold`) combined with repetitive self-promotional usage in the README.
`idiom_mismatch`	Style	Language-specific idiom violations — e.g. getter/setter patterns in Rust where idiomatic code would use direct field access or a builder.
`prompt`	Prompt Leakage	Residual LLM prompt fragments in source files or docs: phrases like “As an AI language model”, “Assistant:”, or “Let’s break this down”.

Using the Scrubber

Scanning

To identify fingerprints without modifying your code, use the scan command:

papertowel scan .

This will output a report of all findings, categorized by severity.

Scrubbing

To automatically fix the identified fingerprints, use the scrub command:

papertowel scrub .

Pro Tip: Always use the --dry-run flag first to see what changes would be made before applying them to your source.

Keyboard shortcuts

papertowel