===============================================================================
Code Complexity & Clone Detection Tool
===============================================================================
An extensible CLI for measuring code complexity and detecting repeated code
fragments. Designed to be language-agnostic with low-overhead.
This document outlines:
- Goals and metrics
- Algorithms used
- Project structure
- Outputs
- Philosophy
===============================================================================
GOALS
- Provide actionable complexity and duplication insights without heavy parsing.
- Support any programming language
- Keep runtime fast: linear or near-linear where possible.
- Keep architecture modular so deeper analysis can be plugged in later.
- Produce both human-readable and machine-readable reports.
===============================================================================
SCOPE
The MVP focuses on high-signal, low-complexity algorithms that require only
tokenization—not full AST parsing. This keeps the implementation small and
the performance excellent.
===============================================================================
Algorithms
-------------------------------------------------------------------------------
1. Cyclomatic Complexity (McCabe)
- Measures independent control-flow paths in a function.
- Works by building a simple CFG (control-flow graph) from tokens or indentation.
- Formula:
CC = E - N + 2P
where:
E = number of edges
N = number of nodes
P = number of connected components (usually 1)
Purpose:
- Flags overly complex functions.
- Well understood by developers.
-------------------------------------------------------------------------------
2. Token-based Clone Detection (Rabin-Karp Rolling Hash)
- Detects repeated token sequences across files.
- Uses a rolling hash window (e.g., 20–50 tokens).
- Language-agnostic; extremely fast.
Purpose:
- Quickly identifies copy/paste blocks and boilerplate.
-------------------------------------------------------------------------------
3. Lines of Code (LOC)
- Counts physical, logical, comment, and blank lines.
- Token-based classification for accuracy.
- Supports ranking by any metric (logical, physical, comments, blank).
- Can aggregate and rank by directory.
- Required for future Maintainability Index calculations.
Purpose:
- Baseline size metric.
- Identify largest files and directories.
- Track codebase growth and comment density.
-------------------------------------------------------------------------------
4. Halstead Complexity Metrics
- Based on counts of operators and operands.
- Computes vocabulary, volume, difficulty, and effort.
Purpose:
- Complements Cyclomatic Complexity with lexical complexity.
===============================================================================
FUTURE
-------------------------------------------------------------------------------
See todo.txt
===============================================================================
CRATES
--------------------------------------------------------------------------------
core
tokenizer
- minimal language-agnostic tokenizer
- operator/operand extractor (Halstead)
- whitespace, comments, string handling
complexity
cyclomatic
- CFG builder
- node/edge counter
halstead
- operator+operand tables
loc
- physical and logical LOC
cloner
- rolling hash (Rabin-Karp)
- w-shingling and fingerprint index
- min token length filter
reporter
- JSON output
- plaintext/ANSI output
- sorting, filtering, severity thresholds
loader
- config loading
- file walker + globbing
- ignore rules
--------------------------------------------------------------------------------
cli
- commands: analyze, clones, complexity, loc, dump-config
- analyze: full analysis (complexity + clones + LOC)
- clones: detect code clones only
- complexity: analyze cyclomatic complexity and LOC
- loc: focused LOC analysis with ranking (by file or directory)
- dump-config: display or save current configuration
- flags: --json, --threshold, --min-tokens, --rank-by, --rank-dirs, --path
===============================================================================
OUTPUT FORMATS
--------------------------------------------------------------------------------
Plaintext:
FILE: src/server/routes.rs
Cyclomatic: 14 (warning)
Halstead Volume: 312.9
LOC: 128
FILE: src/server/auth.rs
Cyclomatic: 5
Halstead Volume: 102.3
LOC: 41
CLONES:
- HashMatch #7
- src/server/routes.rs:41-78
- src/server/router.rs:12-49
Length: 32 tokens
--------------------------------------------------------------------------------
JSON:
{
"files": [
{
"path": "src/server/routes.rs",
"cyclomatic": 14,
"halstead": { "volume": 312.9, "difficulty": 12.8 },
"loc": 128
}
],
"clones": [
{
"id": 7,
"length": 32,
"locations": [
{ "file": "src/server/routes.rs", "start": 41, "end": 78 },
{ "file": "src/server/router.rs", "start": 12, "end": 49 }
]
}
]
}
===============================================================================
DESIGN PRINCIPLES
1. Zero language assumptions in the MVP.
2. Pluggable architecture (token → AST → CFG → semantic).
3. High performance by default.
4. No global state; everything streamed and incremental.
5. Developer-friendly reporting and clear severity levels.
6. Configurable thresholds
===============================================================================
REFERENCES
Cyclomatic Complexity (McCabe, 1976):
https://www.literateprogramming.com/mccabe.pdf
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication500-235.pdf
AST clone detection:
https://www.cs.ucdavis.edu/~su/publications/icse07.pdf
LSH-based clone detection:
https://journals.riverpublishers.com/index.php/JCSANDM/article/download/18799/18101/68821
https://www.sei.cmu.edu/library/code-similarity-detection-using-syntax-agnostic-locality-sensitive-hashing/
Halstead:
https://www.researchgate.net/publication/317317114_Software_Complexity_Analysis_Using_Halstead_Metrics
https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.1990.pdf
code complexity & repetition analysis tool
Rust
91.7%
Python
1.6%
Go
1.3%
JavaScript
0.9%
Other
4.5%
README