code complexity & repetition analysis tool
1===============================================================================
2Code Complexity & Clone Detection Tool
3===============================================================================
4
5An extensible CLI for measuring code complexity and detecting repeated code
6fragments. Designed to be language-agnostic with low-overhead.
7
8This document outlines:
9 - Goals and metrics
10 - Algorithms used
11 - Project structure
12 - Outputs
13 - Philosophy
14
15===============================================================================
16GOALS
17
18- Provide actionable complexity and duplication insights without heavy parsing.
19- Support any programming language
20- Keep runtime fast: linear or near-linear where possible.
21- Keep architecture modular so deeper analysis can be plugged in later.
22- Produce both human-readable and machine-readable reports.
23
24===============================================================================
25SCOPE
26
27The MVP focuses on high-signal, low-complexity algorithms that require only
28tokenization—not full AST parsing. This keeps the implementation small and
29the performance excellent.
30
31===============================================================================
32Algorithms
33-------------------------------------------------------------------------------
341. Cyclomatic Complexity (McCabe)
35- Measures independent control-flow paths in a function.
36- Works by building a simple CFG (control-flow graph) from tokens or indentation.
37- Formula:
38 CC = E - N + 2P
39 where:
40 E = number of edges
41 N = number of nodes
42 P = number of connected components (usually 1)
43
44Purpose:
45- Flags overly complex functions.
46- Well understood by developers.
47-------------------------------------------------------------------------------
482. Token-based Clone Detection (Rabin-Karp Rolling Hash)
49- Detects repeated token sequences across files.
50- Uses a rolling hash window (e.g., 20–50 tokens).
51- Language-agnostic; extremely fast.
52
53Purpose:
54- Quickly identifies copy/paste blocks and boilerplate.
55-------------------------------------------------------------------------------
563. Lines of Code (LOC)
57- Counts physical, logical, comment, and blank lines.
58- Token-based classification for accuracy.
59- Supports ranking by any metric (logical, physical, comments, blank).
60- Can aggregate and rank by directory.
61- Required for future Maintainability Index calculations.
62
63Purpose:
64- Baseline size metric.
65- Identify largest files and directories.
66- Track codebase growth and comment density.
67-------------------------------------------------------------------------------
684. Halstead Complexity Metrics
69- Based on counts of operators and operands.
70- Computes vocabulary, volume, difficulty, and effort.
71
72Purpose:
73- Complements Cyclomatic Complexity with lexical complexity.
74===============================================================================
75FUTURE
76-------------------------------------------------------------------------------
77See todo.txt
78===============================================================================
79CRATES
80--------------------------------------------------------------------------------
81core
82 tokenizer
83 - minimal language-agnostic tokenizer
84 - operator/operand extractor (Halstead)
85 - whitespace, comments, string handling
86 complexity
87 cyclomatic
88 - CFG builder
89 - node/edge counter
90 halstead
91 - operator+operand tables
92 loc
93 - physical and logical LOC
94 cloner
95 - rolling hash (Rabin-Karp)
96 - w-shingling and fingerprint index
97 - min token length filter
98 reporter
99 - JSON output
100 - plaintext/ANSI output
101 - sorting, filtering, severity thresholds
102 loader
103 - config loading
104 - file walker + globbing
105 - ignore rules
106
107--------------------------------------------------------------------------------
108cli
109 - commands: analyze, clones, complexity, loc, dump-config
110 - analyze: full analysis (complexity + clones + LOC)
111 - clones: detect code clones only
112 - complexity: analyze cyclomatic complexity and LOC
113 - loc: focused LOC analysis with ranking (by file or directory)
114 - dump-config: display or save current configuration
115 - flags: --json, --threshold, --min-tokens, --rank-by, --rank-dirs, --path
116
117===============================================================================
118OUTPUT FORMATS
119--------------------------------------------------------------------------------
120Plaintext:
121
122FILE: src/server/routes.rs
123 Cyclomatic: 14 (warning)
124 Halstead Volume: 312.9
125 LOC: 128
126
127FILE: src/server/auth.rs
128 Cyclomatic: 5
129 Halstead Volume: 102.3
130 LOC: 41
131
132CLONES:
133 - HashMatch #7
134 - src/server/routes.rs:41-78
135 - src/server/router.rs:12-49
136 Length: 32 tokens
137
138--------------------------------------------------------------------------------
139JSON:
140
141{
142 "files": [
143 {
144 "path": "src/server/routes.rs",
145 "cyclomatic": 14,
146 "halstead": { "volume": 312.9, "difficulty": 12.8 },
147 "loc": 128
148 }
149 ],
150 "clones": [
151 {
152 "id": 7,
153 "length": 32,
154 "locations": [
155 { "file": "src/server/routes.rs", "start": 41, "end": 78 },
156 { "file": "src/server/router.rs", "start": 12, "end": 49 }
157 ]
158 }
159 ]
160}
161
162===============================================================================
163DESIGN PRINCIPLES
164
1651. Zero language assumptions in the MVP.
1662. Pluggable architecture (token → AST → CFG → semantic).
1673. High performance by default.
1684. No global state; everything streamed and incremental.
1695. Developer-friendly reporting and clear severity levels.
1706. Configurable thresholds
171
172===============================================================================
173REFERENCES
174Cyclomatic Complexity (McCabe, 1976):
175 https://www.literateprogramming.com/mccabe.pdf
176 https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication500-235.pdf
177
178AST clone detection:
179 https://www.cs.ucdavis.edu/~su/publications/icse07.pdf
180
181LSH-based clone detection:
182 https://journals.riverpublishers.com/index.php/JCSANDM/article/download/18799/18101/68821
183 https://www.sei.cmu.edu/library/code-similarity-detection-using-syntax-agnostic-locality-sensitive-hashing/
184
185Halstead:
186 https://www.researchgate.net/publication/317317114_Software_Complexity_Analysis_Using_Halstead_Metrics
187 https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.1990.pdf