code complexity & repetition analysis tool
at main 187 lines 6.3 kB view raw
1=============================================================================== 2Code Complexity & Clone Detection Tool 3=============================================================================== 4 5An extensible CLI for measuring code complexity and detecting repeated code 6fragments. Designed to be language-agnostic with low-overhead. 7 8This document outlines: 9 - Goals and metrics 10 - Algorithms used 11 - Project structure 12 - Outputs 13 - Philosophy 14 15=============================================================================== 16GOALS 17 18- Provide actionable complexity and duplication insights without heavy parsing. 19- Support any programming language 20- Keep runtime fast: linear or near-linear where possible. 21- Keep architecture modular so deeper analysis can be plugged in later. 22- Produce both human-readable and machine-readable reports. 23 24=============================================================================== 25SCOPE 26 27The MVP focuses on high-signal, low-complexity algorithms that require only 28tokenization—not full AST parsing. This keeps the implementation small and 29the performance excellent. 30 31=============================================================================== 32Algorithms 33------------------------------------------------------------------------------- 341. Cyclomatic Complexity (McCabe) 35- Measures independent control-flow paths in a function. 36- Works by building a simple CFG (control-flow graph) from tokens or indentation. 37- Formula: 38 CC = E - N + 2P 39 where: 40 E = number of edges 41 N = number of nodes 42 P = number of connected components (usually 1) 43 44Purpose: 45- Flags overly complex functions. 46- Well understood by developers. 47------------------------------------------------------------------------------- 482. Token-based Clone Detection (Rabin-Karp Rolling Hash) 49- Detects repeated token sequences across files. 50- Uses a rolling hash window (e.g., 20–50 tokens). 51- Language-agnostic; extremely fast. 52 53Purpose: 54- Quickly identifies copy/paste blocks and boilerplate. 55------------------------------------------------------------------------------- 563. Lines of Code (LOC) 57- Counts physical, logical, comment, and blank lines. 58- Token-based classification for accuracy. 59- Supports ranking by any metric (logical, physical, comments, blank). 60- Can aggregate and rank by directory. 61- Required for future Maintainability Index calculations. 62 63Purpose: 64- Baseline size metric. 65- Identify largest files and directories. 66- Track codebase growth and comment density. 67------------------------------------------------------------------------------- 684. Halstead Complexity Metrics 69- Based on counts of operators and operands. 70- Computes vocabulary, volume, difficulty, and effort. 71 72Purpose: 73- Complements Cyclomatic Complexity with lexical complexity. 74=============================================================================== 75FUTURE 76------------------------------------------------------------------------------- 77See todo.txt 78=============================================================================== 79CRATES 80-------------------------------------------------------------------------------- 81core 82 tokenizer 83 - minimal language-agnostic tokenizer 84 - operator/operand extractor (Halstead) 85 - whitespace, comments, string handling 86 complexity 87 cyclomatic 88 - CFG builder 89 - node/edge counter 90 halstead 91 - operator+operand tables 92 loc 93 - physical and logical LOC 94 cloner 95 - rolling hash (Rabin-Karp) 96 - w-shingling and fingerprint index 97 - min token length filter 98 reporter 99 - JSON output 100 - plaintext/ANSI output 101 - sorting, filtering, severity thresholds 102 loader 103 - config loading 104 - file walker + globbing 105 - ignore rules 106 107-------------------------------------------------------------------------------- 108cli 109 - commands: analyze, clones, complexity, loc, dump-config 110 - analyze: full analysis (complexity + clones + LOC) 111 - clones: detect code clones only 112 - complexity: analyze cyclomatic complexity and LOC 113 - loc: focused LOC analysis with ranking (by file or directory) 114 - dump-config: display or save current configuration 115 - flags: --json, --threshold, --min-tokens, --rank-by, --rank-dirs, --path 116 117=============================================================================== 118OUTPUT FORMATS 119-------------------------------------------------------------------------------- 120Plaintext: 121 122FILE: src/server/routes.rs 123 Cyclomatic: 14 (warning) 124 Halstead Volume: 312.9 125 LOC: 128 126 127FILE: src/server/auth.rs 128 Cyclomatic: 5 129 Halstead Volume: 102.3 130 LOC: 41 131 132CLONES: 133 - HashMatch #7 134 - src/server/routes.rs:41-78 135 - src/server/router.rs:12-49 136 Length: 32 tokens 137 138-------------------------------------------------------------------------------- 139JSON: 140 141{ 142 "files": [ 143 { 144 "path": "src/server/routes.rs", 145 "cyclomatic": 14, 146 "halstead": { "volume": 312.9, "difficulty": 12.8 }, 147 "loc": 128 148 } 149 ], 150 "clones": [ 151 { 152 "id": 7, 153 "length": 32, 154 "locations": [ 155 { "file": "src/server/routes.rs", "start": 41, "end": 78 }, 156 { "file": "src/server/router.rs", "start": 12, "end": 49 } 157 ] 158 } 159 ] 160} 161 162=============================================================================== 163DESIGN PRINCIPLES 164 1651. Zero language assumptions in the MVP. 1662. Pluggable architecture (token → AST → CFG → semantic). 1673. High performance by default. 1684. No global state; everything streamed and incremental. 1695. Developer-friendly reporting and clear severity levels. 1706. Configurable thresholds 171 172=============================================================================== 173REFERENCES 174Cyclomatic Complexity (McCabe, 1976): 175 https://www.literateprogramming.com/mccabe.pdf 176 https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication500-235.pdf 177 178AST clone detection: 179 https://www.cs.ucdavis.edu/~su/publications/icse07.pdf 180 181LSH-based clone detection: 182 https://journals.riverpublishers.com/index.php/JCSANDM/article/download/18799/18101/68821 183 https://www.sei.cmu.edu/library/code-similarity-detection-using-syntax-agnostic-locality-sensitive-hashing/ 184 185Halstead: 186 https://www.researchgate.net/publication/317317114_Software_Complexity_Analysis_Using_Halstead_Metrics 187 https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.1990.pdf