=============================================================================== Code Complexity & Clone Detection Tool =============================================================================== An extensible CLI for measuring code complexity and detecting repeated code fragments. Designed to be language-agnostic with low-overhead. This document outlines: - Goals and metrics - Algorithms used - Project structure - Outputs - Philosophy =============================================================================== GOALS - Provide actionable complexity and duplication insights without heavy parsing. - Support any programming language - Keep runtime fast: linear or near-linear where possible. - Keep architecture modular so deeper analysis can be plugged in later. - Produce both human-readable and machine-readable reports. =============================================================================== SCOPE The MVP focuses on high-signal, low-complexity algorithms that require only tokenization—not full AST parsing. This keeps the implementation small and the performance excellent. =============================================================================== Algorithms ------------------------------------------------------------------------------- 1. Cyclomatic Complexity (McCabe) - Measures independent control-flow paths in a function. - Works by building a simple CFG (control-flow graph) from tokens or indentation. - Formula: CC = E - N + 2P where: E = number of edges N = number of nodes P = number of connected components (usually 1) Purpose: - Flags overly complex functions. - Well understood by developers. ------------------------------------------------------------------------------- 2. Token-based Clone Detection (Rabin-Karp Rolling Hash) - Detects repeated token sequences across files. - Uses a rolling hash window (e.g., 20–50 tokens). - Language-agnostic; extremely fast. Purpose: - Quickly identifies copy/paste blocks and boilerplate. ------------------------------------------------------------------------------- 3. Lines of Code (LOC) - Counts physical, logical, comment, and blank lines. - Token-based classification for accuracy. - Supports ranking by any metric (logical, physical, comments, blank). - Can aggregate and rank by directory. - Required for future Maintainability Index calculations. Purpose: - Baseline size metric. - Identify largest files and directories. - Track codebase growth and comment density. ------------------------------------------------------------------------------- 4. Halstead Complexity Metrics - Based on counts of operators and operands. - Computes vocabulary, volume, difficulty, and effort. Purpose: - Complements Cyclomatic Complexity with lexical complexity. =============================================================================== FUTURE ------------------------------------------------------------------------------- See todo.txt =============================================================================== CRATES -------------------------------------------------------------------------------- core tokenizer - minimal language-agnostic tokenizer - operator/operand extractor (Halstead) - whitespace, comments, string handling complexity cyclomatic - CFG builder - node/edge counter halstead - operator+operand tables loc - physical and logical LOC cloner - rolling hash (Rabin-Karp) - w-shingling and fingerprint index - min token length filter reporter - JSON output - plaintext/ANSI output - sorting, filtering, severity thresholds loader - config loading - file walker + globbing - ignore rules -------------------------------------------------------------------------------- cli - commands: analyze, clones, complexity, loc, dump-config - analyze: full analysis (complexity + clones + LOC) - clones: detect code clones only - complexity: analyze cyclomatic complexity and LOC - loc: focused LOC analysis with ranking (by file or directory) - dump-config: display or save current configuration - flags: --json, --threshold, --min-tokens, --rank-by, --rank-dirs, --path =============================================================================== OUTPUT FORMATS -------------------------------------------------------------------------------- Plaintext: FILE: src/server/routes.rs Cyclomatic: 14 (warning) Halstead Volume: 312.9 LOC: 128 FILE: src/server/auth.rs Cyclomatic: 5 Halstead Volume: 102.3 LOC: 41 CLONES: - HashMatch #7 - src/server/routes.rs:41-78 - src/server/router.rs:12-49 Length: 32 tokens -------------------------------------------------------------------------------- JSON: { "files": [ { "path": "src/server/routes.rs", "cyclomatic": 14, "halstead": { "volume": 312.9, "difficulty": 12.8 }, "loc": 128 } ], "clones": [ { "id": 7, "length": 32, "locations": [ { "file": "src/server/routes.rs", "start": 41, "end": 78 }, { "file": "src/server/router.rs", "start": 12, "end": 49 } ] } ] } =============================================================================== DESIGN PRINCIPLES 1. Zero language assumptions in the MVP. 2. Pluggable architecture (token → AST → CFG → semantic). 3. High performance by default. 4. No global state; everything streamed and incremental. 5. Developer-friendly reporting and clear severity levels. 6. Configurable thresholds =============================================================================== REFERENCES Cyclomatic Complexity (McCabe, 1976): https://www.literateprogramming.com/mccabe.pdf https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication500-235.pdf AST clone detection: https://www.cs.ucdavis.edu/~su/publications/icse07.pdf LSH-based clone detection: https://journals.riverpublishers.com/index.php/JCSANDM/article/download/18799/18101/68821 https://www.sei.cmu.edu/library/code-similarity-detection-using-syntax-agnostic-locality-sensitive-hashing/ Halstead: https://www.researchgate.net/publication/317317114_Software_Complexity_Analysis_Using_Halstead_Metrics https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.1990.pdf