fork of https://github.com/tree-sitter/tree-sitter-graph
at main 21 kB view raw
1// -*- coding: utf-8 -*- 2// ------------------------------------------------------------------------------------------------ 3// Copyright © 2021, tree-sitter authors. 4// Licensed under either of Apache License, Version 2.0, or MIT license, at your option. 5// Please see the LICENSE-APACHE or LICENSE-MIT files in this distribution for license details. 6// ------------------------------------------------------------------------------------------------ 7 8//! This section defines the graph DSL implemented by this library. 9//! 10//! [tree-sitter]: https://docs.rs/tree-sitter/ 11//! [tree-sitter-python]: https://docs.rs/tree-sitter-python/ 12//! 13//! # Overview 14//! 15//! You can use [tree-sitter][] to parse the content of source code into a _concrete syntax tree_. 16//! For instance, using the [tree-sitter-python][] grammar, you can parse Python source code: 17//! 18//! ``` python 19//! # test.py 20//! from one.two import d, e.c 21//! import three 22//! print(d, e.c) 23//! print three.f 24//! ``` 25//! 26//! ``` text 27//! $ tree-sitter parse test.py 28//! (module [0, 0] - [4, 0] 29//! (import_from_statement [0, 0] - [0, 26] 30//! module_name: (dotted_name [0, 5] - [0, 12] 31//! (identifier [0, 5] - [0, 8]) 32//! (identifier [0, 9] - [0, 12])) 33//! name: (dotted_name [0, 20] - [0, 21] 34//! (identifier [0, 20] - [0, 21])) 35//! name: (dotted_name [0, 23] - [0, 26] 36//! (identifier [0, 23] - [0, 24]) 37//! (identifier [0, 25] - [0, 26]))) 38//! (import_statement [1, 0] - [1, 12] 39//! name: (dotted_name [1, 7] - [1, 12] 40//! (identifier [1, 7] - [1, 12]))) 41//! (expression_statement [2, 0] - [2, 13] 42//! (call [2, 0] - [2, 13] 43//! function: (identifier [2, 0] - [2, 5]) 44//! arguments: (argument_list [2, 5] - [2, 13] 45//! (identifier [2, 6] - [2, 7]) 46//! (attribute [2, 9] - [2, 12] 47//! object: (identifier [2, 9] - [2, 10]) 48//! attribute: (identifier [2, 11] - [2, 12]))))) 49//! (print_statement [3, 0] - [3, 13] 50//! argument: (attribute [3, 6] - [3, 13] 51//! object: (identifier [3, 6] - [3, 11]) 52//! attribute: (identifier [3, 12] - [3, 13])))) 53//! ``` 54//! 55//! There are many interesting use cases where you want to use this parsed syntax tree to create 56//! some _other_ graph-like structure. This library lets you do that, using a declarative DSL to 57//! identify patterns in the parsed syntax tree, along with rules for which graph nodes and edges 58//! to create for the syntax nodes that match those patterns. 59//! 60//! # Terminology 61//! 62//! One confusing aspect of the graph DSL is that we have to talk about two different kinds of 63//! "node": the nodes of the concrete syntax tree that is our input when executing a graph DSL 64//! file, and the nodes of the graph that we produce as an output. We will be careful to always 65//! use "syntax node" and "graph node" in this documentation to make it clear which kind of node we 66//! mean. 67//! 68//! # High-level structure 69//! 70//! A graph DSL file consists of one or more **_stanzas_**. Each stanza starts with a tree-sitter 71//! [**_query pattern_**][query], which identifies a portion of the concrete syntax tree. The 72//! query pattern should use [**_captures_**][captures] to identify particular syntax nodes in the 73//! matched portion of the tree. Following the query is a **_block_**, which is a sequence of 74//! statements that construct **_graph nodes_**, link them with **_edges_**, and annotate both with 75//! **_attributes_**. 76//! 77//! [query]: https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries 78//! [captures]: https://tree-sitter.github.io/tree-sitter/using-parsers#capturing-nodes 79//! 80//! Query patterns can be suffixed by quantification operators. If a pattern with a quantification 81//! suffix is captured, the suffix determines the capture value. In the case of `?`, the capture value 82//! is a null value, or a syntax node. In the case of `+` and `*`, the value is a list of syntax nodes. 83//! 84//! [quantification]: https://tree-sitter.github.io/tree-sitter/using-parsers#quantification-operators 85//! 86//! Comments start with a semicolon, and extend to the end of the line. 87//! 88//! Identifiers start with either an ASCII letter or underscore, and all remaining characters are 89//! ASCII letters, numbers, underscores, or hyphens. (More precisely, they satisfy the regular 90//! expression `/[a-zA-Z_][a-zA-Z0-9_-]*/`.) Identifiers are used as the names of 91//! [attributes](#attributes), [functions](#functions), and [variables](#variables). 92//! 93//! To execute a graph DSL file against a concrete syntax tree, we execute each stanza in the graph 94//! DSL file exhaustively. For each stanza, we identify each place where the concrete syntax tree 95//! matches the query pattern. For each of these places, we end up with a different set of syntax 96//! nodes assigned to the query pattern's captures. We execute the block of statements for each of 97//! these capture assignments, creating any graph nodes, edges, or attributes mentioned in the 98//! block. 99//! 100//! Regular execution will apply the stanzas _in order_, and it is important to make sure that scoped variables 101//! have been assigned before they are used. This is not a requirement when using the lazy evaluation strategy, 102//! which handles this implicitly. The lazy evaluation strategy is also more efficient when there are many stanzas, 103//! because it can reduce tree traversals. Therefore, using the lazy evaluation strategy is recommended, and will 104//! likely become the only supported strategy in future releases. 105//! 106//! For instance, the following stanza would match all of the identifiers in our example syntax 107//! tree: 108//! 109//! ``` tsg 110//! (identifier) @id 111//! { 112//! ; Some statements that will be executed for each identifier in the 113//! ; source file. You can use @id to refer to the matched identifier 114//! ; syntax node. 115//! } 116//! ``` 117//! 118//! # Expressions 119//! 120//! The value of an expression in the graph DSL can be any of the following: 121//! 122//! - null 123//! - a boolean 124//! - a string 125//! - an integer (unsigned, 32 bits) 126//! - a reference to a syntax node 127//! - a reference to a graph node 128//! - an ordered list of values 129//! - a list comprehension 130//! - an unordered set of values 131//! - a set comprehension 132//! 133//! The null value is spelled `#null`. 134//! 135//! The boolean literals are spelled `#true` and `#false`. 136//! 137//! String constants are enclosed in double quotes. They can contain backslash escapes `\\`, `\0`, 138//! `\n`, `\r`, and `\t`: 139//! 140//! - `"a string"` 141//! - `"a string with\na newline"` 142//! - `"a string with\\a backslash"` 143//! 144//! Integer constants are encoded in ASCII decimal: 145//! 146//! - `0` 147//! - `10` 148//! - `42` 149//! 150//! Lists consist of zero or more expressions, separated by commas, enclosed in square brackets. 151//! The elements of a list do not have to have the same type: 152//! 153//! ``` tsg 154//! [0, "string", 0, #true, @id] 155//! ``` 156//! 157//! Sets have the same format, but are enclosed in curly braces instead of square brackets: 158//! 159//! ``` tsg 160//! {0, "string", #true, @id} 161//! ``` 162//! 163//! Both lists and sets both allow trailing commas after the final element, if you prefer that 164//! style to play more nicely with source control diffs: 165//! 166//! ``` tsg 167//! [ 168//! 0, 169//! "string", 170//! #true, 171//! @id, 172//! ] 173//! ``` 174//! 175//! List comprehensions allow mapping over a list and producing a new list with elements based on the 176//! given element expression: 177//! 178//! ``` tsg 179//! [ (some-function x) for x in @xs ] 180//! [ @x.scoped_var for x in @xs ] 181//! ``` 182//! 183//! Set comprehensions have similar syntax, but the resulting value will be a set instead of an ordered list: 184//! 185//! ``` tsg 186//! { (some-function x) for x in @xs } 187//! { @x.scoped_var for x in @xs } 188//! ``` 189//! 190//! List and set comprehensions are subject to the same restrictions as for loops, which means the list 191//! value that is iterated over must be local. It is therefore not possible to iterator over the value 192//! of a scoped variable. Using scoped variables in the element expression however is no problem. 193//! 194//! # Syntax nodes 195//! 196//! Syntax nodes are identified by tree-sitter query captures (`@name`). For instance, in our 197//! example stanza, whose query is `(identifier) @id`, `@id` would refer to the `identifier` syntax 198//! node that the stanza matched against. 199//! 200//! Unused query captures are considered errors, unless they start with an underscode. For example, 201//! a capture `@id` must be used within the stanza, but `@_id` does not. 202//! 203//! # Variables 204//! 205//! You can use variables to pass information between different stanzas and statements in a graph 206//! DSL file. There are three kinds of variables: 207//! 208//! - **_Global_** variables are provided externally by whatever process is executing the graph 209//! DSL file. You can use this, for instance, to pass in the path of the source file being 210//! analyzed, to use as part of the graph structure that you create. 211//! 212//! - **_Local_** variables are only visible within the current execution of the current stanza. 213//! Once all of the statements in the stanza have been executed, all local variables disappear. 214//! 215//! - **_Scoped_** variables are "attached to" syntax nodes. Their values carry over from stanza 216//! to stanza. Scoped variables are referenced by using a syntax node expression (typically a 217//! query capture) and a variable name, separated by a period: `@node.variable`. 218//! 219//! Global variables are declared using a `global` declaration. The external process that executes the 220//! graph DSL file must provide values for all declared global variables. (It is not possible to define 221//! a global variable and give it a value from within the graph DSL.) The name of the global variable can 222//! be suffixed by a quantifier: '*' and '+' for lists, and '?' for optional values, which allows them to 223//! be used in iteration and conditional statements, respectively. 224//! 225//! Local and scoped variables are created using `var` or `let` statements. A `let` statement 226//! creates an **_immutable variable_**, whose value cannot be changed. A `var` statement creates 227//! a **_mutable variable_**. You use a `set` statement to change the value of a mutable variable. 228//! Local variables are not allowed to have the same name as a declared global variable. 229//! 230//! Local variables are block scoped. For example, a local variable defined in a `scan` arm is not 231//! visible in other scan arms, or after the `scan` statement. If you need to persist a value for use 232//! after a block, introduce a mutable variable before the block and assign to it inside the block. 233//! 234//! ``` tsg 235//! global global_variable 236//! 237//! (identifier) @id 238//! { 239//! let local_variable = "a string" 240//! ; The following would be an error, since `let` variables are immutable: 241//! ; set local_variable = "a new value" 242//! 243//! ; The following is also an error, since you can't mutate a variable that 244//! ; doesn't exist: 245//! ; set missing_variable = 42 246//! 247//! ; The following is an error, since you cannot hide a global variable with 248//! ; a local one: 249//! ; let global_variable = "a new value" 250//! 251//! var mutable_variable = "first value" 252//! set mutable_variable = global_variable ; we can refer to the global variable 253//! 254//! var @id.kind = "id" 255//! } 256//! ``` 257//! 258//! Variables can be referenced anywhere that you can provide an expression. It's an error if you 259//! try to reference a variable that hasn't been defined. 260//! 261//! # Functions 262//! 263//! The process executing a graph DSL file can provide **_functions_** that can be called from 264//! within graph DSL stanzas. 265//! 266//! Function calls use a Lisp-like syntax, where the name of the function being called is _inside_ 267//! of the parentheses. The parameters to a function call are arbitrary expressions. For 268//! instance, if the executing process provides a function named `+`, you could call it as: 269//! 270//! ``` tsg 271//! (identifier) @id 272//! { 273//! let x = 4 274//! let @id.nine = (+ x 5) 275//! } 276//! ``` 277//! 278//! Note that it's the process executing the graph DSL file that decides which functions are 279//! available. We do define a [standard library][], and most of the time those are the functions 280//! that are available, but you should double-check the documentation of whatever graph DSL tool 281//! you're using to make sure. 282//! 283//! [standard library]: functions/index.html 284//! 285//! # Graph nodes 286//! 287//! You can use this graph DSL to create any graph structure that you want. There are no 288//! limitations on which graph nodes you create, nor on how you use edges to connect the graph 289//! nodes. You are not limited to creating a tree, and in particular, you are not limited to 290//! creating a tree that "lines" up with the parsed syntax tree. 291//! 292//! There are two ways to create a new graph node. The first, and most common, is to use a `node` 293//! statement: 294//! 295//! ``` tsg 296//! (identifier) @id 297//! { 298//! node @id.node 299//! } 300//! ``` 301//! 302//! This creates a graph node, and assigns a reference to the new node to a new immutable variable 303//! named `new_node`. 304//! 305//! The second way to create a graph node is to call the [`node`][] function: 306//! 307//! [`node`]: functions/index.html#node 308//! 309//! ``` tsg 310//! (identifier) @id 311//! { 312//! let @id.node = (node) 313//! } 314//! ``` 315//! 316//! Note that these two examples do exactly the same thing! In fact, the `node` statement in the 317//! first example is just syntactic sugar for the `let` statement in the second example. Since the 318//! most common pattern is to create a graph node and immediately assign it to an immutable scoped 319//! variable, we provide the `node` statement as a convenient shorthand. If you need to do 320//! anything more complex, such as assigning the graph node reference to a _mutable_ variable, you 321//! can call the [`node`][] function directly. 322//! 323//! By attaching a graph node to a syntax node using a [scoped variable](#variables), you can refer 324//! to them from multiple stanzas: 325//! 326//! ``` tsg 327//! (identifier) @id 328//! { 329//! node @id.node 330//! } 331//! 332//! (dotted_name (identifier) @dotted_element) 333//! { 334//! ; We will learn more about the attr statement below 335//! attr (@dotted_element.node) kind = "dotted" 336//! } 337//! ``` 338//! 339//! In this example, we will create a graph node for _every_ `identifier` syntax node. Each of 340//! those syntax nodes will have a `node` scoped variable, containing a reference to its graph 341//! node. But only the graph nodes of those identifiers that appear inside of a `dotted_name` will 342//! have a `kind` attribute. And even though we used different capture names, inside of different 343//! queries, to find those `identifier` nodes, the graph node references in both stanzas refer to 344//! the same graph nodes. 345//! 346//! # Edges 347//! 348//! Edges are created via an `edge` statement, which specifies the two graph nodes that should be 349//! connected. Edges are directed, and the `->` arrow in the `edge` statement indicates the 350//! direction of the edge. 351//! 352//! ``` tsg 353//! (import_statement name: (_) @name) 354//! { 355//! node @name.source 356//! node @name.sink 357//! edge @name.source -> @name.sink 358//! } 359//! ``` 360//! 361//! There can be at most one edge connecting any particular source and sink graph node in the 362//! graph. If multiple stanzas create edges between the same graph nodes, those are "collapsed" 363//! into a single edge. 364//! 365//! # Attributes 366//! 367//! Graph nodes and edges have an associated set of **_attributes_**. Each attribute has a name 368//! (which is an identifier), and a value. 369//! 370//! You add attributes to a graph node or edge using an `attr` statement: 371//! 372//! ``` tsg 373//! (import_statement name: (_) @name) 374//! { 375//! node @name.source 376//! node @name.sink 377//! attr (@name.sink) kind = "module" 378//! attr (@name.source -> @name.sink) precedence = 10 379//! } 380//! ``` 381//! 382//! Note that you have to have already created the graph node or edge, and the graph node or edge 383//! must not already have an attribute with the same name. 384//! 385//! (Attributes might seem similar to scoped variables, but they are quite different. Attributes 386//! are attached to graph nodes and edges, while scoped variables are attached to syntax nodes. 387//! More importantly, scoped variables only exist while executing the graph DSL file. Once the 388//! execution has completed, the variables disappear. Attributes, on the other hand, are part of 389//! the output produced by the graph DSL file, and live on after execution has finished.) 390//! 391//! ## Attribute shorthands 392//! 393//! Commonly used combinations of attributes can be captured in **_shorthands_**. Each shorthand defines 394//! the attribute name, a variable which captures the attribute value, and a list of attributes to which 395//! it expands. 396//! 397//! Attribute shorthands are defined at the same level as stanzas. For example, the following shorthand 398//! takes a syntax node as argument, and expands to attributes for its source text and child index: 399//! 400//! ``` tsg 401//! attribute node_props = node => node_text = (source-text node), node_index = (named-child-index node) 402//! 403//! (argument (_)@expr) { 404//! attr (@expr) node_props = @expr 405//! } 406//! ``` 407//! 408//! # Regular expressions 409//! 410//! You can use a `scan` statement to match the content of a string value against a set of regular 411//! expressions. 412//! 413//! Starting at the beginning of the string, we determine the first character where one of the 414//! regular expressions matches. If more than one regular expression matches at this earliest 415//! character, we use the regular expression that appears first in the `scan` statement. We then 416//! execute the statements in the "winning" regular expression's block. 417//! 418//! After executing the matching block, we try to match all of the regular expressions again, 419//! starting _after_ the text that was just matched. We continue this process, applying the 420//! earliest matching regular expression in each iteration, until we have exhausted the entire 421//! string, or none of the regular expressions match. 422//! 423//! Within each regular expression's block, you can use `$0`, `$1`, etc., to refer to any capture 424//! groups in the regular expression. 425//! 426//! The value being scanned must be local, which means it cannot be derived from scoped variables. 427//! 428//! For example, if `filepath` is a global variable containing the path of a Python source file, 429//! you could use the following `scan` statement to construct graph nodes for the name of the 430//! module defined in the file: 431//! 432//! ``` tsg 433//! global filepath 434//! 435//! (module) @mod 436//! { 437//! var new_node = #null 438//! var current_node = (node) 439//! 440//! scan filepath { 441//! "([^/]+)/" 442//! { 443//! ; This arm will match any directory component of the file path. In 444//! ; Python, this gives you the sequence of packages that the module 445//! ; lives in. 446//! 447//! ; Note that we keep appending additional tag names to the `current` 448//! ; graph node to create an arbitrary number of graph nodes linked to 449//! ; the @mod syntax node. 450//! set new_node = (node) 451//! attr (new_node) name = $1 452//! edge current_node -> new_node 453//! set current_node = new_node 454//! } 455//! 456//! "__init__\\.py$" 457//! { 458//! ; This arm will match a trailing __init__.py, indicating that the 459//! ; module's name comes from the last directory component. 460//! 461//! ; Expose the graph node that we created for that component as a 462//! ; scoped variable that later stanzas can see. 463//! let @mod.root = current_node 464//! } 465//! 466//! "([^/]+)\\.py$" 467//! { 468//! ; This arm will match any other trailing module name. Note that 469//! ; __init__.py also matches this regular expression, but since it 470//! ; appears later, the __init__.py clause will take precedence. 471//! 472//! set new_node = (node) 473//! attr (new_node) name = $1 474//! edge current_node -> new_node 475//! let @mod.root = new_node 476//! } 477//! } 478//! } 479//! ``` 480//! 481//! # Conditionals 482//! 483//! You can use `if` statements to make blocks of statements conditional on optional values. 484//! Conditions are comma-separated lists of clauses. The clause `some EXPRESSION` indicates 485//! that the optional value must be present. The clause `none EXPRESSION` indicates that the 486//! optional value is absent. A bare expression is evaluated as to boolean. All values in 487//! conditions must be local, which means they cannot be derived from scoped variables. 488//! 489//! ``` tsg 490//! (lexical_declaration type:(_)? @type value:(_)? @value) 491//! { 492//! if some @type, none @value { 493//! ; ... 494//! } elif some @value { 495//! ; ... 496//! } else { 497//! ; ... 498//! } 499//! } 500//! ``` 501//! 502//! # List iteration 503//! 504//! You can use a `for` statement to execute blocks of statements for every element in list 505//! values. The list value must be local, which means it cannot be derived from scoped variables. 506//! 507//! ```tsg 508//! (module (_)* @stmts) 509//! { 510//! for stmt in @stmts { 511//! print stmt 512//! } 513//! } 514//! ``` 515//! 516//! # Debugging 517//! 518//! To support members of the Ancient and Harmonious Order of Printf Debuggers, you can use `print` 519//! statements to print out (to `stderr`) the content of any expressions during the execution of a graph DSL 520//! file: 521//! 522//! ``` tsg 523//! (identifier) @id 524//! { 525//! let x = 4 526//! print "Hi! x = ", x 527//! } 528//! ``` 529 530pub mod functions;