src/reference/mod.rs at main · oppi.li/tree-sitter-graph

fork of https://github.com/tree-sitter/tree-sitter-graph
tree-sitter-graph / src / reference / mod.rs
at main 21 kB view raw
  1// -*- coding: utf-8 -*-
  2// ------------------------------------------------------------------------------------------------
  3// Copyright © 2021, tree-sitter authors.
  4// Licensed under either of Apache License, Version 2.0, or MIT license, at your option.
  5// Please see the LICENSE-APACHE or LICENSE-MIT files in this distribution for license details.
  6// ------------------------------------------------------------------------------------------------
  7
  8//! This section defines the graph DSL implemented by this library.
  9//!
 10//! [tree-sitter]: https://docs.rs/tree-sitter/
 11//! [tree-sitter-python]: https://docs.rs/tree-sitter-python/
 12//!
 13//! # Overview
 14//!
 15//! You can use [tree-sitter][] to parse the content of source code into a _concrete syntax tree_.
 16//! For instance, using the [tree-sitter-python][] grammar, you can parse Python source code:
 17//!
 18//! ``` python
 19//! # test.py
 20//! from one.two import d, e.c
 21//! import three
 22//! print(d, e.c)
 23//! print three.f
 24//! ```
 25//!
 26//! ``` text
 27//! $ tree-sitter parse test.py
 28//! (module [0, 0] - [4, 0]
 29//!   (import_from_statement [0, 0] - [0, 26]
 30//!     module_name: (dotted_name [0, 5] - [0, 12]
 31//!       (identifier [0, 5] - [0, 8])
 32//!       (identifier [0, 9] - [0, 12]))
 33//!     name: (dotted_name [0, 20] - [0, 21]
 34//!       (identifier [0, 20] - [0, 21]))
 35//!     name: (dotted_name [0, 23] - [0, 26]
 36//!       (identifier [0, 23] - [0, 24])
 37//!       (identifier [0, 25] - [0, 26])))
 38//!   (import_statement [1, 0] - [1, 12]
 39//!     name: (dotted_name [1, 7] - [1, 12]
 40//!       (identifier [1, 7] - [1, 12])))
 41//!   (expression_statement [2, 0] - [2, 13]
 42//!     (call [2, 0] - [2, 13]
 43//!       function: (identifier [2, 0] - [2, 5])
 44//!       arguments: (argument_list [2, 5] - [2, 13]
 45//!         (identifier [2, 6] - [2, 7])
 46//!         (attribute [2, 9] - [2, 12]
 47//!           object: (identifier [2, 9] - [2, 10])
 48//!           attribute: (identifier [2, 11] - [2, 12])))))
 49//!   (print_statement [3, 0] - [3, 13]
 50//!     argument: (attribute [3, 6] - [3, 13]
 51//!       object: (identifier [3, 6] - [3, 11])
 52//!       attribute: (identifier [3, 12] - [3, 13]))))
 53//! ```
 54//!
 55//! There are many interesting use cases where you want to use this parsed syntax tree to create
 56//! some _other_ graph-like structure.  This library lets you do that, using a declarative DSL to
 57//! identify patterns in the parsed syntax tree, along with rules for which graph nodes and edges
 58//! to create for the syntax nodes that match those patterns.
 59//!
 60//! # Terminology
 61//!
 62//! One confusing aspect of the graph DSL is that we have to talk about two different kinds of
 63//! "node": the nodes of the concrete syntax tree that is our input when executing a graph DSL
 64//! file, and the nodes of the graph that we produce as an output.  We will be careful to always
 65//! use "syntax node" and "graph node" in this documentation to make it clear which kind of node we
 66//! mean.
 67//!
 68//! # High-level structure
 69//!
 70//! A graph DSL file consists of one or more **_stanzas_**.  Each stanza starts with a tree-sitter
 71//! [**_query pattern_**][query], which identifies a portion of the concrete syntax tree.  The
 72//! query pattern should use [**_captures_**][captures] to identify particular syntax nodes in the
 73//! matched portion of the tree.  Following the query is a **_block_**, which is a sequence of
 74//! statements that construct **_graph nodes_**, link them with **_edges_**, and annotate both with
 75//! **_attributes_**.
 76//!
 77//! [query]: https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries
 78//! [captures]: https://tree-sitter.github.io/tree-sitter/using-parsers#capturing-nodes
 79//!
 80//! Query patterns can be suffixed by quantification operators. If a pattern with a quantification
 81//! suffix is captured, the suffix determines the capture value. In the case of `?`, the capture value
 82//! is a null value, or a syntax node. In the case of `+` and `*`, the value is a list of syntax nodes.
 83//!
 84//! [quantification]: https://tree-sitter.github.io/tree-sitter/using-parsers#quantification-operators
 85//!
 86//! Comments start with a semicolon, and extend to the end of the line.
 87//!
 88//! Identifiers start with either an ASCII letter or underscore, and all remaining characters are
 89//! ASCII letters, numbers, underscores, or hyphens.  (More precisely, they satisfy the regular
 90//! expression `/[a-zA-Z_][a-zA-Z0-9_-]*/`.)  Identifiers are used as the names of
 91//! [attributes](#attributes), [functions](#functions), and [variables](#variables).
 92//!
 93//! To execute a graph DSL file against a concrete syntax tree, we execute each stanza in the graph
 94//! DSL file exhaustively.  For each stanza, we identify each place where the concrete syntax tree
 95//! matches the query pattern.  For each of these places, we end up with a different set of syntax
 96//! nodes assigned to the query pattern's captures.  We execute the block of statements for each of
 97//! these capture assignments, creating any graph nodes, edges, or attributes mentioned in the
 98//! block.
 99//!
100//! Regular execution will apply the stanzas _in order_, and it is important to make sure that scoped variables
101//! have been assigned before they are used.  This is not a requirement when using the lazy evaluation strategy,
102//! which handles this implicitly.  The lazy evaluation strategy is also more efficient when there are many stanzas,
103//! because it can reduce tree traversals.  Therefore, using the lazy evaluation strategy is recommended, and will
104//! likely become the only supported strategy in future releases.
105//!
106//! For instance, the following stanza would match all of the identifiers in our example syntax
107//! tree:
108//!
109//! ``` tsg
110//! (identifier) @id
111//! {
112//!   ; Some statements that will be executed for each identifier in the
113//!   ; source file.  You can use @id to refer to the matched identifier
114//!   ; syntax node.
115//! }
116//! ```
117//!
118//! # Expressions
119//!
120//! The value of an expression in the graph DSL can be any of the following:
121//!
122//!   - null
123//!   - a boolean
124//!   - a string
125//!   - an integer (unsigned, 32 bits)
126//!   - a reference to a syntax node
127//!   - a reference to a graph node
128//!   - an ordered list of values
129//!   - a list comprehension
130//!   - an unordered set of values
131//!   - a set comprehension
132//!
133//! The null value is spelled `#null`.
134//!
135//! The boolean literals are spelled `#true` and `#false`.
136//!
137//! String constants are enclosed in double quotes.  They can contain backslash escapes `\\`, `\0`,
138//! `\n`, `\r`, and `\t`:
139//!
140//!   - `"a string"`
141//!   - `"a string with\na newline"`
142//!   - `"a string with\\a backslash"`
143//!
144//! Integer constants are encoded in ASCII decimal:
145//!
146//!   - `0`
147//!   - `10`
148//!   - `42`
149//!
150//! Lists consist of zero or more expressions, separated by commas, enclosed in square brackets.
151//! The elements of a list do not have to have the same type:
152//!
153//! ``` tsg
154//! [0, "string", 0, #true, @id]
155//! ```
156//!
157//! Sets have the same format, but are enclosed in curly braces instead of square brackets:
158//!
159//! ``` tsg
160//! {0, "string", #true, @id}
161//! ```
162//!
163//! Both lists and sets both allow trailing commas after the final element, if you prefer that
164//! style to play more nicely with source control diffs:
165//!
166//! ``` tsg
167//! [
168//!   0,
169//!   "string",
170//!   #true,
171//!   @id,
172//! ]
173//! ```
174//!
175//! List comprehensions allow mapping over a list and producing a new list with elements based on the
176//! given element expression:
177//!
178//! ``` tsg
179//! [ (some-function x) for x in @xs ]
180//! [ @x.scoped_var for x in @xs ]
181//! ```
182//!
183//! Set comprehensions have similar syntax, but the resulting value will be a set instead of an ordered list:
184//!
185//! ``` tsg
186//! { (some-function x) for x in @xs }
187//! { @x.scoped_var for x in @xs }
188//! ```
189//!
190//! List and set comprehensions are subject to the same restrictions as for loops, which means the list
191//! value that is iterated over must be local.  It is therefore not possible to iterator over the value
192//! of a scoped variable. Using scoped variables in the element expression however is no problem.
193//!
194//! # Syntax nodes
195//!
196//! Syntax nodes are identified by tree-sitter query captures (`@name`).  For instance, in our
197//! example stanza, whose query is `(identifier) @id`, `@id` would refer to the `identifier` syntax
198//! node that the stanza matched against.
199//!
200//! Unused query captures are considered errors, unless they start with an underscode. For example,
201//! a capture `@id` must be used within the stanza, but `@_id` does not.
202//!
203//! # Variables
204//!
205//! You can use variables to pass information between different stanzas and statements in a graph
206//! DSL file.  There are three kinds of variables:
207//!
208//!   - **_Global_** variables are provided externally by whatever process is executing the graph
209//!     DSL file.  You can use this, for instance, to pass in the path of the source file being
210//!     analyzed, to use as part of the graph structure that you create.
211//!
212//!   - **_Local_** variables are only visible within the current execution of the current stanza.
213//!     Once all of the statements in the stanza have been executed, all local variables disappear.
214//!
215//!   - **_Scoped_** variables are "attached to" syntax nodes.  Their values carry over from stanza
216//!     to stanza.  Scoped variables are referenced by using a syntax node expression (typically a
217//!     query capture) and a variable name, separated by a period: `@node.variable`.
218//!
219//! Global variables are declared using a `global` declaration.  The external process that executes the
220//! graph DSL file must provide values for all declared global variables.  (It is not possible to define
221//! a global variable and give it a value from within the graph DSL.) The name of the global variable can
222//! be suffixed by a quantifier: '*' and '+' for lists, and '?' for optional values, which allows them to
223//! be used in iteration and conditional statements, respectively.
224//!
225//! Local and scoped variables are created using `var` or `let` statements.  A `let` statement
226//! creates an **_immutable variable_**, whose value cannot be changed.  A `var` statement creates
227//! a **_mutable variable_**.  You use a `set` statement to change the value of a mutable variable.
228//! Local variables are not allowed to have the same name as a declared global variable.
229//!
230//! Local variables are block scoped.  For example, a local variable defined in a `scan` arm is not
231//! visible in other scan arms, or after the `scan` statement.  If you need to persist a value for use
232//! after a block, introduce a mutable variable before the block and assign to it inside the block.
233//!
234//! ``` tsg
235//! global global_variable
236//!
237//! (identifier) @id
238//! {
239//!   let local_variable = "a string"
240//!   ; The following would be an error, since `let` variables are immutable:
241//!   ; set local_variable = "a new value"
242//!
243//!   ; The following is also an error, since you can't mutate a variable that
244//!   ; doesn't exist:
245//!   ; set missing_variable = 42
246//!
247//!   ; The following is an error, since you cannot hide a global variable with
248//!   ; a local one:
249//!   ; let global_variable = "a new value"
250//!
251//!   var mutable_variable = "first value"
252//!   set mutable_variable = global_variable ; we can refer to the global variable
253//!
254//!   var @id.kind = "id"
255//! }
256//! ```
257//!
258//! Variables can be referenced anywhere that you can provide an expression.  It's an error if you
259//! try to reference a variable that hasn't been defined.
260//!
261//! # Functions
262//!
263//! The process executing a graph DSL file can provide **_functions_** that can be called from
264//! within graph DSL stanzas.
265//!
266//! Function calls use a Lisp-like syntax, where the name of the function being called is _inside_
267//! of the parentheses.  The parameters to a function call are arbitrary expressions.  For
268//! instance, if the executing process provides a function named `+`, you could call it as:
269//!
270//! ``` tsg
271//! (identifier) @id
272//! {
273//!    let x = 4
274//!    let @id.nine = (+ x 5)
275//! }
276//! ```
277//!
278//! Note that it's the process executing the graph DSL file that decides which functions are
279//! available.  We do define a [standard library][], and most of the time those are the functions
280//! that are available, but you should double-check the documentation of whatever graph DSL tool
281//! you're using to make sure.
282//!
283//! [standard library]: functions/index.html
284//!
285//! # Graph nodes
286//!
287//! You can use this graph DSL to create any graph structure that you want.  There are no
288//! limitations on which graph nodes you create, nor on how you use edges to connect the graph
289//! nodes.  You are not limited to creating a tree, and in particular, you are not limited to
290//! creating a tree that "lines" up with the parsed syntax tree.
291//!
292//! There are two ways to create a new graph node.  The first, and most common, is to use a `node`
293//! statement:
294//!
295//! ``` tsg
296//! (identifier) @id
297//! {
298//!   node @id.node
299//! }
300//! ```
301//!
302//! This creates a graph node, and assigns a reference to the new node to a new immutable variable
303//! named `new_node`.
304//!
305//! The second way to create a graph node is to call the [`node`][] function:
306//!
307//! [`node`]: functions/index.html#node
308//!
309//! ``` tsg
310//! (identifier) @id
311//! {
312//!   let @id.node = (node)
313//! }
314//! ```
315//!
316//! Note that these two examples do exactly the same thing!  In fact, the `node` statement in the
317//! first example is just syntactic sugar for the `let` statement in the second example.  Since the
318//! most common pattern is to create a graph node and immediately assign it to an immutable scoped
319//! variable, we provide the `node` statement as a convenient shorthand.  If you need to do
320//! anything more complex, such as assigning the graph node reference to a _mutable_ variable, you
321//! can call the [`node`][] function directly.
322//!
323//! By attaching a graph node to a syntax node using a [scoped variable](#variables), you can refer
324//! to them from multiple stanzas:
325//!
326//! ``` tsg
327//! (identifier) @id
328//! {
329//!   node @id.node
330//! }
331//!
332//! (dotted_name (identifier) @dotted_element)
333//! {
334//!   ; We will learn more about the attr statement below
335//!   attr (@dotted_element.node) kind = "dotted"
336//! }
337//! ```
338//!
339//! In this example, we will create a graph node for _every_ `identifier` syntax node.  Each of
340//! those syntax nodes will have a `node` scoped variable, containing a reference to its graph
341//! node.  But only the graph nodes of those identifiers that appear inside of a `dotted_name` will
342//! have a `kind` attribute.  And even though we used different capture names, inside of different
343//! queries, to find those `identifier` nodes, the graph node references in both stanzas refer to
344//! the same graph nodes.
345//!
346//! # Edges
347//!
348//! Edges are created via an `edge` statement, which specifies the two graph nodes that should be
349//! connected.  Edges are directed, and the `->` arrow in the `edge` statement indicates the
350//! direction of the edge.
351//!
352//! ``` tsg
353//! (import_statement name: (_) @name)
354//! {
355//!   node @name.source
356//!   node @name.sink
357//!   edge @name.source -> @name.sink
358//! }
359//! ```
360//!
361//! There can be at most one edge connecting any particular source and sink graph node in the
362//! graph.  If multiple stanzas create edges between the same graph nodes, those are "collapsed"
363//! into a single edge.
364//!
365//! # Attributes
366//!
367//! Graph nodes and edges have an associated set of **_attributes_**.  Each attribute has a name
368//! (which is an identifier), and a value.
369//!
370//! You add attributes to a graph node or edge using an `attr` statement:
371//!
372//! ``` tsg
373//! (import_statement name: (_) @name)
374//! {
375//!   node @name.source
376//!   node @name.sink
377//!   attr (@name.sink) kind = "module"
378//!   attr (@name.source -> @name.sink) precedence = 10
379//! }
380//! ```
381//!
382//! Note that you have to have already created the graph node or edge, and the graph node or edge
383//! must not already have an attribute with the same name.
384//!
385//! (Attributes might seem similar to scoped variables, but they are quite different.  Attributes
386//! are attached to graph nodes and edges, while scoped variables are attached to syntax nodes.
387//! More importantly, scoped variables only exist while executing the graph DSL file.  Once the
388//! execution has completed, the variables disappear.  Attributes, on the other hand, are part of
389//! the output produced by the graph DSL file, and live on after execution has finished.)
390//!
391//! ## Attribute shorthands
392//!
393//! Commonly used combinations of attributes can be captured in **_shorthands_**.  Each shorthand defines
394//! the attribute name, a variable which captures the attribute value, and a list of attributes to which
395//! it expands.
396//!
397//! Attribute shorthands are defined at the same level as stanzas.  For example, the following shorthand
398//! takes a syntax node as argument, and expands to attributes for its source text and child index:
399//!
400//! ``` tsg
401//! attribute node_props = node => node_text = (source-text node), node_index = (named-child-index node)
402//!
403//! (argument (_)@expr) {
404//!   attr (@expr) node_props = @expr
405//! }
406//! ```
407//!
408//! # Regular expressions
409//!
410//! You can use a `scan` statement to match the content of a string value against a set of regular
411//! expressions.
412//!
413//! Starting at the beginning of the string, we determine the first character where one of the
414//! regular expressions matches.  If more than one regular expression matches at this earliest
415//! character, we use the regular expression that appears first in the `scan` statement.  We then
416//! execute the statements in the "winning" regular expression's block.
417//!
418//! After executing the matching block, we try to match all of the regular expressions again,
419//! starting _after_ the text that was just matched.  We continue this process, applying the
420//! earliest matching regular expression in each iteration, until we have exhausted the entire
421//! string, or none of the regular expressions match.
422//!
423//! Within each regular expression's block, you can use `$0`, `$1`, etc., to refer to any capture
424//! groups in the regular expression.
425//!
426//! The value being scanned must be local, which means it cannot be derived from scoped variables.
427//!
428//! For example, if `filepath` is a global variable containing the path of a Python source file,
429//! you could use the following `scan` statement to construct graph nodes for the name of the
430//! module defined in the file:
431//!
432//! ``` tsg
433//! global filepath
434//!
435//! (module) @mod
436//! {
437//!   var new_node = #null
438//!   var current_node = (node)
439//!
440//!   scan filepath {
441//!     "([^/]+)/"
442//!     {
443//!       ; This arm will match any directory component of the file path.  In
444//!       ; Python, this gives you the sequence of packages that the module
445//!       ; lives in.
446//!
447//!       ; Note that we keep appending additional tag names to the `current`
448//!       ; graph node to create an arbitrary number of graph nodes linked to
449//!       ; the @mod syntax node.
450//!       set new_node = (node)
451//!       attr (new_node) name = $1
452//!       edge current_node -> new_node
453//!       set current_node = new_node
454//!     }
455//!
456//!     "__init__\\.py$"
457//!     {
458//!       ; This arm will match a trailing __init__.py, indicating that the
459//!       ; module's name comes from the last directory component.
460//!
461//!       ; Expose the graph node that we created for that component as a
462//!       ; scoped variable that later stanzas can see.
463//!       let @mod.root = current_node
464//!     }
465//!
466//!     "([^/]+)\\.py$"
467//!     {
468//!       ; This arm will match any other trailing module name.  Note that
469//!       ; __init__.py also matches this regular expression, but since it
470//!       ; appears later, the __init__.py clause will take precedence.
471//!
472//!       set new_node = (node)
473//!       attr (new_node) name = $1
474//!       edge current_node -> new_node
475//!       let @mod.root = new_node
476//!     }
477//!   }
478//! }
479//! ```
480//!
481//! # Conditionals
482//!
483//! You can use `if` statements to make blocks of statements conditional on optional values.
484//! Conditions are comma-separated lists of clauses.  The clause `some EXPRESSION` indicates
485//! that the optional value must be present.  The clause `none EXPRESSION` indicates that the
486//! optional value is absent.  A bare expression is evaluated as to boolean.  All values in
487//! conditions must be local, which means they cannot be derived from scoped variables.
488//!
489//! ``` tsg
490//! (lexical_declaration type:(_)? @type value:(_)? @value)
491//! {
492//!   if some @type, none @value {
493//!     ; ...
494//!   } elif some @value {
495//!     ; ...
496//!   } else {
497//!     ; ...
498//!   }
499//! }
500//! ```
501//!
502//! # List iteration
503//!
504//! You can use a `for` statement to execute blocks of statements for every element in list
505//! values.  The list value must be local, which means it cannot be derived from scoped variables.
506//!
507//! ```tsg
508//! (module (_)* @stmts)
509//! {
510//!   for stmt in @stmts {
511//!     print stmt
512//!   }
513//! }
514//! ```
515//!
516//! # Debugging
517//!
518//! To support members of the Ancient and Harmonious Order of Printf Debuggers, you can use `print`
519//! statements to print out (to `stderr`) the content of any expressions during the execution of a graph DSL
520//! file:
521//!
522//! ``` tsg
523//! (identifier) @id
524//! {
525//!    let x = 4
526//!    print "Hi! x = ", x
527//! }
528//! ```
529
530pub mod functions;