···11+[Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
22+queries allow you to search for patterns in syntax trees,
33+much like a regex would, in text. Combine that with some Rust
44+glue to write simple, custom linters.
55+66+### Tree-sitter syntax trees
77+88+Here is a quick crash course on syntax trees generated by
99+tree-sitter. Syntax trees produced by tree-sitter are
1010+represented by S-expressions. The generated S-expression for
1111+the following Rust code,
1212+1313+```rust
1414+fn main() {
1515+ let x = 2;
1616+}
1717+```
1818+1919+would be:
2020+2121+```scheme
2222+(source_file
2323+ (function_item
2424+ name: (identifier)
2525+ parameters: (parameters)
2626+ body:
2727+ (block
2828+ (let_declaration
2929+ pattern: (identifier)
3030+ value: (integer_literal)))))
3131+```
3232+3333+Syntax trees generated by tree-sitter have a couple of other
3434+cool properties: they are _lossless_ syntax trees. Given a
3535+lossless syntax tree, you can regenerate the original source
3636+code in its entirety. Consider the following addition to our
3737+example:
3838+3939+```rust
4040+ fn main() {
4141++ // a comment goes here
4242+ let x = 2;
4343+ }
4444+```
4545+4646+The tree-sitter syntax tree preserves the comment, while the
4747+typical abstract syntax tree wouldn't:
4848+4949+```scheme
5050+ (source_file
5151+ (function_item
5252+ name: (identifier)
5353+ parameters: (parameters)
5454+ body:
5555+ (block
5656++ (line_comment)
5757+ (let_declaration
5858+ pattern: (identifier)
5959+ value: (integer_literal)))))
6060+```
6161+6262+### Tree-sitter queries
6363+6464+Tree-sitter provides a DSL to match over CSTs. These queries
6565+resemble our S-expression syntax trees, here is a query to
6666+match all line comments in a Rust CST:
6767+6868+```scheme
6969+(line_comment)
7070+7171+; matches the following rust code
7272+; // a comment goes here
7373+```
7474+7575+Neat, eh? But don't take my word for it, give it a go on the
7676+[tree-sitter
7777+playground](https://tree-sitter.github.io/tree-sitter/playground).
7878+Type in a query like so:
7979+8080+```scheme
8181+; the web playground requires you to specify a "capture"
8282+; you will notice the capture and the nodes it captured
8383+; turn blue
8484+(line_comment) @capture
8585+```
8686+8787+Here's another to match `let` expressions that
8888+bind an integer to an identifier:
8989+9090+```scheme
9191+(let_declaration
9292+ pattern: (identifier)
9393+ value: (integer_literal))
9494+9595+; matches:
9696+; let foo = 2;
9797+```
9898+9999+We can _capture_ nodes into variables:
100100+101101+```scheme
102102+(let_declaration
103103+ pattern: (identifier) @my-capture
104104+ value: (integer_literal))
105105+106106+; matches:
107107+; let foo = 2;
108108+109109+; captures:
110110+; foo
111111+```
112112+113113+And apply certain _predicates_ to captures:
114114+115115+```scheme
116116+((let_declaration
117117+ pattern: (identifier) @my-capture
118118+ value: (integer_literal))
119119+ (#eq? @my-capture "foo"))
120120+121121+; matches:
122122+; let foo = 2;
123123+124124+; and not:
125125+; let bar = 2;
126126+```
127127+128128+The `#match?` predicate checks if a capture matches a regex:
129129+130130+```scheme
131131+((let_declaration
132132+ pattern: (identifier) @my-capture
133133+ value: (integer_literal))
134134+ (#match? @my-capture "foo|bar"))
135135+136136+; matches both `foo` and `bar`:
137137+; let foo = 2;
138138+; let bar = 2;
139139+```
140140+141141+Exhibit indifference, as a stoic programmer would, with the
142142+_wildcard_ pattern:
143143+144144+```scheme
145145+(let_declaration
146146+ pattern: (identifier)
147147+ value: (_))
148148+149149+; matches:
150150+; let foo = "foo";
151151+; let foo = 42;
152152+; let foo = bar;
153153+```
154154+155155+[The
156156+documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
157157+does the tree-sitter query DSL more justice, but we now know
158158+enough to write our first lint.
159159+160160+### Write you a tree-sitter lint
161161+162162+Strings in `std::env` functions are error prone:
163163+164164+```rust
165165+std::env::remove_var("RUST_BACKTACE");
166166+ // ^^^^ "TACE" instead of "TRACE"
167167+```
168168+169169+I prefer this instead:
170170+171171+```rust
172172+// somewhere in a module that is well spellchecked
173173+static BACKTRACE: &str = "RUST_BACKTRACE";
174174+175175+// rest of the codebase
176176+std::env::remove_var(BACKTRACE);
177177+```
178178+179179+Let's write a lint to find `std::env` functions that use
180180+strings. Put aside the effectiveness of this lint for the
181181+moment, and take a stab at writing a tree-sitter query. For
182182+reference, a function call like so:
183183+184184+```rust
185185+remove_var("RUST_BACKTRACE")
186186+```
187187+188188+Produces the following S-expression:
189189+190190+```scheme
191191+(call_expression
192192+ function: (identifier)
193193+ arguments: (arguments (string_literal)))
194194+```
195195+196196+We are definitely looking for a `call_expression`:
197197+198198+```scheme
199199+(call_expression) @raise
200200+```
201201+202202+Whose function name matches `std::env::var` or
203203+`std::env::remove_var` at the very least (I know, I know,
204204+this isn't the most optimal regex):
205205+206206+```scheme
207207+((call_expression
208208+ function: (_) @fn-name) @raise
209209+ (#match? @fn-name "std::env::(var|remove_var)"))
210210+```
211211+212212+Let's turn that `std::` prefix optional:
213213+214214+```scheme
215215+((call_expression
216216+ function: (_) @fn-name) @raise
217217+ (#match? @fn-name "(std::|)env::(var|remove_var)"))
218218+```
219219+220220+And ensure that `arguments` is a string:
221221+222222+```scheme
223223+((call_expression
224224+ function: (_) @fn-name
225225+ arguments: (arguments (string_literal)))
226226+ (#match? @fn-name "(std::|)env::(var|remove_var)"))
227227+```
228228+229229+### Running our linter
230230+231231+We could always plug our query into the web playground, but
232232+let's go a step further:
233233+234234+```bash
235235+cargo new --bin toy-lint
236236+```
237237+238238+Add `tree-sitter` and `tree-sitter-rust` to your
239239+dependencies:
240240+241241+```toml
242242+# within Cargo.toml
243243+[dependencies]
244244+tree-sitter = "0.20"
245245+246246+[dependencies.tree-sitter-rust]
247247+git = "https://github.com/tree-sitter/tree-sitter-rust"
248248+```
249249+250250+Let's load in some Rust code to work with. As [an ode to
251251+Gödel](https://en.wikipedia.org/wiki/Self-reference)
252252+(G`ode`l?), why not load in our linter itself:
253253+254254+```rust
255255+fn main() {
256256+ let src = include_str!("main.rs");
257257+}
258258+```
259259+260260+Most tree-sitter APIs require a reference to a `Language`
261261+struct, we will be working with Rust if you haven't
262262+already guessed:
263263+264264+```rust
265265+use tree_sitter::Language;
266266+267267+let rust_lang: Language = tree_sitter_rust::language();
268268+```
269269+270270+Enough scaffolding, let's parse some Rust:
271271+272272+```rust
273273+use tree_sitter::Parser;
274274+275275+let mut parser = Parser::new();
276276+parser.set_language(rust_lang).unwrap();
277277+278278+let parse_tree = parser.parse(&src, None).unwrap();
279279+```
280280+281281+The second argument to `Parser::parse` may be of interest.
282282+Tree-sitter has this cool feature that allows for quick
283283+reparsing of existing parse trees if they contain edits. If
284284+you do happen to want to reparse a source file, you can pass
285285+in the old tree:
286286+287287+```rust
288288+// if you wish to reparse instead of parse
289289+old_tree.edit(/* redacted */);
290290+291291+// generate shiny new reparsed tree
292292+let new_tree = parser.parse(&src, Some(old_tree)).unwrap()
293293+```
294294+295295+Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it:
296296+297297+```rust
298298+println!("{}", parse_tree.root_node().to_sexp());
299299+```
300300+301301+Or better yet, run a query on it:
302302+303303+```rust
304304+use tree_sitter::Query;
305305+306306+let query = Query::new(
307307+ rust_lang,
308308+ r#"
309309+ ((call_expression
310310+ function: (_) @fn-name
311311+ arguments: (arguments (string_literal))) @raise
312312+ (#match? @fn-name "(std::|)env::(var|remove_var)"))
313313+ "#
314314+)
315315+.unwrap();
316316+```
317317+318318+A `QueryCursor` is tree-sitter's way of maintaining state as
319319+we iterate through the matches or captures produced by
320320+running a query on the parse tree. Observe:
321321+322322+```rust
323323+use tree_sitter::QueryCursor;
324324+325325+let mut query_cursor = QueryCursor::new();
326326+let all_matches = query_cursor.matches(
327327+ &query,
328328+ parse_tree.root_node(),
329329+ src.as_bytes(),
330330+);
331331+```
332332+333333+We begin by passing our query to the cursor, followed by the
334334+"root node", which is another way of saying, "start from the
335335+top", and lastly, the source itself. If you have already
336336+taken a look at the C API, you will notice that the last
337337+argument, the source (known as the `TextProvider`), is not
338338+required. The Rust bindings seem to require this argument to
339339+provide predicate functionality such as `#match?` and
340340+`#eq?`.
341341+342342+Do something with the matches:
343343+344344+```rust
345345+// get the index of the capture named "raise"
346346+let raise_idx = query.capture_index_for_name("raise").unwrap();
347347+348348+for each_match in all_matches {
349349+ // iterate over all captures called "raise"
350350+ // ignore captures such as "fn-name"
351351+ for capture in each_match
352352+ .captures
353353+ .iter()
354354+ .filter(|c| c.idx == raise_idx)
355355+ {
356356+ let range = capture.node.range();
357357+ let text = &src[range.start_byte..range.end_byte];
358358+ let line = range.start_point.row;
359359+ let col = range.start_point.column;
360360+ println!(
361361+ "[Line: {}, Col: {}] Offending source code: `{}`",
362362+ line, col, text
363363+ );
364364+ }
365365+}
366366+```
367367+368368+Lastly, add the following line to your source code, to get
369369+the linter to catch something:
370370+371371+```rust
372372+env::remove_var("RUST_BACKTRACE");
373373+```
374374+375375+And `cargo run`:
376376+377377+```shell
378378+λ cargo run
379379+ Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint)
380380+ Finished dev [unoptimized + debuginfo] target(s) in 0.74s
381381+ Running `target/debug/toy-lint`
382382+[Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")`
383383+```
384384+385385+Thank you tree-sitter!
386386+387387+### Bonus
388388+389389+Keen readers will notice that I avoided `std::env::set_var`.
390390+Because `set_var` is called with two arguments, a "key" and
391391+a "value", unlike `env::var` and `env::remove_var`. As a
392392+result, it requires more juggling:
393393+394394+```scheme
395395+((call_expression
396396+ function: (_) @fn-name
397397+ arguments: (arguments . (string_literal)? . (string_literal) .)) @raise
398398+ (#match? @fn-name "(std::|)env::(var|remove_var|set_var)"))
399399+```
400400+401401+The interesting part of this query is the humble `.`, the
402402+_anchor_ operator. Anchors help constrain child nodes in
403403+certain ways. In this case, it ensures that we match exactly
404404+two `string_literal`s who are siblings or exactly one
405405+`string_literal` with no siblings. Unfortunately, this query
406406+also matches the following invalid Rust code:
407407+408408+```rust
409409+// remove_var accepts only 1 arg!
410410+std::env::remove_var("RUST_BACKTRACE", "1");
411411+```
412412+413413+### Notes
414414+415415+All-in-all, the query DSL does a great job in lowering the
416416+bar to writing language tools. The knowledge gained from
417417+mastering the query DSL can be applied to other languages
418418+that have tree-sitter grammars too. This query
419419+detects `to_json` methods that do not accept additional
420420+arguments, in Ruby:
421421+422422+```scheme
423423+((method
424424+ name: (identifier) @fn
425425+ !parameters)
426426+ (#is? @fn "to_json"))
427427+```