doc/ref/impl.md at master · mvdan.cc/cue

mvdan.cc / cue
this repo has no description
cue / doc / ref / impl.md
at master 336 lines 14 kB view raw view rendered
  1# Implementing CUE
  2
  3
  4> NOTE: this is a working document attempting to describe CUE in a way
  5> relatable to existing graph unification systems. It is mostly
  6> redundant to [the spec](./spec.md). Unless one is interested in
  7> understanding how to implement CUE or how it relates to the existing
  8> body of research, read the spec instead.
  9
 10
 11CUE is modeled after typed feature structure and graph unification systems
 12such as LKB.
 13There is a wealth of research related to such systems and graph unification in
 14general.
 15This document describes the core semantics of CUE in a notation
 16that allows relating it to this existing body of research.
 17
 18
 19## Background
 20
 21CUE was inspired by a formalism known as
 22typed attribute structures [Carpenter 1992] or
 23typed feature structures [Copestake 2002],
 24which are used in linguistics to encode grammars and
 25lexicons. Being able to effectively encode large amounts of data in a rigorous
 26manner, this formalism seemed like a great fit for large-scale configuration.
 27
 28Although CUE configurations are specified as trees, not graphs, implementations
 29can benefit from considering them as graphs when dealing with cycles,
 30and effectively turning them into graphs when applying techniques like
 31structure sharing.
 32Dealing with cycles is well understood for typed attribute structures
 33and as CUE configurations are formally closely related to them,
 34we can benefit from this knowledge without reinventing the wheel.
 35
 36## Formal Definition
 37
 38
 39<!--
 40The previous section is equivalent to the below text with the main difference
 41that it is only defined for trees. Technically, structs are more akin dags,
 42but that is hard to explain at this point and also unnecessarily pedantic.
 43 We keep the definition closer to trees and will layer treatment
 44of cycles on top of these definitions to achieve the same result (possibly
 45without the benefits of structure sharing of a dag).
 46
 47A _field_ is a field name, or _label_ and a protype.
 48A _struct_ is a set of _fields_ with unique labels for each field.
 49-->
 50
 51A CUE configuration can be defined in terms of constraints, which are
 52analogous to typed attribute structures referred to above.
 53
 54### Definition of basic values
 55
 56> A _basic value_ is any CUE value that is not a struct (or, by
 57> extension, a list).
 58> All basic values are partially ordered in a lattice, such that for any
 59> basic value `a` and `b` there is a unique greatest lower bound
 60> defined for the subsumption relation `a ⊑ b`.
 61
 62```
 63Basic values
 64null
 65true
 66bool
 673.14
 68string
 69"Hello"
 70>=0
 71<8
 72re("Hello .*!")
 73```
 74
 75The basic values correspond to their respective types defined earlier.
 76
 77Struct (and by extension lists), are represented by the abstract notion of
 78a typed feature structure.
 79Each node in a configuration, including the root node,
 80is associated with a constraint.
 81
 82
 83### Definition of a typed feature structures and substructures
 84
 85<!-- jba: This isn't adding understanding. I'd rather you omitted it and
 86   added a bit of rigor to the above spec. Or at a minimum, translate the
 87   formalism into the terms you use above.
 88-->
 89
 90> A typed feature structure_ defined for a finite set of labels `Label`
 91> is directed acyclic graph with labeled
 92> arcs and values, represented by a tuple `C = <Q, q0, υ, δ>`, where
 93>
 94> 1. `Q` is the finite set of nodes,
 95> 1. `q0 ∈ Q`, is the root node,
 96> 1. `υ: Q → T` is the total node typing function,
 97>     for a finite set of possible terms `T`.
 98> 1. `δ: Q × Label → Q` is the partial feature function,
 99>
100> subject to the following conditions:
101>
102> 1. there is no node `q` or label `l` such that `δ(q, l) = q0` (root)
103> 2. for every node `q` in `Q` there is a path `π` (i.e. a sequence of
104>    members of Label) such that `δ(q0, π) = q` (unique root, correctness)
105> 3. there is no node `q` or non-empty path `π` such that `δ(q, π) = q` (no cycles)
106>
107> where `δ` is extended to be defined on paths as follows:
108>
109> 1. `δ(q, ϵ) = q`, where `ϵ` is the empty path
110> 2. `δ(q, l∙π) = δ(δ(q, l), π)`
111>
112> The _substructures_ of a typed feature structure are the
113> typed feature structures rooted at each node in the structure.
114>
115> The set of all possible typed feature structures for a given label
116> set is denoted as `𝒞`<sub>`Label`</sub>.
117>
118> The set of _terms_ for label set `Label` is recursively defined as
119>
120> 1. every basic value: `P ⊆ T`
121> 1. every constraint in `𝒞`<sub>`Label`</sub> is a term: `𝒞`<sub>`Label`</sub>` ⊆ T`
122>    a _reference_ may refer to any substructure of `C`.
123> 1. for every `n` values `t₁, ..., tₙ`, and every `n`-ary function symbol
124>    `f ∈ F_n`, the value `f(t₁,...,tₙ) ∈ T`.
125>
126
127
128This definition has been taken and modified from [Carpenter, 1992]
129and [Copestake, 2002].
130
131Without loss of generality, we will henceforth assume that the given set
132of labels is constant and denote `𝒞`<sub>`Label`</sub> as `𝒞`.
133
134In CUE configurations, the abstract constraints implicated by `υ`
135are CUE expressions.
136Literal structs can be treated as part of the original typed feature structure
137and do not need evaluation.
138Any other expression is evaluated and unified with existing values of that node.
139
140References in expressions refer to other nodes within the `C` and represent
141a copy of the substructure `C'` of `C` rooted at these nodes.
142Any references occurring in terms assigned to nodes of `C'` are be updated to
143point to the equivalent node in a copy of `C'`.
144<!-- TODO: define formally. Right now this is implied already by the
145definition of evaluation functions and unification: unifying
146the original TFS' structure of the constraint with the current node
147preserves the structure of the original graph by definition.
148This is getting very implicit, though.
149-->
150The functions defined by `F` correspond to the binary and unary operators
151and interpolation construct of CUE, as well as builtin functions.
152
153CUE allows duplicate labels within a struct, while the definition of
154typed feature structures does not.
155A duplicate label `l` with respective values `a` and `b` is represented in
156a constraint as a single label with term `&(a, b)`,
157the unification of `a` and `b`.
158Multiple labels may be recursively combined in any order.
159
160<!-- unnecessary, probably.
161#### Definition of evaluated value
162
163> A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting
164> only of atoms, typed attribute structures and constraint functions.
165>
166> A value is called _ground_ if it is an atom or typed attribute structure.
167
168#### Unification of evaluated values
169
170> A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting
171> only of atoms, typed attribute structures and constraint functions.
172>
173> A value is called _ground_ if it is an atom or typed attribute structure.
174-->
175
176### Definition of subsumption and unification on typed attribute structure
177
178> For a given collection of constraints `𝒞`,
179> we define `π ≡`<sub>`C`</sub> `π'` to mean that typed feature structure `C ∈ 𝒞`
180> contains path equivalence between the paths `π` and `π'`
181> (i.e. `δ(q0, π) = δ(q0, π')`, where `q0` is the root node of `C`);
182> and `𝒫`<sub>`C`</sub>`(π) = c` to mean that
183> the typed feature structure at the path `π` in `C`
184> is `c` (i.e. `𝒫`<sub>`C`</sub>`(π) = c` if and only if `υ(δ(q0, π)) == c`,
185> where `q0` is the root node of `C`).
186> Subsumption is then defined as follows:
187> `C ∈ 𝒞` subsumes `C' ∈ 𝒞`, written `C' ⊑ C`, if and only if:
188>
189> - `π ≡`<sub>`C`</sub> `π'` implies  `π ≡`<sub>`C'`</sub> `π'`
190> - `𝒫`<sub>`C`</sub>`(π) = c` implies`𝒫`<sub>`C'`</sub>`(π) = c` and  `c' ⊑ c`
191>
192> The unification of `C` and  `C'`, denoted `C ⊓ C'`,
193> is the greatest lower bound of `C` and `C'` in `𝒞` ordered by subsumption.
194
195<!-- jba: So what does this get you that you don't already have from the
196various "instance-of" definitions in the main spec? I thought those were
197sufficiently precise. Although I admit that references and cycles
198are still unclear to me. -->
199
200Like with the subsumption relation for basic values,
201the subsumption relation for constraints determines the mutual placement
202of constraints within the partial order of all values.
203
204
205### Evaluation function
206
207> The evaluation function is given by `E: T -> 𝒞`.
208> The unification of two typed feature structures is evaluated as defined above.
209> All other functions are evaluated according to the definitions found earlier
210> in this spec.
211> An error is indicated by `_|_`.
212
213#### Definition of well-formedness
214
215> We say that a given typed feature structure `C = <Q, q0, υ, δ> ∈ 𝒞` is
216> a _well-formed_ typed feature structure if and only if for all nodes `q ∈ Q`,
217> the substructure `C'` rooted at `q`,
218> is such that `E(υ(q)) ∈ 𝒞` and `C' = <Q', q, δ', υ'> ⊑ E(υ(q))`.
219
220<!-- Also, like Copestake, define appropriate features?
221Appropriate features are useful for detecting unused variables.
222
223Appropriate features could be introduced by distinguishing between:
224
225a: MyStruct // appropriate features are MyStruct
226a: {a : 1}
227
228and
229
230a: MyStruct & { a: 1 } // appropriate features are those of MyStruct + 'a'
231
232This is way too subtle, though.
233
234Alternatively: use Haskell's approach:
235
236#a: MyStruct // define a to be MyStruct any other features are allowed but
237             // discarded from the model. Unused features are an error.
238
239Let's first try to see if we can get away with static usage analysis.
240A variant would be to define appropriate features unconditionally, but enforce
241them only for unused variables, with some looser definition of unused.
242-->
243
244The _evaluation_ of a CUE configuration represented by `C`
245is defined as the process of making `C` well-formed.
246
247<!--
248ore abstractly, we can define this structure as the tuple
249`<≡, 𝒫>`, where
250
251- `≡ ⊆ Path × Path` where `π ≡ π'` if and only if `Δ(π, q0) = Δ(π', q0)` (path equivalence)
252- `P: Path → ℙ` is `υ(Δ(π, q))` (path value).
253
254A struct `a = <≡, 𝒫>` subsumes a struct `b = <≡', 𝒫'>`, or `a ⊑ b`,
255if and only if
256
257- `π ≡ π'` implied `π ≡' π'`, and
258- `𝒫(π) = v` implies `𝒫'(π) = v'` and `v' ⊑ v`
259-->
260
261### References
262Theory:
263- [1992] Bob Carpenter, "The logic of typed feature structures.";
264  Cambridge University Press, ISBN:0-521-41932-8
265- [2002] Ann Copestake, "Implementing Typed Feature Structure Grammars.";
266  CSLI Publications, ISBN 1-57586-261-1
267
268Some graph unification algorithms:
269
270- [1985] Fernando C. N. Pereira, "A structure-sharing representation for
271  unification-based grammar formalisms."; In Proc. of the 23rd Annual Meeting of
272  the Association for Computational Linguistics. Chicago, IL
273- [1991] H. Tomabechi, "Quasi-destructive graph unifications.."; In Proceedings
274  of the 29th Annual Meeting of the ACL. Berkeley, CA
275- [1992] Hideto Tomabechi, "Quasi-destructive graph unifications with structure-
276   sharing."; In Proceedings of the 15th International Conference on
277   Computational Linguistics (COLING-92), Nantes, France.
278- [2001] Marcel van Lohuizen, "Memory-efficient and thread-safe
279  quasi-destructive graph unification."; In Proceedings of the 38th Meeting of
280  the Association for Computational Linguistics. Hong Kong, China.
281
282
283## Implementation
284
285The _evaluation_ of a CUE configuration `C` is defined as the process of
286making `C` well-formed.
287
288
289This section does not define any operational semantics.
290As the unification operation is communitive, transitive, and reflexive,
291implementations have a considerable amount of leeway in
292choosing an evaluation strategy.
293Although most algorithms for the unification of typed attribute structure
294that have been proposed are near `O(n)`, there can be considerable performance
295benefits of choosing one of the many proposed evaluation strategies over the
296other.
297Implementations will need to be verified against the above formal definition.
298
299
300### Constraint functions
301
302A _constraint function_ is a unary function `f` which for any input `a` only
303returns values that are an instance of `a`. For instance, the constraint
304function `f` for `string` returns `"foo"` for `f("foo")` and `_|_` for `f(1)`.
305Constraint functions may take other constraint functions as arguments to
306produce a more restricting constraint function.
307For instance, the constraint function `f` for `<=8` returns `5` for `f(5)`,
308`>=5 & <=8` for `f(>=5)`, and `_|_` for `f("foo")`.
309
310
311Constraint functions play a special role in unification.
312The unification function `&(a, b)` is defined as
313
314- `a & b` if `a` and `b` are two atoms
315- `a & b` if `a` and `b` are two nodes, respresenting struct
316- `a(b)` or `b(a)` if either `a` or `b` is a constraint function, respectively.
317
318Implementations are free to pick which constraint function is applied if
319both `a` and `b` are constraint functions, as the properties of unification
320will ensure this produces identical results.
321
322
323### References
324
325A distinguising feature of CUE's unification algorithm is the use of references.
326In conventional graph unification for typed feature structures, the structures
327that are unified into the existing graph are independent and pre-evaluated.
328In CUE, the typed feature structures indicated by references may still need to
329be evaluated.
330Some conventional evaluation strategies may not cope well with references that
331refer to each other.
332The simple solution is to deploy a breadth-first evaluation strategy, rather than
333the more traditional depth-first approach.
334Other approaches are possible, however, and implementations are free to choose
335which approach is deployed.
336