Constructor of incremental parsers in Scala using PEG grammars.
Papa Carlo was designed to construct programming language parsers that may be
used in a wide range of code editors from simple syntax highlighters to
full-featured IDEs like Eclipse or IntelliJ Idea.
Please, visit Demo page to see an incremental
parser in action.
Key features:
- Incremental parsing. Parsing performance is always proportional to the
changes end users make to the source code. Even if the source code consists
of thousands lines.
- Automatic AST construction. No rule action needed to construct Abstract
Syntax Tree that represents source code's structure. Parser builds AST out of
the box.
- Error-recovery mechanism. The parser can recognise source code even if
the code contains syntax errors.
- Language definition using DSL in Scala. No external files needed. The
developer can define parser using library's API in the Scala code.
- PEG grammars. Modern and easy to understand language definition grammar.
- Expression parsing with Pratt algorithm. Complicated expression syntax
with a number of unary/binary precedence operators can be defined in a few
lines of code.
The project is published on GitHub
under the terms of
Apache Public License 2.
Library's build artifacts are hosted on
Sonatype
and Maven Central
remote repositories. They are ready to use in the SBT based projects.
Support:
Development of programming language code editors and IDEs is a quite complicated problem. Modern IDEs like Eclipse, Visual Studio and IntelliJ IDEA provide end-user with a number of handy code manipulation features: code completion, refactoring, jump-to-definition and even semantic analysis.
The heart of these systems are specific code parsers that are different from the ordinary parsers designed for programming language compilers. Such parsers should be able to:
- Build internal object-based representation(AST) of the source code that may contain syntax errors.
- Keep this representation in touch with source code while the user makes changes to it. And the parser should do it quickly regardless the size of the source code.
Unfortunately there is a lack of tools to build parsers of this type. Most of the existing parser generators like YACC, ANTLR and combinators like Scala Parser Combinator don't provide features described above. Because resulting parsers are designed to parse the whole files at once. This approach is well suitable for ordinary compilers, but not for code editors. And usually developers of language support plugins for IDEs need to develop them more or less manually.
Papa Carlo has been made to resolve this problem. Another goal of the library is providing easy to use toolkit with a number of handy features out of the box that language parser developers usually need.
Here is how Papa Carlo accomplishes it:
- User defines programming language specifications using Papa Carlo's API directly in the Scala code.
This approach is close to one that is using in many modern parser combinators.
- Resulting parser can build Abstract Syntax Tree out of the box. No rule actions needed.
- Parser can efficiently recover syntax errors and build parsing result from incomplete code.
- And the most important is that Papa Carlo caches parsing results by small source code fragments. So it can parse files incrementally while the user edits it without significant performance penalties.
It is also important to mention when Papa Carlo does not fit well. If you are interested in development of standalone programming language compilers that should be able to parse large amount of files with thousands of source code lines at once, and the speed is vital, probably you should look for something different. For example at a parser generator like ANTLR. Due to Papa Carlo's built-in feature nature like caching and error recovering, it's performance may be not as good on startup as of parsers generated with ANTLR.
Let's have a look at example of JSON parser. Usually parser consists of three main components:
- Language lexis specification.
- Language syntax specification.
- Fragments specifications. This is about caching and will be described later in more details.
Normally the Parser can be define in one Scala file. And have structure like this:
object Json {
private def tokenizer = {
val tokenizer = new Tokenizer()
import tokenizer._
import Matcher._
...
tokenizer
}
private def contextualizer = {
val contextualizer = new Contextualizer
import contextualizer._
...
contextualizer
}
def lexer = new Lexer(tokenizer, contextualizer)
def syntax(lexer: Lexer) = new {
val syntax = new Syntax(lexer)
import syntax._
import Rule._
...
}.syntax
}
...
val lexer = Json.lexer()
val syntax = Json.syntax(lexer)
syntax.onNodeMerge.bind {node => println(node.prettyPrint())}
lexer.input(""" {"x": ["hello world", 1234], "y": 5678} """)
This is in fact all that we need to obtain full-feature incremental parser. As easy as pie.
You don't really need to know how the Parser works in a nutshell to build one. But knowledge of it's internal processes may improve understanding of Papa Carlo's top level API.
When the text data comes into Lexer it is passing through several transformation stages. Finally becomes an object-based hierarchical structure, so called "Abstract Syntax Tree". Each component responsible for performing specific stage softly connected with components of the previous stages. And the connections between components are based on slot-signal model. So each component has it's own public signal fields that could be subscribed by appropriate anonymous functions. This technique is also could be a useful thing to debug individual stages of your own parser during the development. And particularly Papa Carlo's own functional tests use these signals to build parser's log for each test case.
Source code data can be(and practically should be) entered to the parser continuously while the end user doing changes in the source code. So each component of the Parser can use results got from the previous entering phase in order to optimise process of the parsing the next phase. This is so called "incremental parsing", the essential feature of Papa Carlo.
So here are the stages and components responsible for handling these stages:
- Selecting part of the code that different from the source code parsed before - Lexer.
- Splitting this part into lines - Lexer.
- Splitting each line into tokens - Tokenizer.
- Replacing part of the exist tokens from the previous phase with new tokens - TokenCollection.
- Splitting tokens into contextual fragments - FragmentController and Contextualizer.
- Parsing fragments using syntax rules - Syntax.
- Building Abstract Syntax Tree based on Nodes for these fragments and replace appropriate branches of the exist Tree with updated branches - Syntax.
Sounds quite complicated. But fortunately most of the time developer don't need to work with all this stuff himself, because Papa Carlo takes the hard work on his side. All that developer needs is specifying parsing rules for Tokenizer, Contextualizer and Syntax using prepared and easy to use API.
Now it's time to describe this API.
On the first parsing stage source code should be splitted into small pieces(token's "value"). And classified by groups(token's "kind"). For example this code {"x": 1234}
will be parsed to the following sequence of tokens:
Kind |
Value |
Skip |
Open brace |
{ |
|
String |
"x" |
|
Colon |
: |
|
Whitespace |
|
+ |
Number |
1234 |
|
Close brace |
} |
Further syntax parsing rules will be applied to these token kinds, not their values. And the tokens flagged as "skippable" will be ignored by syntax parser.
As a first step we need to define Token Category parsing rules that should match character sequences and produce tokens of appropriate kinds. Such rules are representing as Finite-State Machines in a nutshell. So they also can be seen as regular expressions, but defined with combination of functions. Here is an example of tokenizer for Math expressions parser:
...
private def tokenizer = {
val tokenizer = new Tokenizer()
import tokenizer._
import Matcher._
tokenCategory(
"whitespace",
oneOrMore(anyOf(" \t\f\n"))
).skip
tokenCategory(
"number",
choice(
chunk("0"),
sequence(rangeOf('1', '9'), zeroOrMore(rangeOf('0', '9')))
)
)
terminals("(", ")", "%", "+", "-", "*", "/")
tokenizer
}
...
Two token categories are defined here: "whitespace" for spaces between valuable tokens, and "number" for valuable tokens. Whitespace token category here marked as skippable with .skip()
method.
For one who are familiar with Regular Expressions here is a table of translations between Papa Carlo's Matchers and regexp rules:
Papa Carlo Matcher |
Similar Regexp |
Description |
zeroOrMore(x) |
x* |
Repetition of x. |
oneOrMore(x) |
x+ |
Repetition of x at least one time. |
optional(x) |
x? |
Repetition of x one time max. |
repeat(x, y) |
x{y, y} |
Repetition of x exactly y times. |
anyOf("xyz") |
[xyz] |
Any character from the given string. |
anyExceptOf("xyz") |
[^xyz] |
Any character but one from the given string. |
any() |
. |
Any character |
rangeOf(x, y) |
[x-y] |
Any character between characters x and y. |
chunk("foobar") |
foobar |
Match given string. |
test(x) |
|
Predicate. Checks whether the matcher x applied here, but doesn't perform actual application. |
testNot(x) |
|
Predicate. Checks whether the matcher x is not applied here. |
sequence(x, y, z) |
x y z |
Sequence of rules |
choice(x, y, z) |
x|y|z |
Ordered choice between rules. In contrast with Regular expressions further rules can be applied if and only if preceded have been failed. |
Except generic token categories described above, it is also important to have a way to define static categories. Whose names and values are equal. There are two kind of such categories: terminals and keywords.
Terminals can be defined like this: tokenizer.terminals("{", "}", "+")
. Terminals are similar to appropriate generic token categories:
tokenizer.tokenCategory("{", chunk("{"))
tokenizer.tokenCategory("}", chunk("}"))
tokenizer.tokenCategory("+", chunk("+"))
Keyword definition API is similar to terminals: tokenizer.keywords("def", "class", "private")
. But they work in a different way. Each keyword rule is trying to turn kind of token produced with tokenCategory rule when token's value matches keyword's name. Let's look at example from JSON parser:
...
tokenCategory(
"alphanum",
oneOrMore(rangeOf('a', 'z'))
)
...
keywords("true", "false", "null")
...
Here string like abc=true cde=null
will be parsed as ["alphanum", "=", "true", "whitespace", "alphanum", "=", "null"]
. Whereas the string abc=truecde=null
parsed as ["alphanum", "=", "alphanum", "=", "null"]
. Here string truecde
contains substring true
, but doesn't match it. So it cannot be turned to independent keyword token.
Another thing that should be mentioned about tokenization process is importance of order of token category definition. The token category rules are applied in the order of their definition. So it is important to be careful with rules of tokens that could be substrings of another tokens.
For example terminal rules like tokenizer.terminals("+", "++")
will parse text ++ + ++
as ["+", "+", "+", "+", "+"]
. And the ++
substring here will be described as two independent tokens. In order to fix it terminal's definition order should be changed to tokenizer.terminals("++", "+")
. Now the input string will be parsed as expected: ["++", "+", "++"]
.
Before the syntax parsing stage parser selects simple syntactical Fragments between the pair of specific tokens. And uses these fragments to perform parsing results caching during the syntax parsing stage. Such fragments can be for example a code blocks between "{" and "}" tokens in C++, or maybe a function definition between keywords "def" and "end" in Ruby. Most of the programming languages usually have such pairs. The main property of the Fragment is that it's meaning(or to be more specific a syntax rule that can be applied to it) in context of language's syntax should be invariant to it's content.
Let's illustrate it in example. Here is a Java snippet:
public class FibCalculator {
public static int fibonacci(int fibIndex) {
if (memoized.containsKey(fibIndex)) {
return memoized.get(fibIndex);
} else {
int answer = fibonacci(fibIndex - 1) + fibonacci(fibIndex - 2);
memoized.put(fibIndex, answer);
return answer;
}
}
}
There are four code fragments between curly braces "{", "}": class body, method body, success branch of conditional operator, failed branch of that operator. And each of these fragments will remain it's type regardless of it's internal content. Method body will remain as a method body even if programmer adds another one code statement, or even writes something syntactically incorrect in it. Of course meaning of the nested fragments may potentially change depending on changes made in to the parent fragment. But it doesn't make sense.
To summarise Papa Carlo needs developer to define Fragments of code with the following properties:
- They could be simply determined as a code sequence between two tokens.
- Their syntactical meaning invariant to internal content. At least in most cases.
- Normally it is expected that code contains a lot of such fragments.
Fortunately most of the modern programming languages meet these requirements.
Defining fragments is extremely simple. That is how it is done in the JSON parser example:
private def contextualizer = {
val contextualizer = new Contextualizer
import contextualizer._
trackContext("[", "]").allowCaching
trackContext("{", "}").allowCaching
contextualizer
}
Calling of .allowCaching()
method here indicates that fragments of this kind may be potentially bound to cached parts of the resulting Abstract Syntax Tree. Caching process described in more details in the next topic.
Another useful point about fragments is changing tokens flags. For example it is possible to define that specific fragment should be completely ignored during syntax parsing. It might be useful to define code comments:
trackContext("/*", "*/").forceSkip.topContext
trackContext(""", """).topContext
Here is a definition of two fragment types. First for code comments /* comments to be ignored */
. And the second for string literals(potentially multiline strings). Method .forseSkip()
indicates that all tokens in it should be ignored. And the .topContext()
method indicates that there could not be fragments nested in it. For example first closed curly brace token in the string of { x = "abc } def"; }
should not be considered as an ending token of this curly-brace fragment.
Similar to other parsing stages that utilising caching approach, fragmentation stage is not an exception. So the computational efforts that parser requires to update fragment system are proportional to the changes in the source code made by end-user. And in practice building fragments is a quite fast process. Moreover the parser is able to determine invalid fragments(with missing opened or closed tokens) and keep valid in touch. For example in this modified Java snippet:
public class FibCalculator {
public static int fibonacci(int fibIndex) {
if (memoized.containsKey(fibIndex)) {
return memoized.get(fibIndex);
} else {
int answer = fibonacci(fibIndex - 1) + fibonacci(fibIndex - 2);
memoized.put(fibIndex, answer);
return answer;
damaged line
}
}
parser will be able to still track fragments of class's body, method's body and even conditional operator's success-branch code block.
The final result of parsing process is building hierarchical object representation of the source code, or Abstract Syntax Tree. Each Node class of the AST has the following main properties:
- kind: String name of the node's category. Usually the same as the name of the rule that builds this node.
- begin and end: First and last tokens of the source code segment represented by the node.
- id: Global unique identifier of the node. Each node of AST may be obtained using this identifier.
- branches: Multimap of child nodes. So each branch of the node has it's own string tag. Usually each branch has unique tag. But sometimes one tag may refer to several branches. For example a node representing JSON Array may have a sequence of child nodes that representing array's elements. And all of these child nodes refer by "element" tag. Branches may be read from the current node with
getBranches(tag: String)
method.
- references: Multimap of selected tokens from the source code of the node. Sometimes it may be useful to select specific tokens from the source in order to simplify the further process of AST semantic analysis. This multimap is forming similarly to the branches multimap. And the tokens can be read with
getValue(tag: String)
method. Result of this method is string of all tagged token values joined together.
Syntax parser is responsible for creating, removing and building the whole AST. And usually developer doesn't need to care about manual operations with nodes. However Papa Carlo provides a way to override default node construction behaviour. This technique will be described in the next chapter.
API
To define actual syntax parser developer needs to decompose programming language syntax by named rules. Each rule parses specific code construction. And responsible for building appropriate AST Node for this code construction. These rules form programming language grammar of Parsing Expression Grammar class.
Rules are defined using methods of instance of the Syntax class: .rule(ruleName: String)(ruleBody: => Rule)
. These rules build nodes of the same kind as the ruleName. Rule that represents top-level node of AST hierarchy should be defined with mainRule
method instead.
The reference returned from the rule definition method can be used to refer to this rule in other rules. Hence the rules can refer to each other forming a graph of rules. Note that rule bodies defined in lazy fashion. So it is safe to define circular references between them. In Json example jsonObject refers to objectEntry, objectEntry refers to jsonValue. And the jsonValue refers to jsonObject again.
Remember that all syntax rules are applied only to the tokens that don't marked as skipped
. Thus whitespaces, line breaks and comments are ignored by default regarding the language's lexical specification. Such approach simplifies syntax rule definition as developer takes care about meaningful tokens only.
An example of syntax rule for parsing Array construction of JSON language:
...
val jsonArray = rule("array") {
sequence(
token("["),
zeroOrMore(
branch("value", jsonValue),
separator =
recover(token(","), "array entries must be separated with , sign")
),
recover(token("]"), "array must end with ] sign")
)
}
...
Full example can be found here: JSON parser.
Operators
Likewise the Lexical rules, syntactical rule bodies can be defined by composing of built-in primitive rules. These primitive rules are in fact representing PEG operators.
Token matcher operators
Operator |
Description |
token(x) |
Matches token of x kind. |
tokensUntil(x) |
Matches all tokens starting from the current position until the x token is met. x is included. |
Composition operators
Operator |
Description |
optional(x) |
Matchers subrule x zero or one time. |
zeroOrMore(x, separator) |
Optionally matches subrule x multiple times. Each code pieces matched with subrule x may be divided with another piece of code defined by optional separator parameter. |
oneOrMore(x, separator) |
Matches subrule x zero or more times. x and separator have the same meaning like in .zeroOrMore method. |
repeat(x, y) |
Matches x repeated exactly y times. |
sequence(x, y, z) |
Matches a sequence of rules. |
choice(x, y, z) |
Ordered choice between rules. Matches next subrule if and only if preceded had been failed. |
Node construction operators
Operator |
Description |
capture(tag, x) |
Matches x and writes token matched by x to the node's reference tagged by tag string. |
branch(tag, x) |
Forces subrules of x to write their resulting nodes to the current node's branches multimap. Using key as a multimap's key. |
Note that if capture
, branch
or any other rule that includes these rules fail, constructing node's references and branches multimaps will be reverted respectively to their initial state. Like they wouldn't have had applied at all.
Miscellaneous
Operator |
Description |
name(label, x) |
Convenient way to reuse composite syntax rule. May be useful to refer the same rule multimple times instead of duplicating it's code. In contrast with .rule() and .mainRule() methods, .name() does not define rule constructor. It simply returns nested rule's result. Note that in contrast with rule and mainRule methods x is defined eagerly. Therefore named rules cannot refer each other cyclically. |
recover(x, exception) |
Allows subrule x to be possibly recovered with error message exception. |
expression(tag, atom) |
Defines expression parsing rule constructor. tag is optional parameter to bind top-level node of the expression's node subtree to the current node's branches multimap. |
Error recovery mechanisms
Parsers made with Papa Carlo are able to recover syntax errors. There are two techniques:
- Erasing mismatched tokens.
- Skipping missed required tokens.
First one is designed to avoid parts of the source code that recognized by parser as completely inappropriate in that place. And the second is to parse code parts that contains syntax errors, but still can be recognized as meaningful. For example a block of code if (a == b) {doSmthng();
has missed closing brace "}", but still can be recognized as operator block.
First way is provided by parser out of the box. If the parser didn't reach tokens from the code fragment it temporary removes leftmost unreached token(like if it was marked as skipped
) and runs over parsing process from the beginning of the code Fragment. Fortunately Papa Carlo's parser uses packract memoization. So running parser on the same fragment of code doesn't lead to performance issues.
To acquire advantages of the second way developer should manually specify which syntax rules can potentially be skipped in case of fail.
Skipping failed syntax rules
Application of syntax rule whether it be a primitive operator or a named rule defined with .rule
or .mainRule
methods will produce one of the following matching result:
- Matching was successful.
- There were syntax errors during the token matching, but they are not crucial. So called recoverable result.
- Application completely fails. Non-recoverable result.
The second and the third results produce syntax error messages that can be read then using .getErrors
method of the Syntax
class. But the first and second results are interpreted by parent rules as successful. For example sequence(x, y, z)
is considered as failed if and only if one of it subrules completely fails. In case of any of the nested rules finish with recoverable results and there were no failed subrules, this parent rule will return recoverable result too.
To mark a rule as potentially skipped and it's content as recoverable developer needs to wrap this rule to operator .recover(subrule, errorMessage)
. This operator will never fail. If nested subrule
has failed the whole composition will finishes as recoverable and the recover
rule produce syntax error in the current position with error message defined by errorMessage
parameter.
It is recommended to use this technique to recover missing control tokens like semicolons ;
in the end of the statements, colons ,
between list's elements, missing closed tokens like )
or }
etc.
Note that it is important not to overuse this recovery mechanism. Especially in cases when missing tokens may completely change the meaning of the code construction and produce grammar ambiguity. For example closing parenthesis )
of the if
operator's condition in Java is not crucial to recognize the whole construction:
if (a == b {
doSomething();
}
But starting brace {
of the operator block is vital. Otherwise every single statement operator will be recognized as operator block with missing starting brace.
Syntax parsing rules are in fact applied to the source code Fragments, not the whole code. FragmentController is responsible to determine which Fragment was invalidated when user makes changes to the source code. And it notifies Syntax parser about this Fragment. Syntax parser is looking for the cachable named rule that was applied to this fragment last time and applied it again. If there were no such rule(for example because it was failed last time, or was not marked as cachable) parser chooses enclosing fragment that matches these requirements. Also during the parsing process Syntax parser reuses AST nodes bound to nested fragments that were not affected by the changes of source code, instead of parsing these fragments again.
There is always exist one top-level Root Fragment that covers the whole source code. And the Main rule defined by .mainRule()
method for the top-level AST node. So if no fragments were parsed by FragmentController, Syntax parser will use Root Fragment by default to parse the whole file.
That is how Papa Carlo's caching mechanism performed.
To mark syntax rules that are intended to be potentially cached use cachable()
method of the Syntax class. From the example of JSON parser:
...
cachable(jsonObject, jsonArray)
...
Note that appropriate code fragments must also be marked as cachable in the fragment specification:
...
trackContext("[", "]").allowCaching
trackContext("{", "}").allowCaching
...
Classical Parsing Expression Grammar approach to define expressions with infix/postfix/prefix operators like (1 + 2 * 3) / -4
offers to use rules like this:
Value <- atomicOperand | '(' Expression ')'
Product <- Expression (('*' | '/') Expression )*
Sum <- Expression (('+' | '-') Expression )*
Expression <- Product | Sum | Value
In simple case described in the pseudo-code above it looks pretty simple. But for real programming languages with numbers of various operators the grammar could be much more complicated. Appropriate rules should be implemented in accordance with operator precedence and associativity. Moreover usually compiler needs to have Nodes in form of Binary Tree. So additional step to reformat resulting tree to binary form will be required.
To reduce this complexity Papa Carlo provide's builder of operator-precedebce parsing rule with easy to use API. And the output node result of this rule form Binary Tree by default.
From the Calculator example:
...
mainRule("expression") {
val rule =
expression(branch("operand", recover(number, "operand required")))
group(rule, "(", ")")
postfix(rule, "%", 1)
prefix(rule, "+", 2)
prefix(rule, "-", 2)
infix(rule, "*", 3)
infix(rule, "/", 3, rightAssociativity = true)
infix(rule, "+", 4)
infix(rule, "-", 4)
rule
}
...
Here the .expression(tag: String, operand: Rule)
method defines expression rule parser. tag
is optional parameter that is a "result"
by default. It's meaning is the same to tag
parameter in .branch()
method. operand
is atomic operand of the expression. In this example of the math expression parser it is a numeric literal.
Four functions are used to configure the parser
Operator constructor |
Expression example |
Description |
group(expression, open, close) |
( x + y ) * 2 |
Defines grouping operators. |
infix(expression, operator, precedence) |
x + y + z |
Left associative infix binary operator with specific operator's precedence. |
infix(expression, operator, precedence, rightAssociativity = true) |
x = y = z |
Right associative infix binary operator with specific operator's precedence. |
|
prefix(expression, operator, precedence) |
( - x) + y |
Prefix unary operator with specific operator's precedence. |
postfix(expression, operator, precedence) |
x = i ++ |
Postfix unary operator with specific operator's precedence. |
Order of the constructor applications does not make sense. They can be applied in any order. So to define expression grammar of the programming language developer can directly translate operator precedence tables like this: Python operator precedence table.
Internally parser uses Pratt algorithm. So one who are familiar with this parsing technique may implement his/her own extensions.
From the Postfix operator constructor source code:
...
def postfix(rule: ExpressionRule, operator: String, precedence: Int) {
rule.parselet(operator)
.leftBindingPower(Int.MaxValue - precedence * 10)
.leftDenotation {
(expression, left, operatorReference) =>
val node = new Node(operator, operatorReference, operatorReference)
node.branches += "operand" -> List(left)
node
}
}
...
Here is a brief low-level API description:
.parselet()
method defines a new "token" in term's of Pratt's algorithm for the specific operator.
.leftBindingPower()
defines ldp parameter of the token.
.leftDenotation()
defines led handler of the token.
.nullDenotation()
defines nud handler of the token.
expression.parseRight(rdp)
to apply parser in the body of the led or nud handler to the right part of the expression with specific rdp
right-binding power.
expression.consume(tokenKind)
to force parser read specific token in the current position.
Despite the fact that Papa Carlo's parser usually constructs Nodes automatically based on defined rules. It is still possible to interrupt this process in some specific case to apply manual changes to the nodes or even reconfigure the whole tree's branch.
To do that developer may call .intercept
method of the Syntax class:
val objectEntry = rule("entry") {
sequence(
capture("key", token("string")),
token(":"),
branch("value", jsonValue)
)
}
intercep(objectEntry) {
entryNode =>
val accessor = new NodeAccessor(entryNode)
accessor.setEnd(entryNode.getBegin)
entryNode
}
During the parsing process constructing Node's id
property is not assigned yet. That means that Syntax does not track these nodes. And they can be freely mutated using NodeAccessor
class. New nodes can also be created manually the same way using NodeAccessor
to configure them.
There are three important things that developer should keep in mind about node manual creation/manipulation:
- It is possible to mutate only the nodes that are not the part of the final AST. Nodes that already bound to the AST become immutable.
- One Node instance must have no more than one parent. In other words developer should not bind the same Node instance as a branch to different supernodes.
- Syntax rules that have explicit interception routine does not support caching.
There are a number of approaches to debug programming language parsers. And probably there is no the best solution. This artical describes approach that is used by Papa Carlo to test and debug it's own example parsers. And it could be adopted to third-party parsers made with Papa Carlo.
In the test directory of the Repository you can find two functional scala-tests of the example parsers. Each Test Spec loads different versions of the prepared source code files, listens to the logs of the various parser's components, and comparing them with prototype logs.
Utils directory contains a number of so called Monitor classes. Each monitor is responsible for listening the logs of specific parser's components. And the abstract class ParserSpec is a ready to use prototype of the configurable Test Spec that utilizes these monitors to make resulting output logs, and compare them with appropriate prototype logs.
Example of test spec for Calculator parser:
class CalculatorSpec extends ParserSpec(
// Name of the directory contains approppriate resource files
// and json configuration.
parserName = "calculator",
// Function-constructor of the lexical parser
lexerConstructor = Calculator.lexer _,
// Function-constructor of the syntax parser
syntaxConstructor = Calculator.syntax
)
This test spec reads configuration details from the appropriate json file placed in resource dir:
{
"expressions": {
"steps": 3,
"independentSteps": true,
"monitors": ["node", "error", "token"]
},
"recovery": {
"steps": 2,
"independentSteps": true,
"monitors": ["node", "error", "token"]
}
}
The configuration lists specific test suites and their options. The name of test suite should match the name of the relative resource directory consists of input and prototype files. In this example input files should be placed to the src/test/resources/fixtures/calculator/expressions/input directory, and prototypes to the src/test/resources/fixtures/calculator/expressions/prototype directory.
This is description of the possible test suite configuration options:
Option |
Description |
steps |
Number of input source code versions. Optional parameter. By default - the number of files found in the "input" directory. By convention each input file should be put to the "input" dir, and have name step<n>.txt , where the n - is version number starting from 0. |
monitors |
Array of Monitor string names that should be used for this suite. By default all monitors are turned on. |
shortOutput |
Boolean flag indicates whether the output logs should be simplified or not. Turned off by default. |
outputFrom |
Number of the first input file that should be logged. All preceded input files will be processed by the parser, but their logs ignored. 0 by default. |
independentSteps |
Boolean flag indicates whether the parser should be reset on each input step. Turned off by default. Turning on disables incremental parsing feature. |
The list of provided Monitors:
- "token": Logs all tokens and their fragment context. Useful to debug Tokenizer and Fragment specification of the parser.
- "fragment": Logs fragments that were created, deleted and invalidated by FragmentController.
- "cache": Logs fragment and appropriate node creation/deletion/invalidation events of internal AST caching process.
- "node": Logs Node's creation and deletion events of the Syntax parser. And also the AST branch that were updated(so called "merge") during the parse phase.
- "error": Logs syntax errors happens during the Syntax parsing step.
TestSpec puts together Monitor's output, comparing it with prototype logs got from appropriate resource files, and updates output files in relative "output" resource dir. So developer may manually review log output and/or update appropriate prototype files using these output files.