Introducing Glush: a robust, human readable, top-down parser compiler
Written by Magnus Holm
It’s been 45 years since Stephen Johnson wrote Yacc (Yet another compiler-compiler), a parser generator that made it possible for anyone to write fast, efficient parsers. Yacc, and its many derivatives, quickly became popular and were included in many Unix distributions. You would imagine that in 45 years we would have further perfected the art of creating parsers and would have standardized on a single tool. A lot of progress has been made, but there are still annoyances and problems affecting every tool out there.
We’re happy to announce that there’s now another tool for all our toolboxes: Glush is a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby) in a declarative and expressive grammar format. Glush focuses on supporting every type of grammar you can throw at it. This means you can write a grammar that’s easy for humans to read and that can function as a readable specification as well. While supporting a wide range of grammar features Glush maintains best-in-class performance.
Introduction
In this article we will go into the philosophy behind Glush and its reason for existence. We’re going to start by explaining exactly the type of parser we’re interested in creating. We’ll look at the major parser techniques and tools that exist today and discuss how well they fit into our requirements. In the final section we’ll explain the inner workings of the algorithm behind Glush.
This article covers a wide range of fascinating topics and some of the sections below could easily be expanded to full chapters in order to be fully explained, but given time available (ours and yours) we’re leaving it at a few paragraphs of discussion. If you find something interesting we hope that we’ve provided enough references and links for you to learn more about it.
Background: The making of GROQ
Why did we need Glush? Well, we needed it because of GROQ. GROQ (Graph-Relational Object Queries) is a flexible query language designed for working with JSON data. It was created by the team behind Sanity.io, where it functions as the primary way of querying structured content. We’re pretty proud with how it has turned out, but to be honest designing and implementing a query language is one of the tasks we wish we didn’t have to do. The need for GROQ appeared back in 2016 when we were dreaming about the optimal way to structure content. We had a very clear plan:
- We want to structure our content as plain old JSON with its flexibility of nested arrays and objects.
- We want to put that JSON into a database. This database is the single source of truth.
- We want to be able to write short, efficient queries for extracting, combining and shaping exactly the content we need.
Fast forward to 2019 and it turns out that GROQ has worked out very well for us and our users. Yes, it’s a new query language to learn, but its expressiveness is surprisingly unique. One concern many users have had is that learning and writing GROQ locks them into our platform. At some level this is true since Sanity.io is currently the only service supporting GROQ, but we feel that it’s important to stress that we don’t want GROQ to be a lock-in. We didn’t create GROQ because we wanted a proprietary technology that would lock users into our platform; we created GROQ because we had strong opinions about how a query language should work and we didn’t want to settle for anything less. As such, earlier this year we announced that we’re open sourcing GROQ.
As a part of this initiative we want to make everything about GROQ as open as possible. The obvious first part then is the parser. How can we claim that GROQ is open source if we’re sitting on the only parser? I guess we should open source our parser then? Well, there’s a few problems with just open sourcing our parser.
The first problem is that we didn’t actually have a well-defined specification of the syntax. By open sourcing our parser we would essentially create a situation where the de-facto specification of the syntax becomes whatever that parser does. This is especially suboptimal because that parser is 2000 lines of handwritten Go. It’s unfortunate if people need to dig through exactly how it works in order to replicate every little strange behavior they find.
So no, we think that open sourcing our existing parser would be quite a meaningless gesture which doesn’t tackle the real problems towards an open query language.
What does the optimal parser look like?
What would the optimal open source parser for GROQ look like? Let’s dream a bit!
First of all: We want a declarative grammar file. What we really want is a human readable specification of the syntax which happens to be executable as well. That makes it important for the grammar language to be flexible: We want to write the grammar in a way which is it clear for the human reading and writing it, even if that comes at the expense of making it harder for the parser.
We want the grammar file to specify both the lexing and parsing. Many parser algorithms are based on the concept that you parse in two steps: First a lexer step where you look at the input and produces tokens (e.g. 1 + 1
would be converted into NUMBER PLUS NUMBER
), and then you have parser step where you look at the tokens and determine the structure. (Yes, it’s a bit confusing that sometimes "parse" means "do the whole job" and other times it only means the step which looks at tokens.) We don’t mind this separation per se, but it’s important for us to be able to specify all the rules in a single file. If the grammar file can only cover the parsing step we need a separate solution for how to write down the lexer rules. A parser which does both steps at the same time is often called a scannerless parser.
We want to be able to generate parsers for multiple programming languages. Initially we want to generate a JavaScript parser. Later we want to generate a Go parser which can replace our current internal parser. Other than that we just want to make it easy to add support for more programming languages over time.
We don’t need super fast performance. The primary function of this parser is to correctly parse GROQ queries in different languages. For most use cases you don’t need super fast performance since the execution time of the query will completely dominate the time spent parsing. For those cases where you really care about the performance of the parser you might not be able to use this parser. That is okay. You can still then use this parser to verify that you’re parsing everything correctly.
Even though we don’t "care" about performance, it must still have reliable performance. We want to be able to run this parser in production, and it’s important that it doesn’t collapse (e.g. suddenly takes exponential time) on pathological cases.
What about the existing tools and techniques?
Parsing is an interesting field in computer science. I recently discussed different approaches to verifying the correctness of parser algorithm with a computer science professor, and his first statement was "Why are you working on a parsing algorithm? That’s a solved problem!" This is true at a certain level as most grammars (for programming and query languages) can be written in LALR(1) which can be efficiently parsed. A parser is also something you write once and then never need to touch again. It might be hard work to write the initial grammar, but it doesn’t matter so much because you only need to do it once.
However, in practice it turns out that LALR(1) isn’t always the answer. For instance, both GCC, Rust and Go have chosen to use handwritten parsers that are not based on a declarative grammar file. I find this disappointing: We have decades of experience with specifying languages in a declarative format, and apparently it’s still easier to manually write parsers. Obviously there’s something lacking with the algorithms we have today.
This is a problem which has irked me for a long time. One of my weaknesses in getting things done is that I often get distracted by wanting to improve all of the things. I have this great idea for a web application, but as I start working on it I find that the web framework has some annoying quirks. Then I’ll start playing with ideas for a better web framework, but as I start working on it I find that the programming language has some annoying quirks. Then I’ll start playing with ideas for a better programming language, but as I start working on it I find that the parser toolkits have some annoying quirks. Now I just need to build the best parser toolkit so I can complete the best programming language so I can complete the best web framework, and then I’ll finish the great web application in no time.
The result of this endeavour is that I’ve tried many parser toolkits and read up on pretty much all of the algorithms that are out there. I’m constantly on the lookout for an approach that can handle all of the requirements in the previous section. The algorithm behind Glush is, at this moment in time, what I believe is the most promising solution, but I will be very honest and admit that it’s been a long journey and that it might end up being another dead end.
In this section I will summarize the various parser algorithms and tools that I’ve looked at. This will tell you how I have discovered a technique, learnt something from it, and then eventually end up discarding it for some reason. Unsurprisingly, this section will slowly converge towards the machinery behind Glush.
Context-free grammars
The classical way of creating a parser involves creating a context-free grammar, which is a set of recursive rules that describes the language. There are many different types of grammars, but it turns out that context-free grammars has a nice combination of expressiveness and ease of parsing and therefore they have become ubiquitous. Typically you see context-free grammars written in BNF style:
A context-free grammar for basic arithmetic of single-digit numbers.
Internal server error