— 20 min read
Building an LSIF indexer for a low-code platform
Written by Brian Shih, Airkit Engineering
In the past 2.5 years of working at Airkit, I’ve been delving deeper and deeper into the fascinating realm of low-code. Amidst this eye-opening yet rather winding expedition, I’ve begun to appreciate the beauty of abstractions. By examining the Airkit’s abstract syntax tree (AST) and my journey of building a Language Server Index Format (LSIF) indexer, we will explore the power of abstractions in the context of a low-code platform.
What is low-code?
To understand what low-code is, we must first examine what computer programs and programming languages are. Computer programs are sets of instructions for the computer, and programming languages are abstractions, in the form of formal languages, for these sets of instructions. The amount of abstraction provided to a programming language defines how “high-level” it is. Effectively, abstractions are what make it possible to write programs in higher-level code without thinking in terms of lower-level code, to write programs in C without thinking in terms of assembly code, and to write assembly code without thinking in terms of logic gates.
At Airkit, we treat low-code as yet another layer of abstraction — a high-level programming language that happens to be visual. As opposed to traditional text-based programming languages that require scanning and parsing to convert the source code into a parsed tree, which later gets converted into AST (abstract syntax tree), visual programming languages have structural editors that directly build the AST.
This exemplifies what low-code is — a visual way of coding. As seen in this example, the user can actually toggle back and forth between the visual editor mode and the more traditional text editor mode while reading and modifying the underlying AST.
The Airkit AST
The goal of this blog isn’t to perform a deep dive on the Airkit AST. However, having an intuition of what an Airkit application looks like under the hood is useful to understanding how Airkit works.
The AST for the button in Image 1 looks like this:
This button is an instance of an Airkit Element, which is the most primitive building block for Airkit applications. There are many different kinds of elements, and compound elements are built up from simpler ones. We can see that the button AST is actually a compound element comprised of elements that describe its text, style, and events.
The right-most panel in Image 1 is the inspector. An inspector is a collection of editors that allow users to modify an element and its sub-components. We can see that the button’s inspector has corresponding editors for the text, style, and events properties of the button element.
Using React as an analogy
We have seen earlier how visual editors that comprise the Airkit studio are just syntactic sugar over Airkit elements. Similarly, in React, JSX is just syntactic sugar over React elements. For instance, below is how we could declare a button in React with JSX.
The JSX is then compiled into:
Which gets evaluated into a React element like this:
Just as how JSX is simply a visual aid for building React elements, the visual editors in Airkit Studio are just visual aids for building Airkit elements. Both frameworks are simply abstractions that allow users to construct lower-level elements with higher-level programming languages.
Airkit Studio: the Airkit IDE
The Airkit Studio is an IDE, just as XCode, VSCode, and IntelliJ are. An IDE normally consists of the source editor, debugger, navigational/refactoring tools, error trays, etc. These are tools that help a programmer write syntactically and semantically correct programs and are all features in Airkit Studio that we work on.
In the remaining post, I will unveil the mystery behind how Airkit supports navigational and refactoring tools such as finding references, renaming variables, and more. Before we dive into the technical details, here is a quick demo of renaming a variable.
So how does Airkit provide accurate code navigation and refactoring to users within milliseconds? The answer lies in Language Server Index Format (LSIF).
What is LSIF?
LSIF is a standard format for programming tools (language servers, IDEs, etc) to persist knowledge about a code space. A good analogy for LSIF is a database index, which allows a database query to efficiently retrieve data from a database at the cost of additional writes and storage space to maintain the index. LSIF is an index that allows a code viewing client (e.g., a code editor) to provide features like autocomplete or go-to-definition without requiring a language analyzer to perform those computations in real-time.
If you ever wonder how IDEs like Visual Studio Code provide the same set of refactoring/navigational tools to so many different programming languages, it is because these IDEs have clients that are powered by language-agnostic protocols like LSIF. So as long as there is an LSIF indexer for a programming language to convert files of that language to the LSIF data format, the IDE can work with that language.
Choosing the LSIF as Airkit’s data model for indexing was easy. After some research, we felt, as opposed to building our own in-house protocol or using another indexing format, LSIF was simple to understand, is supported in many languages, and has ongoing support by several companies like Sourcegraph and Github/Microsoft.
Quick LSIF Example
To illustrate how LSIF supports efficient code navigation, we will use this code snippet as an example (from official LSIF specification).
As you can see, LSIF is just a collection of nodes and edges. Figure 2 below shows a visual representation of the generated LSIF graph.
The bar function declaration is labeled as bar[def] in the visual graph. We can also find the corresponding node in the JSON output: the node with id 9. The bar[def] node is a range node because it represents a symbol in the program ranging from line 0 character 9 to line 0 character 12. Range nodes are uniquely identified with starting and ending line/characters.
So how does VSCode se LSIF index allow a user to find all references for the bar declaration? The user would first hover over the bar declaration and then click the find-all-references button. Note that hovering over position denoting the “b” in bar and the “r” in bar will return the same range node. The VSCode client would submit a language server request, containing the cursor position, to the language server. The algorithm to find all references for a node is as follows:
- Use the cursor position to identify the range node with the line number and character count. The identified node is bar[def].
- Traverse to [result set] by taking the next edge. The result set has outgoing edges corresponding to the type of query a user can make about a node, such as references or definition.
- Take the textDocument/references node to the referenceResult node. The referenceResult node contains one or more item edges corresponding to the references of a node.
- Following the two item edges bring us to bar[def] and bar[ref], the two references for bar[def].
For those of you who aren’t familiar with how Airkit’s studio works, I suggest taking a look at this quick Airkit 101 video.
To demonstrate how LSIF is used in Airkit Studio, we will continue to work with our button example from earlier. To start off, we need a pair of definitions and references to work with.
We will first create a definition node. In figure 3, we create a variable called foo under Web Page, which is the parent element of the Button. The tree on the left of Figure 3 is the UI hierarchy of the Web Page. It also signifies the scope for each element. In general, any child elements are inside the scope of the parent element, so it will have all variables declared by the parent. Therefore, the Button now has access to the foo variable we just created.
This is the AST for the variable declaration:
Next, let’s create a reference node for the foo variable. Instead of just a plain “Click Me”, we now use a dynamic expression that contains the foo variable.
The updated AST for the button is:
Now that we have a variable definition and a variable reference, LSIF will connect the two range nodes as shown in Figure 5.
The entire app model is split into many small connected components like Figure 5. Each connected component connects a definition node with a list of reference nodes.
Finally, now that we have an LSIF index connecting the foo definition and reference, we can now query for the references of the foo variable in O(1) time. Figure 6 shows how a user could make an LSIF query to fetch references for the foo variable and Figure 7 shows the output of the query.
The remainder of this blog post will focus on the interesting technical challenges we faced when implementing LSIF. The first challenge we had to solve was how to modify the LSIF protocol to work with a non-textual programming language. LSIF is built for text-based programming languages: files, line numbers, and starting characters are first-class properties for nodes. In contrast, we store the ASTs of an application as one JSON tree.
This wasn’t difficult though — we just had to find a way to uniquely identify any Airkit element in the JSON tree. While normal files can use line numbers and character count to uniquely identify symbols, we can use the data path, which is the absolute path from the root of the JSON tree to uniquely identify an element. After figuring out the protocol, we needed to design the LSIF indexer.
Designing the indexer
LSIF is just a data format; there isn’t any documentation on how to build the LSIF index. Here are some of the major decisions we had to make about the indexer:
- Do we update the index synchronously or asynchronously?
- Does the index live on the client or the server?
- Do we persist the index?
Our team discussed many approaches and here are some of the tradeoffs for each of them.
Approach #1: synchronous updates on the client side
Currently, each client has a state manager (redux) that holds the app definition. When a user edits an app, a command is dispatched to the state manager to apply the commands to the app definition. Our initial approach was to also have the LSIF indexer live on the client. Whenever the state manager received a command, it would pass the new app definition to the LSIF indexer to rebuild the index each time.
The advantage of synchronous updates is that the LSIF index is always guaranteed to be up-to-date. Figure 8 shows the communication between the different components of the system: the user, the state manager, and the LSIF indexer. Time flows from left to right. Each request or response message is represented by an arrow.
In the example, the update to the LSIF index is performed synchronously. The coordinator waits for the indexer to finish indexing before reporting success to the user. This guarantees that when the user makes a request to the LSIF Indexer, the LSIF Indexer is up-to-date with the latest changes the user submitted.
Approach #2: asynchronous updates – server side indexing
To not block the UI thread when the LSIF indexer updates the index, we can move the LSIF indexer onto a server, making updates to the LSIF asynchronous.
With asynchronous updates, there is a problem. As seen in Figure 9, if the user queries LSIF shortly after making a write, the data changes may not yet have reached the LSIF index. If the user queries the LSIF index, it will look like the data they submitted was lost, so they will be confused.
In this scenario, we need read-after-write consistency, also known as read-your-writes consistency. This consistency model guarantees that if a user makes a change to the app model and subsequently makes a request to the LSIF index, the index will be up-to-date with the changes made by the user.
There are various ways to achieve read-after-write consistency with asynchronous updates. One way is to have the client remember the timestamp of its most recent write. Then we just need to make sure the LSIF index won’t respond to any requests until it is up-to-date with the timestamp of the request. Instead of the actual system clock, we can use a logical timestamp, such as the log sequence number formed by the commands the user submits to the state manager. The advantage of asynchronous updates is that the state manager can respond successfully to the user before the LSIF indexer finishes updating the index. The disadvantage is the additional cost of the servers and the latency when making requests to the server. In addition, we would need to make the indexer fault-tolerant. And, if we wanted to protect the indexers against a single node failure or network outage, we would need to make the indexers distributed.
Approach #3: asynchronous updates with background thread (Web Worker)
A similar approach to server-side indexing is to store and build the LSIF index in a background thread, such as a web worker (the worker thread can perform tasks without interfering with the user interface). The problem of this approach is similar to the problems of server-side indexing — if the system doesn’t guarantee read-after-write consistency, the LSIF may be outdated for a split second.
This is the flow for renaming variables:
- user submits a rename variable request to state manager to change the variable name from foo to bar
- state manager renames the variable, then makes a request to LSIF index for all references of foo
- after retrieving all references, state manager renames variables then responds to user with success
- state manager then updates the LSIF index
Suppose a user queries LSIF before the state manager updates the LSIF in stage 4, then the user would get outdated data. This is seen in Figure 10 when the user renames a foo variable to bar, then queries the references of bar, but doesn’t get the renamed bar variables in return.
The advantage of background thread is that laborious processing can be performed on a separate thread, allowing the UI thread to run without being blocked. The disadvantage of a web worker is the additional complexity to guarantee read-after-write consistency and the additional complexity to prevent memory leaks, etc.
Approach #4: incremental updates
Instead of rebuilding the LSIF index each time there is a write, we could perform incremental updates in real-time. This way, we could synchronously update the LSIF index. However, incremental updates come with some interesting technical challenges.
Suppose we have two global variables.
The user inserts a baz variable (Figure 13) to an array (Figure 12). How might we incrementally update the LSIF index to the changes made that takes us from Figure 12 to Figure 13?
You might be thinking since an array is a tree such that each child has its index as its key, can’t we just traverse to the array, increment the index of foo and bar by 1 and insert the baz variable into the graph? Unfortunately, things aren’t as easy as it seems. While the three variables — baz, foo, bar — are semantically connected by the array, they are actually not connected in the LSIF graph. Figure 14 shows that foo and bar are actually two disconnected components.
The challenge of building LSIF incrementally is that nodes in the app model that have semantic connections might not be connected in the LSIF graph. This makes it difficult to update data paths when we insert them into an array. It also makes it tricky to garbage collect the graph when a parent node is deleted since we need to recursively find all its children and delete them.
To solve this problem, we need a reverse index where we can search for a node by data path with a prefix. In the example provided in Figure 13, we need to search foo and bar by the data path elements/ and shift them by one. A good data structure for this would be a trie. Tries are ordered tree data structures for strings, each node is associated with a string inferred from the position of the node in the tree. Tries are also called prefix trees and can be searched by prefixes. All descendants of a node have a common prefix of the string associated with that node. The search/insertion/deletion time is linear to the length of term and key. Using a trie would allow us to find the child of a parent node given the parent node’s data path very easily.
The disadvantage of using a trie is the time and space complexity to index it. Furthermore, as of now, we decided to not store the LSIF index to reduce the scope of this project. This would increase the initial load time of an application by O(mn) where m is the number of nodes in an application. The number of nodes of an application can range from 200~1000 for normal apps, which makes the load time not trivial.
Choosing an approach
In an ideal world, if we could develop a robust and efficient incremental update algorithm, we would be able to avoid tricky race conditions with fast, synchronous updates. The problem with the approach is that it is error-prone and difficult to implement and maintain. In addition, unlike some projects that can be shipped piece by piece without fully completing, an LSIF indexer either works or it doesn’t. As a result, the incremental update approach is even riskier.
The initial prototype we built was to rebuild the entire LSIF index synchronously after each user update. Driven by the KISS (keep it simple, stupid) approach, we decided to test how far this approach would go. To test the performance, we tested apps with JSON files ranging 0.1 MB to 4 MB. Most apps are less than 1 MB, and it takes roughly 10ms to 100ms to index apps of that size, with roughly increasing order of time against the size of the app. The biggest app we tested was 4 MB and had more than 180,000 lines of JSON and over 37,000 Airkit elements. To rebuild the LSIF index would take roughly 600ms on average.
Given that one of Airkit’s core values is impact, we ultimately feel that for now, the performance of the naive LSIF indexer is good enough for most apps. It also allows us to not have to deal with race conditions or to maintain additional servers/web workers.
We may have explored many winding paths to build the LSIF indexer, but in the end, we were able to find a practical solution that allowed us to deliver the functionalities we were hoping to deliver. While this solution wasn’t the most optimal in terms of performance, it allowed us to make the biggest impact in a relatively short period of time.
One of the most exciting things about working at Airkit is that we get to build tools and abstractions that we enjoy using as programmers ourselves, such as type checkers, IDE tools, UI frameworks, etc. In addition, building a visual programming language and a distributed runtime allows us to apply theoretical knowledge from and dive deep into fascinating computer science domains such as compilers, type theory, distributed systems, databases, etc. If these kinds of technical challenges sound interesting to you, we’re hiring! Check out our open positions here.
Special thanks to Ari, Fabian, Aria, Aaron, Joe, Sean, Warren, and Daniel for reviewing and helping out with the content!