Blog/Developer/Airscript – a programming language designed to deal with sensitive data
data-binding-tag.gif

Airkit is a low-code platform that allows users to build and deploy enterprise web apps. At Airkit, we developed Airscript, our own programming language, as the language developers use to build Airkit apps.

You’re probably wondering – why did we build our own programming language instead of using a more widely adopted language like Javascript?

The answer is that having end-to-end control over the language allows us to build first-class primitives into the language to make it easier for developers to build the type of applications Airkit is designed for – distributed, compliant web apps.

For example, Airkit apps are distributed and need to synchronize across devices and users. Building distributed web apps (think Google Docs or Figma) is time-consuming and difficult. Airscript’s runtime has built-in state synchronization mechanisms which enable every app built with Airscript to have multiplayer capabilities.

Secondly, making an app compliant requires a lot of manual work and overhead. Airscript’s runtime has built in mechanisms to automatically track sensitive data and emit audit events. In this blog, we will focus on understanding how Airscript makes it easier for developers to build compliant apps that deal with sensitive customer data.

Compliance!

When developers build applications, they want to focus on building the app and solving the problem at hand. Compliance, although not necessarily an afterthought, is often viewed by developers as just a required checkbox that needs to be checked before deploying the app. One of Airscript’s design goals is to minimize the amount of time developers spend on making their app compliant.

What constitutes a compliant application? The exact requirements for data compliance is industry-based. For example, there’s PCI DSS for payment card details, HIPAA for healthcare, and PII for personal information. But all these compliances require audit logs that track how sensitive customer data flows into and out of the application.

Let’s suppose an insurance company built a web app to collect the last four digits of a user’s social security number (SSN) before sending that to their CRM. To be compliant, the application must emit two audit events – one for the application capturing the user’s SSN and one for the application sending the SSN to the CRM. Traditionally, web developers need to manually add code throughout the codebase to fire off these audit events. Doing that isn’t only time-consuming but also prone to errors, particularly for apps with complex data processing logic.

With Airscript, developers don’t need to audit events manually themselves. Instead, the application runtime automatically tracks sensitive data and fires off audit events at the right time. All developers need to do is mark data as sensitive as it flows into the application and point to where they want the audit events to go. Let’s first look at how Airscript tracks sensitive data!

Data tags

In Airkit’s platform, developers can attach data tags to a runtime value. Data tags are arbitrary metadata that provide additional information about a runtime value. Examples of data tags include PII, PCI, and HIPAA. Once a runtime value is tagged, the Airscript runtime is able to preserve the data tags even if it gets transformed or combined with another runtime value. Let’s look at a few examples!

First, imagine that there is a variable called pii_var that holds data tagged as PII.

Screen Shot 2023-03-09 at 12.47.52 AM.png

The output of the expression above is a text tagged as PII. This is because an UPPERCASE-ed version of a PII text is still PII.

In the following example, we concatenate a PII text and a PCI text. The output of the expression is a value tagged as both PCI and PII, since the combined string contains substrings that are PII and PCI.

Screen Shot 2023-03-09 at 12.54.26 AM.png

If a runtime value is copied to another variable, the data tags will also be copied. In the following LET-IN expression, we create two alias variables – a and b  and assign them to pii_var and pci_var respectively. The output of an expression is an object containing two key-value pairs, “foo” which is mapped to data tagged as PII and “bar”, which is mapped to data tagged as PCI.

Screen Shot 2023-03-09 at 12.56.01 AM.png

In Airscript, assignment (which we call “Set Variable”) also retains data tags. In the following example, we set the variable foo to a nested object containing sensitive data.

Screen Shot 2023-03-13 at 8.50.48 AM.png

If we look at Airkit’s debugger below, we can see that all the data tags now residing in the nested object are retained.

toggle-nested.gif

Runtime values can only be tagged when data flows into the application runtime. The figure below is a text input whose job is to collect someone’s driver’s license number. In this example, whatever the user types will be stored in the variable pii_var which will hold a runtime value tagged as PII.

Screen Shot 2023-03-13 at 5.40.30 PM.png

In the next section, we uncover how Airscript’s runtime is able to track data tags for its runtime values with a technique called metadata boxing.

Metadata Boxing

In programming languages, boxing is the process of converting a primitive type to an object type. In Java, you may choose to convert the primitive int into a boxed Integer. The exact meaning and behavior of boxing depend on the language you’re using. At Airkit, we developed a special form of boxing called metadata boxing for Airscript.

The core idea behind metadata boxing is that the runtime not only boxes all primitive data types, it also attaches a bag of metadata to each boxed object, called data tags. The data tags of each boxed value can be used to store arbitrary metadata about the underlying data. In Airscript’s case, the boxed value stores the sensitivity of the data (e.g. PII, PCI, HIPAA). By having the data tags attached to the raw value, the tags will automatically be propagated alongside the data across function calls.

Airscript is compiled into JSON and interpreted by a runtime built in Typescript. Airscript has primitive data types such as number, string, boolean, date, null. But, the runtime doesn’t use Javascript primitives as the runtime representation for these data types. Instead, the runtime uses a boxed data type called AirValue. Each Airscript primitive data type has corresponding AirValue constructors which instantiate the boxed objects. These AirValue constructors create AirValue instances.

For example, instead of using Javascript’s primitive number as the runtime value, the runtime constructs a boxed object with the AirNumber constructor. Similarly, there are other constructors such as AirBoolean, AirString, AirRecord, AirList, AirNull that are used to create runtime value for other primitive data types.

To develop a better understanding of what this means, let’s take a look at the following expression

Screen Shot 2023-01-20 at 11.46.39 PM.png

Airscript expression is eventually compiled down to the following JS code

1
AirNumber(1).plus(AirNumber(3))

Here, the primitive Airscript numbers 1 and 3 are both instantiated into instances of AirNumber. Note that because we don’t use Javascript primitives, we couldn’t use Javascript built-in operators like +. Instead, we built our own methods that work on AirValue instances.

Let’s look at a more complex example – Let-In.

Screen Shot 2023-02-09 at 3.47.22 PM.png

Under the hood, the above Airscript expression is compiled and evaluated as follows (simplified version)

1
2
3
4
5
6
(function(units: AirNumber, unit_price: AirNumber) {
   return AirRecord({
      "total_price": units.times(unit_price).times(AirNumber(1.08))
      "units": units
   })
})(a.plus(b), c.get(0))

This example is quite complex, but the takeaway is that functions in the Airscript runtime take in AirValue instances as inputs. The function body is also implemented with the API AirValue exposes. This essentially means that Airscript compiles down into lower-level AirValue methods.

Boxing Airscript primitives comes with trade-offs. Primitive Javascript data types are stored on the stack while objects are stored in the heap. As a result, boxing can be slower and more memory intensive. It also requires us to build a suite of methods for each AirValue instance to implement Airscript’s evaluator.  But, having end-to-end control over the runtime data structure unlocks data tagging, which we will cover in the next section.

Data Tagging Internals

Earlier, we mentioned that each AirValue instance has a bag of metadata, otherwise known as data tags. Data tagging is the process of adding a tag to an AirValue instance’s tags.

Each AirValue instance has a setTag method to add a tag to an AirValue instance. It also has a getTag method to retrieve the tag’s value.

1
2
const tagged = airValue.setTag("tagName", "tagValue")
const tagValue = tagged.getTag("tagName") // returns "tagValue"

To tag data as sensitive, developers can just call setTag  as follows

1
2
3
const tagged = airValue.setTag("PII", ...)
const tagged = airValue.setTag("PCI", ...)
const tagged = airValue.setTag("HIPAA", ...)

Earlier, we demonstrated that AirValue instances have methods like plus, times, get, etc. All AirValue methods are implemented in a way that preserves the data tags of the AirValue instances. To showcase this, let’s look at the get method in action:

1
2
3
4
const airValue = AirRecord({ "foo": "bar" })
const tagged = airValue.setTag("PII", "abc123")
const childAirValue = tagged.get("foo") // AirString
childAirValue.getTag("PII") // abc123

In this code snippet, we created an AirRecord with the value { “foo”: “bar”}. We then tag the entire object as PII. If we access the foo property, the child AirValue will also have the PII tag with the same tag value. In this example, ”abc123” is just an arbitrary string to demonstrate that a tag’s value is also propagated to its child.

Similarly, if you concatenate two AirString instances with the concat method, the resulting AirString instance combines the tags of the two respective AirValue instances.

1
2
3
4
5
const piiString = AirString("foo").setTag("PII", 123)
const pciString = AirString("bar").setTag("HIPAA", 456)
const combined = piiString.concat(pciString) // "foobar"
combined.getTag("PII") // 123
combined.getTag("PCI") // 456

In the example above, we created two AirString instances and tag them as PII and HIPAA, respectively. We then concatenate the two AirString instances to yield the combined AirString instance. The combined AirString is both PII and PCI.

In case you’re curious, here is the simplified implementation of the concat method for AirString.

1
2
3
4
5
6
7
8
9
class AirString {
   ...

   public concat(input: AirString): AirString {
      const rawJS = this.getRawJS() + input.getRawJS()
      const mergedTags = mergeTags(this.getTags(), input.getTags())
      return AirString(input).setTags(mergedTags)
   }
}

We can see here that concat merges the tags of the two AirString instances. The returned AirString instance contains the combined tags.

Now that we have a deeper understanding of how AirValue’s APIs preserve data tags, let’s take a look at how Airscript’s evaluator uses the AirValue API.

More Airscript expressions!

In the following expression, assume piiVar is a variable that holds PII data and pciVar is a variable that holds PCI data.

Screen Shot 2023-03-07 at 4.43.55 PM.png

The Airscript evaluator takes the expression above and returns an AirString that is both PII and PCI. Text Interpolation is Airscript’s implementation of Javascript’s Template String. It breaks the expression down into a list of expressions that form the string: My address is , UPPERCASE(piiVar), and my primary doctor is, and pciVar. The runtime evaluates each subexpression and feeds them to the TextInterpolation evaluator.

Here is the simplified version of Text Interpolation’s evaluator:

1
2
3
function TextInterpolation(sections: AirString[]): AirString {
   return sections.reduce((acc, section) => acc.concat(section, new AirString(""))
}

We can see that the evaluator simply loops over the sub-text and calls the concat method that we covered in the previous section to stitch them together. Since concat retains the tags of the AirString it deals with, Text Interpolation’s evaluator will also retain the tags of its subcomponents.

UPPERCASE is a builtin Airscript function and here is a simplified version of it

1
2
3
4
5
6
function UpperCase(airValue: AirString): AirString {
   const rawJS = airValue.getRawJS()
   const upperCased = rawJS.toLocaleUpperCase()
   const tags = airValue.getTags()
   return AirString(upperCased).setTags(tags)
}

As we can see here, UpperCase takes in an AirString. It extracts the rawJS with the method getRawJS. It then creates a new AirString instance with the tags from the original AirString. This is generally how built-in functions ensure that the output of a function retains the tags in a way that semantically makes sense.

These examples serve as a summary of how Airscript’s runtime works. Expressions are compiled down into lower-level AirValue methods that manipulate AirValue instances. These lower-level AirValue methods have mechanisms to ensure that tags are retained and combined. Because of this, developers who write Airscript don’t need to worry about how sensitive data flows through the runtime. The runtime will automatically propagate data tags associated with each runtime value.

Data Lineage

Developers want to know the lifecycle of sensitive data across the application runtime. They are interested in learning how the data was captured, how the data was transformed, and whether or not the data was sent to third-party services.

Metadata boxing already allows us to track sensitive data across application runtime. But we still need a way to group the audit events based on which data the audit event is related to. In Airkit’s runtime, we  assign a unique tagId to each piece of sensitive data. Each time an audit event is emitted, we attach the tagId to the audit event. This way, the audit events can be grouped together by the tagId.

The concept of tagId is inspired by distributed tracing. Distributed tracing uses traceId and spanId to trace how task execution propagates across microservices. Instead of traceId and spanId, we use tagId and parentTagIds to track how sensitive data flows across the runtime and external services. So how is tagId attached to AirValue instances? The setTag method we covered earlier automatically returns a tagged AirValue instance with a unique tagId.

The following two figures are two audit logs with the same tagId , a “Create” event when sensitive data flows into the runtime and a “Read” event when the sensitive data leaves the application runtime to a third-party service. Note that the two audit events have the same sensitiveDataTagId.

Screen Shot 2023-03-14 at 10.43.38 AM.png
Audit event for PII data flowing into application runtime
Screen Shot 2023-03-14 at 10.46.09 AM.png
Audit event for PII data flowing out of application runtime

Combining sensitive data

Sometimes, we need to combine tags. For example, developers may choose to concatenate PII AirString with HIPAA AirString. In this situation, we would like the new AirString instance to somehow have a pointer to have the two tagIds that correspond to the AirValue instances it was derived from. We introduce parentTagIds, an array of tagIds from which the new AirValue instance was derived. We call tags created as a result of merging AirValue instances a “derived tag”. Let’s look at how we can generate derived tags with the following code snippet.

1
2
3
4
5
6
7
8
9
const piiString1 = AirString("foo").setTag(PII, ...)
const tag1 = piiString1.getTag(PII) // { tagId: "uuid1", parentTagIds: [] }

const piiString2 = AirString("bar").setTag(PII, ...)
const tag2 = piiString2.getTag(PII) // { tagId: "uuid2", parentTagIds: [] }

const combinedString = piiString1.concat(piiString2)
const tag3 = combinedString.getTag(PII)
// { tagId: "uuid3", parentTagIds: ["uuid1", "uuid2"] }

As we can see here, piiString1 and piiString2 are given unique tagIds. When we concatenate the two AirString instances, the combined AirString has parentTagIds, which point to the original data that it was derived from, which is piiString1 and piiString2.

The following audit event is emitted when a PII data that is created by merging two other PII data leaves the application runtime. Highlighted is the parentTagIds that are part of the audit event.

Screen Shot 2023-03-14 at 9.03.08 PM.png

Emitting Audit events

One thing we haven’t covered is how the Airscript runtime automatically emits audit events when sensitive data leaves the application runtime.

AirValue is the runtime representation of the primitive data types in Airscript. However, when the data leaves the platform (i.e. the developer makes a POST request), the AirValue instance needs to be unwrapped into raw Javascript value to be serialized and sent across the network. This is the perfect place to automatically emit audit events because unless data is flowing to external services across the network, there is no need to unwrap AirValue instances into raw JS value.

In the example below, when the sensitive data is sent to Salesforce, a READ audit event is generated.

salesforce-read-event.gif

Storage

State management is one of the core components in Airscript’s runtime. Something unique about the runtime is that all application state is automatically persisted in Airkit’s distributed document store (built on top of Redis and Postgres). This means that any data set in the state store will automatically become accessible to concurrent clients and servers in near real time. This is similar to Google Doc or Figma in which the document state is automatically persisted. This means that if a user is typing into a text input in an application built on top of Airkit but accidentally closes the link, the data will still be there when the user reopens the link.

Storing application state poses a new challenge for tracking sensitive data. When the application state contains sensitive data, we need to make sure the sensitivity of the data is persisted along with the data. This way, when the persisted state is accessed by a concurrent client, the instantiated AirValue instances have the correct sensitivity tag in them.

To achieve this, we defined a new data format to store the application state.

For example, suppose we have the following raw data, where we assume SSN is tagged as PII.

1
2
3
4
{
 "SSN": "123456":
 "pets": ["dog", "cat"]
}

Airscript’s runtime instead stores it as follows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
   $$type: "object",
   tags: {},
   properties: {
      ssn: {
         $$type: "string",
         tags: {
            "PII": { tagId: "uuid3", parentTagIds: ["uuid1", "uuid2"] }
         }
         value: "123456"
      }
      arr: {
         $$type: "array",
         tags: {},
         value: [
            { $$type: "string", tags: {}, value: "dog" },
            { $$type: "string", tags: {}, value: "cat" }
         ]
      }
   }
}

This data format stores the tags of the data alongside the value. Each value has a $$type to specify the runtime value type and a tags key-value pair to store metadata about the value, such as its sensitivity. This is the same data format Airscript runtime uses to serialize/deserialize application state when it is being sent between servers and clients under the hood.

Conclusion

Programming language design is driven by the core values of the programming language. For example, one of Rust’s core values is memory safety. As a result, developers of Rust built ownership to govern how Rust program manages memory and enforce programmers to write memory-safe code. Because one of Airscript’s core values is to make it easy for developers to build compliant apps, we built metadata boxing as a first-class runtime primitive to track sensitive data flowing across the runtime and automatically emit audit events.

Having end-to-end control over a programming language is filled with possibilities. It allows us to tailor the language for certain use cases, thus gaining a competitive advantage over more generic programming languages.

Special thanks to Cam Kennedy and Sean Lynch for their huge contributions to the data tagging project!

See what Airkit has to offer.

Sign up for a free account!

Start free