all Technical posts

From Untrusted Input to Strict Model with Layered JSON Parsing in F# FPrimitive & C#

How should we accept untrusted input and transform it somehow to a strict domain model? This post describes a possible process of how JSON content can be loaded, checked for syntax, sanitized, deserialized, and validated before we are certain of a strict model.

Introduction

Json.NET is of course the recommanded library for the JSON tokenization; but I will also use a library of my own called FPrimitive because it contains building blocks to create strict domain models and other functionality to help you model your domain (such as optional, read-once, railway-oriented types).

If you consider that the most common reason that expose security issues is input validation (i.e. injection), we should invest a great deal of time in determining what is ‘correct’ and ‘incorrect’ in your application. What is allowed? What makes sense? What do you trust?

Untrusted Input

What is ‘untrusted input’? Sometimes I get a bit paranoid about this, but it’s scary how paranoia can link up with reality. A good starting point is to look at everything that your application can’t control. This means everything that you receive from a HTTP request, but also from a local file, a configuration value, an environment variable, … everything outside your application. Even a custom implementation of a publicly exposed interface should be considered ‘unsafe’ or ‘untrusted’.

This post will look how the following JSON can be received and transformed to a strict domain model:

{ "author": "Philip K. Dick", "isbn13": "978-0-575-07921-2", "pages": 202 }

The JSON represents a book with an author, the ISBN13 code, and the amount  of pages the book has. Of course this is too small to be real; but I assure you that it’s big enough to make my point.

Pre-Guards

Before we go any further, keep in mind that we start from the content itself and not from a HttpRequestMessage or FileInfo. It’s important to keep in mind that before you even touch the content, you check the size of the content, content-type, encoding, …
I don’t want to accept a 5GB of text input for the author name for example (ex.: { "author": "aaaaaaaaaaaaaaaaaaaaaaa..." }). We can check length of seperate fields later; but the entire size, type, encoding, etc…should be checked beforehand.

Lexical Content of Data

If you have checked the content’s size, type, encoding, … we can start by loading the JSON into a workable model, using JObject.Load from Newtonsoft.Json.

You could debate on whether to throw or not for the arguments, or if the content is seekable, but that’s beside the point. Note that it’s not important how the JSON content is structured or what it represents.

After we determine that we don’t have a huge piece of content and we can safely load the content without overblowing ourselves, we then need to find out if it is indeed something that can be considered valid JSON.

  • Outcome is a type from the FPrimitive package and allows you to use Success/Failure values instead of throwing exceptions. Everything we do here is related to reading content, so in a web app, we could respond with a BadRequest. No need to throw any exceptions to control the flow.
  • The JsonTextReader is initialized with a set of settings that ignore all additional info and throws on duplicate property entries. So we definitly have something that is valid JSON. And we have made sure that we fail-fast for things we can already rule out as invalid.

Note that it’s not because we have valid JSON content, that we have a valid JSON syntax or JSON model. That comes later. Following inputs will still be valid, even though we know that they are not correct JSON contents of a book:

  • { 'product': 'Microsoft Surface Pro 6', 'price': '729' }
  • {}
  • {'👍': ''}

But we already have rejected some values. Incorrect JSON, and duplicated values are already handled.

  • { 'author': 'Philip K. Dick', 'author': 'Richard K. Morgan' }
  • <Book><Author>Philip K. Dick</Author></Book>
  • åß∂ƒ©˙∆˚¬…æœ∑®†¥¨ˆøπ“‘¡™£¢∞§¶•ªº–≠

For the F# devs, the Outcome type is a C# representation of the built-in Result type in F#.

Syntax of Data

After we have something that can be considered valid JSON, we then check if the content matches a predefined JSON syntax or schema.

Firstly, scan through the data to see that it contains the expected characters without focusing to much on the details. You’re checking to see if you have the expected JSON fields, and if those fields contains a value that CAN be valid.
An example of this could be the number of pages in the book. If the JSON contains a field "pages" with a string value instead of a number, we already can rule out that the JSON holds to the expected schema; without close inspecting, deserializing or validating any value.

With the Newtonsoft.Json.Schema package; we can use theJSchema model to create a schema for our book model.

The previous inputs that we’re considered ‘valid JSON’ are now all rejected. And on top of that, we have now a first syntax verification of our data; which makes the following also invalid:

  • { 'author': '', 'title': 'Ubik', 'isbn13': '978-0-575-07921-2000000', 'pages': 1001 }
    • Author should not be empty
    • ‘Title’ field is not recognized
    • ISBN13 field is too long
    • Pages is above 1000
  • { 'isbn13': 9780575079212, 'pages': '202' }
    • Author field is not found
    • ISBN13 is a number, not a string
    • Pages is a string, not a number

Note that we have yet to create any specific class or data-transfer object or domain model; and yet we have already inspected our content before any domain validation has taken place.

Sanitization

After the JObject is validated, with jobject.IsValid(schema); we know that we have a valid JSON content that respects the syntax of our book model. Now, we can deserialize the JSON content to a model of our own creation for easier access and to eventually create our strict domain model. Note that we DON’T make our domain model serializable.

It’s important to note that the object that is transfered over the wire, is not the same as the model that we use internally in our application. The model over the wire can be represented as a total different structure then the domain model, which means that changes externally will not effect how we have stricten our domain model.

Very important to keep in mind.
The following serialization class, shows how the book can be structured into a data-transfor-object or DTO. This class will later be used when we map to a domain model. Before we do that, there is also a place for sanitization. Restructuring data before we actually use the data. It can mean we remove characters, only allow certain characters, add header/trailers, etc…

Personally, I’m more comfortable placing this inside the DTO object because it’s this model that will be the first one that saves the values seperately. Sanitization can also happen on a previous level, of course, by removing invalid chars in the entire document. In XML, this can be a set of invalid chars.

The message is we should be ‘liberal’ about what we receive and ‘conservative’ about what we send.

The FPrimitive library contains some basic set of sanitization functionality to blacklist/whitelist, remove/replace characters to restructure some content before domain validation. It’s all about layering, not about the strength of each seperate layer.

Deserialization

Now, starting from a JObject, we can deserialize to a DTO object with a specific JsonSerializer, setting some extra settings to control the deserialization process. The TypeNameHandling = None is an important setting. The SerializerBinder that controls the types that will be deserialized is also interesting to investigate.

Semantics of Data

At this point, we have valid JSON that adheres to a predifined syntax and is sanitized to something that is ready to be validated inside the domain. We have a BookJson Data-Transfer-Object that will pass along the information to our domain model of our book. But first we have to create our book model.

The following model uses again the FPrimitive library to describe a set of specifications on how the book model is considered valid.

Note that I skipped the equality comparison, string representation, etc… because they are automatically generated in the later F# variant.

The FPrimitive already helps a lot in this regard; but let me show you how the domain model in F# is represented. All the previous code is almost exactly the same in F#, except the domain model because we can use some specific F# functionality.

F# as language is by default a lot more safer, correct and secure than the C# variant. So my first recommandation would be to reconsider using F# or a combination of F#/C# in your next project.

Conclusion

Probably the most used approach of parsing JSON into code, is using the JsonConvert.DeserializeObject<> method, parsing it directly into a class that you use in the rest of your code, which leaves a lot of possible injection and other security risks in your application. Using a layered approach, where you start with the entire content size, content type and encoding verification and then the:

  1. Lexical Content of Data: to determine if the content is indeed something that is considered valid JSON,
  2. Syntax of Data: to determine if the content is structure in the correct format,
  3. Semantics of Data: to determine if the content is completely valid and “makes sense” in the current application,

This leaves the attacker to a smaller set of options at their disposal. Of course, it’s never entirely safe; but using this layered-approach it’s definitely safer.

Thanks for reading and stay safe.

Subscribe to our RSS feed