all Technical posts

Extending F# FPrimitive With European Character Input Validation

F# FPrimitive is a great library for domain modeling validation and specification definitions. In a practical context though, some missing links may arise. Let's take a look at how we can extend the library to include European-supported character input validation.

The problem

During input validation on input fields, one has to come up with an ‘accept’ list, and ask: “What kind of characters should we allow?”

Generally, it’s much safer to look towards what you can accept, than what you should reject. We can already validate a lot before we even begin with the actual characters: origin, content length, syntax, schema, lexical content, encoding. In this blog post, we’ll look at the location of these steps, and the actual content of the input.

So, what’s the problem?

Imagine validating a city’s or place name, for example. If it were an American or British city, we would probably be able to validate the name based on the alphabet, with A-Z and its lowercase variants. In regular terms, this would be ^[A-Za-z]+$. This is already very strict and would allow us to filter out any special characters such as punctuation or other code characters that would allow injection. However, if we want to validate a Swedish city, such as Malmö, our previously defined pattern wouldn’t work.

European regular expression pattern

I took it upon myself to research all of the special characters used in Europe. Some states have a lot of special characters, others do not. Nordic countries have å, ä, ö, ø, while Southern countries have ñ, ú, é, á. We are dealing with a large and diverse set of characters. For security experts, it’s a bit harder, as we have to take into account all of these different ways of writing.

All of these special characters combined, with capitals included, would result in this regular expression pattern:

^[A-Za-zÁáĂăÂâÅåÄäǞǟÃãĄąĀāÆæĆćĈĉĊċÇçĎďḐḑĐđÐðÉéÊêĚěËëĖėĘęĒēĞğĜĝĠġĢģĤĥĦħİıÍíÌìÎîÏïĨĩĮįĪīIJijĴĵĶķĹĺĻļŁłĿŀŃńŇňÑñŅņŊŋÓóÒòÔôÖöȪȫŐőÕõȮȯØøǪǫŌōỌọOEoeĸŘřŔŕŖŗſŚśŜŝŠšŞşṢṣȘșẞߍťŢţȚțŦŧÚúÙùŬŭÛûŮůÜüŰűŨũŲųŪūŴŵÝýŶŷŸÿȲȳŹźŽžŻżÞþªº]+$

One might wonder if this impacts performance. The way this pattern is written is very strict and it doesn’t include backtracking, only allowing this specific set of characters. In combination with the .NET RegexOptions.Compiled, we can ensure that the most optimal usage of this pattern validation is used.

It’s also worth mentioning that the validation can also include some sanitization. We can validate on multiple words and make sure that two single space characters are rejected, but the sanitization can also prepare the data for us. The sanitization could also do things such as limit the number of space or dash characters used, make everything lower-case, or only the first letter upper-case. This is too project-specific to discuss here. However, it should be taken into account when validating the input.

Validation extensibility model

All theory aside, we should determine how we can integrate this in our projects. This is a very generic and overall collapsing validation. It could be widely re-used across projects, but it’s probably not generic enough to be included in third-party libraries. A common code space or an umbrella project could both be good places for these kinds of validations.

As an example, here’s how we could extend the F# FPrimitive library to include character validation on European characters:

Let’s also include a purely C# example. I’ll choose FluentValidation here, as it’s a very popular library:

Conclusion

Input validation is a very important topic in software security. It’s prone to errors and should be looked into very closely. Determining which kind of inputs could occur is an important aspect of this. This post looked into the possibility of validating European words. However, it’s very wrong to use every kind of input or to use the most basic types that your code language provides. This doesn’t reflect the domain and is full of risks on many levels.

Balancing how much to validate and how much to sanitize is a project-specific decision and should be discussed carefully. Allow as much as possible within the bounds of your domain, but be able to ‘help’ the input with a secure sanitization of the input. That’s the sweet spot, in my opinion.

Thanks for reading,
Stijn

Subscribe to our RSS feed

Thanks, we've sent the link to your inbox

Invalid email address

Submit

Your download should start shortly!

Stay in Touch - Subscribe to Our Newsletter

Keep up to date with industry trends, events and the latest customer stories

Invalid email address

Submit

Great you’re on the list!