Friday, September 07, 2012

Filtering out illegal characters using Guava

The other day, I needed to validate a text-field against a whitelist of characters. Actually, lots of text-fields needed lots of different sets of whitelists, but let's just stick to one for the sake of example.
The text field "First name" is only allowed to contain any of these characters: 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz°Žµ·žŒœŸÀÁÂÃÄÅÆ
ÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ
Þßàáâãäåæçèéêëìíîï    
ðñòóôõöøùúûüýþÿ
.. and the user should be notified with which illegal characters she has entered in the field, if any.

The knee-jerk reaction to this kind of logic is regular expressions, but my knee-jerk reaction to regexp is  to avoid it. The above character set probably doesn't fit with any of the predefined regexp notations, so it would probably be nasty anyway.

Some Guava fruits for coloring up this blog post (image source)

So, let's put Google Guava to some use. It has some nifty utilities for working with sets and other collections.

If you consider the user's input characters as one set, and the above whitelist characters as a second set, you can find the characters that appear in the first set, but not in the second like this (see the whole class here):

[Loading gist....]

That's pretty much it. I've put a whole example project along with tests on Github.

Here are the tests showing off some usage:
[Loading gist....]

I will probably be adding some more Guava examples to that project over time.