Thursday 23 May 2019

Fuzzy Matching in IDM


Overview

Crumbs.  It's been a while since I last blogged!
I was recently working with a customer that asked if we had the capability to perform 'fuzzy' searching across the customer data that can be held in the ForgeRock Identity Platform.  In this case, the customer data was being managed in the ForgeRock Identity Management (IDM) component of the platform.  IDM requires an underlying datastore to hold the data, such as customer information.  IDM provides support for several different datastores, including PostgreSQL, MySQL, and ForgeRock Directory Services.  Each of these datastores have different approaches to storing and retrieving data which means they have differing query syntaxes.  IDM provides the developer an abstraction layer to these different technologies so that a consistent approach to storing and accessing data is provided.  It does this by translating RESTful IDM queries into datastore specific queries.  This translation is defined by a configuration file - a different one for each datastore technology supported.  The capability of the product allows basic querying with multiple fields, filtering, sorting and paging for large datasets.  The basic querying allows partial matches of data, for example, 'contains', 'starts with', 'less than', 'greater than' as well as equality.  This is great for most scenarios but does not offer 'fuzzy' matching.

What do I mean by 'fuzzy' matching?
Take the name 'Stephen' - is that the same, or different to 'Steven'?  Sara/Sarah?  Stuart/Stewart?  (I totally recommend watching this: https://www.youtube.com/watch?v=-n7gNLQoG6Y) Well of course they're different, but, as humans, we can also recognise that they sound the same despite being spelt differently.  
However computers don't naturally understand this nuance. If you ask it to find all people with the name Steven, it won't find people with the name Stephen.  Ok, well let's ask it to find everyone with the name that 'starts with' "Ste".  Great, we get everyone called Steven and Stephen.  And Stewart. And Stelios. And Stein. And Sterling…. you get the idea.

So what we need is fuzzy matching… everyone that sounds similar to 'Steven'.

Fuzzy Matching

Fortunately several methods have been devised to accommodate this.  Also, fortunately, several datastores provide native support for these methods.
Let's take one one of these methods: Soundex, and one of these datastores: PostgreSQL.
Firstly you need to enable fuzzy-matching in PostgreSQL by running this command in psql:
CREATE EXTENSION fuzzystrmatch;

Now we have access to fuzzy matching functions such as Soundex.  (See: https://www.postgresql.org/docs/9.6/fuzzystrmatch.html for the details on this extension and the other functions it offers)

So we can execute this query:
SELECT soundex('steven');
which returns the 4 character soundex code that represents 'steven':
S315

Compare that to:
SELECT soundex('stephen');
And note that the soundex code is identical:
S315

Also consider:
SELECT soundex('stelios');
And note the code is different:
S342

Now we can apply that to queries, for example:
SELECT * FROM customers WHERE soundex(firstName) = soundex('stephen');
This will return all customers that have firstName values that sound like 'stephen', which includes 'stephen' and 'steven', but not 'stelios'.

IDM Repo commands and parameterised queries

Great, how do we expose that in IDM?
Remember the configuration file that translates RESTful IDM queries into datastore specific queries?  Well, we're going to make use of that.
IDM exposes 'commands' defined within the configuration file on its REST endpoints.
IDM also supports parameterised queries on these repository commands. See here for more info: https://backstage.forgerock.com/docs/idm/6.5/integrators-guide/index.html#parameterized-queries

In the example from the link above we see that the repository configuration file has a 'parameterised command' defined: 
"query-all-ids" : "SELECT objectid FROM ${_dbSchema}.${_table} LIMIT ${int:_pageSize} OFFSET ${int:_pagedResultsOffset}",

And therefore, to reference this command on the REST endpoint you use this query string:
?_queryId=query-all-ids

Because it is parametrised the URL indicates the managed object on which this query should run, so this command would be called similar to this:
curl  --header "X-OpenIDM-Username: openidm-admin"  --header "X-OpenIDM-Password: openidm-admin"  "http://localhost:8080/openidm/managed/user?_queryId=query-all-ids"

In this case, the query would run against the managed user objects.  But the same query could be called like this:
curl  --header "X-OpenIDM-Username: openidm-admin"  --header "X-OpenIDM-Password: openidm-admin"  "http://localhost:8080/openidm/managed/role?_queryId=query-all-ids"
which would query managed roles instead.

What this means is that we can expose native datastore queries as 'parameterised commands' through the REST interface in a supported way.
So we need to put these two concepts together.  Let's write a parameterised command that leverages soundex which is then exposed through the REST interface.

Putting it together

Firstly, I'm going to assume a generic schema in the IDM repository, rather than fixed (for the definition, see here: https://backstage.forgerock.com/docs/idm/6.5/integrators-guide/index.html#explicit-generic-mapping)

Also note that as PostgreSQL supports JSON fields natively, and IDM makes use of this, when querying the IDM tables we need to take this into account.  For other datastores (for example MySQL, IDM uses separate property tables for searchable fields).
I also want to allow the calling application to decide which property on the managed object that should be searched for a fuzzy match.

So this is the definition of the parameterised query added as a command to the repository configuration file for PostgreSQL:
"fuzzy-match" : "SELECT * FROM ${_dbSchema}.${_mainTable} obj INNER JOIN ${_dbSchema}.objecttypes objtype ON objtype.id = obj.objecttypes_id WHERE soundex(json_extract_path_text(fullobject, ${propertyName})) = soundex(${propertyValue}) AND objtype.objecttype = ${_resource} LIMIT ${int:_pageSize} OFFSET ${int:_pagedResultsOffset}"
See how it uses the 'json_extract_path_text' function - this is part of the native JSON functionality in PostgreSQL.  Simply add this to the IDM repo configuration file.

And this is how you call it from curl:
curl -H "X-OpenIDM-Password: openidm-admin" -H "X-OpenIDM-Username: openidm-admin" "http://localhost:8080/openidm/managed/user?_queryId=fuzzy-match&propertyName=givenName&propertyValue='steven'"

Simples!

You may wish to refine this query to use different fuzzy algorithms, or maybe take account of the 'difference' function in PostgreSQL (which would enable matching 'steve' and 'steven').  It's also worth noting that the syntax for different datastores will be different so the actual query you write is up to you.  The point is that there is a supported and documented way of exposing that query as a RESTful API.

Pretty cool huh!