Wednesday, 29 October 2014

A Fuzzy Sublime-like String Matching Algorithm

Those of you who pay attention to what I'm up to, will know that one of my long running projects is to write my own code editor.

One of the features that's been implemented for a while was a basic filename search via a popup search box. This approach is seen in other editors like Sublime Text, and it's an efficient way to quickly navigate between files.

I wasn't happy with my search matching though, it used a hacky Levenshtein distance algorithm for ordering. It was slow, and to keep it usable I had to disregard many matches that the user might have been looking for.

After much thought I've come up with a much better algorithm for fuzzy matches, I'm going to call it the Kazade Ranking Algorithm, because I can.

Step 1 - Filter down the list

The first step is to have a list of all the strings you want to search, let's say for example you have the following haystack of strings:

1. project/main.py
2. project/tests.py
3. sitepackages/project2/tests.py
4. sitepackages/project2/python.py
5. templates/base.html
6. templates/project/other.html

Let's say that you've searched for the text 'oth'. The first step of the algorithm is to look to see if the letters you've typed appear in the correct order in each of the files. So let's go through them:

1. The letters 'o' and 't' appear in order, but no 'h' so disregard
2. The same, only 'o' and 't'
3. Same
4. The letters do appear in order!
5. Nope, not even an 'o'
6. Yep! All three letters appear in the order they were typed

So, we've narrowed down the list to the following strings:

1. sitepackages/project2/python.py
2. templates/project/other.html

Step 2 - Scoring matches

Right, so we've narrowed down the list to two matches. We need to order them though so they are useful, it's probably safe to assume that if someone types 'oth' they are probably expecting other.html to be the top result.

The scoring process works like this:
  1. Set score=10, total=0.
  2. Find the position of the first typed letter in the result, add score to total
  3. Now, move along until we find the next typed letter, each time we skip over a non-matching letter, decrement score.
  4. When we find the next letter, add score to total.
  5. Now reset score to 10. Go to 3.
What this algorithm does is add a score for each letter that was searched. In our example, we typed 3 letters, and so the top total would be 30. However, because we decrement the score when we skip over letters, we punish non-consecutive matching letters.

If we apply this to the two matches we have, we end up with the following scores:

1. sitepackages/project2/python.py - 10 + 7 + 5 = 22
2. templates/project/other.html - 10 + 7 + 7 = 24

So it worked! The file we were looking for has a higher score! But, wait a minute... surely the second file should have a score of 30? I can see the letters 'oth' right there next to each other!

Step 3 - Finding the highest total

The problem is that the first letter we found wasn't the 'o' of other, but it was the one in the middle of 'project', and also the 't' we found was the one in project too, not 'other'.

The solution is to perform the algorithm several times, once for each occurrence of the first typed letter. In the second match above, the letter 'o' appears twice, so we perform the score counting beginning at the first 'o', and then at the second 'o' and then keep the highest of the two.

Done!

This algorithm works really quite well. I rarely don't find what I need, I can be pretty fuzzy with what I type, and it's super fast. For more information you can see the rank function that I use here.

My ranking function also reduces the total score by the difference in length between what was typed, and the string we are checking. I find this improves the ranking a little, but it's not necessary by any means.

Let me know if you find the algorithm useful!


2 comments:

  1. Fuzzy string search algorithm is the best way for creating automatic spelling check systems. We used it in our project and here is report about it http://multi-programming.com/blog/trigram-method-in-automatic-spelling-correction.

    ReplyDelete
  2. Examine this tracking app - t-mobile familywhere that would be a great help for every parent.

    ReplyDelete