Wednesday 29 October 2014

A Fuzzy Sublime-like String Matching Algorithm

Those of you who pay attention to what I'm up to, will know that one of my long running projects is to write my own code editor.

One of the features that's been implemented for a while was a basic filename search via a popup search box. This approach is seen in other editors like Sublime Text, and it's an efficient way to quickly navigate between files.

I wasn't happy with my search matching though, it used a hacky Levenshtein distance algorithm for ordering. It was slow, and to keep it usable I had to disregard many matches that the user might have been looking for.

After much thought I've come up with a much better algorithm for fuzzy matches, I'm going to call it the Kazade Ranking Algorithm, because I can.

Step 1 - Filter down the list

The first step is to have a list of all the strings you want to search, let's say for example you have the following haystack of strings:

1. project/main.py
2. project/tests.py
3. sitepackages/project2/tests.py
4. sitepackages/project2/python.py
5. templates/base.html
6. templates/project/other.html

Let's say that you've searched for the text 'oth'. The first step of the algorithm is to look to see if the letters you've typed appear in the correct order in each of the files. So let's go through them:

1. The letters 'o' and 't' appear in order, but no 'h' so disregard
2. The same, only 'o' and 't'
3. Same
4. The letters do appear in order!
5. Nope, not even an 'o'
6. Yep! All three letters appear in the order they were typed

So, we've narrowed down the list to the following strings:

1. sitepackages/project2/python.py
2. templates/project/other.html

Step 2 - Scoring matches

Right, so we've narrowed down the list to two matches. We need to order them though so they are useful, it's probably safe to assume that if someone types 'oth' they are probably expecting other.html to be the top result.

The scoring process works like this:
  1. Set score=10, total=0.
  2. Find the position of the first typed letter in the result, add score to total
  3. Now, move along until we find the next typed letter, each time we skip over a non-matching letter, decrement score.
  4. When we find the next letter, add score to total.
  5. Now reset score to 10. Go to 3.
What this algorithm does is add a score for each letter that was searched. In our example, we typed 3 letters, and so the top total would be 30. However, because we decrement the score when we skip over letters, we punish non-consecutive matching letters.

If we apply this to the two matches we have, we end up with the following scores:

1. sitepackages/project2/python.py - 10 + 7 + 5 = 22
2. templates/project/other.html - 10 + 7 + 7 = 24

So it worked! The file we were looking for has a higher score! But, wait a minute... surely the second file should have a score of 30? I can see the letters 'oth' right there next to each other!

Step 3 - Finding the highest total

The problem is that the first letter we found wasn't the 'o' of other, but it was the one in the middle of 'project', and also the 't' we found was the one in project too, not 'other'.

The solution is to perform the algorithm several times, once for each occurrence of the first typed letter. In the second match above, the letter 'o' appears twice, so we perform the score counting beginning at the first 'o', and then at the second 'o' and then keep the highest of the two.

Done!

This algorithm works really quite well. I rarely don't find what I need, I can be pretty fuzzy with what I type, and it's super fast. For more information you can see the rank function that I use here.

My ranking function also reduces the total score by the difference in length between what was typed, and the string we are checking. I find this improves the ranking a little, but it's not necessary by any means.

Let me know if you find the algorithm useful!


11 comments:

  1. Fuzzy string search algorithm is the best way for creating automatic spelling check systems. We used it in our project and here is report about it http://multi-programming.com/blog/trigram-method-in-automatic-spelling-correction.

    ReplyDelete
  2. Examine this tracking app - t-mobile familywhere that would be a great help for every parent.

    ReplyDelete
  3. A sublime catching is here for the engagement for the citizens. All the shows of the https://techliveupdates.com/essential-devices-to-aid-remote-learning-and-teaching/ are invited for the terms. Path is fit for the hosting of the fussy and all sublime items for the future needs for the cafe.

    ReplyDelete
  4. Resume must be the mirror for your personality. All the chunks of the https://www.theodysseyonline.com/10-ways-that-you-will-get-caught-if-you-lie-on-your-resume are enhanced for the field. The partial theme is applied by this forum for the making of the good resumes.

    ReplyDelete
  5. What a beautiful article it is. spin dryer machine Check them out to learn more about spin dryer machines.

    ReplyDelete
  6. I'm fascinated by the concept of a Fuzzy Sublime-like String Matching Algorithm! It seems like a game-changer for data retrieval and analysis. I can see its potential applications in various fields, including healthcare, like optimizing search algorithms for medical procedures like Rhinoplasty surgery Dubai. The ability to find relevant information even with slight variations in the query is invaluable. Can't wait to see how this technology evolves and becomes more accessible. Great article!

    ReplyDelete
  7. zeed slot168 ของพวกเรา วันนี้สามารถ ช่วยเพิ่มเดิมพันที่เหมาะสมที่สุด ให้กับนักร้องเพลงทุกคน pg slot สำหรับเพื่อการเลือกเข้าใช้บริการตรงนี้ กันแน่ๆ ค้ำประกันได้เลยว่าการสร้างรายได้

    ReplyDelete
  8. ติดต่อเรา pg slot เว็บตรง ลูกค้าสามารถ ติดต่อมาและสอบถาม เนื้อหาการใช้แรงงานได้ทาง PG SLOT หรืออยากได้ อัพเดทข้อมูลโปรโมชั่นใหม่ ลูกค้าก็สามารถแอดไลน์ เพื่อติดตามข้อมูลได้ในทันที

    ReplyDelete
  9. Delving into the realms of programming, open source, games, Dreamcast, Linux, and Android is a diverse and thrilling journey - a digital odyssey where innovation meets endless possibilities. Just as these varied interests converge seamlessly, so does the efficiency and convenience of transactions through a pakistan payment gateway.

    ReplyDelete
  10. "TheDailyGuide">DailyGuide: Your go-to hub for daily insights and practical advice. Dive into topics covering wellness, productivity, relationships, and more. Empower yourself with actionable tips to elevate your daily life. Join our community and embark on a journey towards self-improvement and fulfillment, day by day.

    ReplyDelete
  11. บริการ สล็อต allbet 24 hr หนึ่งในเกมคาสิโนออนไลน์ที่ได้รับความนิยมอย่างมากในวงการพนันออนไลน์ pg slot ด้วยความหลากหลายที่มีในเกม รวมถึงกราฟิกที่สวยงามและการออกแบบที่น่าทึ่ง

    ReplyDelete