Proposed by Burkhard and Keller in 1973, the BK-Tree is a data structure used for spell checking based on the Levenshtein Distance between two words, which is basically the number of changes you need to make to a word to turn it into another word. Some example distances:
LevenshteinDistance(cook, book) -> 1 LevenshteinDistance(cook, books) -> 2 LevenshteinDistance(what, water) -> 3
To make the last example clear, the distance between what and water is based on three moves, the first is to drop the h, then add the e and finally add the r. You can use this distance to work out how close someone is to spelling a word correctly, for instance if I want to know out of the set of words [cook, book, books, what, water] which word the misspelling wat is closest to I could do a full scan of the list and find the words with the lowest distance:
LevenshteinDistance(wat, cook) -> 4 LevenshteinDistance(wat, book) -> 4 LevenshteinDistance(wat, books) -> 5 LevenshteinDistance(wat, what) -> 1 LevenshteinDistance(wat, water) -> 2
Based on that search I can determine the user probably meant the word what due to it having the lowest distance of 1.
This works in the case of a small number of words since O(n) isn’t bad in this case, however if I wanted to scan a full dictionary for the closest match O(n) isn’t optimal. This is where a BK-Tree comes in.
To build a BK-Tree all you have to do is take any word from your set and plop it in as your root node, and then add words to the tree based on their distance to the root. For instance if I started a tree with the word set [book, books, cake] then my tree would look like this if I started by making the word book my root node:
This is because:
LevenshteinDistance(book, books) -> 1 LevenshteinDistance(book, cake) -> 4
Of course now if we add the word boo to the mix we get a conflict because:
LevenshteinDistance(book, boo) -> 1
which collides with books. To handle this we now have to chain boo from books by making it a child of books based on their Levenshtein Distance.
LevenshteinDistance(books, boo) -> 2
We continue to use this strategy as we add words, so if we throw in [Cape, Boon, Cook, Cart] then we get this:
Now that we have our structure the obvious question is how do we search it? This is simple as now all we need to do is take our misspelled word and find matches within a certain level of tolerance, which we’ll call N. We do this by taking the Levenshtein Distance of our word and compare it to the root, then crawl all nodes that are that distance ±N.
For example, if we searched for the closest match to the misspelling caqe within the tolerance of 1 in the above structure we would crawl it like this:
1) LevenshteinDistance(caqe, book) -> 4 a) Check (4 <= 1) - Don't add book as a match b) Crawl all edges between 3 and 5 (4-1,4+1) 2) Crawl into edge path 4 from book 3) LevenshteinDistance(caqe, cake) -> 1 a) Check (1 <= 1) - Add cake as a match b) Crawl all edges between 0 and 2 (1-1, 1+1) 4) Crawl into edge path 1 from cake 5) LevenshteinDistance(caqe, cape) -> 1 a) (1 <= 1) - Add cape as a match 6) Crawl into edge path 2 from cake 7) LevenshteinDistance(caqe, cart) -> 2 a) Check (2 <= 1) - Don't add cart as a match
From this example it appears we can now find misspellings at a O(log n) time, which is better than our O(n) from before. This however doesn’t tell the whole story, because our tolerance can drastically increase the number of nodes we need to visit in order to do the search. Adiscussion over a BK-Tree implementation in Lucene seems to indicate a tolerance of 2 should limit a dictionary search to 5-10% of a tree, while someone else did some tests with random strings to get their own metrics (though random strings are a poor representation of real-world misspellings). In practice the speed improvement of this structure is going to be highly dependent on the set of words you are using and the typical inputs you run into so independent research is needed to see if it applies to any specific problem (such as matching misspelled city names).
As always here’s a C# implementation under the MIT license: