*lol*
Nowadays it’s very usual to find websites offering hints while you’re typing on a search box. Google is a pretty good example of it. But, how could it be implemented?
This feature could be implemented either in the client side or in the server side. If the word list is big (usually it is), it’s recommended to keep the lookup logic in the server side to save some bytes while transferring the page to the client and also to save some computing power using server-side caches (cool when you plan to serve many requests).
Either way, there should be a data structure somewhere containing the word list and an algorithm to do the lookup. The simplest approach may be to use a list to store the words and issue something like this when you want to get a list of hints for a given prefix:
filter(lambda x: x.startsWith(prefix), word_list)
That’s Python’s filter, but it works the same way the well-known Haskell’s first-order function filter does. It builds a new list with the elements of the original list (word_list) that match the predicate (the lambda function).
Although the results can (and should) be cached, the very first lookup (or when the cache expires) would be very inefficient because the entire list must be traversed and that operation will take linear time. Not bad, but when the size of the problem gets bigger (i.e. more and more words in the database) the lookup process may be too slow, especially whether you’re serving several users at the same time. If the list was sorted, the execution time could be improved a little bit by writing a more sophisticated algorithm, but let’s keep it that way for now.
Fortunately, there are better and faster ways to face the problem. If you don’t want to write code (usually the best choice) you may use some high-performance indexing engine such as Apache Lucene. But if you prefer the ‘do-it-yourself’ way (for learning purposes), a search tree (more specifically, a trie or a prefix tree) is a good approach.
I’ve poorly benchmarked both alternatives (the list and the tree) and as expected the tree is pretty quicker generating hints. What I did was to feed both data structures with the content of an American English word list holding ~640k words (debian package wamerican-insane).
So, assuming four is a reasonable minimum prefix length, I measured the time it would take to get a list of words prefixed by hous (yes, just one, remember I said this was a poor benchmark? ;). Unsurprisingly, it took around 230 times longer for the list alternative to generate the hints (438.96 ms vs 1.92 ms). Wow.
My implementation of the tree is as follows. The API is quite straightforward, the “hot” methods are put and get_hints. I’ve stripped off the test suite for space reasons.
Usage example:
>>> tree = HintSearchTree()
>>> tree.put("nacho")
>>> tree.put("nachos")
>>> tree.put("nachete")
>>> tree.get_hints("nach")
['nachete', 'nacho', 'nachos']
>>> tree.get_hints("nacho")
['nacho', 'nachos']
>>> tree.delete("nacho")
>>> tree.get_hints("nacho")
['nachos']
>>> tree.count_words()
2
>>> tree.get_hints("n")
['nachete', 'nachos']
>>> tree.is_indexed("nachete")
True
>>> tree.is_indexed("nach")
False
>>> tree.empty()
False
class HintSearchTreeNode(object):
class HintSearchTreeNode(object):
def __init__(self, parent=None, terminal=False):
self._children = {}
self._terminal = terminal
self._parent = parent
@property
def children(self):
return self._children
@property
def terminal(self):
return self._terminal
@terminal.setter
def terminal(self, value):
self._terminal = value
@property
def parent(self):
return self._parent
class HintSearchTree(object):
def __init__(self):
self._root = HintSearchTreeNode()
def put(self, word):
"""Adds a word to the tree."""
# TODO: Sanitize 'word'
if len(word) > 0:
self._put(self._root, word)
def count_words(self):
"""Retrieves the number of indexed words in the tree."""
return self._count_words(self._root)
def is_indexed(self, word):
"""Returns True if 'word' is indexed."""
node = self._find(self._root, word)
return node is not None and node.terminal is True
def get_hints(self, prefix):
"""Returns a list of words prefixed by 'prefix'."""
return self._match_prefix(self._root, prefix)
def delete(self, word):
"""Deletes 'word' (if exists) from the tree."""
terminal = self._find(self._root, word)
if terminal is not None:
terminal.terminal = False
self._prune(terminal.parent, word)
def empty(self):
"""Returns True if the tree contains no elements."""
return len(self._root.children) == 0
def _put(self, node, word, depth=0):
next_node = node.children.get(word[depth])
if next_node is None:
next_node = HintSearchTreeNode(parent=node)
node.children[word[depth]] = next_node
if len(word)-1 == depth:
next_node.terminal = True
else:
self._put(next_node, word, depth+1)
def _count_words(self, node):
words = 1 if node.terminal is True else 0
for k in node.children:
words += self._count_words(node.children[k])
return words
def _match_prefix(self, node, prefix):
terminal = self._find(node, prefix)
if terminal is not None:
return self._harvest_node(terminal, prefix)
else:
return []
def _harvest_node(self, node, prefix, path=""):
hints = []
if node.terminal is True:
hints.append(prefix + path)
for k in node.children:
hints.extend(self._harvest_node(node.children[k], prefix, path+k))
return hints
def _find(self, node, word, depth=0):
if depth == len(word):
return node
else:
child = node.children.get(word[depth])
if child is not None:
return self._find(child, word, depth+1)
else:
return None
def _prune(self, node, word):
if self._count_words(node.children[word[-1]]) == 0:
del node.children[word[-1]]
if len(node.children) == 0 and node.parent is not None \
and node.terminal is not True:
self._prune(node.parent, word[:-1])
The code is released in the public domain.