A Fast Algorithm for Finding the Nearest Neighbor of a Word in a Dictionary
Technical Report, Institut für Informatik und angewandte Mathematik, 1993
Download: Gzip archive (52Kb)
In this paper a new algorithm for string edit distance computation is proposed. It is based on the classical approach . However, while in  the two strings to be compared may be given online, our algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted, in an off-line phase to be carried out beforehand, into a special type of deterministic finite state automaton. Now, given an input string corresponding to a word with possible OCR errors and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the OCR word. It is independent of the length of the dictionary word. Given not only one but $N$ different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the computation of the edit distance between the input word and each dictionary entry, and the determination of the nearest neighbor in the dictionary need time that is only linear in the length of the OCR word. Particularly, the time complexity is independent of the size of the dictionary. Categories and Subject Descriptors: I.7.0 [Text Processing]: General; I.7.1 [Text Processing]: Text Editing; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing - dictionaries General Terms: algorithms, computational complexity Additional Key Words and Phrases: string edit distance, finite state automaton, nearest neighbor search, dictionary lookup.