RT Journal Article
JF Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
YR 2003
VO 00
SP 125
TI Distance Based Indexing for String Proximity Search
A1 Jai Macker,
A1 Murat Tasan,
A1 S. Cenk Sahinalp,
A1 Z. Meral Ozsoyoglu,
K1 null
AB In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance.<div></div> The main goal in this paper is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.
PB IEEE Computer Society, [URL:http://www.computer.org]
SN 1063-6382
LA English
DO 10.1109/ICDE.2003.1260787
LK http://doi.ieeecomputersociety.org/10.1109/ICDE.2003.1260787