Data structure for representation of maximal repeats in strings

Diploma thesis home page

Author: Michal Linhard

Thesis advisor: Tibor Hegedus

Last changed: May 01 2007

Abstract

In this diploma thesis we present data structure for representation of maximal repeats in strings
- R3 tree, based on well known data structure - suffix tree. It requires O(n) space and it can be
constructed in O(n) time and space for string of length n over constant-sized alphabet. We formalize repeat in
string S as triple (p1, p2, l), where p1, p2 are two distinct positions in S and l is the length of the repeat.
We formulate query for maximal repeats in S in the form of the function findPairs(p1, k, S)
that returns all pairs (p2, l) such that (p1, p2, l) is maximal repeat with l >= k. R3 tree allows
computation of findPairs queries in optimal time O(z), where z is the number of found pairs. We
also describe design and functionality of R3lib - library written in C, for finding maximal repeats
in arbitrary binary data, that works with proposed structure.

Documents

Name Last change Link
Diploma thesis - final version may 01 2007 pdf
Diploma thesis TeX sources + figures - final version may 01 2007 zip
Diploma thesis - working version apr 24 2007 pdf
Diploma thesis - working version apr 23 2007 pdf
Diploma thesis - working version apr 10 2007 pdf
Diploma thesis - working version mar 26 2007 pdf
Diploma thesis - working version mar 22 2007 pdf
Fourth presentation for seminar (slovak) feb 2 2007 pdf, odp
Third presentation for seminar (slovak) nov 29 2006 pdf, odp
Second presentation for seminar (slovak) sep 27 2006 pdf, odp
Presentation for diploma thesis seminar (slovak) jun 29 2006 ppt

References

  1. Brenda S. Baker, On finding duplication in strings and software, technical report, AT&T Bell Laboratories, February, 1993 [source] [.ps.gz]
  2. Manber, U., & Myers, G. (1993). Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5), 935--948 [source] [.ps]
  3. E. Ukkonen: On-line construction of suffix-trees. Algorithmica 14 (1995), 249-260 [source] [.pdf]
  4. M. Crochemore, W. Rytter: Text Algorithms. Oxford University Press, New York, 1994 [source]
  5. P. Ko and S. Aluru, "Space-efficient linear time construction of suffix arrays," Combinatorial Pattern Matching, pp. 200-210, 2003. [source] [.pdf]
  6. Juha Kärkkäinen and Peter Sanders: Simple linear work suffix array construction, in Proc. 30th International Colloquium on Automata, Languages and Programming (ICALP '03). LNCS 2719, Springer, 2003, pp. 943-955 [source] [.pdf]
  7. S. Kurtz and C. Schleiermacher: REPuter: Fast Computation of Maximal Repeats in Complete Genomes, Bioinformatics, 15(5):426-427, 1999. [source][.pdf]
  8. S. Kurtz, and E. Ohlebusch: Replacing Suffix Trees with Enhanced Suffix Arrays, M.I. Abouelhoda, Journal of Discrete Algorithms, 2:53-86, 2004. [source][.pdf]
  9. M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch: The Enhanced Suffix Array and its Applications to Genome Analysis, In Proceedings of the Second Workshop on Algorithms in Bioinformatics, pages 449-463. Lecture Notes in Computer Science 2452, Springer-Verlag, 2002. [source][.pdf]
  10. R. Dementiev, J. Mehnert, and J. Kärkkäinen: Better External Memory Suffix Array Construction. In Workshop on Algorithm Engineering & Experiments, Vancouver, 2005. [source][.pdf]
  11. Jens Mehnert: External Memory Suffix Array Construction, Diploma Thesis [source][.pdf]
  12. R. Dementiev, L. Kettner, P. Sanders: Stxxl: Standard Template Library for XXL Data Sets. Technical Report 2005/18, Fakultät für Informatik, University of Karlsruhe [source][.pdf]
  13. Stefan Burkhardt, Juha Kärkkäinen: Fast Lightweight Suffix Array Construction and Checking, CPM 2003, LNCS 2676, pp. 55–69, 2003 [.pdf]
  14. Klaus-Bernd Schurmann, Jens Stoye: An Incomplex Algorithm for Fast Suffix Array Construction, Proceedings of the 7th Workshop on Algorithm Engineering and Experiments, 2005 [.pdf]
  15. S. Lee, K. Park, Efficient Implementations of Suffix Array Construction Algorithms, Proceedings of the 15th Australasian Workshop on Combinatorial Algorithms, pp. 64-72, 2004 [.ps]
  16. D. K. Kim, J. S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 186-199, 2003 [.ps]
  17. Juha Kärkkäinen, Peter Sanders and Stefan Burkhardt, Linear work suffix array construction, Journal of ACM, 2005. [.pdf]
  18. Giovanni Manzini, Two space saving tricks for linear time LCP computation, Technical Report TR-INF-2004-02-03-UNIPMN, [.pdf]
  19. Moritz G. Maaß. Linear Bidirectional On-line Construction of Affix Trees. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1848 of LNCS, pages 320-334. Springer, jun 2000 [source][.ps.gz]
  20. J. Stoye. Affixbäume. Diploma thesis, 1995. [source][.ps]
  21. J. Stoye. Affix trees. Diploma thesis, 1995. [source][.ps]
  22. Shunsuke Inenaga, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs, Proc. 8th Symposium on String Processing and Information Retrieval (SPIRE2001), IEEE-CS Press, pp. 96-110, November 2001 [source][.pdf]

Relevant software

Links

Changelog

May 01 2007 - Final version of diploma thesis added.

April 30 2007 - Pre-Final version of diploma thesis added.

April 25 2007 - Link added.

April 24 2007 - New working version of thesis.

April 23 2007 - New working version of thesis.

April 10 2007 - R3lib project page link added, new working version of diploma thesis.

March 22 2007 - New version of diploma thesis home page in english. Old version can be found here : Old version of page.