We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T[1, n] is bounded by nH_k(T)+O((n log log n)/ log_(|Σ|) n) bits, where H_k(T) is the k-th order empirical entropy of T. The above bound holds simultaneously for all k ≤ α log_(|Σ|) n and 0 < α < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, the counting of the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p log |Σ|) time. Locating each pattern occurrence takes O(log |Σ| (log~2 n/ log log n)) time. Reporting a text substring of length l takes O((l + log~2 n/ log log n) log |Σ|) time.
展开▼