Review of "Succinct: Enabling Queries on Compressed Data"

11 Oct 2015

Review of "Succinct: Enabling Queries on Compressed Data"

For traditional data stores, data scans incur high latency for large data sizes and have limited throughput since queries typically touch all machines. One can use index in this case, but index have high memory footprint which can be as much as 8x larger than the input size according to the paper. Thus, the author proposed Succinct, which is a distributed data store operates on compressed data directly which provides memory efficiency close to data scans and latency close to indexes.

The key idea is Succinct stores an entroy-compressed representation of the input data that allows random access, enabling efficient storage and retrieval of data. The data representation used by Succinct also supports count, search, range and wildcard queries without storing indexes – all the required information is embedded in the compressed data and execution doesn't require decompressing. By doing this, Succinct achieves similar or stronger functionality but requires 10-11x lower memory than data stores that uses indexes, while only loose part of the decompression throughput than traditional compression techniques.

More concretely, Succinct stores semi-structured data as compressed suffix arrays. Array of suffixes are compressed by using the fact that they are ordered and a help array called NextCharIdx. With this technique, essentially, you only need to store different characters used by your data once, along with the number of occurrences. By using NextCharIdx, Succinct has all the information to reconstruct the original suffix. Mapping between position of suffixes and the original substring are managed by Aos2Input and Input2Aos. Naive binary search would require reconstructing suffix at every step, whereas, in Succinct, it aggressively exploits the 2-d NextCharIdx representation, for a string of length of m, this algorithm only performs 2(m - t - 1) binary searches instead of two full binary searches on Aos2Input array.

Will this paper be influential in 10 years? I think so, it creatively applies compressed suffix array to store semi-structured data and achieves far better memory efficiency while maintaining similar performance of indexed data. It's a great example of applying theory advancement in algorithms to real-world problems!