Download PDFOpen PDF in browser

On Lowering Merge Costs of an LSM Tree

EasyChair Preprint no. 6129

6 pagesDate: July 21, 2021

Abstract

In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM improves ingestion performance, but in column stores the sort-merge phase is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we aim to reduce the sorting and merging cost that arise when data is ingested in column stores. We present LDI, a learned distribution index for column stores. LDI learns a frequency-based data distribution and constructs a bucket worth of data based on the learned distribution. Filled buckets that conform to the distribution are written out to disk; unfilled buckets are retained to achieve the desired level of sortedness, thus avoiding the expensive sort-merge phase. We present an algorithm to learn and adapt to distributions, and a robust implementation that takes advantage of disk parallelism. We compare LDI with LSM and production columnar stores using real and synthetic datasets.

Keyphrases: approximate clustering, column-oriented, Learned Distribution Index, Write-optimized

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:6129,
  author = {Dai Hai Ton That and Mohammadsaleh Gharehdaghi and Alexander Rasin and Tanu Malik},
  title = {On Lowering Merge Costs of an LSM Tree},
  howpublished = {EasyChair Preprint no. 6129},

  year = {EasyChair, 2021}}
Download PDFOpen PDF in browser