Software

I am committed to efficient software production: my software is available on GitHub.

Data Structures

Efficient C++ implementations of the following data structures (see also related publications).

  • Compacted and Colored de Bruijn Graphs (RECOMB-2023, WABI-2023, ALGOMB-2024, RECOMB-2024, JCB-2024)
  • K-Mer Dictionaries (ISMB-2022, WABI-2022, GBIO-2023, ALGOMB-2023)
  • Minimal Perfect Hash Functions (SIGIR-2021, ISMB-2023, TKDE-2023, ESA-2024)
  • Segment-Trees and Fenwick-Trees (SPE-2021)
  • Mutable Bitmaps with Rank/Select (INFOSYS-2021)
  • Compressed Bitmaps (DCC-2021)
  • Tries (SIGIR-2017, TOIS-2019, TKDE-2021)
  • Inverted Indexes (TOIS-2017, WSDM-2019, TKDE-2020, SIGIR-2020, CSUR-2021)

Minimizers – A collection of minimizer-based sampling algorithms.
Reference publications: WABI-2024.

Fulgor – A fast and compact k-mer index for large-scale matching and color queries.
Reference publications: WABI-2023, ALGOMB-2024, RECOMB-2024, JCB-2024.

LPHash – Fast and compact locality-preserving minimal perfect hashing for k-mer sets.
Reference publications: ISMB-2023.

SSHash – A compressed, weighted, associative, and exact dictionary for k-mers. A membership-only version of SSHash is SSHash-Lite.
Reference publications: ISMB-2022, WABI-2022, GBIO-2023, ALGOMB-2023.

PTHash – Fast and compact minimal perfect hash functions.
Reference publications: SIGIR-2021, TKDE-2023, ESA-2024.

Mutable Rank and Select Queries – Mutable bitmaps with support for Rank and Select queries.
Reference publication: INFOSYS-2021.

Prefix-Sum Queries – A range of tree-shaped data structures for maintaining prefix-sums, including: binary Segment-Tree (top-down and bottom-up), b-ary Segment-Tree, Fenwick-Tree, b-ary Fenwick-Tree, blocked Fenwick-Tree, truncated Fenwick-Tree.
Reference publication: SPE-2021.

Query Auto-Completion – Efficienct and effective autocompletion framework, based on forward/inverted indexes, succinct RMQ, and string dictionaries (Front-Coding and tries).
Reference publication: SIGIR-2020.

Inverted Indexes Benchmark – A benchmarking suite for inverted index data structures, featuring the following compressors: Elias-Fano and partitioned Elias-Fano, Opt-PFor-Delta, Binary Interpolative, QMX, Simple, Variable-Byte and Opt-VByte, Gamma, Delta, Rice, Zeta, DINT.
Reference publication: CSUR-2021.

Interpolative Coding – An efficient implementation of the Binary Interpolative Coding algorithm.

Sliced Indexes – Compressed bitmap indexes that support fast intersection and union.
Reference publication: DCC-2021.

DINT – A Fast and compact dictionary-based decoder for inverted lists.
Reference publication: WSDM-2019.

Opt-VByte – Optimal partitioning of inverted lists compressed using binary vectors and point-wise encoders, like Variable-Byte.
Reference publication: TKDE-2020.

Indexes for RDF – Trie-based indexes for semantic data like RDF triples.
Reference publication: TKDE-2021.

Tongrams – Fast language model queries and estimation in compressed space.
Reference publications: SIGIR-2017, TOIS-2019.

Clustered Elias-Fano Indexes – Clustered Elias-Fano inverted indexes.
Reference publication: TOIS-2017.

Miscellanea

C++ utilities that I used as sub-modules for larger projects.

essentials – A C++ library providing essential core utilities for data structure design and benchmarking. More precisely:

  • benchmarking facilities, including: messages displaying local time, configurable timer class, function to prevent code elision by compiler, simple creation and printing of json documents;
  • functions to serialize-to and load-from disk data structures;
  • functions to compute the number of bytes consumed by data structures;
  • support for creating, removing, and iterate inside directories;
  • transparent support for contiguous memory allocation.

cmd_line_parser – Command line parser for C++17. It offers all handy features in just 150 lines of code.

mm_file – A self-contained, header-only, implementation of memory-mapped files in C++ for both reading and writing.