Software
I am committed to efficient software production: my software is available on GitHub.
Data Structures
Efficient C++ implementations of the following data structures (see also related publications).
- Compacted and Colored de Bruijn Graphs (RECOMB2023, WABI2023, ALGOMB2024, RECOMB2024)
- K-Mer Dictionaries (ISMB2022, WABI2022, GBIO2023, ALGOMB2023)
- Minimal Perfect Hash Functions (SIGIR2021, ISMB2023, TKDE2023)
- Segment-Trees and Fenwick-Trees (SPE2021)
- Mutable Bitmaps with Rank/Select (INFOSYS2021)
- Compressed Bitmaps (DCC2021)
- Tries (SIGIR2017, TOIS2019, TKDE2021)
- Inverted Indexes (TOIS2017, WSDM2019, BIC2019, TKDE2020, SIGIR2020, CSUR2021)
Fulgor –
A fast and compact k-mer index for large-scale matching and color queries.
Reference publications: WABI2023, ALGOMB2024, RECOMB2024.
LPHash –
Fast and compact locality-preserving minimal perfect hashing for k-mer sets.
Reference publications: ISMB2023.
SSHash –
A compressed, weighted, associative, and exact dictionary for k-mers.
A membership-only version of SSHash is SSHash-Lite.
Reference publications: ISMB2022, WABI2022, GBIO2023, ALGOMB2023.
PTHash –
Fast and compact minimal perfect hash functions.
Reference publications: SIGIR2021, TKDE2023.
Mutable Rank and Select Queries –
Mutable bitmaps with support for Rank and Select queries.
Reference publication: INFOSYS2021.
Prefix-Sum Queries –
A range of tree-shaped data structures for maintaining prefix-sums, including:
binary Segment-Tree (top-down and bottom-up),
b-ary Segment-Tree,
Fenwick-Tree,
b-ary Fenwick-Tree,
blocked Fenwick-Tree,
truncated Fenwick-Tree.
Reference publication: SPE2021.
Query Auto-Completion –
Efficienct and effective
autocompletion framework, based on forward/inverted indexes, succinct RMQ, and string dictionaries (Front-Coding and tries).
Reference publication: SIGIR2020.
Inverted Indexes Benchmark –
A benchmarking suite for inverted index data structures, featuring the following compressors:
Elias-Fano and partitioned Elias-Fano,
Opt-PFor-Delta,
Binary Interpolative,
QMX,
Simple,
Variable-Byte and Opt-VByte,
Gamma, Delta, Rice, Zeta,
DINT.
Reference publication: CSUR2021.
Interpolative Coding –
An efficient implementation of the Binary Interpolative Coding
algorithm.
Reference publication: BIC2019.
Sliced Indexes –
Compressed bitmap indexes that support fast intersection and union.
Reference publication: DCC2021.
DINT –
A Fast and compact dictionary-based decoder for inverted lists.
Reference publication: WSDM2019.
Opt-VByte –
Optimal partitioning of inverted lists compressed using binary vectors
and point-wise encoders, like Variable-Byte.
Reference publication: TKDE2020.
Indexes for RDF –
Trie-based indexes for semantic data like RDF triples.
Reference publication: TKDE2021.
Tongrams –
Fast language model queries and estimation in compressed space.
Reference publications: SIGIR2017, TOIS2019.
Clustered Elias-Fano Indexes –
Clustered Elias-Fano inverted indexes.
Reference publication: TOIS2017.
Miscellanea
C++ utilities that I used as sub-modules for larger projects.
essentials – A C++ library providing essential core utilities for data structure design and benchmarking. More precisely:
- benchmarking facilities, including: messages displaying local time, configurable timer class, function to prevent code elision by compiler, simple creation and printing of json documents;
- functions to serialize-to and load-from disk data structures;
- functions to compute the number of bytes consumed by data structures;
- support for creating, removing, and iterate inside directories;
- transparent support for contiguous memory allocation.
cmd_line_parser – Command line parser for C++17. It offers all handy features in just 150 lines of code.
mm_file – A self-contained, header-only, implementation of memory-mapped files in C++ for both reading and writing.