TurboPFor: Fastest Integer Compression [![Build Status](https://travis-ci.org/powturbo/TurboPFor.svg?branch=master)](https://travis-ci.org/powturbo/TurboPFor) ====================================== - 100% C (C++ compatible headers), without inline assembly

- Fastest **"Variable Byte"** implementation

- Novel **"Variable Simple"** faster than simple16 and more compact than simple64

- Scalar **"Bit Packing"** with bulk decoding as fast as SIMD FastPFor in realistic and practical (No "pure cache") scenarios - Bit Packing with **Direct/Random Access** without decompressing entire blocks - Access any single bit packed entry with **zero decompression** - Reducing **Cache Pollution**

- Novel **"TurboPFor"** (Patched Frame-of-Reference) scheme with direct access or bulk decoding. Outstanding compression

- Several times faster than other libraries - Usage in C/C++ as easy as memcpy - Most functions optimized for speed and others for high compression ratio - **New:** Include more functions

- Instant access to compressed *frequency* and *position* data in inverted index with zero decompression - **New:** Inverted Index Demo + Benchmarks: Intersection of lists of sorted integers. - more than **1000 queries per second** on gov2 (25 millions documents) on a **SINGLE** core. - Decompress only the minimum necessary blocks. # Benchmark: i7-2600k at 3.4GHz, gcc 4.9, ubuntu 14.10. - Single thread - Realistic and practical benchmark with large integer arrays. - No PURE cache benchmark #### Synthetic data: coming soon! #### data files - clueweb09.sorted from FastPFor (http://lemire.me/data/integercompression2014.html)
./icbench -n10000000000 clueweb09.sorted

Size Ratio in % Bits/Integer C Time MB/s D Time MB/s Function

514438405 8.16 2.61 357.22 1286.42 TurboPFor

514438405 8.16 2.61 358.09 309.70 TurboPFor DA

539841792 8.56 2.74 6.47 767.35 OptP4

583184112 9.25 2.96 132.42 914.89 Simple16

623548565 9.89 3.17 235.32 925.71 SimpleV

733365952 11.64 3.72 162.21 1312.15 Simple64

862464289 13.68 4.38 1274.01 1980.55 TurboPack

862464289 13.68 4.38 1285.28 868.06 TurboPack DA

862465391 13.68 4.38 1402.12 2075.15 SIMD-BitPack FPF

6303089028 100.00 32.00 1257.50 1308.22 copy

## Compile: make ## Benchmark ###### Synthetic data: 1. test all functions ./icbench -a1.0 -m0 -x8 -n100000000 - zipfian distribution alpha = 1.0 (Ex. -a1.0=uniform -a1.5=skewed distribution) - number of integers = 100000000 - integer range from 0 to 255 (integer size = 0 to 8 bits) 2. individual function test (ex. copy TurboPack TurboPack Direct access) ./icbench -a1.0 -m0 -x8 -ecopy/turbopack/turbopackda -n100000000 ###### Data files: - Data file Benchmark (file format as in FastPFOR) ./icbench gov2.sorted ###### Benchmarking intersections - Download "gov2.sorted" (or clueweb09) + query file "aol.txt" from "http://lemire.me/data/integercompression2014.html" - Create index file gov2.sorted.i
./idxcr gov2.sorted .
create inverted index file "gov2.sorted.i" in the current directory - Benchmarking intersections
./idxqry gov2.sorted.i aol.txt
run queries in file "aol.txt" over the index of gov2 file 8GB Minimum of RAM required (16GB recommended for benchmarking "clueweb09" files). ## Function usage: In general compression/decompression functions are of the form: char *endptr = compress( unsigned *in, int n, char *out) endptr : set by compress to the next character in "out" after the compressed buffer in : input integer array n : number of elements out : pointer to output buffer char *endptr = decompress( char *in, int n, unsigned *out) endptr : set by decompress to the next character in "in" after the decompressed buffer in : pointer to input buffer n : number of elements out : output integer array header files with documentation : vint.h - variable byte vsimple.h - variable simple vp4dc.h,vp4dd.h - TurboPFor bitpack.h,bitunpack.h - Bit Packing ## Reference: - "SIMD-BitPack FPF" from FastPFor https://github.com/lemire/simdcomp - Sorted integer datasets from http://lemire.me/data/integercompression2014.html - OptP4 and Simple-16 from http://jinruhe.com/ #------------------------------------------------

Size	Ratio in %	Bits/Integer	C Time MB/s	D Time MB/s	Function
514438405	8.16	2.61	357.22	1286.42	TurboPFor
514438405	8.16	2.61	358.09	309.70	TurboPFor DA
539841792	8.56	2.74	6.47	767.35	OptP4
583184112	9.25	2.96	132.42	914.89	Simple16
623548565	9.89	3.17	235.32	925.71	SimpleV
733365952	11.64	3.72	162.21	1312.15	Simple64
862464289	13.68	4.38	1274.01	1980.55	TurboPack
862464289	13.68	4.38	1285.28	868.06	TurboPack DA
862465391	13.68	4.38	1402.12	2075.15	SIMD-BitPack FPF
6303089028	100.00	32.00	1257.50	1308.22	copy