diff --git a/README.md b/README.md index f6f733c..2d0d123 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,8 @@ CPU: Sandy bridge i7-2600k at 4.2GHz, gcc 5.1, ubuntu 15.04, single thread. ##### - Synthetic data: - Generate and test skewed distribution (100.000.000 integers, Block size=128). - >*./icbench -a1.5 -m0 -M255 -n100m* + + ./icbench -a1.5 -m0 -M255 -n100m |Size| Ratio % |Bits/Integer |C Time MI/s |D Time MI/s |Function | |--------:|-----:|----:|-------:|-------:|---------| @@ -66,9 +67,10 @@ MI/s: 1.000.000 integers/second ( = 4.000.000 bytes/sec )
**#BOLD** = pareto frontier ##### - Data files: - - gov2.sorted from [Document identifier data set](http://lemire.me/data/integercompression2014.html) Block size=128 (lz4+SimpleV 64k) + - gov2.sorted from [DocId data set](http://lemire.me/data/integercompression2014.html) Block size=128 (lz4+SimpleV 64k) - >*./icbench -c1 gov2.sorted* + + ./icbench -c1 gov2.sorted |Size |Ratio %|Bits/Integer|C Time MI/s|D Time MI/s|Function | |----------:|-----:|----:|------:|------:|---------------------| @@ -128,77 +130,89 @@ q/s: queries/second, ms/q:milliseconds/query ##### - Synthetic data: + test all functions
- >*./icbench -a1.0 -m0 -M255 -n100m* - - zipfian distribution alpha = 1.0 (Ex. -a1.0=uniform -a1.5=skewed distribution) - - number of integers = 100.000.000 - - integer range from 0 to 255 + ./icbench -a1.0 -m0 -M255 -n100m + + >*-zipfian distribution alpha = 1.0 (Ex. -a1.0=uniform -a1.5=skewed distribution)
+ -number of integers = 100.000.000
+ -integer range from 0 to 255
* + individual function test (ex. Copy TurboPack TurboPFor)
- >*./icbench -a1.5 -m0 -M255 -ecopy/turbopack/turbopfor -n100m* + + ./icbench -a1.5 -m0 -M255 -ecopy/turbopack/turbopfor -n100m ##### - Data files: - - Data file Benchmark (file format as in FastPFOR) + - Data file Benchmark (file from [DocId data set](http://lemire.me/data/integercompression2014.html)) - >*./icbench -c1 gov2.sorted* + + ./icbench -c1 gov2.sorted ##### - Intersections: - 1 - Download Gov2 (or ClueWeb09) + query files (Ex. "1mq.txt")
- from [Document identifier data set](http://lemire.me/data/integercompression2014.html)
+ 1 - Download Gov2 (or ClueWeb09) + query files (Ex. "1mq.txt") from [DocId data set](http://lemire.me/data/integercompression2014.html)
8GB RAM required (16GB recommended for benchmarking "clueweb09" files). 2 - Create index file - >*./idxcr gov2.sorted .* - create inverted index file "gov2.sorted.i" in the current directory + ./idxcr gov2.sorted . + + >*create inverted index file "gov2.sorted.i" in the current directory* 3 - Test intersections - >*./idxqry gov2.sorted.i 1mq.txt* - run queries in file "1mq.txt" over the index of gov2 file + ./idxqry gov2.sorted.i 1mq.txt + + >*run queries in file "1mq.txt" over the index of gov2 file* ##### - Parallel Query Processing: 1 - Create partitions + - >*./idxseg gov2.sorted . -26m -s8* + ./idxseg gov2.sorted . -26m -s8 + - create 8 (CPU hardware threads) partitions for a total of ~26 millions document ids + >*create 8 (CPU hardware threads) partitions for a total of ~26 millions document ids* 2 - Create index file for each partition - >./idxcr gov2.sorted.s* - create inverted index file for all partitions "gov2.sorted.s00 - gov2.sorted.s07" in the current directory + ./idxcr gov2.sorted.s + + + >*create inverted index file for all partitions "gov2.sorted.s00 - gov2.sorted.s07" in the current directory* 3 - Intersections: - delete "idxqry.o" file and then type "make para" to compile "idxqry" w. multithreading + delete "idxqry.o" file and then type "make para" to compile "idxqry" w. multithreading - >./idxqry gov2.sorted.s\*.i 1mq.txt* - run queries in file "1mq.txt" over the index of all gov2 partitions "gov2.sorted.s00.i - gov2.sorted.s07.i". + ./idxqry gov2.sorted.s\*.i 1mq.txt + + >*run queries in file "1mq.txt" over the index of all gov2 partitions "gov2.sorted.s00.i - gov2.sorted.s07.i".* ### Function usage: In general encoding/decoding functions are of the form: - **char *endptr = encode( unsigned *in, unsigned n, char *out, [unsigned start], [int b])**
- endptr : set by encode to the next character in "out" after the encoded buffer
- in : input integer array
- n : number of elements
- out : pointer to output buffer
- b : number of bits. Only for bit packing functions
- start : previous value. Only for integrated delta encoding functions + + >***char *endptr = encode( unsigned *in, unsigned n, char *out, [unsigned start], [int b])**
+ endptr : set by encode to the next character in "out" after the encoded buffer
+ in : input integer array
+ n : number of elements
+ out : pointer to output buffer
+ b : number of bits. Only for bit packing functions
+ start : previous value. Only for integrated delta encoding functions* + + - **char *endptr = decode( char *in, unsigned n, unsigned *out, [unsigned start], [int b])**
- endptr : set by decode to the next character in "in" after the decoded buffer
- in : pointer to input buffer
- n : number of elements
- out : output integer array
- b : number of bits. Only for bit unpacking functions
- start : previous value. Only for integrated delta decoding functions + >**char *endptr = decode( char *in, unsigned n, unsigned *out, [unsigned start], [int b])**
+ endptr : set by decode to the next character in "in" after the decoded buffer
+ in : pointer to input buffer
+ n : number of elements
+ out : output integer array
+ b : number of bits. Only for bit unpacking functions
+ start : previous value. Only for integrated delta decoding functions* header files to use with documentation :
@@ -220,11 +234,11 @@ header files to use with documentation :
- All TurboPFor functions are thread safe ### References: + + [FastPFor](https://github.com/lemire/FastPFor) + [Simdcomp](https://github.com/lemire/simdcomp): SIMDPackFPF, VbyteFPF + [Optimized Pfor-delta compression code](http://jinruhe.com): PForDelta: OptPFD or OptP4, Simple16 + [MaskedVByte](http://maskedvbyte.org/). See also: [Vectorized VByte Decoding](http://engineering.indeed.com/blog/2015/03/vectorized-vbyte-decoding-high-performance-vector-instructions/) + [Document identifier data set](http://lemire.me/data/integercompression2014.html) + **Publications:** - [SIMD Compression and the Intersection of Sorted Integers](http://arxiv.org/abs/1401.6399) - - [Quasi-Succinct Indices](http://arxiv.org/abs/1206.4300) - [Partitioned Elias-Fano Indexes](http://www.di.unipi.it/~ottavian/files/elias_fano_sigir14.pdf)