diff --git a/README.md b/README.md
index fa987f2..34073b6 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,7 @@
TurboPFor: Fastest Integer Compression [](https://travis-ci.org/powturbo/TurboPFor)
======================================
* **TurboPFor: The new synonym for "integer compression"**
+ * :new: (2019.7) all TurboPFor functions now available under 64 bits ARMv8 including NEON SIMD.
* 100% C (C++ headers), as simple as memcpy
* :+1: **Java** Critical Natives/JNI. Access TurboPFor **incl. SIMD/AVX2!** from Java as fast as calling from C
* :sparkles: **FULL** range 8/16/32/64 bits scalar + 16/32/64 bits SIMD functions
@@ -267,7 +268,7 @@ using [900.000 multicore servers](https://www.cloudyn.com/blog/10-facts-didnt-kn
+ Unit test: test function from bit size 0 to 32
- ./icbench -m0 -M32 -eturbpfor
+ ./icbench -m0 -M32 -eturbpfor -fu
./icbench -m0 -M8 -eturbopack -fs -n1M
##### - Data files:
@@ -275,11 +276,11 @@ using [900.000 multicore servers](https://www.cloudyn.com/blog/10-facts-didnt-kn
./icbench file
./icapp file
- ./icapp -Fs file "16 bits binary file
- ./icapp -Fu file "32 bits binary file
- ./icapp -Fl file "64 bits binary file
- ./icapp -Ff file "32 bits floating point binary file
- ./icapp -Fd file "64 bits floating point binary file
+ ./icapp -Fs file "16 bits raw binary file
+ ./icapp -Fu file "32 bits raw binary file
+ ./icapp -Fl file "64 bits raw binary file
+ ./icapp -Ff file "32 bits raw floating point binary file
+ ./icapp -Fd file "64 bits raw floating point binary file
- Text file: 1 entry per line. [Test data: ts.txt(sorted) and lat.txt(unsorted)](https://github.com/zhenjl/encoding/tree/master/benchmark/data))
@@ -385,21 +386,22 @@ In general encoding/decoding functions are of the form:
compressed_size : number of bytes read from compressed input buffer in
### Function syntax:
- - {vb | p4 | bit | vs}[d | d1 | f | fm | z ]{enc/dec | pack/unpack}[| 128V | 256V][8 | 16 | 32 | 64]:
+ - {vb | p4 | bit | vs}[n][d | d1 | f | fm | z ]{enc/dec | pack/unpack}[| 128V | 256V][8 | 16 | 32 | 64]:
vb: variable byte
p4: turbopfor
vs: variable simple
bit: bit packing
-
- d: delta encoding for increasing integer lists (sorted w/ duplicate)
- d1: delta encoding for strictly increasing integer lists (sorted unique)
- f : FOR encoding for sorted integer lists
- fm: FOR encoding for unsorted integer lists
- z: ZigZag encoding for unsorted integer lists
+ n : high level array functions for large arrays.
+
+ '' : encoding for unsorted integer lists
+ 'd' : delta encoding for increasing integer lists (sorted w/ duplicate)
+ 'd1': delta encoding for strictly increasing integer lists (sorted unique)
+ 'f' : FOR encoding for sorted integer lists
+ 'z' : ZigZag encoding for unsorted integer lists
- enc/pack: encode
- dec/unpack:decode
- XX : integer size (8/16/32/64)
+ 'enc' or 'pack' : encode or bitpack
+ 'dec' or 'unpack': decode or bitunpack
+ 'NN' : integer size (8/16/32/64)
header files to use with documentation:
@@ -411,12 +413,15 @@ header files to use with documentation:
|bitpack.h|Bit Packing, For, +Direct Access| bitpack256v32/bitunpack256v32 bitforenc64/bitfordec64|
|eliasfano.h|Elias Fano| efanoenc256v32/efanoc256v32 |
+Note: Some low level functions (like p4enc32) are limited to 128/256 (SSE/AVX2) integers per call.
+
### Environment:
###### OS/Compiler (64 bits):
- Linux: GNU GCC (>=4.6)
- clang (>=3.2)
- Windows: MinGW-w64 (no parallel query processing demo app)
- Visual c++ (VS2008-VS2017)
+- Linux aarch64 for 64 bits ARM CPU : gcc
###### Multithreading:
- All TurboPFor integer compression functions are thread safe
@@ -451,5 +456,112 @@ header files to use with documentation:
* [Small Polygon Compression](https://arxiv.org/abs/1509.05505) + [Poster](http://abhinavjauhri.me/publications/dcc_poster_2016.pdf) + [code](https://github.com/ajauhri/bignum_compression)
* [Parallel Graph Analysis (Lecture 18)](http://www.cs.rpi.edu/~slotag/classes/FA16/) + [code](http://www.cs.rpi.edu/~slotag/classes/FA16/handson/lec18-comp2.cpp)
-Last update: 09 Nov 2018
+Last update: 15 Jul 2019
+
+## APPENDIX: icbench Integer Compression Benchmark
+
+##### TurboPFor + external libraries
+
+TurboPFor https://github.com/powturbo/TurboPFor +FastPFor (FP) https://github.com/lemire/FastPFor +lz4 https://github.com/Cyan4973/lz4 +LittleIntPacker (LI) https://github.com/lemire/LittleIntPacker +MaskedVbyte http://maskedvbyte.org +Polycom (PC) https://github.com/encode84/bcm +simdcomp (SC) https://github.com/lemire/simdcomp +Simple-8b optimized https://github.com/powturbo/TurboPFor +Streamvbyte https://github.com/lemire/streamvbyte +VarintG8IU https://github.com/lemire/FastPFor ++ +##### Functions integrated into 'icbench' for benchmarking +
+Codec group: +TURBOPFOR TurboPFor library TurboPFor256V/TurboPack256V/TurboPFor256N/TurboPFor/TurboPackV/TurboVByte/TurboPack/TurboForDA/EliasFano/VSimple/TurboPForN/TurboPackN/TurboPForDI +DEFAULT Default TurboPFor/TurboPackV/TurboVByte/TurboPack/TurboFor/TurboPForN/TurboPackN/TurboPForDI/TurboPFor256V/TurboPack256V/TurboPFor256N +BENCH Benchmark TurboPFor/TurboPackV/TurboVByte/TurboPack/QMX/FP.SimdFastPfor/FP.SimdOptPFor/MaskedVbyte/StreamVbyte +EFFICIENT Efficient TurboPFor/vsimple/turbovbyte +TRANSFORM transpose/shufle,delta,zigzag tpbyte4s/tpbyte,4/tpnibble,4/ZigZag_32/Delta_32/BitShuffle,4 +BITPACK Bit Packing TurboPack256V/TurboPackV/TurboPackH/TurboPack/SC.SimdPack128/SC.SimdPack256 +VBYTE Variable byte TurboVByte/FP.VByte/PC.Vbyte/VarintG8IU/MaskedVbyte/StreamVbyte +SIMPLE Simple Family simple8b/simple16/vsimple/qmx +LZ4 lz4+bitshufle/transpose 4,8 lz4_bitshufle/lz4_tp4/lz4_tp8 +LI Little Integer LI_Pack/LI_TurboPack/LI_SuperPack/LI_HorPack + + +Function Description level + +-------- ----------- ----- +TurboPFor PFor (SSE2) +TurboPForN PFor (SSE2) large blocks +TurboPFor256 PFor (AVX2) +TurboPFor256N PFor (AVX2) large blocks +TurboPForDA PFor direct access +TurboPForDI PFord min +TurboPForZZ PFor zigzag of delta +TurboFor FOR +TurboForV FOR (SIMD) +TurboFor256V FOR (AVX2) +TurboForDA FOR direct access +TurboPackDA Bit packing direct access +TurboPack Bit packing (scalar) +TurboPackN Bit packing (scalar) large blocks +TurboPackV Bit packing (SSE2 Vertical) +TurboPackH Bit packing (SSE2 Horizontal) +TurboPackVN Bit packing (SSE2 large block) +TurboPack256V Bit packing (AVX2 Vertical) +TurboPack256N Bit packing (AVX2 large block) +TurboVByte Variable byte (scalar) +VSimple Variable simple (scalar) +EliasFano Elias fano (scalar) +EliasFanoV Eliasfano (SSE2) +EliasFano256V Elias fano (AVX2 +memcpy memcpy +copy Integer copy +tpbyte4s Byte Transpose (scalar) +tpbyte Byte transpose (simd) 2,4,8 +tpnibble Nibble transpose (simd) 2,4,8 +ZigZag32 ZigZag encoding (sse2) +Delta32 Delta encoding (sse2) +DDelta32 Delta of delta encoding (sse2) +Xor32 Xor encoding (sse2) +FP_PREV64 Floating point PFOR +FP_FCM64 Floating point PFOR (FCM) +FP_DFCM64 Floating point PFOR (DFCM) +TurboPFor64 PFOR 64 +TurboPFor64V PFOR 64 +Simple8b 64 bits Simple family (instable) +PC_Simple16 Simple 16. limited to 28 bits +PC_OptPFD OptPFD. limited to 28 bits +PC_Vbyte Variable byte +PC_Rice Rice coding (instable) +VarintG8IU Variable byte SIMD +MaskedVbyte Variable byte SIMD +StreamVbyte Variable byte SIMD +FP_FastPFor PFor scalar (inefficient for small blocks) +FP_SimdFastPFor PFor SIMD (inefficient for small blocks) +FP_OptPFor OptPFor scalar +FP_SIMDOptPFor OptPFor SIMD +FP_VByte Variable byte +FP_Simple8bRLE Simple-8b + rle +FP_GROUPSIMPLE Group Simple +SC_SIMDPack128 Bit packing (SSE4.1) +SC_SIMDPack256 Bit packing (SSE4.1) +SC_For For (SSE4.1) +SC_ForDA For direct access (SSE4.1) +LibFor_For For +LibFor_ForDA For direct access +LI_Pack Bit packing (scalar) +LI_TurboPack Bit packing (scalar) +LI_SuperPack Bit packing (scalar) +LI_HorPack Bit packing (sse4.1 horizontal) +LI_BMIPack256 Bit packing (avx2) +lz4 lz4 +lz4_bit Bitshuffle + [delta]+lz4 2,4,8 +lz4_nibble TurboPFor's [delta]+nibble transpose + lz4 2,4,8 +lz4_bitxor Bitshuffle + [xor]+lz4 2,4,8 +lz4_nibblexor TurboPFor's [xor]+nibble transpose + lz4 2,4,8 +lz4_byte TurboPFor's [delta]+byte transpose + lz4 2,4,8 +BitShuffle Bit shuffle (simd) 2,4,8 +