Only support one level array now.
for example:
- nullable(array(nullable(tinyint))) is **support**.
- nullable(array(nullable(array(xx))) is **not support**.
ZSTD compression is fast with high compression ratio. It can be used to archive higher compression ratio
than default Lz4f codec for storing cost sensitive data such as logs.
Compared to Lz4f codec, we see zstd codec get 35% compressed size off, 30% faster at first time read without OS page
cache, 40% slower at second time read with OS page cache in the following comparison test.
test data: 25GB text log, 110 million rows
test table: test_table(ts varchar(30), log string)
test SQL: set enable_vectorized_engine=1; select sum(length(log)) from test_table
be.conf: disable_storage_page_cache = true
set this config to disable doris page cache to avoid all data cached in memory for test real decompression speed.
test result
master branch with lz4f codec result:
- compressed size 4.3G
- SQL first exec time(read data from disk + decompress + little computation) : 18.3s
- SQL second exec time(read data from OS pagecache + decompress + little computation) : 2.4s
this branch with zstd codec (hardcode enable it) result:
- compressed size: 2.8G
- SQL first exec time: 12.8s
- SQL second exec time: 3.4s
1. Fix Lru Cache MemTracker consumption value is negative.
2. Fix compaction Cache MemTracker has no track.
3. Add USE_MEM_TRACKER compile option.
4. Make sure the malloc/free hook is not stopped at any time.
* [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner
1. fix bug of vjson scanner not support `range_from_file_path`
2. fix bug of vjson/vbrocker scanner core dump by src/dest slot nullable is different
3. fix bug of vparquest filter_block reference of column in not 1
4. refactor code to simple all the code
It only changed vectorized load, not original row based load.
Co-authored-by: lihaopeng <lihaopeng@baidu.com>
PaloExternalSourcesService is designed for es_scan_node using tcp protocol.
But es tcp protocol need deploy a tcp jar into es code. Both es version and lucene version are upgraded,
and the tcp jar is not maintained any more.
So that I remove all the related code and thrift definitions.
SQL with JOIN and columns ARRAY, will call function ColumnArray::replicate. At this pr,
we implement replicate for ARRAY type, to support SQL like this:
`SELECT count(lo_array),count(d_array),SUM(lo_extendedprice*lo_discount) AS REVENUE FROM lineorder, date WHERE lo_orderdate = d_datekey AND d_year = 1993 AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25;`
## Design:
For now, there are two categories of types in Doris, one is for scalar types (such as int, char and etc.) and the other is for composite types (array and etc.). For the sake of performance, we can cache type info of scalar types globally (unique objects) due to the limited number of scalar types. When we consider the composite types, normally, the type info is generated in runtime (we can also use some cache strategy to speed up). The memory thereby should be reclaimed when we create type info for composite types.
There are a lots of interfaces to get the type info of a specific type. I reorganized those as the following describes.
1. `const TypeInfo* get_scalar_type_info(FieldType field_type)`
The function is used to get the type info of scalar types. Due to the cache, the caller uses the result **WITHOUT** considering the problems about memory reclaim.
2. `const TypeInfo* get_collection_type_info(FieldType sub_type)`
The function is used to get the type info of array types with just **ONE** depth. Due to the cache, the caller uses the result **WITHOUT** considering the problems about memory reclaim.
3. `TypeInfoPtr get_type_info(segment_v2::ColumnMetaPB* column_meta_pb)`
4. `TypeInfoPtr get_type_info(const TabletColumn* col)`
These functions are used to get the type info of **BOTH** scalar types and composite types. The caller should be responsible to manage the resources returned.
#### About the new type `TypeInfoPtr`
`TypeInfoPtr` is an alias type to `unique_ptr` with a custom deleter.
1. For scalar types, the deleter does nothing.
2. For composite types, the deleter reclaim the memory.
By analyzing the callers of `get_type_info`, these classes should hold TypeInfoPtr:
1. `Field`
2. `ColumnReader`
3. `DefaultValueColumnIterator`
Other classes are either constructed by the foregoing classes or hold those, so they can just use the raw pointer of `TypeInfo` directly for the sake of performance.
1. `ScalarColumnWriter` - holds `Field`
1. `ZoneMapIndexWriter` - created by `ScalarColumnWriter`, use `type_info` from the field in `ScalarColumnWriter`
1. `IndexedColumnWriter` - created by `ZoneMapIndexWriter`, only uses scalar types.
2. `BitmapIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter`
1. `IndexedColumnWriter` - created by `BitmapIndexWriter`, uses `type_info` in `BitmapIndexWriter` and `BitmapIndexWriter` doesn't support `ArrayType`.
3. `BloomFilterIndexWriter` - created by `ScalarColumnWriter`, uses `type_info` from the field in `ScalarColumnWriter`
1. `IndexedColumnWriter` - created by `BloomFilterIndexWriter`, only uses scalar types.
2. `IndexedColumnReader` initializes `type_info` by the field type in meta (only scalar types).
3. `ColumnVectorBatch`
1. `ZoneMapIndexReader` creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `IndexedColumnReader`
2. `BitmapIndexReader` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BitmapIndexReader`
3. `BloomFilterIndexWriter` supports scalar types only and it creates `ColumnVectorBatch`, `ColumnVectorBatch` uses `type_info` in `BloomFilterIndexWriter`
Currently, there are 2 status code in BE, one is common/Status.h,
and the other is olap/olap_define.h called OLAPStatus.
OLAPStatus is just an enum type, it is very simple and could not save many informations,
I will unify these code to common/Status.
1. solved the previous delayed unit test file size is too large (1.7G+) and the unit test link time is too long problem problems
2. Unify all unit tests into one file to significantly reduce unit test execution time to less than 3 mins
3. temporarily disable stream_load_test.cpp, metrics_action_test.cpp, load_channel_mgr_test.cpp because it will re-implement part of the code and affect other tests