Commit Graph

199 Commits

Author SHA1 Message Date
Pxl
90d710e83d [Enchancement](function) optimize for padding function && add string length check on string op (#20363) 2023-06-02 21:24:41 +08:00
9e21318834 [refactor](dynamic table) Make segment_writer unaware of dynamic schema, and ensure parsing is exception-safe. (#19594)
1. make ColumnObject exception safe
2. introduce FlushContext and construct schema at memtable flush stage to make segment independent from dynamic schema
3. add more test cases
2023-06-01 10:25:04 +08:00
Pxl
d1d0d9e5e8 [Chore](build) adjust some compile diagnostic (#20162) 2023-05-29 19:19:01 +08:00
Pxl
15a7420661 [Chore](ub) fix some undefined behaviors (#19986)
/home/zcp/repo_center/doris_master/doris/be/src/olap/rowset/segment_v2/column_reader.cpp:895:21: runtime error: load of value 423208544, which is not a valid value for type 'doris::ReaderType'

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_decimal.cpp:260:33: runtime error: load of misaligned address 0x7fa3348b301c for type 'int64_t' (aka 'long'), which requires 8 byte alignment

/home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:82:24: runtime error: variable length array bound evaluates to non-positive value 0

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_string.h:225:26: runtime error: null pointer passed as argument 2, which is declared to never be null
2023-05-26 14:08:40 +08:00
e9a4cbcdf9 [Refact](type system) refact column with arrow serde (#19091)
* refact arrow serde

* add date serde

* update arrow and fix nullable and date type
2023-05-04 15:28:46 +08:00
8eab20d3df [bugfix](low cardinality) cached code is wrong will result wrong query result when many null pages (#19221)
Sometimes the dict is not initialized when run comparison predicate here, for example, the full page is null, then the reader will skip read, so that the dictionary is not inited. The cached code is wrong during this case, because the following page maybe not null, and the dict should have items in the future.
This will result the dict string column query return wrong result, if there are many null values in the column.
I also add some regression test for dict column's equal query, larger than query, less than query.

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-29 21:28:41 +08:00
b81b470d4f [fix](planner) fix pr "using crchash replace murmurhash in the runtime filter" (#18759) 2023-04-23 10:33:35 +08:00
1ffd34f6f1 [Refact](type system)refact interconversion for jsonb with column (#18819)
* refact jsonb to column

* update

* fix format

* fixed

* fix file head for compile
2023-04-22 14:01:05 +08:00
8cc0af150a [Fix](dynamic table) fix dynamic table with insert into and column al… (#18808)
1. The num_rows should be correctly set
2. insert into has no dynamic column
2023-04-21 11:19:00 +08:00
e412dd12e8 [chore](build) Use include-what-you-use to optimize includes (PART II) (#18761)
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
2023-04-19 23:11:48 +08:00
0b379de602 [refactor](scan) optimize the agg function of count(1) (#18739) 2023-04-19 09:10:51 +08:00
79c446c89f [enhancement](exception) Column filter/replicate supports exception safety (#18503) 2023-04-18 19:23:09 +08:00
0b074ade02 [fix](const column) fix coredump caused by const column for some functions (#18737) 2023-04-18 13:57:55 +08:00
4335c9998f [chore](ARM) Add some vectorization compatibility code on aarch64 (#18553)
update sse2noen to support more sse code on arm cpus
2023-04-13 10:15:33 +08:00
43392918cd [Optimization](functions)Optimize function call for const columns. (#18310) 2023-04-12 11:11:01 +08:00
Pxl
c9b4eaea76 [Chore](storage) change FieldType to enum class #18500 2023-04-10 08:53:44 +08:00
f38e00b4c0 [refactor](typesystem) using typeindex to create column instead of type name because type name is not stable (#18328)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-09 18:08:31 +08:00
30f2abe5d3 [FIX](Map)fix calculate map offset in olap convertor (#18295)
Fix be core when load bigger kv data in one row for map.
2023-04-07 17:04:08 +08:00
66a0c090b8 [fix](column) Add unimplemented replicate function in ColumnStruct (#18368) 2023-04-06 09:50:27 +08:00
f800ba8f4c [Exec](opt) Optimize function call for const columns (#18212) 2023-03-31 11:36:21 +08:00
d9fe5f7b67 [enhancement](memory) Remove MemPool and replace it with Arena (#17820)
Arena can replace MemPool in most scenarios. Except for memory reuse, MemPool supports reuse of previous memory chunks after clear, but Arena does not.

Some comparisons between MemPool and Arena:

 1. Expansion
     Arena is less than 128M index 2 alloc chunk; more than 128M memory, allocate 128M * n > `size`, n is equal to the minimum value that satisfies the expression;
     MemPool less than 512K index 2 alloc chunk, greater than 512K memory, separately apply for a `size` length chunk
     
     After Arena applied for a chunk larger than 128M last time, the minimum chunk applied for after that is 128M. Does this seem to be a waste of memory? MemPool is also similar. After the chunk of 512K was applied for last time, the minimum chunk of subsequent applications is 512K.

 2. Alignment
     MemPool defaults to 16 alignment, because memtable and other places that use int128 require 16 alignment;
     Arena has no default alignment;

 3. Memory reuse
     Arena only supports `rollback`, which reuses the memory of the current chunk, usually the memory requested last time.
     MemPool supports clear(), all chunks can be reused; or call ReturnPartialAllocation() to roll back the last requested memory; if the last chunk has no memory, search for the most free chunk for allocation

 4. Realloc
     Arena supports realloc contiguous memory; it also supports realloc contiguous memory from any position at the time of the last allocation. The difference between `alloc_continue` and `realloc` is:
         1. Alloc_continue does not need to specify the old size, but the default old size = head->pos - range_start
         2. alloc_continue supports expansion from range_start when additional_bytes is between head and pos, which is equivalent to reusing a part of memory, while realloc completely allocates a new memory
     MemPool does not support realloc, but supports transferring or absorbing chunks between two MemPools

 5. check mem limit
     MemPool checks the mem limit, and Arena checks at the Allocator layer.

 6. Support for ASAN
     Arena does something extra

 7. Error handling
     MemPool supports returning the error message of application failure directly through `Status`, and Arena throws Exception.
Tests that Arena can consider

 1. After the last applied chunk is larger than 128M, the minimum applied chunk is 128M, which seems to waste memory;

 2. Support clear, memory multiplexing;

 3. Increase the large list, alloc the memory larger than 128M, and the size is equal to `size`, so as to avoid the current chunk not being fully used, which is wasteful.

 4. In some cases, it may be possible to allocate backwards to find chunks t
2023-03-29 20:56:49 +08:00
6b6682cd96 [Enhancement](Expr) Opt In Set by small size fixed container to improve performance. (#17976) 2023-03-28 23:10:39 +08:00
78abb40fdc [improvement](string) throw exception instead of log fatal if string column exceed total size limit (#17989)
Throw exception instead of log fatal if string column exceed total size limit, so that we can catch it and let query fail, instead of causing be exit.
2023-03-27 08:55:26 +08:00
Pxl
a8753faeb1 [Bug](function) fix column complex not resize after filter (#18043) 2023-03-25 21:48:13 +08:00
7ae51c856e [refactor](unify exception) unify exception definition and error code (#18006)
* [refactor](unify exception) unify exception definition and error code


---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-25 12:41:07 +08:00
e8b9587fe6 [Improvement](dict) compute hash only if needed (#18058) 2023-03-24 11:45:58 +08:00
1999cccde9 [feature](array-type) Unique table support array value (#17024)
Unique table support array value

---------

Co-authored-by: huangqixiang.871 <huangqixiang.871@bytedance.com>
2023-03-24 10:18:59 +08:00
Pxl
40ca250678 [Feature](materialized-view) support where clause on create materialized view (#17534)
support where clause on create materialized view
2023-03-22 11:25:13 +08:00
4193884a32 [feature](array_zip) Support array_zip function (#17696) 2023-03-21 18:44:30 +08:00
bd8e3e6405 [refactor](date) unify DateTimeValue and VecDateTimeValue (#17670) 2023-03-20 16:27:08 +08:00
dd53bc1c8d [unify type system](remove unused type desc) remove some code (#17921)
There are many type definitions in BE. Should unify the type system and simplify the development.



---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-19 14:05:02 +08:00
5c5dcfda78 Revert "[enhancement](memory) PODArray replaces MemPool in PredicateColumn (#17800)" (#17910)
This reverts commit 17d1c1bc7f6cc95eecd224eaa219c976b60fa17e.
2023-03-17 20:50:01 +08:00
043f77200f [Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null (#17842)
Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior.
Second column alignment logic has bug :

```
template <typename ColumnInserterFn>
void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt,
                                    ColumnInserterFn inserter) {
    CHECK(dst.is_finalized() && src.is_finalized());
    // Use rows() here instead of size(), since size() will check_consistency
    // but we could not check_consistency since num_rows will be upgraded even
    // if src and dst is empty, we just increase the num_rows of dst and fill
    // num_rows of default values when meet new data
    size_t num_rows = dst.rows();
```
2023-03-17 16:53:30 +08:00
5d3de05976 [feature](map) basic functions for map datatype (#16916)
basic functions for map datatype:
- MAP<K, V> map(K k1, V v1, ...)
- BIGINT map_size(MAP<K, V> m)
- BOOL map_contains_key(MAP<K, V> m, K k1)
- BOOL map_contains_value(MAP<K, V> m, V v1)
- ARRAY< K> map_keys(MAP<K, V> m)
- ARRAY< V> map_values(MAP<K, V> m)
2023-03-17 10:28:17 +08:00
17d1c1bc7f [enhancement](memory) PODArray replaces MemPool in PredicateColumn (#17800)
MemPool is about to be removed, replaced by Arena and PODArray.
2023-03-16 09:01:28 +08:00
5b39fa9843 [Feature](vec)(quantile_state): support quantile state in vectorized engine (#16562)
* [Feature](vectorized)(quantile_state): support vectorized quantile state functions
1. now quantile column only support not nullable
2. add up some regression test cases
3. set default enable_quantile_state_type = true
---------

Co-authored-by: spaces-x <weixiang06@meituan.com>
2023-03-14 10:54:04 +08:00
9b7596f1c6 [Feature](Dynamic schema table) step1 support schema change expression (#17494)
1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns
2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility
2023-03-13 15:12:42 +08:00
a79b8ede88 [Bug](ColumnArray) Fix array column replicate replicate_offsets not matched (#17616)
the input replicate_offsets should be the same size as ColumnArray's offset.
```
IColumn::Offsets replicate_offsets(get_offsets().size(), 0);
// |---------------------|-------------------------|-------------------------|
// [0, begin)             [begin, begin + count_sz)  [begin + count_sz, size())
//  do not need to copy    copy counts[n] times       do not need to copy
```

we should
2023-03-10 11:52:22 +08:00
06dee69174 [Refactor](map) remove using column array in map to reduce offset column (#17330)
1. remove column array in map 
2. add offsets column in map 
Aim to reduce duplicate offset  from key-array and value-array in disk
2023-03-09 11:22:26 +08:00
368e6a4f9c [Bug](array filter) Fix bug due to ColumnArray::filter_generic invalid inplace size_at after set_end_ptr (#17554)
We should make a new PodArray to add items instead of do it inplace
2023-03-09 10:59:29 +08:00
Pxl
e2ac06d6d6 [Chore](execution) change PipelineTaskState to enum class && remove some row-based code (#17300)
1. change PipelineTaskState to enum class
2. remove some row-based code on FoldConstantExecutor::_get_result
3. reduce memcpy on minmax runtime filter function(Now we can guarantee that the input data is aligned)
4. add Wunused-template check, and remove some unused function, change some static function to inline function.
2023-03-08 12:41:15 +08:00
8ccc805cd0 [Fix](Lightweight schema Change) query error caused by array default type is unsupported (#17331)
We have supportted array type default [], but when using lightweight schema Change to add column array type, query failed as follows:

Fix "array default type is unsupported" error.
Fix the default value filling assignment digit problem.
2023-03-07 16:30:41 +08:00
caacee253d [fix](olap)Crashing caused by IS NULL expression (#17463)
Issue Number: close #17462
2023-03-07 15:32:52 +08:00
e82b827bc8 [optimize](vectorization)Optimize to_string's performance. (#17076) 2023-03-03 10:35:59 +08:00
48ef61780d [refactor](struct-type) refactor and clean unused code for struct type (#17257)
remove unused code for struct type
2023-03-01 15:49:31 +08:00
a1db5c6f52 [fix](vec) crash caused by not-implemented function in ColumnFixedLengthObject (#17215) 2023-02-28 15:27:06 +08:00
91fc9fae8e [Bug](complex-type) Fix is null predicate in delete stmt for array/struct/map type (#17018) 2023-02-23 15:06:49 +08:00
08adf914f9 [improvement](vec) avoid creating a new column while filtering mutable columns (#16850)
Currently, when filtering a column, a new column will be created to store the filtering result, which will cause some performance loss。 ssb-flat without pushdown expr from 19s to 15s.
2023-02-21 09:47:21 +08:00
ef2fdb79bb [Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818)
Optimize and refactor parquet reader to improve performance.
- Improve 2x performance for small dict string by aligned copying.
- Refactor code to decrease condition(if) checking.
- Don't call skip(0).
- Don't read page index if no condition.

**ssb-flat-100**: (single-machine, single-thread)
| Query        | before opt           | after opt  |
| ------------- |:-------------:| ---------:|
| SELECT count(lo_revenue) FROM lineorder_flat       | 9.23   | 9.12 |
| SELECT count(lo_linenumber) FROM lineorder_flat | 4.50    | 4.36 |
| SELECT count(c_name) FROM lineorder_flat             | 18.22 | 17.88| 
| **SELECT count(lo_shipmode) FROM lineorder_flat**     |**10.09** | **6.15**|
2023-02-20 11:42:29 +08:00
f08c1222cc [Opt](exec) Refactor the code and logical functions to SIMD the code (#16785) 2023-02-16 16:55:12 +08:00