Commit Graph

302 Commits

Author SHA1 Message Date
75e2bc8a25 [function](bitmap) support bitmap_to_base64 and bitmap_from_base64 (#23759) 2023-09-02 00:58:48 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
62c075bf7e [improvement](Block) Replace Block(const PBlock&) with deserialize because it has heavy operations in ctor (#23672) 2023-08-31 14:44:17 +08:00
94a8fa6bc9 [bug](function) fix explode_number function return wrong rows (#23603)
before the explode_number function result is random with const value.
because the _cur_size is reset, so it's can't insert values to column.
2023-08-29 19:02:49 +08:00
5be8d57f52 [fix](be-ut) fix ColumnFixedLenghtObjectTest on 32 bits system (#23519) 2023-08-28 14:02:05 +08:00
f80b067990 [fix](column) add unimplemented function of ColumnFixedLengthObject (#23468) 2023-08-25 17:38:01 +08:00
9cacf9535a [Opt](functions) Use preloaded cache to accelerate timezone parsing (#22694)
* opt

* bugfix

* fix ut

* fix stylecheck
2023-08-25 10:00:48 +08:00
51ac92f65c Revert "[fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236)" (#23368)
This reverts commit 1c3cc77a54938ed948ad8186b8dea8385977d23c.
2023-08-23 18:27:35 +08:00
Pxl
8ed4045df9 [Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842) 2023-08-23 08:37:40 +08:00
5ff7b57fc1 [fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198)
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
2023-08-22 13:35:29 +08:00
12075f9853 [pipelineX](projection) Support projection and blocking agg (#23256) 2023-08-21 22:23:02 +08:00
33dfa0c454 [Improve](serde) support text serde for nested type-array/map (#22738)
Now we can not support nested type array/map 
so this pr aim to:
1. add format option for string convert defined datatype to keep with origin from_string
2. support array map can nested array and map
2023-08-21 10:32:28 +08:00
1c3cc77a54 [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236)
* [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty

* add ut

* fix nereids

* fix regression-test
2023-08-18 14:37:49 +08:00
4e880288c6 [refactor]use clear concept to replace std::enable_if_t (#22801)
---------

Signed-off-by: flynn <fenglv15@mails.ucas.ac.cn>
2023-08-12 15:10:30 +08:00
1a8a1e5b16 [Feature](count_by_enum) support count_by_enum function (#22071)
count_by_enum(expr1, expr2, ... , exprN);

Treats the data in a column as an enumeration and counts the number of values in each enumeration. Returns the number of enumerated values for each column, and the number of non-null values versus the number of null values.
2023-08-06 16:05:14 +08:00
b122f9b80c [fix](concat) ColumnString::chars is resized with wrong size (#22610)
FunctionStringConcat::execute_impl resized with size that include string null terminator, which causes ColumnString::chars.size() does not match with ColumnString::offsets.back, this will cause problems for some string functions, e.g. like and regexp.
2023-08-04 19:13:35 +08:00
86e6f5d039 [FIX](decimal)fix decimal precision (#22364)
Now we make wrong for decimal parse from string
if given string precision is bigger than defined decimal precision, we will return a overflow error, but only digit part is bigger than typed digit length , we should return overflow error when we traverse given string to decimal value
2023-08-03 21:13:58 +08:00
f16a39aea1 [feature](time) using timev2 type to replace the old time type. (#22269) 2023-08-01 15:59:07 +08:00
3a11de889f [Opt](exec) opt the performance of date parquet convert by date dict (#22384)
before:

mysql> select count(l_commitdate) from lineitem;
+---------------------+
| count(l_commitdate) |
+---------------------+
| 600037902 |
+---------------------+
1 row in set (0.86 sec)
after:

mysql> select count(l_commitdate) from lineitem;
+---------------------+
| count(l_commitdate) |
+---------------------+
| 600037902 |
+---------------------+
1 row in set (0.36 sec)
2023-08-01 12:24:00 +08:00
d585a8acc1 [Improvement](shuffle) Accumulate rows in a batch for shuffling (#22218) 2023-08-01 09:55:06 +08:00
ec1a4d172b (vertical compaction) fix vertical compaction core (#22275)
* (vertical compaction) fix vertical compaction core
co-author:@zhannngchen
2023-07-28 16:41:00 +08:00
d4a4c172ea [Improve](serde)update serialize and deserialize text for data type (#21109) 2023-07-26 10:06:16 +08:00
Pxl
19ba6bec38 [Improvement](pipeline) support send eos on local exchange and remove some unused code (#22086)
support send eos on local exchange and remove some unused code
2023-07-24 09:25:32 +08:00
ce397a8d32 [FIX](map)fix arrow serde with map null key #21955 2023-07-19 12:09:34 +08:00
b35cfc5d5e [opt](join) Opt the performance of join probe (#21845) 2023-07-19 01:21:22 +08:00
c6063ed92f [Revert](lazy open) revert lazy open and add case (#21821) 2023-07-18 19:41:33 +08:00
cbddff0694 [FIX](map) fix map key-column nullable for arrow serde #21762
arrow is not support key column has null element , but doris default map key column is nullable , so need to deal with if doris map row if key column has null element , we put null to arrow
2023-07-14 00:30:07 +08:00
3163841a3a [FIX](serde)Fix decimal for arrow serde (#21716) 2023-07-12 19:15:48 +08:00
d0eb4d7da3 [Improve](hash-fun)improve nested hash with range #21699
Issue Number: close #xxx

when cal array hash, elem size is not need to seed hash
hash = HashUtil::zlib_crc_hash(reinterpret_cast<const char*>(&elem_size),
                                                   sizeof(elem_size), hash);
but we need to be care [[], [1]] vs [[1], []], when array nested array , and nested array is empty, we should make hash seed to
make difference
2. use range for one hash value to avoid virtual function call in loop.
which double the performance. I make it in ut

column: array[int64]
50 rows , and single array has 10w elements
2023-07-11 14:40:40 +08:00
Pxl
ca71048f7f [Chore](status) avoid empty error msg on status (#21454)
avoid empty error msg on status
2023-07-11 13:48:16 +08:00
736d6f3b4c [improvement](timezone) support mixed uppper-lower case of timezone names (#21572) 2023-07-11 09:37:14 +08:00
8973610543 [feature](datetime) "timediff" supports calculating microseconds (#21371) 2023-07-10 19:21:32 +08:00
7caab87bbe [FIX](serde) fix map/struct/array support arrow #21628
support map/struct support arrow format
fix string arrow format
fix largeInt 128 for arrow builder
2023-07-08 15:51:14 +08:00
b7d6a70868 [FIX](datatype) Implement hash func with array/map/struct type (#21334)
we do not Implement any hash functions in array/map/struct column , so we use sql like this will make be core

select * from (
        select
            bdp.nc_num,
            collect_list(distinct(bd.catalog_name)) as catalog_name,
            material_qty
        from
            dataease.bu_delivery_product bdp
            left join dataease.bu_trans_transfer btt on bdp.delivery_product_id = btt.delivery_product_id
            left join dataease.bu_delivery bd on bdp.delivery_id = bd.delivery_id
        where
            bd.val_status in ('10', '20', '30', '90')
            and bd.delivery_type in (0, 1, 2)
        group by
            nc_num,
            material_qty
        union
        ALL
        select
            bdp.nc_num,
            collect_list(distinct(bd.catalog_name)) as catalog_name,
            material_qty
        from
            dataease.bu_trans_transfer btt
            left join dataease.bu_delivery_product bdp on bdp.delivery_product_id = btt.delivery_product_id
            left join dataease.bu_delivery bd on bdp.delivery_id = bd.delivery_id
        where
            bd.val_status in ('10', '20', '30', '90')
            and bd.delivery_type in (0, 1, 2)
        group by
            nc_num,
            material_qty
) aa;
core :
2023-06-30 17:11:35 +08:00
d76fa427a3 [improve](jsonb)Invalid json path prompts an error instead of null (#19646)
1. Invalid json path prompts an error instead of null:
before:
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]');
+-------------------------------------------------------------+
| jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]') |
+-------------------------------------------------------------+
| NULL                                                        |
+-------------------------------------------------------------+
1 row in set (0.01 sec)
```
now
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]');
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Json path error: Invalid Json Path for value: $[a]
```
2. fix some problem: https://github.com/apache/doris/pull/19185
   a. support negative numbers
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[-2]');
+--------------------------------------------------------------+
| jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[-2]') |
+--------------------------------------------------------------+
| "a"                                                          |
+--------------------------------------------------------------+
1 row in set (0.02 sec)
```
  b. Avoid using unnecessary memory
3. Supplementary regression test
2023-06-30 14:29:21 +08:00
0396f78590 [fix](memory) Remove ChunkAllocator & fix Allocator no use mmap (#21259) 2023-06-28 16:10:24 +08:00
Pxl
b6835840f7 [Bug](table-function) return InvalidArgument when explode_split meet empty delimiter (#20795)
return InvalidArgument when explode_split meet empty delimiter
2023-06-15 15:17:22 +08:00
31a4f96f01 [refactor](exprcontext) move close to expr context's dector method (#20747)
The close method does nothing. But I am not sure we could remove it. So that I add it to dector method and remove many many calls.
2023-06-14 18:01:07 +08:00
Pxl
a15a0b9193 [Chore](build) use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp (#20461)
use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp
2023-06-08 19:36:21 +08:00
c03a19ea23 [improvement](bitmap) Using set to store a small number of elements to improve performance (#19973)
Test on SSB 100g:

select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 4.388s

create materialized view:

create materialized view customer_uv as select lo_suppkey, bitmap_union(to_bitmap(lo_linenumber)) from lineorder group by lo_suppkey;
select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 12.908s

test with the patch, exec time: 5.790s
2023-05-31 16:13:42 +08:00
ab8125d56f [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema (#20037)
* [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema

1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse.

2. Make some variables wrapped with std::unique<unique_ptr>

Performance:
| 状态              | QPS | 平均响应时间 (avg) | P99 响应时间 |
|------------------|-----|------------------|-------------|
| 开启 SchemaCache | 501 | 20ms             | 34ms        |
| 关闭 SchemaCache | 321 | 31ms             | 61ms        |

* handle schema change with schema version

* remove useless header

* rebase
2023-05-29 17:34:53 +08:00
9f8de89659 [refactor](exec) replace the single pointer with an array of 'conjuncts' in ExecNode (#19758)
Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity.

By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed.

This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.
2023-05-29 11:47:31 +08:00
9d44918036 [Improve](data-type) Clean datatype uselesscode (#20145)
* fix struct_export out data

* delete useless code with data type
2023-05-28 20:48:29 +08:00
93933308e6 [Feature-WIP](CCR): Add ccr doris interface (WIP) (#17881) 2023-05-26 23:40:49 +08:00
ee34b6de2d [Refact] (serde) refact mysql serde with data type (#19543)
refact mysql output (de)serialize with data type serde , avoid accoriding switch case Primitive type writed in mysqlWriter
2023-05-26 14:11:17 +08:00
Pxl
15a7420661 [Chore](ub) fix some undefined behaviors (#19986)
/home/zcp/repo_center/doris_master/doris/be/src/olap/rowset/segment_v2/column_reader.cpp:895:21: runtime error: load of value 423208544, which is not a valid value for type 'doris::ReaderType'

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_decimal.cpp:260:33: runtime error: load of misaligned address 0x7fa3348b301c for type 'int64_t' (aka 'long'), which requires 8 byte alignment

/home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:82:24: runtime error: variable length array bound evaluates to non-positive value 0

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_string.h:225:26: runtime error: null pointer passed as argument 2, which is declared to never be null
2023-05-26 14:08:40 +08:00
a6bd014b8a [FIX](serde)pb ut is not stable #19907 2023-05-22 09:03:02 +08:00
fd4fa5c64e [Optimize](row store) optimize serialization and deserialization (#19691)
1. Get DataTypeSerde in advance to avoid get temporary DataTypeSerde iterate each column
2. Iterate the original row once is enoungh for deserializing by introducing a map for record the index of each column's unique id
2023-05-18 16:22:38 +08:00
294599ee45 [feature](jsonb) rename JSONB type name and function name to JSON (#19774)
To be more compatible with MySQL, rename JSONB type name and function name to JSON.

The old JSONB type name and jsonb_xx function can still be used for backward compatibility.

There is a function jsonb_extract remained since json_extract is used by json string function and more work need to change it. It will be changed further.
2023-05-18 16:16:52 +08:00
79d30cfe46 [feature](compact) Duplicate with no keys tables compaction coredump (#19490)
Co-authored-by: yuxianbing <yuxianbing@yy.com>
2023-05-17 22:22:14 +08:00