Commit Graph

1195 Commits

Author SHA1 Message Date
268c867679 [Improve](serde)replace function_cast from_string to serde (#24087)
Now we can not support streamload with column which is map/array nested map/array
serde can do this now , so we can replace it
Notice. if item data in complex type data is empty we just return error, instead of makeup default value , because now we can not define right default for complex type
2023-09-14 13:53:16 +08:00
563c3f75ff [feature](move-memtable) share delta writer v2 among sinks (#24066) 2023-09-13 14:39:29 +08:00
c7ae2a7d22 [Refactor & Bugfix](static variables) move some static vairables to exec_env (#24029) 2023-09-13 09:27:03 +08:00
d8ef9dda59 [feature](merge-cloud) Rewrite FS interface (#23953) 2023-09-12 19:20:25 +08:00
4bb9a12038 [function](bitmap) support bitmap_remove (#24190) 2023-09-12 14:52:04 +08:00
bdacefa734 [Fix](status)Fix leaky abstraction and shield the status code END_OF_FILE from upper layers (#24165) 2023-09-12 11:10:52 +08:00
1228995dec [improvement](segment) reduce memory footprint of column_reader and segment (#24140) 2023-09-11 21:54:00 +08:00
dbb9365556 [Enhance](ip)optimize priority_ network matching logic for be (#23795)
Issue Number: close #xxx

If the user has configured the wrong priority_network, direct startup failure to avoid users mistakenly assuming that the configuration is correct
If the user has not configured p_ n. Select only the first IP from the IPv4 list, rather than selecting from all IPs, to avoid users' servers not supporting IPv4
extends #23784
2023-09-11 18:32:31 +08:00
a0fcc30764 [Fix](Status) Handle status code correctly and add a new error code ENTRY_NOT_FOUND (#24139) 2023-09-11 09:32:11 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
537369f4e2 [Fix](http) Fix curl return HTTP_ERROR && Add not_found HttpClientTest, fix (#23984)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-09-07 10:10:51 +08:00
2f8b075b71 [improvement](bitmap) support version for ser/deser of bitmap (#23959) 2023-09-07 09:55:29 +08:00
d183e08f6d [opt](MergedIO) optimize merge small IO, prevent amplified read (#23849)
There were two vulnerabilities in the previous fix(https://github.com/apache/doris/pull/20305):
1. `next_content` may not necessarily be a truly readable range
2. The last range of the merged data may be the hollow

This PR fundamentally solves the problem of reading amplification by rechecking the calculation range. According to the algorithm, there is only one possibility of generating read amplification, with only a small content of data within the 4k(`MIN_READ_SIZE `) range. However, 4k is generally the minimum IO size and there is no need for further segmentation.
2023-09-06 22:45:31 +08:00
95ae5376f3 [Fix](BinaryPrefixPage) stop to read values when current pos reached the end of the page in BinaryPrefixPageDecoder::next_batch (#23855) 2023-09-06 16:34:38 +08:00
Pxl
a96adc01aa [Chore](function) refactor of quantile_state (#23862)
refactor of quantile_state
2023-09-06 15:39:19 +08:00
09bcedb116 [feature](merge-cloud) Remove deprecated old cache (#23881)
* Remove deprecated old cache
2023-09-06 08:07:05 +08:00
a542f107db [feature](move-memtable) buffer messages in load stream stub (#23721) 2023-09-02 13:42:34 +08:00
75e2bc8a25 [function](bitmap) support bitmap_to_base64 and bitmap_from_base64 (#23759) 2023-09-02 00:58:48 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
25b6e4deb2 [fix](daemon) Fix incorrect initialization order of daemon services (#23578)
Current initialization dependency:

      Daemon ───┬──► StorageEngine ──► ExecEnv ──► Disk/Mem/CpuInfo
                │
                │
BackendService ─┘
However, original code incorrectly initialize Daemon before StorageEngine.
This PR also stop and join threads of daemon services in their dtor, to ensure Daemon services release resources in reverse order of initialization via RAII.
2023-08-31 19:46:38 +08:00
b3a9c247af [refactor](move-memtable) add load stream stub (#23642) 2023-08-31 19:39:34 +08:00
f1e43fcaa4 [opt](cache) Support segment cache dynamic opening and closing (#23659)
Dynamically modify the config to clear the cache, each time the disable cache will only be cleared once.
TODO, Support page cache and other caches.

curl -X POST http://xxxx:8040/api/update_config?disable_segment_cache=true
2023-08-31 18:48:26 +08:00
62c075bf7e [improvement](Block) Replace Block(const PBlock&) with deserialize because it has heavy operations in ctor (#23672) 2023-08-31 14:44:17 +08:00
3e4ee3c1e6 [fix](jdbc catalog) fix jdbc driver cache load error (#23656)
log error:
`W20230830 11:19:47.495721 3046231 status.h:363] meet error status: [INTERNAL_ERROR]user function's name should be function_id.checksum[.file_name].file_type, now the all split parts are by delimiter(.): 7119053928154065546.20c8228267b6c9ce620fddb39467d3eb.postgresql-42.5.0.jar`

When the jdbc driver had `.` in its name we failed to split it properly
2023-08-31 10:17:15 +08:00
14310ad30b [improvement](move-memtable) wait StreamClose from remote (#23605)
* [fix](move-memtable) wait StreamClose from remote
2023-08-30 18:03:36 +08:00
1ac0ff0ea9 [feature](delete-predicate) support delete sub predicate v2 (#22442)
New structure for delete sub predicate.
Delete sub predicate uses a string type condition_str to stored temporarily now and fields will be extracted from it using std::regex, which may introduces stack overflow when matching a extremely large string(bug of libc).

Now we attempt to use a new PB structure to hold the delete sub predicate, to avoid that problem.

message DeleteSubPredicatePB {
    optional int32 column_unique_id = 1;
    optional string column_name = 2;
    optional string op = 3;
    optional string cond_value = 4;
}
Currently, 2 versions of sub predicate will both be filled. For query, we use the v2, and during compaction we still use v1. The old rowset meta with delete predicates which had sub predicate v1 will be attempted to convert to v2 when read from PB. Moreover, efforts will be made to rewrite these meta with the new delete sub predicate.

Make preparation to use column unique id to specify a column globally.
Using the column unique id rather than the column name to identify a column is vital for flexible schema change. The rewritten delete predicate will attach column unique id.
2023-08-29 19:37:23 +08:00
94a8fa6bc9 [bug](function) fix explode_number function return wrong rows (#23603)
before the explode_number function result is random with const value.
because the _cur_size is reset, so it's can't insert values to column.
2023-08-29 19:02:49 +08:00
da9eb79ac4 [Enhancement](Schema hash) Remove schema hash in tablet info (#23516) 2023-08-29 10:05:12 +08:00
9c65b7ab96 [improvement](column_reader) move load once to index reader to reduce (#23537)
memory footprint of column reader
2023-08-29 09:34:27 +08:00
5be8d57f52 [fix](be-ut) fix ColumnFixedLenghtObjectTest on 32 bits system (#23519) 2023-08-28 14:02:05 +08:00
e0bf621fe0 [chore](build) Fix compilation errors for BE UT (#23535)
Issue Number: close #23536

This issue was introduced by #23414 .
2023-08-27 11:52:13 +08:00
f80b067990 [fix](column) add unimplemented function of ColumnFixedLengthObject (#23468) 2023-08-25 17:38:01 +08:00
d8e499cb55 [fix](UT) fix flaky test in LoadStreamMgrTest (#23459) 2023-08-25 13:53:20 +08:00
9cacf9535a [Opt](functions) Use preloaded cache to accelerate timezone parsing (#22694)
* opt

* bugfix

* fix ut

* fix stylecheck
2023-08-25 10:00:48 +08:00
71071ba057 [feature](move-memtable)[4/7] add stream sink file writer (#23416)
Co-authored-by: laihui <1353307710@qq.com>
2023-08-25 00:08:27 +08:00
98d0a2f6c1 [feature](move-memtable)[3/7] add load stream manager and rpc service (#23415)
Co-authored-by: zhengyu <freeman.zhang1992@gmail.com>
Co-authored-by: Yongqiang YANG <dataroaring@gmail.com>
Co-authored-by: laihui <1353307710@qq.com>
2023-08-25 00:08:04 +08:00
51ac92f65c Revert "[fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236)" (#23368)
This reverts commit 1c3cc77a54938ed948ad8186b8dea8385977d23c.
2023-08-23 18:27:35 +08:00
Pxl
8ed4045df9 [Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842) 2023-08-23 08:37:40 +08:00
bcdb481374 [refactor](fragment) refactor non pipeline fragment executor (#23281)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-08-22 16:00:34 +08:00
5ff7b57fc1 [fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198)
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
2023-08-22 13:35:29 +08:00
0d7a61ae8c [fix](load) fix duplicate register of memtable writer in memory limiter (#23205) 2023-08-22 10:05:17 +08:00
12075f9853 [pipelineX](projection) Support projection and blocking agg (#23256) 2023-08-21 22:23:02 +08:00
33dfa0c454 [Improve](serde) support text serde for nested type-array/map (#22738)
Now we can not support nested type array/map 
so this pr aim to:
1. add format option for string convert defined datatype to keep with origin from_string
2. support array map can nested array and map
2023-08-21 10:32:28 +08:00
1c3cc77a54 [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236)
* [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty

* add ut

* fix nereids

* fix regression-test
2023-08-18 14:37:49 +08:00
330f369764 [enhancement](file-cache) limit the file cache handle num and init the file cache concurrently (#22919)
1. the real value of BE config `file_cache_max_file_reader_cache_size` will be the 1/3 of process's max open file number.
2. use thread pool to create or init the file cache concurrently.
    To solve the issue that when there are lots of files in file cache dir, the starting time of BE will be very slow because
    it will traverse all file cache dirs sequentially.
2023-08-17 16:52:08 +08:00
6cf1efc997 [refactor](load) use smart pointers to manage writers in memtable memory limiter (#23019) 2023-08-16 16:34:57 +08:00
4510e16845 [improvement](delete) support delete predicate on value column for merge-on-write unique table (#21933)
Previously, delete statement with conditions on value columns are only supported on duplicate tables. After we introduce delete sign mechanism to do batch delete, a delete statement with conditions on value columns on unique tables will be transformed into the corresponding insert into ..., __DELETE_SIGN__ select ... statement. However, for unique table with merge-on-write enabled, the overhead of inserting these data can be eliminated. So this PR add the ability to allow delete predicate on value columns for merge-on-write unique tables.
2023-08-16 12:18:05 +08:00
4e880288c6 [refactor]use clear concept to replace std::enable_if_t (#22801)
---------

Signed-off-by: flynn <fenglv15@mails.ucas.ac.cn>
2023-08-12 15:10:30 +08:00
b9b9071c9b [improvement](create partition) create partition require quorum replicas succ (#22554) 2023-08-11 11:59:05 +08:00
71807ceb5f [Enhancement](tvf) Table value function support reading local file (#17404)
I tested the local tvf with tpch queries. First, generate `lineitem` datasets with 6001215 rows, and load it into `lineitem` table by:
```
insert into lineitem select c11, c1, c4, c2, c3, c5, c6, c7, c8, c9, c10, c12, c13, c14, c15, c16 
from local(
        "file_path" = "tools/tpch-tools/bin/tpch-data/lineitem.tbl.1", 
        "backend_id" = "10003", 
        "format" = "csv", 
        "column_separator" = "|"
);
```
Then, run `q1` and `q16` tpch queries, the query result is correct.

It can also analyze the BE's log directly like:

```
mysql> select * from local(
        "file_path" = "log/be.out",
        "backend_id" = "10006",
        "format" = "csv")
       where c1 like "%start_time%" limit 10;
+--------------------------------------------------------+
| c1                                                     |
+--------------------------------------------------------+
| start time: 2023年 08月 07日 星期一 23:20:32 CST       |
| start time: 2023年 08月 07日 星期一 23:32:10 CST       |
| start time: 2023年 08月 08日 星期二 00:20:50 CST       |
| start time: 2023年 08月 08日 星期二 00:29:15 CST       |
+--------------------------------------------------------+
```
2023-08-10 20:07:42 +08:00