doris

Author	SHA1	Message	Date
Xinyi Zou	bd4bfa8f00	[fix](memtracker) Fix thread mem tracker try consume accuracy #12782	2022-09-21 09:20:41 +08:00
AlexYue	c72a19f410	[BugFix](VExprContext) capture error status to prevent incorrect func call which causes coredump #12779	2022-09-21 09:20:16 +08:00
AlexYue	f1539761e8	[Bugfix](string_functions) rearrange code to avoid global buffer overflow in FindInSetOp::execute (#12677 )	2022-09-21 09:19:38 +08:00
Zhengguo Yang	e70c298e0c	[Bugfix](mem) Fix memory limit check may overflow (#12776 ) This bug is because the result of subtracting signed and unsigned numbers may overflow if it is negative.	2022-09-20 18:18:23 +08:00
Ashin Gau	b837b2eb95	[feature-wip](parquet-reader) filter rows by page index (#12664 ) # Proposed changes [Parquet v1.11+ supports page skipping](https://github.com/apache/parquet-format/blob/master/PageIndex.md), which helps the scanner reduce the amount of data scanned, decompressed, decoded, and insertion. According to the performance FlameGraph, decompression takes up 20% cpu time. If a page can be filtered as a whole, the page can not be decompressed. However, the row numbers between pages are not aligned. Columns containing predicates can be filtered by page granularity, but other columns need to be skipped within pages, so non predicate columns can only save the decoding and insertion time. Array column needs the repetition level to align with other columns, so the array column can only save the decoding and insertion time. ## Explore `OffsetIndex` in the column metadata can locate the page position. Theoretically, a page can be completely skipped, including the time of reading from HDFS. However, the average size of a page is around 500KB. Skipping a page requires calling the `skip`. The performance of `skip` is low when it is called frequently, and may not be better than continuous reading of large blocks of data (such as 4MB). If multiple consecutive pages are filtered, `skip` reading can be performed according to`OffsetIndex`. However, for the convenience of programming and readability, the data of all pages are loaded and filtered in turn.	2022-09-20 15:55:19 +08:00
slothever	d435f0de41	[feature-wip](parquet-reader) add page index row range (#12652 ) Add some utils and provide the candidate row range (generated with skipped row range of each column) to read for page index filter this version support binary operator filter todo: - use context instead of structures in close() - process complex type filter - use this instead of row group minmax filter - refactor _eval_binary() for row group filter and page index filter	2022-09-20 10:36:19 +08:00
starocean999	ca3e52a0bb	[fix](agg)the output of window function's nullability should be consistent with output slot (#12607 ) FE may force window function to output a nullable value in some case, be should follow this and change the nullability accordingly.	2022-09-20 09:29:44 +08:00
Xin Liao	41cf94498d	[feature-wip](unique-key-merge-on-write) fix that incremental clone may lead to loss of delete bitmap (#12721 )	2022-09-20 09:08:06 +08:00
Jibing-Li	5978fd9647	[refactor](file scanner)Refactor file scanner. (#12602 ) Refactor the scanners for hms external catalog, work in progress. Use VFileScanner, will remove NewFileParquetScanner, NewFileOrcScanner and NewFileTextScanner after fully tested. Query for parquet file has been tested, still need to add readers for orc file, text file and load logic as well.	2022-09-19 15:23:51 +08:00
luozenglin	d68b8cce1a	[fix](intersect) fix intersect query failed in row storage code (#12712 )	2022-09-19 11:47:50 +08:00
yiguolei	415721ef20	[enhancement](pred column) improve predicate column insert performance (#12690 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-09-19 10:53:48 +08:00
yinzhijian	fb9e48a34a	[fix](vstream load) Fix bug when load json with jsonpath (#12660 )	2022-09-19 10:13:18 +08:00
minghong	b608de668f	[fix](compile)compile error: open_telemetry_scop_wrapper.hpp cannot file 'UNLIKELY' (#12709 )	2022-09-19 09:18:04 +08:00
carlvinhust2012	fa8ed2bccc	[fix](array-type) fix the invalid format load for stream load (#12424 ) this pr is used to fix the invalid format load for stream load. before the change , we will get the error when we load the invalid array format. the origin file to load : 1 [1, 2, 3] 2 [4, 5, 6] 3 \N 4 [7, \N, 8] 5 10, 11, 12 [hugo@xafj-palo]$ sh curl_cmd.sh { "TxnId": 11035, "Label": "11c9f111-188e-4616-9a50-aec8b7814513", "TwoPhaseCommit": "false", "Status": "Fail", "Message": "Array does not start with '[' character, found '1'", "NumberTotalRows": 0, "NumberLoadedRows": 0, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 55, "LoadTimeMs": 7, "BeginTxnTimeMs": 0, "StreamLoadPutTimeMs": 2, "ReadDataTimeMs": 0, "WriteDataTimeMs": 3, "CommitAndPublishTimeMs": 0 } 3. after this change, we will get success and the error url which report the error line. [hugo@xafj-palo]$ sh curl_cmd.sh { "TxnId": 11046, "Label": "249808ee-55f4-4c08-b671-b3d82689d614", "TwoPhaseCommit": "false", "Status": "Success", "Message": "OK", "NumberTotalRows": 5, "NumberLoadedRows": 4, "NumberFilteredRows": 1, "NumberUnselectedRows": 0, "LoadBytes": 55, "LoadTimeMs": 39, "BeginTxnTimeMs": 0, "StreamLoadPutTimeMs": 2, "ReadDataTimeMs": 0, "WriteDataTimeMs": 19, "CommitAndPublishTimeMs": 16, "ErrorURL": "http://10.81.85.89:8502/api/_load_error_log?file=__shard_3/error_log_insert_stmt_8d4130f0c18aeb0a-ad7ffd4233c41893_8d4130f0c18aeb0a_ad7ffd4233c41893" } the sql select result: MySQL [example_db]> select * from array_test06; +------+--------------+ \| k1 \| k2 \| +------+--------------+ \| 1 \| [1, 2, 3] \| \| 2 \| [4, 5, 6] \| \| 3 \| NULL \| \| 4 \| [7, NULL, 8] \| +------+--------------+ 4 rows in set (0.019 sec) the url page show us: "Reason: Invalid format for array column(k2). src line [10, 11, 12]; " Issue Number: #7570	2022-09-19 08:52:59 +08:00
yixiutt	65cff8d40c	[enhancement](compaction) prevent quick_compaction&auto_compaction conflict (#12674 ) Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-19 08:39:27 +08:00
Mingyu Chen	bc38b2fdfb	[improvement](new-scan) graceful quit scanner scheduler (#12715 )	2022-09-19 08:39:08 +08:00
starocean999	3b7a04ee8b	[fix](inpredicate)always use PredicateColumn<TYPE_STRING> for CHAR, VARCHAR and STRING type (#12637 ) The predicate column type for char, varchar and string is PredicateColumnType<TYPE_STRING>, so _base_evaluate method should convert the input column to PredicateColumnType<TYPE_STRING> always.	2022-09-19 08:37:06 +08:00
luozenglin	cb06e67fba	[fix](tracing) Fix opentelemetry log output to be.out (#11856 )	2022-09-18 17:40:23 +08:00
Xinyi Zou	a73b28789d	Fix memory leak by calling in mem hook (#12708 ) After the consume mem tracker exceeds the mem limit in the mem hook, the boost stacktrace will be printed. A query/load will only be printed once, and the process tracker will only be printed once per second. After the process memory reaches the upper limit, the boost stacktrace will be printed every second. The observed phenomena are as follows: After query/load is canceled, the memory increases instantly; tcmalloc profile total physical memory is less than perf process memory; The process mem tracker is smaller than the perf process memory;	2022-09-18 10:04:15 +08:00
Xin Liao	bac58a4774	[feature-wip](unique-key-merge-on-write) fix calculate delete bitmap when flush memtable (#12668 )	2022-09-17 17:04:03 +08:00
HappenLee	35b97a5af0	[Opt](hash) Speed up insert from dict data map and not datetime (#12670 ) Speed up dict data read and not datetime. same target #12636	2022-09-17 17:02:43 +08:00
luozenglin	3030a3606a	[fix](load) fix stream load fail when setting strict mode (#12684 )	2022-09-17 17:02:11 +08:00
Xinyi Zou	3bb042e45c	[fix](memtracker) Process physical mem check does not include tc/jemalloc allocator cache (#12688 ) tcmalloc/jemalloc allocator cache does not participate in the mem check as part of the process physical memory. because new/malloc will trigger mem hook when using tcmalloc/jemalloc allocator cache, but it may not actually alloc physical memory, which is not expected in mem hook fail. in addition: The value of tcmalloc/jemalloc allocator cache is used as a mem tracker, the parent is the process mem tracker, which is updated every 1s. Modify the process default mem_limit to 90%. expect mem tracker to effectively limit the memory usage of the process.	2022-09-17 11:31:01 +08:00
Lightman	e01986b8b9	[feature](light-schema-change) fix light-schema-change and add more cases (#12160 ) Fix _delete_sign_idx and _seq_col_idx when append_column or build_schema when load. Tablet schema cache support recycle when schema sptr use count equals 1. Add a http interface for flink-connector to sync ddl. Improve tablet->tablet_schema() by max_version_schema.	2022-09-17 11:29:36 +08:00
Xinyi Zou	942b31038f	[fix](memory) Fix BE OOM when load -238 fail (#12666 ) When the flush is triggered when the load channel exceeds the mem limit, if the flush fails, an error message is returned and the load is terminated. Usually flush failure is -238 error code. Because the memtable is frequently flushed after the load channel exceeds the mem limit, the number of segments exceeds the max value.	2022-09-17 00:17:53 +08:00
Xinyi Zou	42b6532131	remove gc and fix print (#12682 )	2022-09-17 00:16:15 +08:00
Zhengguo Yang	b733a23cf7	[Bugfix](stack_over_flow) fix be may core dump because of stack-buffer-overflow when TBrokerOpenReaderResponse too large (#12658 )	2022-09-16 20:57:22 +08:00
HappenLee	9d6c199553	[Bug](vec) Fix avg overflow in clickbench (#12621 )	2022-09-16 14:43:40 +08:00
TengJianPing	8364165e30	[regression_test](testcase) add regression test case from session variable skip_storage_engine_merge, skip_delete_predicate and show_hidden_columns (#12617 ) also add this function to new olap scan node.	2022-09-16 10:33:12 +08:00
Pxl	d44ec74988	[Enhancement](column) optimize for ColumnString::insert_many_dict_data (#12636 ) optimize for ColumnString::insert_many_dict_data	2022-09-16 10:23:04 +08:00
Gabriel	c05d736331	[Improvement](sort) fallback to partial sort small block if topN is small (#12604 ) * [Improvement](sort) fallback to partial sort small block if topN is small	2022-09-16 10:20:17 +08:00
yinzhijian	2a063355ad	[fix](vstream load) Fix the default value insertion problem when importing json (#12601 ) * [fix](vstream load) Fix the default value insertion problem when importing json * update	2022-09-16 09:54:45 +08:00
yinzhijian	a97f63141e	[fix](cast) Add validity check for date conversion for non-vectorization (#12608 ) actual result select cast("0.0000031417" as date); +------------------------------+ \| CAST('0.0000031417' AS DATE) \| +------------------------------+ \| 2000-00-00 \| +------------------------------+ expect result select cast("0.0000031417" as date); +------------------------------+ \| CAST('0.0000031417' AS DATE) \| +------------------------------+ \| NULL \| +------------------------------+	2022-09-16 09:08:53 +08:00
yixiutt	d906e97f1b	[bugfix](compression) fix lock bug in concurrent acquire context (#12638 ) Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-16 09:05:29 +08:00
yixiutt	3072e17b39	[Bugfix](primary-key) fix calc delete bitmap bug in concurrent memtable flush (#12605 ) Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-15 21:50:24 +08:00
Zhengguo Yang	c6c84a2784	[chore](build) add build param to version string (#12591 )	2022-09-15 17:09:22 +08:00
Gabriel	fc4298e85e	[feature](outfile) support parquet writer (#12492 )	2022-09-15 11:09:12 +08:00
zhangstar333	22a8d35999	[Feature](vectorized) support jdbc sink for insert into data to table (#12534 )	2022-09-15 11:08:41 +08:00
HappenLee	e413a2b8e9	[Opt](vectorized) Use new way to do hash shffle to speed up query (#12586 )	2022-09-15 11:08:04 +08:00
starocean999	8e4374b7ec	[enhancement](agg)remove unnessasery mem alloc and dealloc in agg node (#12535 )	2022-09-15 11:07:06 +08:00
yixiutt	b136d80e1a	[enhancement](compress) reuse compression ctx and buffer (#12573 ) Reuse compression ctx and buffer. Use a global instance for every compression algorithm, and use a thread saft buffer pool to reuse compression buffer, pool size is equal to max parallel thread num in compression, and this will not be too large. Test shows this feature increase 5% of data import and compaction. Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-15 10:59:46 +08:00
Zhengguo Yang	d8b6f09cc1	[Bugfix](string_functions) fix heap-buffer-overflow on find_in_set (#12613 )	2022-09-15 08:43:10 +08:00
lihangyu	f50054f547	[Enhancement](array-type) record offsets info to speed up the seek performance (#12293 ) Store the offset rather than the length in file for the data with array type. The new file format can improve the seek performance. Please refer to #12246 to get the performance report. Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>	2022-09-14 22:41:54 +08:00
Mingyu Chen	c5ad989065	[refactor](reader) refactor the interface of file reader (#12574 ) Currently, Doris has a variety of readers for different file formats, such as parquet reader, orc reader, csv reader, json reader and so on. The interfaces of these readers are not unified, which makes it impossible to call them through a unified method. In this PR, I added a `GenericReader` interface class, and other Readers will implement this interface class to use the `get_next_block()` method. This PR currently only modifies `arrow_reader` and `parquet reader`. Other readers will be modified one by one in subsequent PRs.	2022-09-14 22:31:11 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
Jerry Hu	501e7b9132	[chore][config] increase the default value of doris_blocking_priority_queue_wait_timeout_ms (#12580 ) The default value of Config::doris_blocking_priority_queue_wait_timeout_ms make PriorityWorkStealingThreadPool::work_thread high CPU usage (about 8%)	2022-09-14 14:26:13 +08:00
Yongqiang YANG	5dcf933012	[Bug](column) ColumnNullable::replace_column_data should DCHECK size > sel… #12558	2022-09-14 08:42:15 +08:00
camby	56b2fc43d4	[enhancement](array-type) shrink column suffix zero for type ARRAY<CHAR> (#12443 ) In compute level, CHAR type will shrink suffix zeros. To keep the logic the same as CHAR type, we also shrink for ARRAY or ARRAY<ARRAY> types. Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-09-13 23:24:48 +08:00
HappenLee	d913ca5731	[Opt](vectorized) Speed up bucket shuffle join hash compute (#12407 ) * [Opt](vectorized) Speed up bucket shuffle join hash compute	2022-09-13 20:19:22 +08:00
AlexYue	58508aea13	[enhance](information_schema) show hll type and bitmap type instead of unknown (#12519 ) Before this pr, when querying data type of hll/bitmap column, 'unknown' would be returned instead of the correct data type of queried column.	2022-09-13 19:43:42 +08:00

1 2 3 4 5 ...

2808 Commits