Commit Graph

4837 Commits

Author SHA1 Message Date
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
5fdd9b9254 [Bug](RuntimeFiter) Fix bf error change the murmurhash to crc32 in regression test p2 (#21167) 2023-06-26 16:39:45 +08:00
960e04b0ed [fix](inverted index) fix build inverted index failed but not return immediately (#21165) 2023-06-26 14:05:12 +08:00
66005570c9 [fix](regression) fix p1 test_backup_restore fail caused by http download 401 invalid token error #21107 2023-06-26 12:56:46 +08:00
1dec592e91 [improvement](fs_bench) optimize the usage of fs benchmark tool for hdfs (#21154)
Optimize the usage of fs benchmark tool:

1. Remove `Open` benchmark, it is useless.
2. Remove `Delete` benchmark, it is dangerous.
3. Add `SingleRead` benchmark, user can specify an exist file to test read operation:

    `sh bin/run-fs-benchmark.sh --conf=conf/hdfs_read.conf --fs_type=hdfs --operation=single_read`

4. Modify the `run-fs-benchmark.sh`, remove `OPTS` section, use options in `fs_benchmark_tool` directly
5. Add some custom counters in the benchmark result, eg:

```
--------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              6864 ms         2385 ms            1 ReadRate=200.936M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3919 ms         1828 ms            1 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3839 ms         1819 ms            1 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_mean         4874 ms         2011 ms            3 ReadRate=304.054M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_median       3919 ms         1828 ms            3 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_stddev       1724 ms          324 ms            3 ReadRate=89.3768M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_cv          35.37 %         16.11 %             3 ReadRate=29.40%
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_max          6864 ms         2385 ms            3 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_min          3839 ms         1819 ms            3 ReadRate=200.936M/s
```

- For `open_read` and `single_read`, add `ReadRate` as `bytes per second`.
- For `create_write`, add `WriteRate` as `bytes per second`.
- For `exists` and `rename`, add `ExistsCost` and `RenameCost` as `time cost per one operation`.
2023-06-26 11:37:14 +08:00
2e6d91aa99 [chore](block) temporarily disable DCHECK for column name equality in MutableBlock (#21116)
* tempororyly disable DCHECK for column name equality in MutableBlock::add_rows

* num columns EQ to LE
2023-06-26 10:49:27 +08:00
28abeef72b [performace](colddata) opt cold data read performance (#21141)
In #10370, we try to opt string evaluate performance by rewrite the predicate using dict value. But it has to check if the string column is full dict encoding. So that we add a logic to read the last page of the string column to check it.

But it has some bad performance for cold data because it has to load the column's ordinal index and zone map index. In some scenario for example, select * from table where pk_col=1. If the query condition is primary key, the result maybe just a few rows but the result may have 100 columns, it will cost a lot of time to load these indices. We could find a lot of time is spending on block_init_time.

In my test, a table with 50 string columns and query with primary key.

The first read time will reduce from 220ms to 40ms.
2023-06-26 10:39:20 +08:00
6f7759b08d [fix](memory) fix mem tracker grace exit (#21136) 2023-06-26 10:28:24 +08:00
1ac8cdec7e [Fix](inverted index) fix inverted query cache for chinese tokenizer (#21106)
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
2023-06-25 22:04:02 +08:00
76bdcf1d26 [improvement](pipeline) task group scan entity (#19924) 2023-06-25 14:43:35 +08:00
d49c412c59 [Feature](multi-catalog) Add hdfs benchmark tools. (#21074) 2023-06-25 09:35:27 +08:00
601120db04 [Bug](pipeline) access map may cause coredump in sink buffer (#21108) 2023-06-24 23:03:59 +08:00
691a988c97 [enhancement](merge-on-write) add async publish task when version is discontinuous for merge on write table when clone (#21025)
version discontinuity may occur when clone. To deal with this case, add async publish task when version is discontinuous.
2023-06-22 21:50:14 +08:00
a33521b2ce [enhancement](exchange) add filter for exchange node in BE (#21087) 2023-06-22 01:04:47 +08:00
49bbe88327 [fix](log) fix the too large warning log of BE (#21027) 2023-06-22 00:39:04 +08:00
3dfeee3946 [fix](typesystem) fix wrong return type argument cause type check fail (#21082) 2023-06-22 00:04:46 +08:00
2c9bdd64fa [fix](memory) arena support memory reuse after clear() (#21033) 2023-06-21 23:27:21 +08:00
2ce8cfbebd [profile](sort) add some metrics in profile (#21056) 2023-06-21 22:57:46 +08:00
661e1ae7c5 [fix](memory) no switch bthread context in UBSAN compile (#21064)
When UBSAN is compiled, all memory will be tracked to the orphan (unknown) mem tracker, and the bthread context and mem tracker will no longer be switched.

The supplementary fixes are as follows: #20999
2023-06-21 21:14:07 +08:00
b2c4e51be1 [fix](load) delete lazy open DCheck when unkown load id (#21083) 2023-06-21 20:42:31 +08:00
18a0824eb3 [fix](compaction)Modify time series compaction policy default config (#21079) 2023-06-21 20:29:58 +08:00
442a734ef5 [improvement](config) update be config max_runnings_transactions_per_txn_map default value (#21060) 2023-06-21 20:29:13 +08:00
6ac0bfeceb [Feature](inverted index) add unicode parser for inverted index (#21035) 2023-06-21 20:14:06 +08:00
84b97860a1 [fix](memory) Fix memory exceed limit and query has been canceled, Allocator will block 100ms (#20959) 2023-06-21 17:35:19 +08:00
85ce6a22c0 [enhancement](merge-on-write) some misc optimizations (#21039) 2023-06-21 16:16:06 +08:00
b65b821813 [enhancement](pk) add bvar stating cached io (#20977) 2023-06-21 15:02:10 +08:00
c5560b8f93 [fix](load) segcompaction does not signal waiters when an error hanppens (#21043)
This leads to a deadlock.
2023-06-21 14:56:34 +08:00
bad22dd4e2 [Fix](orc-reader) Fix orc dict filter null value issue in _convert_dict_cols_to_string_cols which caused incorrect result. (#21047)
Query results should not have empty values.
```
use regresssion.multi_catalog;
select commit_id from github_events_orc WHERE (event_type = 'CommitCommentEvent') AND commit_id != "" limit 10;
```
```
+------------------------------------------+
| commit_id                                |
+------------------------------------------+
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 4e3ab2ff2d2474f5d51334b9b0fdf17e9845a166 |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
| 7191c20cb49da07a7fc16aa32dc0de4faff528b2 |
+------------------------------------------+
10 rows in set (0.54 sec) 
```
2023-06-21 14:54:01 +08:00
564b3533cf [enhancement](merge-on-write) update publish/streamload/compaction co… (#21040) 2023-06-21 14:49:51 +08:00
81abdeffbc [Improvement](pipeline) Improve shared scan performance (#20785) 2023-06-21 14:36:05 +08:00
Pxl
5f0bb49d46 [Feature](materialized-view) support create mv contain aggstate column (#20812)
support create mv contain aggstate column
2023-06-21 13:06:52 +08:00
5f760a8939 [fix](runtime_filter) remove incorrect DCHECK (#21050) 2023-06-21 11:27:53 +08:00
ef17289925 [feature](jni) add jni metrics and attach to BE profile automatically (#21004)
Add JNI metrics, for example:
```
-  HudiJniScanner:  0ns
  -  FillBlockTime:  31.29ms
  -  GetRecordReaderTime:  1m5s
  -  JavaScanTime:  35s991ms
  -  OpenScannerTime:  1m6s
```
Add three common performance metrics for JNI scanner:
1. `OpenScannerTime`: Time to init and open JNI scanner
2. `JavaScanTime`: Time to scan data and insert into vector table in java side
3. `FillBlockTime`: Time to convert java vector table to c++ block

And support user defined metrics in java side, for example: `OpenScannerTime` is a long time for the open process, we want to determine which sub-process takes too much time, so we add `GetRecordReaderTime` in java side.
The user defined metrics in java side can be attached to BE profile automatically.
2023-06-21 11:19:02 +08:00
0cf9de8cef [fix](decimalv3) fix result error when cast a round decimalv3 to double (#20678) 2023-06-21 00:02:48 +08:00
ca6f51fcd5 [Performance] disable mmap alloc for doris performance (#21034)
disable mmap alloc for some benchmark
2023-06-20 23:27:49 +08:00
6d579d924d [fix](profile) delete useless profile add_child #20989 2023-06-20 23:21:52 +08:00
b70a14d9c9 [fix](merge-on-write) fix that delete bitmap is not calculated correctly when has sequence column (#20955) 2023-06-20 21:36:47 +08:00
2c11ce0a02 [bugfix](topn) fix key topn merge block conflict with index predicate result columns (#20820) 2023-06-20 21:23:00 +08:00
7a58a69aa9 [Fix](inverted index) skip index compaction when src rs did not have inverted index (#21010) 2023-06-20 21:22:25 +08:00
ce1b39e79d [fix](profile) avoid unnecessary refresh profile of TabletsChannel
Before, refresh the TabletsChannel profile in the LoadChannelMgr refresh memory statistics thread

This means that enable_profile=false will refresh and have performance loss in stress test
2023-06-20 21:09:43 +08:00
622ef63c69 [fix](memory) fix bthread_setspecific error in rpc done.run() (#20999) 2023-06-20 21:00:45 +08:00
55a6649da9 [fix](testcase) fix test case failure of insert null value into not null column (#20963) 2023-06-20 20:46:07 +08:00
190debaac9 [Improvement](load) single partition load optimize (#20876)
1. When creating a single partition,partition and tablet are not looked up for each row of data
2. Only DISTRIBUTED BY random
2023-06-20 20:29:39 +08:00
9eade148dd [enhancement](merge-on-write) add primary key data page size config (#20961) 2023-06-20 19:51:02 +08:00
ccba11d7ea [Fix](inverted index) remove IndexReader::indexExists, use fs interface (#20970) 2023-06-20 15:22:25 +08:00
012813b3f7 [fix](load) add missing flush context for BetaRowsetWriter::_add_block() (#20884) 2023-06-20 14:27:39 +08:00
c85271d2ae [Fix](orc-reader) Fix filter size mismatch in orc reader. (#20998)
Fix filter size mismatch in orc reader introduced by #20806
2023-06-20 12:27:16 +08:00
d05614ef51 [Fix](invert index)all directories use NoLock (#20962) 2023-06-20 12:12:16 +08:00
923f7edad0 [opt](hudi) using native reader to read the base file with no log file (#20988)
Two optimizations:
1. Insert string bytes directly to remove decoding&encoding process.
2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.
2023-06-20 11:20:21 +08:00
824bc02603 [Function] Support date function: microsecond() (#20044) 2023-06-20 10:32:54 +08:00