Commit Graph

11475 Commits

Author SHA1 Message Date
c19e35116b [fix](inverted index)fix transaction id not unique for one index change job when light index change (#21180) 2023-06-26 19:54:05 +08:00
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
05d94e5a4c [typo](docs) add a create table as select sample (#21078) 2023-06-26 19:27:05 +08:00
eb2a08bdf2 [typo](docs) Update the audit document (#21185) 2023-06-26 19:25:10 +08:00
65d81c04e6 [Docs](inverted index) update docs for build index (#21184) 2023-06-26 19:24:44 +08:00
839ad8786a [typo](docs) improvement SQL manual ddl drop doc (#21188) 2023-06-26 18:51:28 +08:00
986f3b2176 [typo](docs) improvement SQL manual ddl alter doc (#21179) 2023-06-26 18:17:01 +08:00
5ebac73a93 [typo](docs) improvement SQL manual ddl create doc (#21181) 2023-06-26 18:16:50 +08:00
9c5a0cc471 [bug](jdbc catalog) fix getPrimaryKeys fun bug (#21137) 2023-06-26 17:13:50 +08:00
cdc2d42c3a [refactor](Nereids): adjust order of rewrite rules. (#21133)
Put the rules that eliminate plan in front to avoid block other rules, so we can avoid to invoke pushdown filter/limit again
2023-06-26 16:47:33 +08:00
5fdd9b9254 [Bug](RuntimeFiter) Fix bf error change the murmurhash to crc32 in regression test p2 (#21167) 2023-06-26 16:39:45 +08:00
102b7f8873 remove useless case (#21166) 2023-06-26 16:27:32 +08:00
f2ed1bce1a [fix](nereids)change PushdownFilterThroughProject post processor from bottom up to top down rewrite (#21125)
1. pass physicalProperties in withChildren function
2. use top down traverse  in PushdownFilterThroughProject post processor
2023-06-26 15:34:41 +08:00
960e04b0ed [fix](inverted index) fix build inverted index failed but not return immediately (#21165) 2023-06-26 14:05:12 +08:00
2b3c82f57a [fix](multi-catalog)fix max compute scanner OOM and datetime (#20957)
1. Fix MC jni scanner OOM
2. add the second datetime type for MC SDK timestamp
3. make s3 uri case insensitive by the way
4. optimize max compute scanner parallel model
2023-06-26 13:53:29 +08:00
d4240ac21b [fix](multi-catalog)add oss sdk, supported oss properties (#21029) 2023-06-26 13:00:44 +08:00
5d2b69b06d [Enhancement](regression) let test case fail fast when job is cancelled (#20578) (#21103)
In doris regression-test/suites, a lot of test cases quit immediately only if "FINISHED", otherwise they will wait till timeout. For example:

while (max_try_secs--) {
        String res = getJobState(tbName1)
        if (res == "FINISHED") {
            sleep(3000)
            break
        } else {
            Thread.sleep(1000)
            if (max_try_secs < 1) {
                println "test timeout," + "state:" + res
                assertEquals("FINISHED", res)
            }
        }
   }
This PR added checks so that these test cases can quit immediately also if "CANCELLED", which is the only unchanging status other than "FINISHED".
2023-06-26 12:58:51 +08:00
66005570c9 [fix](regression) fix p1 test_backup_restore fail caused by http download 401 invalid token error #21107 2023-06-26 12:56:46 +08:00
1dec592e91 [improvement](fs_bench) optimize the usage of fs benchmark tool for hdfs (#21154)
Optimize the usage of fs benchmark tool:

1. Remove `Open` benchmark, it is useless.
2. Remove `Delete` benchmark, it is dangerous.
3. Add `SingleRead` benchmark, user can specify an exist file to test read operation:

    `sh bin/run-fs-benchmark.sh --conf=conf/hdfs_read.conf --fs_type=hdfs --operation=single_read`

4. Modify the `run-fs-benchmark.sh`, remove `OPTS` section, use options in `fs_benchmark_tool` directly
5. Add some custom counters in the benchmark result, eg:

```
--------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              6864 ms         2385 ms            1 ReadRate=200.936M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3919 ms         1828 ms            1 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3839 ms         1819 ms            1 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_mean         4874 ms         2011 ms            3 ReadRate=304.054M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_median       3919 ms         1828 ms            3 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_stddev       1724 ms          324 ms            3 ReadRate=89.3768M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_cv          35.37 %         16.11 %             3 ReadRate=29.40%
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_max          6864 ms         2385 ms            3 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_min          3839 ms         1819 ms            3 ReadRate=200.936M/s
```

- For `open_read` and `single_read`, add `ReadRate` as `bytes per second`.
- For `create_write`, add `WriteRate` as `bytes per second`.
- For `exists` and `rename`, add `ExistsCost` and `RenameCost` as `time cost per one operation`.
2023-06-26 11:37:14 +08:00
1138ed1d70 [doc](catalog) update and improve doc of multi catalog (#21105)
Update the document of multi catalog feature.
2023-06-26 11:36:44 +08:00
2e6d91aa99 [chore](block) temporarily disable DCHECK for column name equality in MutableBlock (#21116)
* tempororyly disable DCHECK for column name equality in MutableBlock::add_rows

* num columns EQ to LE
2023-06-26 10:49:27 +08:00
28abeef72b [performace](colddata) opt cold data read performance (#21141)
In #10370, we try to opt string evaluate performance by rewrite the predicate using dict value. But it has to check if the string column is full dict encoding. So that we add a logic to read the last page of the string column to check it.

But it has some bad performance for cold data because it has to load the column's ordinal index and zone map index. In some scenario for example, select * from table where pk_col=1. If the query condition is primary key, the result maybe just a few rows but the result may have 100 columns, it will cost a lot of time to load these indices. We could find a lot of time is spending on block_init_time.

In my test, a table with 50 string columns and query with primary key.

The first read time will reduce from 220ms to 40ms.
2023-06-26 10:39:20 +08:00
baf9a2107b [fix](regression) fix case failure by adding sync after stream load (#21155) 2023-06-26 10:38:46 +08:00
6f7759b08d [fix](memory) fix mem tracker grace exit (#21136) 2023-06-26 10:28:24 +08:00
880252984b [typo](docs) fix jdbc catalog doc example err (#21152) 2023-06-26 10:14:17 +08:00
f8ef4ed18f [fix](log4j) fix some issues when modify log config (#21099)
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2023-06-26 08:46:33 +08:00
af51a31c21 [deps](benchmark) bump benchmakr from 1.5.6 -> 1.8.0 (#21121)
To support some new methods used in #21074
2023-06-25 23:42:54 +08:00
Pxl
0122aa79df [Chore](vectorized) remove all isVectorized (#21076)
isVectorized is always true now
2023-06-25 23:13:34 +08:00
58b3e5ebdb [fix](nereids)scan node's smap should use materiazlied slots and project list as left and right expr list (#21142) 2023-06-25 22:34:43 +08:00
8f7a62c79b [improvement](mutil-catalog) PaimonColumnValue support short and Decimal (#20723) 2023-06-25 22:31:38 +08:00
2c2d56e8a0 [Feature](broker-load) Add priority info for ShowLoadStmt. (#20984)
Following pr #20628 , add priority information of the load job.
2023-06-25 22:11:21 +08:00
1ac8cdec7e [Fix](inverted index) fix inverted query cache for chinese tokenizer (#21106)
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
2023-06-25 22:04:02 +08:00
64790a3a86 [bugfix](workloadgroup) could not upgrade from 2.0 alpha (#21149)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-06-25 22:02:53 +08:00
2d1163c4d8 [refactor](nereids) update Agg stats derive method #21036
This pr has no effect on tpch queries.
Some tpcds queries are impacted.
They are 4/11/23/24/47/51/57/65/74, in which 4 and 51 are improved
2023-06-25 21:47:32 +08:00
34b048a2bd [fix](nereids) update outer join estimation #21126
the row count of left outer join should be no less than left child row count.
2023-06-25 21:37:55 +08:00
af2b67e65a [Fix](multi-catalog) Invalidate cache when enable auto refresh catalog. (#21070)
The default value of RefreshCatalogStmt.invalidCache is false now, but the RefreshManager.RefreshTask does not invoke RefreshCatalogStmt.analyze() so it will not invalidate the cache. This pr mainly fix this problem
2023-06-25 19:14:44 +08:00
638aa41988 [fix](planner) fix push filter through agg #21080
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work

You could reproduce it by runnning belowing sql:

CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
    "replication_num"="1"
);

CREATE VIEW `view_i` AS 
SELECT 
    `b`.`col1` AS `col1`, 
    `b`.`col2` AS `col2`
FROM 
(
    SELECT 
        `col1` AS `col1`, 
        sum(`cost`) AS `col2`
    FROM 
    (
        SELECT 
            `col1` AS `col1`, 
            sum(CAST(`col3` AS INT)) AS `cost` 
        FROM 
            `t_push_filter_through_agg` 
        GROUP BY 
            `col1`
    ) a 
    GROUP BY 
        `col1`
) b;

SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;
2023-06-25 19:14:20 +08:00
69d5adaee3 [Improvement](doc) improve ngram and inverted index documents #21091 2023-06-25 19:13:41 +08:00
ee2492dd78 [typo](doc)fix delete table associate to other table only support unique model (#21129)
Co-authored-by: smallhibiscus <844981280>
2023-06-25 19:04:27 +08:00
55e7af1e31 [fix](test) fix two case bug #21124 2023-06-25 18:53:20 +08:00
b6c9feb458 [fix](nereids) check table privilege when it's needed (#21130)
check privilege on LogicalOlapScan, LogicalEsScan, LogicalFileScan and LogicalSchemaScan
2023-06-25 18:35:39 +08:00
46f0295b78 [feature](load-refactor-with-tvf) S3 load with S3 tvf and native insert (#19937) 2023-06-25 17:45:31 +08:00
771b0cbb4c [fix](stats) Update analyze task execute time (#21026)
Before this PR last_execute_time of pending analyze jobs would be 1970-01-01, you can reproduce it by run show analyze
2023-06-25 15:52:33 +08:00
cf66280e60 [opt](stats) Sampling when aggregate column stats (#21020)
In the previous implementation, when aggregating partition statistics into column statistics, the calculation of distinct values (ndv) for the entire column was performed without using sampling, resulting in reduced efficiency of the sampling process.

Before this PR analyze below table which has 1000000 lines would cost 5.75sec, after this PR, it would cost 3.39sec.


```sql
CREATE TABLE IF NOT EXISTS `duplicate_all` (
    `k3` int(11) null comment "",
    `k0` boolean null comment "",
    `k1` tinyint(4) null comment "",
    `k2` smallint(6) null comment "",
    `k4` bigint(20) null comment "",
    `k5` decimalv3(9, 3) null comment "",
    `k6` char(36) null comment "",
    `k10` date null comment "",
    `k11` datetime null comment "",
    `k7` varchar(64) null comment "",
    `k8` double null comment "",
    `k9` float null comment "",
    `k12` string  null comment "",
    `k13` largeint(40)  null comment ""
) engine=olap
DUPLICATE KEY(`k3`)
DISTRIBUTED BY HASH(`k3`) BUCKETS 5 properties("replication_num" = "3")
```
2023-06-25 15:52:01 +08:00
dd99468b8f [fix](stats) Fix jdbc timeout with multiple FE when execute analyze table (#21115)
SQL may forward to master to execute when connecting to follower node, the result should be set to `StmtExecutor#proxyResultSet`

Before this PR, in above scenario , submit analyze sql by  mysql client/jdbc whould return get malformed packet/ Communication failed.
2023-06-25 15:49:36 +08:00
76bdcf1d26 [improvement](pipeline) task group scan entity (#19924) 2023-06-25 14:43:35 +08:00
80d54368e0 [minor](Nereids) replace some nullable field to Optional (#20967) 2023-06-25 12:02:25 +08:00
6896776034 [test](regression) update some case in p2 (#21094)
update some case in p2
2023-06-25 11:16:56 +08:00
207bc53b06 [functionpushdown](performance) move function pushdown as default false since its performance is not good (#21111)
set enable function pushdown default to false.
enable it in fuzzy mode to test this feature.
We should remove function pushdown in the future since we already have common expr pushdown.
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-06-25 10:36:20 +08:00
20b92b0812 [Feature](log)friendly hint for creating table failed (#20617) 2023-06-25 10:02:26 +08:00