Commit Graph

6507 Commits

Author SHA1 Message Date
00c672340d [improvement](memory) set TCMALLOC_HEAP_LIMIT_MB to control memory consumption of tcmalloc (#12981) 2022-09-28 15:44:18 +08:00
819aecb26c [DOC](datev2) Add documents for DateV2 (#12976) 2022-09-28 14:36:26 +08:00
1b1f13ec84 [optimization](array-type) optimize error prompts when sql parser report error (#12999)
Co-authored-by: hucheng01 <hucheng01@baidu.com>
2022-09-28 14:35:41 +08:00
16bb5cb430 [enhancement](memory) Jemalloc performance optimization and compatibility with MemTracker #12496 2022-09-28 12:04:29 +08:00
e627d285e0 [chore](regression-test) add default group(p0) for regression-test (#12977) 2022-09-28 11:47:19 +08:00
a79d2e592b [improvement](test) cache data from s3 to cacheDataPath (#13018)
Now, regression data is stored in sf1DataPath, which is local or remote.
For performance reason, we use local dir for community pipeline, however, we need prepare data for every machine, 
this process is easy mistake. So we cache data from s3 in local transparently, thus, we just need to config one data source.
2022-09-28 10:43:55 +08:00
eef9367705 [feature](Nereids) use one stage aggregation if available (#12849)
Currently, we always disassemble aggregation into two stage: local and global. However, in some case, one stage aggregation is enough, there are two advantage of one stage aggregation.
1. avoid unnecessary exchange.
2. have a chance to do colocate join on the top of aggregation.

This PR move AggregateDisassemble rule from rewrite stage to optimization stage. And choose one stage or two stage aggregation according to cost.
2022-09-28 10:38:03 +08:00
1ba9e4b568 [Improvement](sort) Reuse memory in sort node (#12921) 2022-09-28 09:44:35 +08:00
339877930d [fix](join)report 'natural join is not supported' instead of getting wrong result (#13008)
* [fix](join)report 'natural join is not supported' instead of getting wrong result

* add regression test
2022-09-28 09:08:56 +08:00
Pxl
ee3dd423b9 [Bug](function) core dump on substr #13007 2022-09-28 08:54:49 +08:00
2dafbda9de [chore](third-party) Fix compilation errors reported by clang-15 (#13016)
Add some compile flags to eliminate compilation errors reported by clang-15.
2022-09-27 23:46:43 +08:00
d8ec53c83f [enhancement](load) avoid duplicate reduce on same TabletsChannel #12975
In the policy changed by PR #12716, when reaching the hard limit, there might be multiple threads can pick same LoadChannel and call reduce_mem_usage on same TabletsChannel. Although there's a lock and condition variable can prevent multiple threads to reduce mem usage concurrently, but they still can do same reduce-work on that channel multiple times one by one, even it's just reduced.
2022-09-27 22:03:08 +08:00
d80b7b9689 [feature-wip](new-scan) support more load situation (#12953) 2022-09-27 21:48:32 +08:00
16f5204cab fix_md5sum_and_sm3sum (#13009) 2022-09-27 21:41:14 +08:00
9a38a9677a [feature](Nereids) Eliminate outer join (#12985)
eliminate outer join if we have non-null predicate on slots of inner side of outer join.

TODO:
1. use constant viariable to handle it (we can handle more case like nullsafeEqual ......)
2. using constant folding to handle of null values, is more general and does not require writing long logical judgments
3. handle null safe equals(<=>)
2022-09-27 21:09:25 +08:00
57570f2090 [feature](Nereids) Set pre-aggregation status for OLAP table scan (#12785)
This is the second step for #12303.

The previous PR #12464 added the framework to select the rollup index for OLAP table, but pre-aggregation is turned on by default.
This PR set pre-aggregation for scan OLAP table.

The main steps are as below:
1. Select rollup index when aggregate is present, this is handled by `SelectRollupWithAggregate` rule.  Expressions in aggregate functions, grouping expressions, and pushdown predicates would be used to check whether the pre-aggregation should be turned off.
2. When selecting from olap scan table without aggregate plan, it would be handled by `SelectRollupWithoutAggregate`.
2022-09-27 19:12:15 +08:00
Pxl
9607f60845 [Feature](serialize) move block_data_version to fe heart beat (#12667)
Move block_data_version from be config to fe heart beat
2022-09-27 18:25:54 +08:00
ba5705a589 [feature-wip](statistics) step6: statistics is available (#8864)
This pull request includes some implementations of the statistics(https://github.com/apache/incubator-doris/issues/6370).

Execute these sql such as "`ANALYZE`, `SHOW ANALYZE`, `SHOW TABLE/COLUMN STATS...`" to collect statistics information and query them.

The following are the changes in this PR:
1. Added the necessary test cases for statistics.
2. Statistics optimization. To ensure the validity of statistics, statistics can only be updated after the statistics task is completed or manually updated by SQL, and the collected statistics should not be changed in other ways. The reason is to ensure that the statistics are not distorted. 
3. Some code or comments have been adjusted to fix checkStyle problem.
4. Remove some code that was previously added because statistics were not available.
5. Add a configuration, which indicates whether to enable the statistics. The current statistics may not be stable, and it is not enabled by default (`enable_cbo_statistics=false`). Currently, it is mainly used for CBO test.

See this PR(#12766) syntax, some simple examples of statistics:
```SQL
-- enable statistics
SET enable_cbo_statistics=true;

-- collect statistics for all tables in the current database
ANALYZE;

-- collect all column statistics for table1
ANALYZE test.table1;

-- collect statistics for siteid of table1
ANALYZE test.table1(siteid);
ANALYZE test.table1(pv, citycode);

-- collect statistics for partition of table1
ANALYZE test.table1 PARTITION(p202208);
ANALYZE test.table1 PARTITIONS(p202208, p202209);

-- display table statistics
SHOW TABLE STATS test.table1;

-- display partition statistics of table1
SHOW TABLE STATS test.table1 PARTITION(p202208);

-- display column statistics of table1
SHOW COLUMN STATS test.table1;

-- display column statistics of partition
SHOW COLUMN STATS test.table1 PARTITION(p202208);

-- display the details of the statistics jobs
SHOW ANALYZE;
SHOW ANALYZE idxxxx; 
```
2022-09-27 17:24:14 +08:00
c21ecdd867 [enhancement](test) add tpcds_sf1000 to p2 (#12695) 2022-09-27 17:12:52 +08:00
eba71cf5da [enhancement](test) add tpch_sf10 cases to p2 (#12698) 2022-09-27 17:12:37 +08:00
Pxl
64988cb3d4 [Enhancement](optimize) optimize for insert_indices_from (#12807) 2022-09-27 15:49:15 +08:00
cbdef66757 [test](join)add join case5 #12854 2022-09-27 15:48:36 +08:00
3dfcfc69ee [regression-test](join)add join case5 #12854 2022-09-27 15:47:36 +08:00
907494760d [typo](docs)Add bitmap_count doc And Adjustment function list (#12978) 2022-09-27 14:21:37 +08:00
722106805f [chore](build) Fix compilation errors reported by clang-15 (#13000)
Add a compile flag -Wno-unused-but-set-variable to build libGeo.a .
2022-09-27 14:04:44 +08:00
3f99dd5c4b [function](bitmap) support bitmap_hash64 (#12992) 2022-09-27 12:16:02 +08:00
a6db5e63df [fix](projection)sort node's unmaterialized slots should be removed from resolvedTupleExprs (#12963) 2022-09-27 11:46:44 +08:00
429ac929fb [chore](build) Support building from source on ubuntu-22.04 (aarch64) (#12813)
Support building from source on ubuntu-22.04
2022-09-27 10:29:13 +08:00
1cc15ccfa3 [feature-wip](unique-key-merge-on-write) fix thread safe issue in BetaRowsetWriter (#12875) 2022-09-27 10:28:18 +08:00
c4341d3d43 [fix](like)prevent null pointer by unimplemented like_vec functions (#12910)
* [fix](like)prevent null pointer by unimplemented like_vec functions

* fix pushed like predicate on dict encoded column bug
2022-09-27 10:02:10 +08:00
e040dccbec [fix](remote)fix bug for delete s3 dir and list s3 dir (#12918)
* fix bug for delete s3 dir and list s3 dir
2022-09-27 09:54:37 +08:00
72b909b5e8 [enhancement](workflow) Enable the shellcheck workflow to comment the PRs (#12633)
> Due to the dangers inherent to automatic processing of PRs, GitHub’s standard pull_request workflow trigger by 
default prevents write permissions and secrets access to the target repository. However, in some scenarios such 
access is needed to properly process the PR. To this end the pull_request_target workflow trigger was introduced.

According to the article [Keeping your GitHub Actions and workflows secure](https://securitylab.github.com/research/github-actions-preventing-pwn-requests/) , the trigger condition in 
`shellcheck.yml` which is `pull_request` can't comment the PR due to the lack of write permissions of the workflow.

Despite the `ShellCheck` workflow checkouts the source, but it doesn't build and test the source code. I think it is safe 
to change the trigger condition from `pull_request` to `pull_request_target` which can make the workflow have write 
permissions to comment the PR.
2022-09-27 09:08:12 +08:00
b14b178928 [enhancement](memory) Trigger load channel flush based on process physical memory to avoid OOM #12960
When the physical memory of the process reaches 90% of the mem limit, trigger the load channel mgr to brush down
The default value of be.conf mem_limit is changed from 90% to 80%, and stability is the priority.
Fix deadlock in arena_locks in BufferPool::BufferAllocator::ScavengeBuffers and _lock in DebugString
2022-09-27 09:07:38 +08:00
df9dcba6db [regression-case](improve) improve regression test case (#12979) 2022-09-27 08:53:53 +08:00
wxy
c4b6d4d839 [enhancement](AuditLoaderPlugin): add audit queue capacity configurat… (#12887) 2022-09-27 08:50:30 +08:00
Pxl
12d6efa92b [Bug](function) fix substr return null on row-based engine #12906 2022-09-27 08:47:32 +08:00
5790d23624 [fix](transfer_thread) fix the loss of notification. (#12988) 2022-09-27 08:44:02 +08:00
Pxl
8731eea26e [Chore](clang) fix some build fail on clang15 (#12882)
remove unused variables
2022-09-26 23:13:28 +08:00
595a5337dc fix doc typos (#12967) 2022-09-26 20:11:26 +08:00
35076431ab [fix](column)fix get_shrinked_column misspell (#12961)
Fix misspell
2022-09-26 17:32:03 +08:00
7977bebfed [feature](Nereids) constant expression folding (#12151) 2022-09-26 17:16:23 +08:00
3902b2bfad [refactor](fe-core src test catalog): refactor and replace use NIO #12818 (#12818) 2022-09-26 16:51:46 +08:00
1bb42a7bc0 [function](hash) add support of murmur_hash3_64 (#12923) 2022-09-26 14:23:37 +08:00
72220440dc [fix](memtracker) Remove mem tracker record mem pool actual memory usage #12954
In order to avoid different mem tracker consumption values of multiple queries/loads, and the difference between the virtual memory of alloc and the physical memory actually increased by the process.

The memory alloc in PODArray and mempool will not be recorded in the query/load mem tracker immediately, but will be gradually recorded in the mem tracker during the memory usage.

But mem pool allocates memory from chunk allocator. If this chunk is used after the second time, it may have used physical memory. The above mechanism will cause the load channel memory statistics to be less than the actual value.
2022-09-26 12:54:06 +08:00
9afa3cdb19 Optimized materialized view documentation (#12798)
Optimized materialized view documentation
2022-09-26 12:25:20 +08:00
18433d7105 Spark load import kerberos parameter modification (#12924)
Spark load import kerberos parameter modification
2022-09-26 12:24:43 +08:00
c809a21993 [feature](nereids) extract single table expression for push down (#12894)
TPCH q7, we have expression like
(n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY') or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')

this expression implies
(n1.n_name='FRANCE' or n1.n_name=''GERMANY)
The implied expression is logical redundancy, but it could be used to reduce the output tuple number of scan(n1), if nereids pushes this expression down.

This pr introduces a RULE to extract such expressions.

NOTE:
1. we only extract expression on a single table.
2. if the extracted expression cannot be pushed down, e.g. it is on right table of left outer join, we need another rule to remove all the useless expressions.
2022-09-26 11:19:37 +08:00
0fcb93aae2 [fix](parquet) fix write error data as parquet format. (#12864)
* [fix](parquet) fix write error data as parquet format.

Fix incorrect data conversion when writing tiny int and small int data
to parquet files in non-vectorized engine.
2022-09-26 10:41:17 +08:00
9c03deb150 [fix](log)Audit log status is incorrect (#12824)
Audit log status is incorrect
2022-09-26 09:57:52 +08:00
978dae267e [typo](docs)Optimized string and date function doc (#12949) 2022-09-26 09:26:12 +08:00