Commit Graph

254 Commits

Author SHA1 Message Date
57656b2459 [Enhancement](java-udf) java-udf module split to sub modules (#20185)
The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner.

Co-authored-by: lexluo <lexluo@tencent.com>
2023-06-13 09:41:22 +08:00
73ad885e19 [Feature][Fix](multi-catalog) Implements transactional hive full acid tables. (#20679)
After supporting insert-only transactional hive full acid tables #19518, #19419, this PR support transactional hive full acid tables.

Support hive3 transactional hive full acid tables.
Hive2 transactional hive full acid tables need to run major compactions.
2023-06-13 08:55:16 +08:00
0b228b3414 [fix](load)Support load json data with default value (#20624)
* support json default value

---------

Co-authored-by: duanxujian <duanxujian@jd.com>
2023-06-12 14:51:31 +08:00
a6f625676b [profile](remove child) child is for node, should not be used to organize counters (#20676)
Currently, there are many profiles using add child profile to orgnanize profile into blocks. But it is wrong. Child profile will have a total time counter. Actually, what we should use is just a label.

                          -  MemoryUsage:  
                              -  HashTable:  23.98  KB
                              -  SerializeKeyArena:  446.75  KB
Add a new macro ADD_LABEL_COUNTER to add just a label in the profile.

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-06-12 10:00:35 +08:00
9a83d78dfe [Enhancement](hudi) support hudi mor table, step2 follow #19909 (#20570)
PR(https://github.com/apache/doris/pull/19909) has implemented the framework of hudi reader for MOR table. This PR completes all functions of reading MOR table and enables end-to-end queries.
Key Implementations:
1. Use hudi meta information to generate the table schema, not from hive client.
2. Use hive client to list hudi partitions, so it strongly depends the sync-tools(https://hudi.apache.org/docs/syncing_metastore/) which syncs the partitions of hudi into hive metastore. However, we may get the hudi partitions directly from .hoodie directory.
3. Remove `HudiHMSExternalCatalog`, because other catalogs like glue is compatible with hive catalog.
4. Read the COW table originally from c++.
5. Hudi RecordReader will use ProcessBuilder to start a hotspot debugger process, which may be stuck when attaching the origin JNI process, soI use a tricky method to kill this useless process.
2023-06-10 12:25:53 +08:00
656b9ad3da [enhancement](index) Nereids support no need to read raw data for index column that only in filter conditions (#20605) 2023-06-09 21:54:48 +08:00
93b53cf2f4 [improvement](exception-safe) create and prepare node/sink support exception safe (#20551) 2023-06-09 21:06:59 +08:00
195beec3a8 [Fix](external scan node)Use consistent hash to collect BE only when the file cache is enabled. #20560
Use consistent hash to collect BE only when the file cache is enabled. And move the consistent BE assign code to FederationBackendPolicy.
Fix explain split number and file size incorrect bug.
2023-06-09 08:43:12 +08:00
841094960f [fix](olapscanner) fix coredump caused by concurrent acccess of olap scan node _conjuncts (#20534)
=3073084==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60601897db80 at pc 0x55b2c993666e bp 0x7d1fbbfb66b0 sp 0x7d1fbbfb66a8
READ of size 8 at 0x60601897db80 thread T610 (_scanner_scan)
    #0 0x55b2c993666d in std::__shared_ptr<doris::vectorized::VExprContext, (__gnu_cxx::_Lock_policy)2>::get() const /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1291:16
    #1 0x55b2dae86ec5 in doris::vectorized::VExprContext::clone(doris::RuntimeState*, std::shared_ptr<doris::vectorized::VExprContext>&) /mnt/disk2/tengjianping/doris-master/be/src/vec/exprs/vexpr_context.cpp:98:5
    #2 0x55b2e757b6d8 in doris::vectorized::VScanner::prepare(doris::RuntimeState*, std::vector<std::shared_ptr<doris::vectorized::VExprContext>, std::allocator<std::shared_ptr<doris::vectorized::VExprContext>>> const&) /mnt/disk2/tengjianping/doris-master/be/src/vec/exec/scan/vscanner.cpp:47:13
    #3 0x55b2e78e8155 in doris::vectorized::NewOlapScanner::init() /mnt/disk2/tengjianping/doris-master/be/src/vec/exec/scan/new_olap_scanner.cpp:109:5
    #4 0x55b2e7551c81 in doris::vectorized::ScannerScheduler::_scanner_scan(doris::vectorized::ScannerScheduler*, doris::vectorized::ScannerContext*, std::shared_ptr<doris::vectorized::VScanner>) /mnt/disk2/tengjianping/doris-master/be/src/vec/exec/scan/scanner_scheduler.cpp:279:27
    #5 0x55b2e7554d5e in doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()::operator()() const /mnt/disk2/tengjianping/doris-master/be/src/vec/exec/scan/scanner_scheduler.cpp:202:31
    #6 0x55b2e7554c14 in void std::__invoke_impl<void, doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()&>(std::__invoke_other, doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #7 0x55b2e7554bb4 in std::enable_if<is_invocable_r_v<void, doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()&>, void>::type std::__invoke_r<void, doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()&>(doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:111:2
    #8 0x55b2e7554a1c in std::_Function_handler<void (), doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_0::operator()() const::'lambda0'()>::_M_invoke(std::_Any_data const&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291:9
    #9 0x55b2c80f2cd2 in std::function<void ()>::operator()() const /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:560:9
    #10 0x55b2e755f3e4 in doris::PriorityWorkStealingThreadPool::work_thread(int) /mnt/disk2/tengjianping/doris-master/be/src/util/priority_work_stealing_thread_pool.hpp:135:17
    #11 0x55b2e7563c72 in void std::__invoke_impl<void, void (doris::PriorityWorkStealingThreadPool::* const&)(int), doris::PriorityWorkStealingThreadPool*&, int&>(std::__invoke_memfun_deref, void (doris::PriorityWorkStealingThreadPool::* const&)(int), doris::PriorityWorkStealingThreadPool*&, int&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:74:14
    #12 0x55b2e7563b44 in std::__invoke_result<void (doris::PriorityWorkStealingThreadPool::* const&)(int), doris::PriorityWorkStealingThreadPool*&, int&>::type std::__invoke<void (doris::PriorityWorkStealingThreadPool::* const&)(int), doris::PriorityWorkStealingThreadPool*&, int&>(void (doris::PriorityWorkStealingThreadPool::* const&)(int), doris::PriorityWorkStealingThreadPool*&, int&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14
    #13 0x55b2e7563b14 in decltype(std::__invoke((*this)._M_pmf, std::forward<doris::PriorityWorkStealingThreadPool*&>(fp), std::forward<int&>(fp))) std::_Mem_fn_base<void (doris::PriorityWorkStealingThreadPool::*)(int), true>::operator()<doris::PriorityWorkStealingThreadPool*&, int&>(doris::PriorityWorkStealingThreadPool*&, int&) const /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/functional:131:11
    #14 0x55b2e7563ae4 in void std::__invoke_impl<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)>&, doris::PriorityWorkStealingThreadPool*&, int&>(std::__invoke_other, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)>&, doris::PriorityWorkStealingThreadPool*&, int&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #15 0x55b2e7563a54 in std::enable_if<is_invocable_r_v<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)>&, doris::PriorityWorkStealingThreadPool*&, int&>, void>::type std::__invoke_r<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)>&, doris::PriorityWorkStealingThreadPool*&, int&>(std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)>&, doris::PriorityWorkStealingThreadPool*&, int&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:111:2
    #16 0x55b2e75639c3 in void std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>::__call<void, 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/functional:570:11
    #17 0x55b2e756382d in void std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>::operator()<>() /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/functional:629:17
    #18 0x55b2e7563744 in void std::__invoke_impl<void, std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>(std::__invoke_other, std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>&&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #19 0x55b2e7563704 in std::__invoke_result<std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>::type std::__invoke<std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>(std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>&&) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14
    #20 0x55b2e75636dc in void std:🧵:_Invoker<std::tuple<std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13
    #21 0x55b2e75636b4 in std:🧵:_Invoker<std::tuple<std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>>::operator()() /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11
    #22 0x55b2e7563638 in std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<std::_Bind_result<void, std::_Mem_fn<void (doris::PriorityWorkStealingThreadPool::*)(int)> (doris::PriorityWorkStealingThreadPool*, int)>>>>::_M_run() /mnt/disk2/tengjianping/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13
    #23 0x55b2eb41d0ef in execute_native_thread_routine /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:82:18
    #24 0x7f1dfd4e1179 in start_thread pthread_create.c
    #25 0x7f1dfdd7bdf2 in clone (/lib64/libc.so.6+0xfcdf2) (BuildId: 20ee73ce1b6ac38a52440bab82ec7e28f0f5c5b9)
2023-06-07 17:00:29 +08:00
fe63a0a3bb [Feature](multi-catalog)support paimon catalog (#19681)
CREATE CATALOG paimon_n2 PROPERTIES (
"dfs.ha.namenodes.HDFS1006531" = "nn2,nn1",
"dfs.namenode.rpc-address.HDFS1006531.nn2" = "172.16.65.xx:4007",
"dfs.namenode.rpc-address.HDFS1006531.nn1" = "172.16.65.xx:4007",
"hive.metastore.uris" = "thrift://172.16.65.xx:7004",
"type" = "paimon",
"dfs.nameservices" = "HDFS1006531",
"hadoop.username" = "hadoop",
"paimon.catalog.type" = "hms",
"warehouse" = "hdfs://HDFS1006531/data/paimon1",
"dfs.client.failover.proxy.provider.HDFS1006531" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
2023-06-06 15:08:30 +08:00
c7888f4bfa [feature](profile)Add the filtering info of the in filter in profile #20321
image Currently, it is difficult to obtain the id of in filters,so, the some in filters's id is -1.
2023-06-06 10:24:59 +08:00
1fc48e83f2 [fix](executor)Fix duplicate timer and add open timer #20448
1 Currently, Node's total timer couter has timed twice(in Open and alloc_resource), this may cause timer in profile is not correct.
2 Add more timer to find more code which may cost much time.
2023-06-06 08:55:52 +08:00
b7fc17da68 [feature-wip](multi-catalog)(step2)support read max compute data by JNI (#19819)
Issue Number: #19679
2023-06-05 22:10:08 +08:00
f0513a861d [Improve](Scan) add a session variable to make scan run serial (#20220)
Parallel scanning can result in some read amplification, for example, select * from xx where limit 1 actually requires only one row of data. However, due to parallel scanning of multiple tablets, read amplification occurs, leading to performance bottlenecks in high-concurrency scenarios. This PR Adding a SessionVariable to enforce serial scanning can help mitigate this issue.
2023-06-01 15:06:35 +08:00
f9dfcb923d [Enhancement] Change Create Resource Group Grammar (#20249) 2023-05-31 15:23:24 +08:00
0c98355fff [fix](catalog) fix create catalog with resource replay issue and kerberos auth issue (#20137)
1. Fix create catalog with resource replay bug.
	If user create catalog using `create catalog hive with resource xxx`, when replaying edit log,
	there is a bug that resource may be dropped, causing NPE and FE will fail to start.

	In this PR, I add a new FE config `disallow_create_catalog_with_resource`, default is true.
	So that `with resource` will not be allowed, and it will be deprecated later.

	And also fix the replay bug to avoid NPE.

2. Fix issue when creating 2 hive catalogs to connect with and without kerberos authentication.

	When user create 2 hive catalogs, one use simple auth, the other use kerberos auth.
	The query may fail with error like: `Server asks us to fall back to SIMPLE auth, but this client is configured to only allow secure connections.`

	So I add a default property for hive catalog: `"ipc.client.fallback-to-simple-auth-allowed" = "true"`.
	Which means this property will be added automatically when user creating hive catalog, to avoid such problem.

3. Fix calling `hdfsExists()` issue

	When calling `hdfsExists()` with non-zero return code, should check if it encounters error or is file not found.

3. Some code refactor

	Avoid import `org.apache.parquet.Strings`
2023-05-30 16:57:39 +08:00
de08c4a57b [enhance](match) Support match query without inverted index (#19936) 2023-05-30 15:02:57 +08:00
ab8125d56f [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema (#20037)
* [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema

1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse.

2. Make some variables wrapped with std::unique<unique_ptr>

Performance:
| 状态              | QPS | 平均响应时间 (avg) | P99 响应时间 |
|------------------|-----|------------------|-------------|
| 开启 SchemaCache | 501 | 20ms             | 34ms        |
| 关闭 SchemaCache | 321 | 31ms             | 61ms        |

* handle schema change with schema version

* remove useless header

* rebase
2023-05-29 17:34:53 +08:00
55ccddb62c [Conf](decimalv3) enable decimalv3 by default 2023-05-29 15:38:31 +08:00
Pxl
8376e5eefb [Chore](build) add non-virtual-dtor, remove no-embedded-directive/no-zero-length-array (#20118)
add non-virtual-dtor, remove no-embedded-directive/no-zero-length-array
2023-05-29 14:42:47 +08:00
9f8de89659 [refactor](exec) replace the single pointer with an array of 'conjuncts' in ExecNode (#19758)
Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity.

By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed.

This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.
2023-05-29 11:47:31 +08:00
Pxl
15a7420661 [Chore](ub) fix some undefined behaviors (#19986)
/home/zcp/repo_center/doris_master/doris/be/src/olap/rowset/segment_v2/column_reader.cpp:895:21: runtime error: load of value 423208544, which is not a valid value for type 'doris::ReaderType'

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_decimal.cpp:260:33: runtime error: load of misaligned address 0x7fa3348b301c for type 'int64_t' (aka 'long'), which requires 8 byte alignment

/home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:82:24: runtime error: variable length array bound evaluates to non-positive value 0

/home/zcp/repo_center/doris_master/doris/be/src/vec/columns/column_string.h:225:26: runtime error: null pointer passed as argument 2, which is declared to never be null
2023-05-26 14:08:40 +08:00
92a6122f74 [feature](profile)Add the filtering information of the Bloom filter in profile. (#19789) 2023-05-26 10:56:58 +08:00
6efe6ef6e8 [Enhancement](scanner) allocate blocks in scanner_context on demand and free them on close (#19389)
Firstly, to reduce memory usage, we do not pre-allocate blocks, instead we lazily allocate block when upper call get_free_block. And when upper call return_free_block to return free block, we add the block to a queue for memory reuse, and we will free the blocks in the queue when the scanner_context was closed instead of destructed.
Secondly, to limit the memory usage of the scanner, we introduce a variable _free_blocks_capacity to indicate the current number of free blocks available to the scanners. The number of scanners that can be scheduled will be calculated based on this value.

ssb flat test
previous
lineorder 1.2G:
load time: 3s, query time: 0.355s
lineorder 5.8G:
load time: 330s, query time: 0.970s
load time: 349s, query time: 0.949s
load time: 349s, query time: 0.955s
load time: 360s, query time: 0.889s (pipeline enabled)
after
lineorder 1.2G:
load time: 3s, query time: 0.349s
lineorder 5.8G:
load time: 342s, query time: 0.929s
load time: 337s, query time: 0.913s
load time: 345s, query time: 0.946s
load time: 346s, query time: 0.865s (pipeline enabled)
2023-05-23 18:17:21 +08:00
53ba46e404 [Fix][Refactor] Fix 'not member call on null pointer of type 'doris::TextConverter' error in ubsan env and refactor text converter. (#19849)
Fix 'not member call on null pointer of type doris::TextConverter' error in ubsan env and refactor text converter.
2023-05-22 21:00:19 +08:00
272a7565b8 [improvement](tracing) Remove useless span levels from be side tracing (#19665)
1. Remove an exec node method corresponding to a span and replace it with an exec node corresponding to a span;
2. Fix some problems with tracing in pipeline.
2023-05-17 19:04:52 +08:00
Pxl
7f73749b88 [Bug](pipeline) fix distributionColumnIds not updated correct when outputColumnUnique… (#19704)
fix distributionColumnIds not updated correct when outputColumnUnique
2023-05-17 00:13:10 +08:00
92bf485abd [Bug] Fix doris pipeline shared scan and top n opt (#19599) 2023-05-15 10:00:44 +08:00
1d421a26d9 [bugfix](memory) merge block may allocate failed (#19507) 2023-05-11 10:42:47 +08:00
95833426e8 [BugFix](table-value-function) Fix backends() tvf (#19452)
Change the `Alive/SystemDecommissioned/ClusterDecommissioned` field type of the `backends()`tvf to bool
2023-05-11 07:49:27 +08:00
4483e3a6e1 [Improvement](scan) add a config for scan queue memory limit (#19439) 2023-05-10 13:14:23 +08:00
Pxl
5473795a51 [Bug](scan) forbiden push down in predicate when in_state->use_set is false (#19471)
forbiden push down in predicate when in_state->use_set is false
2023-05-10 11:12:20 +08:00
cf8ceb8586 [fix](scan) fix scanner mem tracker (#19354) 2023-05-10 09:56:41 +08:00
096aa25ca6 [improvement](orc-reader) Implements ORC lazy materialization (#18615)
- Implements ORC lazy materialization, integrate with the implementation of https://github.com/apache/doris-thirdparty/pull/56 and https://github.com/apache/doris-thirdparty/pull/62.
- Refactor code: Move `execute_conjuncts()` and `execute_conjuncts_and_filter_block()` in `parquet_group_reader `to `VExprContext`, used by parquet reader and orc reader.
- Add session variables `enable_parquet_lazy_materialization` and `enable_orc_lazy_materialization` to control whether enable lazy materialization.
- Modify `build.sh` to update apache-orc submodule or download package every time.
2023-05-09 23:33:33 +08:00
9edbfa37cd [Enhancement](Broker Load) New progress manager for showing loading progress status (#19170)
This work is in the early stage, current progress is not accurate because the scan range will be too large
for gathering information, what's more, only file scan node and import job support new progress manager

## How it works

for example, when we use the following load query:
```
LOAD LABEL test_broker_load
(
	DATA INFILE("XXX")
	INTO TABLE `XXX`
        ......
)
```

Initial Progress: the query will call `BrokerLoadJob` to create job, then `coordinator` is called to calculate scan range and its location. 
Update Progress: BE will report runtime_state to FE and FE update progress status according to jobID and fragmentID

we can use `show load` to see the progress

PENDING:
```
         State: PENDING
      Progress: 0.00%
```

LOADING:
```
         State: LOADING
      Progress: 14.29% (1/7)
```

FINISH:
```
         State: FINISHED
      Progress: 100.00% (7/7)
```

At current time, full output of `show load\G` looks like:

```
*************************** 1. row ***************************
         JobId: 25052
         Label: test_broker
         State: LOADING
      Progress: 0.00% (0/7)
          Type: BROKER
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):250000; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-05-03 20:53:13
  EtlStartTime: 2023-05-03 20:53:15
 EtlFinishTime: 2023-05-03 20:53:15
 LoadStartTime: 2023-05-03 20:53:15
LoadFinishTime: NULL
           URL: NULL
    JobDetails: {"Unfinished backends":{"5a9a3ecd203049bc-85e39a765c043228":[10080]},"ScannedRows":39611808,"TaskNumber":1,"LoadBytes":7398908902,"All backends":{"5a9a3ecd203049bc-85e39a765c043228":[10080]},"FileNumber":1,"FileSize":7895697364}
 TransactionId: 14015
  ErrorTablets: {}
          User: root
       Comment: 
```

## TODO:

1. The current partition granularity of scan range is too large, resulting in an uneven loading process for progress."
2. Only broker load supports the new Progress Manager, support progress for other query
2023-05-06 22:44:40 +08:00
4e4fb33995 [refactor](conjuncts) simplify conjuncts in exec node (#19254)
Co-authored-by: yiguolei <yiguolei@gmail.com>
Currently, exec node save exprcontext**, but the object is in object pool, the code is very unclear. we could just use exprcontext*.
2023-05-04 18:04:32 +08:00
c74c2a4f8e [fix](Metadata tvf) Metadata TVF supports read the specified columns from Fe (#19110) 2023-04-29 00:06:08 +08:00
28016c53f0 [profile](rf) refactor profile of runtime filters (#19134)
* [profile](rf) refactor profile of runtime filters


---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-04-28 08:46:42 +08:00
a262f42a28 [refactor](exceptionsafe) make scanner and scancontext exception safe (#19057) 2023-04-27 09:23:01 +08:00
aabcab9dbe [Improvement](runtime filter) Improve merge phase (#18828) 2023-04-26 21:01:20 +08:00
339d804ec4 [Refactor](exceptionsafe) add factory creator to some class (#19000) 2023-04-25 14:33:47 +08:00
b2c26e17e1 [Compile](vec) Fix compile by BHREAD_SCANNER (#18979) 2023-04-24 17:07:06 +08:00
8d7a9fd21b [refactor](exceptionsafe) add factory creator to some class (#18978)
make vexprecontext,vexpr,function,query context,runtimestate thread safe.


---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-24 10:32:11 +08:00
8e4710079d [improvement](profile) Insert into add LoadChannel runtime profile (#18908)
TabletSink and LoadChannel in BE are M: N relationship,
Every once in a while LoadChannel will randomly return its own runtime profile to a TabletSink, so usually all LoadChannel runtime profiles are saved on each TabletSink, and the timeliness of the same LoadChannel profile saved on different TabletSinks is different, and each TabletSink will periodically send fe reports all the LoadChannel profiles saved by itself, and ensures to update the latest LoadChannel profile according to the timestamp.
2023-04-24 09:41:57 +08:00
3736530585 [refactor](query context) rename query fragments context to query context and make query context safe (#18950)
* [refactor](query context) rename query fragments context to query context and make query context safe

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-23 22:53:56 +08:00
63a76ed115 [refactor](exceptionsafe) disallow call new method explicitly (#18830)
disallow call new method explicitly
force to use create_shared or create_unique to use shared ptr
placement new is allowed
reference https://abseil.io/tips/42 to add factory method to all class.
I think we should follow this guide because if throw exception in new method, the program will terminate.

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-21 09:13:24 +08:00
b26e2d5d50 [bugfix](memoryleak) close expr after it is pushdown to storage layer (#18849) (#18852)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-21 05:21:16 +08:00
e412dd12e8 [chore](build) Use include-what-you-use to optimize includes (PART II) (#18761)
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
2023-04-19 23:11:48 +08:00
eb128753ac [Opt](pipeline) opt pipeline shared scan (#18715) 2023-04-17 13:06:39 +08:00
0f00ad4d2a [fix](executor)Fix scanner's _max_thread_num may == 0 #18465 2023-04-16 18:17:18 +08:00