Commit Graph

955 Commits

Author SHA1 Message Date
308ff9a16f [enchancement](memory) tracking lru cache memory and page memory not in cache (#18361)
Statistics lru cache memory in metrics
Statistics page memory not in cache in mem tracker
2023-04-07 14:22:44 +08:00
c32adba1cf [Refactor](Pipeline) Refactor pipeline code to improve coverage (#18376)
Refactor pipeline code to improve coverage
2023-04-07 13:09:44 +08:00
7f8d92656e [fix](streamload) fix stream load failed when enable profile (#18364)
#18015 enables stream load profile log,  however be will encounter rpc fail when loading tpch data(see #18291). This is because when `is_report_success` is true, be will reportExecStatus to fe, but fe cannot find QueryInfo in `coordinatorMap`, thus it will return error to be.
2023-04-05 01:01:46 +08:00
66bfd18601 [opt](file_reader) add prefetch buffer to read csv&json file (#18301)
Co-authored-by: ByteYue <[yj976240184@gmail.com](mailto:yj976240184@gmail.com)>
This PR is an optimization for https://github.com/apache/doris/pull/17478:
1. Change the buffer size of `LineReader` to 4MB to align with the size of prefetch buffer.
2. Lazily prefetch data in the first read to prevent wasted reading.
3. S3 block size is 32MB only, which is too small for a file split. Set 128MB as default file split size.
4. Add `_end_offset` for prefetch buffer to prevent wasted reading.

The query performance of reading data on object storage is improved by more than 3x+.
2023-04-04 19:05:22 +08:00
5e7ea5e305 [fix](memory) Fix bthread_setspecific log fatal on UBSAN build (#18274) 2023-03-31 19:46:53 +08:00
1027abe0d3 [enhancement](query exec) should print error status when query meet error (#18247)
If BE is in heavy load, the query may failed, but BE will try to connect to FE using thrift, if FE is also in heavy load the thrift connection will failed. And the status is rewritten at line 342, and the actual failure reason for the query is lost. Should print the error status every time during update.
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-31 14:08:24 +08:00
1b2aaab2f2 [vectorized](bug) fix some case in enable fold constant (#17997)
fix some case in enable fold constant
2023-03-31 11:41:31 +08:00
4e1e0ce06d [bugfix](topn) fix topn optimzation wrong result for NULL values (#18121)
1. add PassNullPredicate to fix topn wrong result for NULL values
2. refactor RuntimePredicate to avoid using TCondition
3. refactor using ordering_exprs in fe and vsort_node
2023-03-31 10:01:34 +08:00
e0f6083e73 [refactor](dynamic table) add get_type_as_tprimitive_type and get_type_as_primitive_type in IDataType to get PrimitiveType and TPrimitiveType (#18260) 2023-03-31 09:03:06 +08:00
2ee1468576 [improvement](executor) Support task group schedule in pipeline engine (#17615) 2023-03-30 10:49:50 +08:00
3094815f8f [enhancement](profile) add blocks produced profile to track if output block is very small (#18217) 2023-03-30 09:51:03 +08:00
01d012bab7 [fix](memory) Remove page cache regular clear, disabled jemalloc prof by default (#18218)
Remove page cache regular clear
Now the page cache is turned off by default. If the user manually opens the page cache, it can be considered that the user can accept the memory usage of the page cache, and then can consider adding a manual clear command to the cache.

fix memory gc cancel top memory query

jemalloc prof is not enabled by default
2023-03-30 09:39:37 +08:00
21895abfe7 [bugfix](buffercontrolblock) many query becomes very slow in 1.2.3 (#18229)
predicate in wait for is wrong, should not check is cancelled.

            VDataBufferSender  (dst_fragment_instance_id=-39f306bf41e3bafb--5dc95f12d4afdcdb):
                  -  AppendBatchTime:  7s50ms
                      -  ResultRendTime:  7s5ms
                      -  TupleConvertTime:  41.829ms
                  -  NumSentRows:  38.114K  (38114)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-30 08:54:38 +08:00
d9fe5f7b67 [enhancement](memory) Remove MemPool and replace it with Arena (#17820)
Arena can replace MemPool in most scenarios. Except for memory reuse, MemPool supports reuse of previous memory chunks after clear, but Arena does not.

Some comparisons between MemPool and Arena:

 1. Expansion
     Arena is less than 128M index 2 alloc chunk; more than 128M memory, allocate 128M * n > `size`, n is equal to the minimum value that satisfies the expression;
     MemPool less than 512K index 2 alloc chunk, greater than 512K memory, separately apply for a `size` length chunk
     
     After Arena applied for a chunk larger than 128M last time, the minimum chunk applied for after that is 128M. Does this seem to be a waste of memory? MemPool is also similar. After the chunk of 512K was applied for last time, the minimum chunk of subsequent applications is 512K.

 2. Alignment
     MemPool defaults to 16 alignment, because memtable and other places that use int128 require 16 alignment;
     Arena has no default alignment;

 3. Memory reuse
     Arena only supports `rollback`, which reuses the memory of the current chunk, usually the memory requested last time.
     MemPool supports clear(), all chunks can be reused; or call ReturnPartialAllocation() to roll back the last requested memory; if the last chunk has no memory, search for the most free chunk for allocation

 4. Realloc
     Arena supports realloc contiguous memory; it also supports realloc contiguous memory from any position at the time of the last allocation. The difference between `alloc_continue` and `realloc` is:
         1. Alloc_continue does not need to specify the old size, but the default old size = head->pos - range_start
         2. alloc_continue supports expansion from range_start when additional_bytes is between head and pos, which is equivalent to reusing a part of memory, while realloc completely allocates a new memory
     MemPool does not support realloc, but supports transferring or absorbing chunks between two MemPools

 5. check mem limit
     MemPool checks the mem limit, and Arena checks at the Allocator layer.

 6. Support for ASAN
     Arena does something extra

 7. Error handling
     MemPool supports returning the error message of application failure directly through `Status`, and Arena throws Exception.
Tests that Arena can consider

 1. After the last applied chunk is larger than 128M, the minimum applied chunk is 128M, which seems to waste memory;

 2. Support clear, memory multiplexing;

 3. Increase the large list, alloc the memory larger than 128M, and the size is equal to `size`, so as to avoid the current chunk not being fully used, which is wasteful.

 4. In some cases, it may be possible to allocate backwards to find chunks t
2023-03-29 20:56:49 +08:00
05db6e9b55 [refactor](file-system)(step-2) remove env, file_utils and filesystem_utils (#18009)
Follow #17586.
This PR mainly changes:

Remove env/
Remove FileUtils/FilesystemUtils
Some methods are moved to LocalFileSystem
Remove olap/file_cache
Add s3 client cache for s3 file system
In my test, the time of open s3 file can be reduced significantly
Fix cold/hot separation bug for s3 fs.
This is the last PR of #17764.
After this, all IO operation should be in io/fs.

Except for tests in #17586, I also tested some case related to fs io:

clone
concurrency query on local/s3/hdfs
load error log create and clean
disk metrics
2023-03-29 09:00:52 +08:00
6b6682cd96 [Enhancement](Expr) Opt In Set by small size fixed container to improve performance. (#17976) 2023-03-28 23:10:39 +08:00
09e346e47c [fix](type) Data precision is lost when converting DOUBLE type data to DECIMAL (#17191) (#17562)
1. Fix bug when converting DOUBLE to DECIMAL;
2. Fix bug when converting DOUBLE to DECIMALV3;
2023-03-28 09:46:43 +08:00
359f5be53e [refactor](cgroup) remove cgroup manager it is useless (#18124)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-27 23:02:18 +08:00
990479e177 [refactor](memory) Query waits for memory free in Allocator, after memory exceed limit. (#18075)
After the memory exceeds the limit, the previous query waited for memory free in the mem hook, and changed it to wait in the Allocator.

more controllable and safe
2023-03-27 09:06:03 +08:00
7c0bcbdca1 [enhance](parquet-reader) cache file meta of parquet to speed up query (#18074)
Problem:
1. FE will split the parquet file into split. So a file can have several splits.
2. BE will scan each split, read the footer of the parquet file.
3. If 2 splits belongs to a same parquet file, the footer of this file will be read twice.

This PR mainly changes:
1. Use kv cache to cache the footer of parquet file.
2. The kv cache is belong to a scan node, so all parquet reader belong to this scan node will share same kv cache.
3. In cache, the key is "meta_file_path", the value is parsed thrift footer.

The KV Cache is sharded into mutlti sub cache.
So that different file can use different sub cache, avoid blocking each other

In my test, a query with 26 splits can reduce the footer parse time from 4s -> 1s
2023-03-25 23:22:57 +08:00
855852d582 [enhancement](timeout) fix set timeout failure and simplify timeout logic (#17837) 2023-03-25 21:56:06 +08:00
b244c41371 [Bug](regression-test) Fix grace stop be coredump in pipeline (#18076) 2023-03-24 17:44:06 +08:00
7fd0ec7d17 [Bug](float) fix wrong value when enable fold constant by BE (#17901) 2023-03-22 09:51:03 +08:00
cb79e42e5c [refactor](file-system)(step-1) refactor file sysmte on BE and remove storage_backend (#17586)
See #17764 for details
I have tested:
- Unit test for local/s3/hdfs/broker file system: be/test/io/fs/file_system_test.cpp
- Outfile to local/s3/hdfs/broker.
- Load from local/s3/hdfs/broker.
- Query file on local/s3/hdfs/broker file system, with table value function and catalog.
- Backup/Restore with local/s3/hdfs/broker file system

Not test:
- cold & host data separation case.
2023-03-21 21:08:38 +08:00
9021d2ae11 [fix](fragment mgr) fix the bug that g_fragmentmgr_prepare_latency update uncorrectly #17909 2023-03-21 08:50:38 +08:00
bd8e3e6405 [refactor](date) unify DateTimeValue and VecDateTimeValue (#17670) 2023-03-20 16:27:08 +08:00
dd53bc1c8d [unify type system](remove unused type desc) remove some code (#17921)
There are many type definitions in BE. Should unify the type system and simplify the development.



---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-19 14:05:02 +08:00
e359e412e1 [vectorized](udaf) fix java udaf meet error of std::bad_alloc (#17848)
Now if the user code of java udaf throws exception, because c++ code of agg function nobody could deal
with it, so maybe get error of std::bad_alloc
2023-03-19 11:52:15 +08:00
85080ee3c3 [vectorized](function) support array_map function (#17581) 2023-03-15 10:51:29 +08:00
Pxl
16fc3a0e22 [Chore](compile) remove some unused static on inline function to reduce compile time (#17603)
remove some unused static on inline function to reduce compile time
2023-03-13 11:11:59 +08:00
39b5682d59 [Pipeline](shared_scan_opt) Support shared scan opt in pipeline exec engine 2023-03-13 10:33:57 +08:00
edb2d90852 [fix](routine load) fix ROUTINE LOAD bug,kafka commit a lack of one(#17282) (#17291)
Co-authored-by: hugoluo <hugoluo@tencent.com>
2023-03-13 10:20:59 +08:00
fcd25b53bf [Optimize](Random distribution) Improve the performance of tablet sin… (#17389)
The current distribution model for Doris is as follows:

OlapTableSink seperate the original Block into serveral subblocks of each node(BE) by tablets distribution and distributes subblocks to storage engine of backends, then the storage engine will seperate the subblock into multiple tablets channel and each delta writer will handle partial of the block.

This model causes blocks to be split according to tablets, and the splitting process can be a relatively heavy operation. After splitting, the blocks are distributed to different DeltaWriters (Memtables) through RPCs to TabletChannels. The distribution operation on TabletChannels is also a relatively heavy operation. If the distribution property of the table is RANDOM distribution, then we have the opportunity to distribute the blocks according to the complete block during distribution. The advantage of doing so is to reduce memory copying and improve write locality, similar to appending the entire block to the memtable.

This optimze could save 10% ~ 20% CPU cost of RANDOM distribution table load when enable load_to_single_tablet
2023-03-10 10:52:40 +08:00
f9baf9c556 [improvement](scan) Support pushdown execute expr ctx (#15917)
In the past, only simple predicates (slot=const), and, like, or (only bitmap index) could be pushed down to the storage layer. scan process:

Read part of the column first, and calculate the row ids with a simple push-down predicate.
Use row ids to read the remaining columns and pass them to the scanner, and the scanner filters the remaining predicates.
This pr will also push-down the remaining predicates (functions, nested predicates...) in the scanner to the storage layer for filtering. scan process:

Read part of the column first, and use the push-down simple predicate to calculate the row ids, (same as above)
Use row ids to read the columns needed for the remaining predicates, and use the pushed-down remaining predicates to reduce the number of row ids again.
Use row ids to read the remaining columns and pass them to the scanner.
2023-03-10 08:35:32 +08:00
Pxl
e2ac06d6d6 [Chore](execution) change PipelineTaskState to enum class && remove some row-based code (#17300)
1. change PipelineTaskState to enum class
2. remove some row-based code on FoldConstantExecutor::_get_result
3. reduce memcpy on minmax runtime filter function(Now we can guarantee that the input data is aligned)
4. add Wunused-template check, and remove some unused function, change some static function to inline function.
2023-03-08 12:41:15 +08:00
4692d6764c [refactor](remove string val) remove string val structure, it is same with string ref (#17461)
remove stringval, decimalv2val, bigintval
2023-03-08 10:42:20 +08:00
48c2d806d7 [enhencement](jdbc catalog) Use Druid instead of HikariCP in JdbcClient (#17395)
This pr does three things:
1. Use Druid instead of HikariCP in JdbcClient
2. when download udf jar, add the name of the jar package after the local file name.
3. refactor some jdbcResource code
2023-03-07 08:51:10 +08:00
9477c48ef8 [refactor](functioncontext) remove duplicate type definition in function context (#17421)
remove duplicate type definition in function context
remove unused method in function context
not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned.
remove useless slot_size in all tuple or slot descriptor.
remove doris_udf namespace, it is useless.
remove some unused macro definitions.
init v_conjuncts in vscanner, not need write the same code in every scanner.
using unique ptr to manage function context since it could only belong to a single expr context.
Issue Number: close #xxx
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-06 16:07:09 +08:00
e7cba11680 [fix](array)(parquet) fix be core dump due to load from parquet file containing array types (#17298) 2023-03-06 15:18:42 +08:00
b9b028099d [enhancement](stream load pipe) using queryid or load id to identify stream load pipe instead of fragment instance id (#17362)
* [enhancement](stream load pipe) using queryid or load id to identify stream load pipe instead of fragment instance id

NewLoadStreamMgr already has pipe and other info. Do not need save the pipe into fragment state. and FragmentState should be more clear.

But this pr will change the behaviour of BE.
I will pick the pr to doris 1.2.3 and add the load id to FE support. The user could upgrade from 1.2.3 to 2.x
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-04 16:19:36 +08:00
c96571c236 [Bug](decimalv2) decimal value is filtered by mistake (#17353)
Reason: column_name[k5], decimal value is not valid for definition, value=123.123, precision=9, scale=3, min=-999999.999, max=-999999.999; . src line [];

#17273
2023-03-03 10:40:19 +08:00
f5232e5c01 [vectorized](bug) fix some open enable_fold_constant_by_be failed cases (#17240) 2023-03-03 10:30:20 +08:00
17f4990bd3 [enhancement](functioncontext) function context should use shared ptr and simply function context (#17311)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-02 16:23:54 +08:00
b7677beab7 [enhancement](memtracker) Add special counter for memtracker and fix thread create and destroy track #17301
Add a special counter for memtracker, faster, but relaxed ordering and not accurate in real time
Track thread create and destroy memory, which was previously removed due to performance loss and added back
2023-03-02 08:55:00 +08:00
34c5e84e9f [fix](insert) fix txn error reason clearly (#16997)
Signed-off-by: nextdreamblue <zxw520blue1@163.com>
2023-03-01 20:28:41 +08:00
3871e989ac [fix](memory) Avoid repeating meaningless memory gc #17258 2023-03-01 19:23:33 +08:00
e22a9ecc3b [enhancement](execute model) using thread pool to execute report or join task instead of staring too many thread (#17212)
* [enhancement](execute model) using thread pool to execute report or join task instead of staring too many thread

Doris will start report thread and join thread during fragment execution. There are many problems if create and destroy thread very frequently. Jemalloc may not behave very well, it may crashed.

jemalloc/jemalloc#1405

It is better to using thread pool to do these tasks.
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-01 08:35:27 +08:00
7f6209ede4 [fix](routine load) fix be core dump while use routine load (#17222) 2023-02-28 21:01:38 +08:00
b51ce415e7 [Feature](load) Add submitter and comments to load job (#16878)
* [Feature](load) Add submitter and comments to load job
2023-02-28 09:06:19 +08:00
84413f33b8 [enhancement](merge-on-write) add skip_delete_bitmap session variable for debug purpose (#17127) 2023-02-27 23:31:28 +08:00