Commit Graph

472 Commits

Author SHA1 Message Date
2b014a0464 [Improve](doris::Status performance) fix the performance issue due to copy of std::string (#17411) 2023-03-04 15:08:59 +08:00
823d968452 [fix](expr) avoid crashing caused by big depth of expression tree (#17314) 2023-03-02 16:55:53 +08:00
30df268c1f [fix](hdfs)(catalog) fix BE crash when hdfs-site.xml not exist in be/conf and fix compute node logic (#17244)
We set LIBHDFS3_CONF env in start_be.sh, so libhdfs3 will try to read this hdfs-site.xml,
if file does not exist, it will throw error. But Doris does not handle this error, cause BE crash.
This CL mainly changes:

Modify start_be.sh to only set LIBHDFS3_CONF if hdfs-site.xml exist.
Refactor the HDFSCommonBuilder so that it can return error correctly.
Add BE IP info in status, so that we can get ip from error msg like:
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]failed to init reader for file  000.snappy.orc, err: 
[INTERNAL_ERROR][172.21.0.101]failed to init HDFSCommonBuilder, please check check be/conf/hdfs-site.xml
The logic of prefer compute node is wrong, which causing the external table query can only assign up to 3 backends.
This CL refactor this logic and also change some FE config:

prefer_compute_node_for_external_table

If set to true, query on external table will prefer to assign to compute node.
And the max number of compute node is controlled by min_backend_num_for_external_table.
If set to false, query on external table will assign to any node.

min_backend_num_for_external_table

Only take effect when prefer_compute_node_for_external_table is true.
If the compute node number is less than this value, query on external table will try to get some mix node
to assign, to let the total number of node reach this value.
If the compute node number is larger than this value, query on external table will assign to compute node only.
2023-03-02 11:09:55 +08:00
b7677beab7 [enhancement](memtracker) Add special counter for memtracker and fix thread create and destroy track #17301
Add a special counter for memtracker, faster, but relaxed ordering and not accurate in real time
Track thread create and destroy memory, which was previously removed due to performance loss and added back
2023-03-02 08:55:00 +08:00
a1e3b908d7 [fix](memory) split mem usage thread and gc thread to different threads (#17213)
Ensure that the memory status is refreshed in time
Avoid frequent GC
2023-03-01 19:19:05 +08:00
6eeba204f9 [Enhancement] path scan causes disk io to skyrocket (#16968) 2023-02-25 09:15:15 +08:00
e5f884a6fc [enhancement](cache) make segment cache prune more effectively (#17011)
BloomFilter in MoW table may consume lots of memory, and it's life cycle is same as segment. This patch try to improve the efficiency of recycling segment cache, to release the memory in time.
2023-02-23 18:24:18 +08:00
8eeb435963 [improvement](meta) Enhance Doris's fault tolerance to disk error (#16472)
Sense io error.
Retry query when io error.
Greylist: When finds one disk is completely broken, or the diff of tablet number in BE and FE meta is too large,reduce the query priority of the BE.
2023-02-23 08:40:45 +08:00
a1c0054b4c [fix](memory) fix memory GC details and join probe catch bad_alloc (#16989)
Fix Redhat 4.x OS /proc/meminfo has no MemAvailable, disable MemAvailable to control memory.
vm_rss_str and mem_available_str recorded when gc is triggered, to avoid memory changes during gc and cause inaccurate logs.
join probe catch bad_alloc, this may alloc 64G memory at a time, avoid OOM.
Modify document doris_be_all_segments_num and doris_be_all_rowsets_num names.
2023-02-23 08:33:30 +08:00
7b0fc17c04 [enhancement](inverted index) Support fulltext index evaluate equal query and list query (#16994)
Fulltext index is the inverted index of the specified tokenizer, before this pr, fulltext index only can evaluate match predicate, this pr to support evaluate equal predicate and list predicate.
2023-02-22 20:18:10 +08:00
0e3be4eff5 [Improvement](brpc) Using a thread pool for RPC service avoiding std::mutex block brpc::bthread (#16639)
mainly include:
- brpc service adds two types of thread pools. The number of "light" and "heavy" thread pools is different
Classify the interfaces of be. Those related to data transmission are classified as heavy interfaces and others as light interfaces
- Add some monitoring to the thread pool, including the queue size and the number of active threads. Use these 
- indicators to guide the configuration of the number of threads
2023-02-22 14:15:47 +08:00
c0bb2e33a8 [improvement](scan) separate scanner into local and remote scanner pool (#16891)
There are 2 kinds for scanner thread pool, local and remote.
Local is for local file read, specially for olap scanner.
Remote is for other external data source, such as file scanner, jdbc scanner.

This PR mainly changes:

For olap scanner, use cold or hot rowset to decide whether to use local or remote pool.
For other scanner, user remote pool by default.
Add a new BE config doris_max_remote_scanner_thread_pool_thread_num, default is 512,
indicate the max thread number of the remote scanner thread pool

This will alleviate the problem of interaction between olap queries with load job and external queries.
2023-02-21 14:13:09 +08:00
113023fb86 (Enhancement)[load-json] support simdjson in new json reader (#16903)
be config:
enable_simdjson_reader=true

related PR #11665
2023-02-21 11:31:00 +08:00
30dafd6a44 [improve](inverted index) Add element count limit for inverted index searcher cache (#16758)
The element in InvertedIndexSearcherCache is inverted index searcher, which is a file descriptor of inverted index file, so InvertedIndexSearcherCache is actually cache file descriptor of inverted index file.

If open file descriptor limit of the Linux system is set too small and config inverted_index_searcher_cache_limit is too big, during high pressure load maybe cause "Too many open files".

So, when insert inverted index searcher into InvertedIndexSearcherCache, need also check whether reach file_descriptor_number limit for inverted index file.
2023-02-17 11:53:07 +08:00
2426d8e6e8 [chore](be-config) set disable_storage_row_cache default true to default disable row cache (#16827) 2023-02-17 10:21:28 +08:00
d013d529c8 [Feature](ipv6)Support IPV6 (#14063)
Support IPV6 in Apache Doris, the main changes are:
1. enable binding to IPV6 address if network priority in config file contains an IPV6 CIDR string
2. BRPC and HTTP support binding to IPV6 address
3. BRPC and HTTP support visiting IPV6 Services
2023-02-14 21:43:10 +08:00
f1b9185830 [feature](cooldown) Implement cold data compaction (#16681) 2023-02-14 15:21:54 +08:00
5014ad03e7 [feature](cooldown) Auto delete unused remote files (#16588) 2023-02-13 23:59:39 +08:00
f41a2055d3 [feature](Load)Remove user/password in properties for mysql load to avoid double auth. (#16073)
Use FE cluster token to auth stream load.
This auth is only open for be, and fe auth still only support http basic auth.

I will use this auth for mysql load to build a no-auth stream load from fe to be.
And this will avoid double auth in mysql load.
More information to see the design doc.
2023-02-13 10:00:08 +08:00
1de4e312cc [fix](metric) Fix be core when set enable_system_metrics to false in be (#16646)
when enable_system_metrics is false, we should not use system_metrics any more

Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2023-02-12 23:01:41 +08:00
aba843bb2b [Improvement](inverted index) inverted index query match bitmap cache (#16578)
Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows. 

Tests result:
- large result: matching 99% out of 247 million rows shows 8x speed up.
- small result: matching 0.1% out of 247 million rows shows 2x speed up.
2023-02-11 13:38:58 +08:00
37d1519316 [WIP](dynamic-table) support dynamic schema table (#16335)
Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
2023-02-11 13:37:50 +08:00
43eca4f209 [Feature-WIP](inverted index) Implementation for alter inverted index. (#16371)
implementation for add/drop inverted index.
2023-02-10 17:56:17 +08:00
32188855ef [improve](topn) seperate multiget rpc to ThreadPool (#16598)
multiget_data working in bthread and may block the whole worker pthread of BRPC framework and effect other bthreads, so I seperate work task into a seperate task pool.
2023-02-10 17:39:31 +08:00
a038fdaec6 [Bug](pipeline) Fix bug in non-local exchange on pipeline engine (#16463)
Currently, for broadcast shuffle, we serialize a block once and then send it by RPC through multiple channel. After this, we will serialize next block in the same memory for consideration of memory reuse. However, since the RPC is asynchronized, maybe the next block serialization will happen before sending the previous block.

So, in this PR, I use a ref count to identify if the serialized block can be reuse in broadcast shuffle.
2023-02-09 19:22:40 +08:00
539fd684e9 [improvement](filecache) use dynamic segment size to cache remote file block (#16485)
`CachedRemoteFileReader` has used fixed segment size(file_cache_max_file_segment_size=4M) to cache remote file blocks. However, the column size in a rowgroup/strip maybe smaller than 10K if a parquet/orc file has many columns, resulting in particularly serious read amplification. For example:
Q1 in clickbench: select count(*) from hits
```
-  FileCache:  0ns
  -  IOHitCacheNum:  552
  -  IOTotalNum:  835
  -  ReadFromFileCacheBytes:  19.98  MB
  -  ReadFromWriteCacheBytes:  0.00  
  -  ReadTotalBytes:  29.52  MB
  -  SkipCacheBytes:  0.00  
  -  WriteInFileCacheBytes:  915.77  MB
  -  WriteInFileCacheNum:  283 
```
Only 30MB of data is needed, but 900MB+ of data is read from hdfs. The query time of Q1(single scan thread) increased from **5.17s** to **24.45s** when enable file cache.

Therefore, this PR introduce dynamic segment size which is based on the `read_size` of the data. In order to prevent too small or too large IO, the segment size is limited in [4096, file_cache_max_file_segment_size].

Q1 in clickbench is **5.66s** when enable file cache. The performance is almost the same as if the cache is disabled, and the data size read from hdfs is reduced to 45MB.
```
-  FileCache:  0ns
    -  IOHitCacheNum:  297
    -  IOTotalNum:  835
    -  ReadFromFileCacheBytes:  8.73  MB
    -  ReadFromWriteCacheBytes:  0.00  
    -  ReadTotalBytes:  29.52  MB
    -  SkipCacheBytes:  0.00  
    -  WriteInFileCacheBytes:  45.66  MB
    -  WriteInFileCacheNum:  544
```
## Remaining Problems
Small queries may result in a large number of small files(4KB at least), and the `BE` saves too much meta information of cached segments.

## Fix bug
`FileCachePolicy` in `FileReaderOptions` is a constant reference, but the parameter passed in `FileFactory::create_file_reader` is a temporary variable, resulting in segmentation fault.
2023-02-09 16:39:10 +08:00
0142ef8b95 [improvement](scanner) Supports bthread scanner (#16031) 2023-02-09 10:24:56 +08:00
cf18de14b5 [fix](writer) add _is_closed state to DeltaWriter and avoid write/close core after close (#16453) 2023-02-07 22:40:26 +08:00
f2fd47f238 [Improve](row-store) support row cache (#16263) 2023-02-06 11:16:39 +08:00
63d57b83f3 [fix](memory) Fix request jemallloc metrics wait lock je_malloc_mutex_lock_slow #16381
MetricRegistry::trigger_all_hooks holds the metrics lock and is stuck in get_je_metrics, to_prometheus is waiting for MetricRegistry::trigger_all_hooks to release the lock, so get_je_metrics is no longer called in MetricRegistry::trigger_all_hooks.
2023-02-04 22:49:22 +08:00
Pxl
5e4bb98900 [Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290)
enable -Wpedantic and update lowest gcc version to 11.1
2023-02-03 11:28:48 +08:00
1d8265c5a3 [refactor](row-store) make row store column a hidden column in meta (#16251)
This could simplfy storage engine logic and make code more readable, and we could analyze
the hidden `__DORIS_ROW_STORE_COL__` length etc..
2023-02-02 20:56:13 +08:00
6470ae58ea [enhancement](config) remove config load_process_max_memory_limit_bytes (#15686) 2023-01-31 21:36:34 +08:00
17885acd09 [improvement](metrics) Metrics add all rowset nums and segment nums (#16208) 2023-01-30 09:55:32 +08:00
5eaa995704 [refactor](some mempool) not memset 0 in default value iterator (#16194)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-29 22:50:39 +08:00
e9afd3210c [improvement](memory) Optimize the log of process memory insufficient and support regular GC cache (#16084)
1. When the process memory is insufficient, print the process memory statistics in a more timely and detailed manner.
2. Support regular GC cache, currently only page cache and chunk allocator are included, because many people reported that the memory does not drop after the query ends.
3. Reduce system available memory warning water mark to reduce memory waste
4. Optimize soft mem limit logging
2023-01-29 10:02:04 +08:00
e49766483e [refactor](remove unused code) remove many xxxVal structure (#16143)
remove many xxxVal structure
remove BetaRowsetWriter::_add_row
remove anyval_util.cpp
remove non-vectorized geo functions
remove non-vectorized like predicate
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-28 14:17:43 +08:00
0148b39de0 [fix](metric) fix be down when enable_system_metrics is false (#16140)
if we set enable_system_metrics to false, we will see be down with following message "enable metric calculator failed, 
maybe you set enable_system_metrics to false ", so fix it
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2023-01-28 00:10:39 +08:00
adb758dcac [refactor](remove non vec code) remove json functions string functions match functions and some code (#16141)
remove json functions code
remove string functions code
remove math functions code
move MatchPredicate to olap since it is only used in storage predicate process
remove some code in tuple, Tuple structure should be removed in the future.
remove many code in collection value structure, they are useless
2023-01-26 16:21:12 +08:00
615a5e7b51 [refactor](remove non vec code) remove non vec functions and AggregateInfo (#16138)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-25 12:53:05 +08:00
6e8eedc521 [refactor](remove unused code) remove storage buffer and orc reader (#16137)
remove olap storage byte buffer
remove orc reader
remove time operator
remove read_write_util
remove aggregate funcs
remove compress.h and cpp
remove bhp_lib

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-24 22:29:32 +08:00
79ad74637d [refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136) 2023-01-24 10:45:35 +08:00
116e17428b [Enhancement](point query optimize) improve performace of point query on primary keys (#15491)
1. support row format using codec of jsonb
2. short path optimize for point query
3. support prepared statement for point query
4. support mysql binary format
2023-01-20 13:33:01 +08:00
6485221ffb [Feature-WIP](inverted index)(bkd) Support try query before query bkd to improve query efficiency (#16075) 2023-01-20 11:19:36 +08:00
05f0f63718 [fix](daemon) should use GetMonoTimeMicros() (#16070) 2023-01-19 10:44:06 +08:00
3894de49d2 [Enhancement](topn) support two phase read for topn query (#15642)
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.

TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.

After the second phase read, Block will contain all the data needed for the query
2023-01-19 10:01:33 +08:00
e579530c99 [Feature-WIP](inverted index) support use inverted index searcher cache (#16003)
use inverted index searcher cache to improve query performance

dependency pr: #14211 #15807 #15823
2023-01-18 09:30:55 +08:00
b1caa68706 [Feature-WIP](inverted index) inverted index reader's implementation, and add mysql_fulltext regression case to test fulltext query (#15823)
Issue Number: Step2 of DSIP-023: Add inverted index for full text search
implementation of inverted index reader

dependency pr: #14211 #15807 #15821
2023-01-17 09:13:56 +08:00
0206e0bc57 [Feature](inverted index) implementation of inverted index writer for numeric types, using bkd index (#15918)
Step3 of DSIP-023: Add inverted index for full text search
implementation of inverted index writer for numeric types, using bkd index
dependency pr: #14207 #15807 #15821
2023-01-14 21:06:51 +08:00
98c74f9ab8 [improvement](signal) add tid during core dump,the tid is equal to tid in be.INFO (#15893)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-14 18:40:02 +08:00