doris

Author	SHA1	Message	Date
Gabriel	95c91fab2e	[refactor](vec) delete non-vec runtime filter (#16016 ) * [refactor](vec) delete non-vec runtime filter * update	2023-01-18 17:49:20 +08:00
yiguolei	42b5d17fa1	[refactor](remove non vec) remove column block and column view (#16022 ) * [refactor](remove non vec) remove column block and column view and column vectorized batch Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-18 12:40:53 +08:00
YueW	e579530c99	[Feature-WIP](inverted index) support use inverted index searcher cache (#16003 ) use inverted index searcher cache to improve query performance dependency pr: #14211 #15807 #15823	2023-01-18 09:30:55 +08:00
HappenLee	d5a3e8df3a	[Exec](opt) Opt the vexplode_split function performance (#15945 )	2023-01-17 19:02:57 +08:00
Gabriel	d062ca2944	[refactor](vectorized) remove unnecessary vectorization check (#15984 )	2023-01-17 12:21:46 +08:00
Xinyi Zou	97fcad76f8	[enhancement](memtracker) Improve readability (#15716 )	2023-01-16 16:30:35 +08:00
Pxl	b727033906	[Chore](build) enable -Wextra and remove some -Wno (#15760 ) enable -Wextra and remove some -Wno	2023-01-15 10:40:35 +08:00
pengxiangyu	58c520dbfd	[Feature](remote) Cooldown cold data to object storage only one replica (#15832 )	2023-01-14 23:58:00 +08:00
Tiewei Fang	1489e3cfbf	[Fix](file system) Make the constructor of `XxxFileSystem` a private method (#15889 ) Since Filesystem inherited std::enable_shared_from_this , it is dangerous to create native point of FileSystem. To avoid this behavior, making the constructor of XxxFileSystem a private method and using the static method create(...) to get a new FileSystem object.	2023-01-13 15:32:16 +08:00
luozenglin	b1fb1277dd	[fix](bitmap) fix bitmap iterator comparison error (#15779 ) Fix the bug that bitmap.begin() == bitmap.end() is always true when the bitmap contains a single value.	2023-01-13 11:37:07 +08:00
yiguolei	16862d9b43	[refactor](remove unused code) remove buffer pool and disk io mgr (#15853 ) * [refactor](remove buffer pool and disk io mgr) remove unused code Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-13 09:42:58 +08:00
Gabriel	0fbdf8e3e1	[Refactor](table function) Decouple vectorized table functions from non-vectorized ones (#15772 )	2023-01-12 15:08:21 +08:00
Mingyu Chen	3fec5ff0f5	[refactor](scan-pool) move scan pool from env to scanner scheduler (#15604 ) The origin scan pools are in exec_env. But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used. All scan task will be submitted to the scan pool in scanner_scheduler. BTW, reorganize the scan pool into 3 kinds: local scan pool For olap scan node remote scan pool For file scan node limited scan pool For query which set cpu resource limit or with small limit clause TODO: Use bthread to unify all IO task. Some trivial issues: fix bug that the memtable flush size printed in log is not right Add RuntimeProfile param in VScanner	2023-01-11 09:38:42 +08:00
yiguolei	d857b4af1b	[refactor](remove row batch) remove impala rowbatch structure (#15767 ) * [refactor](remove row batch) remove impala rowbatch structure Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-11 09:37:35 +08:00
TengJianPing	8f31a36429	[feature] support spill to disk for sort node (#15624 )	2023-01-11 08:40:58 +08:00
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
plat1ko	ab186a60ce	[enhancement](compaction) Optimize judging delete rowset and picking candidate rowsets for compaction #15631 Tablet::version_for_delete_predicate should travel all rowset metas in tablet meta which complex is O(N), however we can directly judge whether this rowset is a delete rowset by RowsetMeta::has_delete_predicate which complex is O(1). As we won't call Tablet::version_for_delete_predicate when pick input rowsets for compaction, we can reduce the critical area of Tablet::_meta_lock.	2023-01-10 08:32:15 +08:00
ElvinWei	76ad599fd7	[enhancement](histogram) optimise aggregate function histogram (#15317 ) This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910) aggregation function. Including the following: 1. Support input parameters `sample_rate` and `max_bucket_num` 2. Add UT and regression test 3. Add documentation 4. Optimize function implementation logic Parameter description： - `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2. - `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128. --- Example： ``` MySQL [test]> SELECT histogram(c_float) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_float`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_string`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ ``` Query result description： ``` { "sample_rate": 0.2, "max_bucket_num": 128, "bucket_num": 3, "buckets": [ { "lower": "0.1", "upper": "0.2", "count": 2, "pre_sum": 0, "ndv": 2 }, { "lower": "0.8", "upper": "0.9", "count": 2, "pre_sum": 2, "ndv": 2 }, { "lower": "1.0", "upper": "1.0", "count": 2, "pre_sum": 4, "ndv": 1 } ] } ``` Field description： - sample_rate：Rate of sampling - max_bucket_num：Limit the maximum number of buckets - bucket_num：The actual number of buckets - buckets：All buckets - lower：Upper bound of the bucket - upper：Lower bound of the bucket - count：The number of elements contained in the bucket - pre_sum：The total number of elements in the front bucket - ndv：The number of different values in the bucket > Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).	2023-01-07 00:50:32 +08:00
HappenLee	f24659c003	[Refactor](pipeline) refactor the code of channel buffer limit and change the default value (#15650 )	2023-01-06 14:52:43 +08:00
Adonis Ling	95f2f43c02	[fix](macOS) Failed to run BE UT due to syscall to map cache into shared region failed (#15641 ) According to the post https://developer.apple.com/forums/thread/676684, the executable whose size is bigger than 2G may fail to start. The size of the executable `doris_be_test` generated by run-be-ut.sh is 2.1G (> 2G) now and we can't run it on macOS (arm64). We can separate the debug info from the executable `doris_be_test` to reduce the size. After that, we can run `doris_be_test` successfully.	2023-01-06 01:23:37 +08:00
TengJianPing	77fda4f749	[SpillToDisk](block reader and writer)Support spill to disk: implement interfaces for spill block and read block (#15399 )	2023-01-03 12:42:45 +08:00
yiguolei	14eaf41029	[refactor](remove rowblockv2) remove rowblock v2 structure (#15540 ) * [refactor](remove rowblockv2) remove rowblock v2 structure * fix bugs Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-03 09:21:57 +08:00
xueweizhang	40c53931e5	[fix](vec) VMergeIterator add key same label for agg table (#14722 )	2023-01-02 22:54:21 +08:00
yixiutt	365c3eec16	[enhancement](compaction) vertical compaction support unique-key mow (#15353 )	2023-01-02 22:53:04 +08:00
Xin Liao	cc7a9d92ad	[refactor](non-vec) remove non vec code for indexed column reader (#15409 )	2022-12-30 23:01:54 +08:00
plat1ko	ad68764977	[enhancement](tablet) Unify redundant `create_rowset_writer` methods (#15519 ) * Remove redundant create_rowset_writer methods * Set resource id when setting FS in rowset meta * fix * fix ut	2022-12-30 22:57:12 +08:00
yiguolei	b23d068281	[refactor](remove-non-vec) Remove non vec load from memtable and delta writer (#15517 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-30 21:22:58 +08:00
Xin Liao	2f572ccc43	[fix](index) fix that the last element of each batch will be read repeatedly for binary prefix page (#15481 )	2022-12-30 15:36:55 +08:00
Gabriel	edb9a3b58d	[Bug](timediff) Fix wrong result for function `timediff` (#15312 )	2022-12-30 00:28:51 +08:00
Mingyu Chen	29492f0d6c	[refactor](file-cache) refactor the file cache interface (#15398 ) Refactor the usage of file cache ### Motivation There may be many kinds of file cache for different scenarios. So the logic of the file cache should be hidden inside the file reader, so that for the upper-layer caller, the change of the file cache does not need to modify the upper-layer calling logic. ### Details 1. Add `FileReaderOptions` param for `fs->open_file()`, and in `FileReaderOptions` 1. `CachePathPolicy` Determine the cache file path for a given file path. We can implement different `CachePathPolicy` for different file cache. 2. `FileCacheType` Specified file cache type: SUB_FILE_CACHE, WHOLE_FILE_CACHE, FILE_BLOCK_SIZE, etc. 2. Hide the cache logic inside the file reader. The `RemoteFileSystem` will handle the `CacheOptions` and determine whether to return a `CachedFileReader` or a `RemoteFileReader`. And the file cache is managed by `CachedFileReader`	2022-12-29 12:15:46 +08:00
AlexYue	ffef81a6ab	[feature](BE)pad missed version with empty rowset (#15030 ) If all replicas of one tablet are broken, user can use this http api to pad the missed version with empty rowset.	2022-12-29 11:20:44 +08:00
Jet He	75aa00d3d0	[Feature](NGram BloomFilter Index) add new ngram bloom filter index to speed up like query (#11579 ) This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733. The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve. For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```, you need set it to ```true```, to make the index work for like query.	2022-12-28 18:01:50 +08:00
Zhengguo Yang	aa0f38f864	[chore](gutil) remove some gutil files and use c++ stl instead (#15357 ) * [chore](gutil) remove some gutil files and use c++ stl instead * fix * fix	2022-12-26 21:25:09 +08:00
Tiewei Fang	ec055e1acb	[feature](new file reader) Integrate new file reader (#15175 )	2022-12-26 08:55:52 +08:00
yiguolei	a807978882	[refactor](non-vec) Remove rowbatch code from delta writer and some rowbatch related code (#15349 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-26 08:54:51 +08:00
yiguolei	e640f49b6d	[refactor](non-vec) remove non vectorized predicate and row_block (#15348 ) remove non vectorized predicate and row_block	2022-12-25 21:45:00 +08:00
Ashin Gau	5cefd05869	[fix](multi-catalog) fix and optimize iceberg v2 reader (#15274 ) Fix three bugs when read iceberg v2 tables: 1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in `RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first row of current row group from `delete position`. 2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`. 3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row group in such case. Optimize four performance issues: 1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly. 2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set. 3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to save memory usage, and the same as `IcebergReader`. 4. Change the recursive call of `_do_lazy_read` to loop logic.	2022-12-24 16:02:07 +08:00
Zhengguo Yang	a98636a970	[bugfix](from_unixtime) fix timezone not work for from_unixtime (#15298 ) * [bugfix](from_unixtime) fix timezone not work for from_unixtime	2022-12-23 19:05:09 +08:00
yiguolei	06d0035c02	[refactor](non-vec)remove schema change related non-vec code (#15313 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-23 18:33:04 +08:00
Gabriel	b085ff49f0	[refactor](non-vec) delete non-vec data sink (#15283 ) * [refactor](non-vec) delete non-vec data sink Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-12-23 14:10:47 +08:00
yiguolei	83a99a0f8b	[refactor](non-vec) Remove non vec code from be (#15278 ) * [refactor](removecode) remove some non-vectorization Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-22 23:28:30 +08:00
HaveAnOrangeCat	df5969ab58	[Feature] Support function roundBankers (#15154 )	2022-12-22 22:53:09 +08:00
Xinyi Zou	77c15729d4	[fix](memory) Fix too many repeat cause OOM (#15217 )	2022-12-22 17:16:18 +08:00
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00
Gabriel	e9a201e0ec	[refactor](non-vec) delete some non-vec exec node (#15239 ) * [refactor](non-vec) delete some non-vec exec node	2022-12-22 14:05:51 +08:00
AlexYue	821c12a456	[chore](BE) remove all useless segment group related code #15193 The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.	2022-12-20 17:11:47 +08:00
zhengyu	a7180c5ad8	[fix](segcompaction) fix segcompaction failed for newly created segment (#15022 ) (#15023 ) Currently, newly created segment could be chosen to be compaction candidate, which is prone to bugs and segment file open failures. We should skip last (maybe active) segment while doing segcompaction.	2022-12-19 14:17:58 +08:00
Pxl	219489ca0e	[Bug](s2geo) avoid some core dump on s2geo && enable ut of s2geo #15068	2022-12-16 10:56:02 +08:00
AlexYue	f17b138cbd	[BugFix](regression) don't use sf1DataPath when stream load (#15060 ) don't use sf1DataPath when stream load	2022-12-14 12:39:56 +08:00
Gabriel	1200b22fd2	[function](round) compute accurate round value by decimal (#14946 )	2022-12-13 09:53:43 +08:00

1 2 3 4 5 ...

934 Commits