doris

Author	SHA1	Message	Date
Tiewei Fang	ec055e1acb	[feature](new file reader) Integrate new file reader (#15175 )	2022-12-26 08:55:52 +08:00
yiguolei	0e651365ca	[profile](scanner) add per scanner running time profile (#15321 ) * [profile](scanner) add per scanner running time profile Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-26 08:55:07 +08:00
yiguolei	a807978882	[refactor](non-vec) Remove rowbatch code from delta writer and some rowbatch related code (#15349 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-26 08:54:51 +08:00
Yulei-Yang	b7768a928d	[Improvement](S3) support access s3 via temporary security credentials (#15340 )	2022-12-26 00:31:55 +08:00
yiguolei	e640f49b6d	[refactor](non-vec) remove non vectorized predicate and row_block (#15348 ) remove non vectorized predicate and row_block	2022-12-25 21:45:00 +08:00
Ashin Gau	5cefd05869	[fix](multi-catalog) fix and optimize iceberg v2 reader (#15274 ) Fix three bugs when read iceberg v2 tables: 1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in `RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first row of current row group from `delete position`. 2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`. 3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row group in such case. Optimize four performance issues: 1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly. 2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set. 3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to save memory usage, and the same as `IcebergReader`. 4. Change the recursive call of `_do_lazy_read` to loop logic.	2022-12-24 16:02:07 +08:00
Xin Liao	e72404c537	[fix](scan) fix that be may core dump when the predicates are all false (#15332 )	2022-12-24 15:27:43 +08:00
Gabriel	06f71f2bca	[pipeline](fix) Fix bugs to pass all regression cases (#15306 ) * [pipeline](fix) Fix bugs to pass all regression cases * update * update	2022-12-23 22:17:50 +08:00
Zhengguo Yang	a98636a970	[bugfix](from_unixtime) fix timezone not work for from_unixtime (#15298 ) * [bugfix](from_unixtime) fix timezone not work for from_unixtime	2022-12-23 19:05:09 +08:00
yiguolei	06d0035c02	[refactor](non-vec)remove schema change related non-vec code (#15313 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-23 18:33:04 +08:00
Jibing-Li	e336178ef8	[Fix](multi catalog)Fix VFileScanner file not found status bug. #15226 The if condition to check NOT FOUND status for VFileScanner is incorrect, fix it.	2022-12-23 16:45:54 +08:00
luozenglin	8a810cd554	[fix](bitmapfilter) fix core dump caused by bitmap filter (#15296 ) Do not push down the bitmap filter to a non-integer column	2022-12-23 16:42:45 +08:00
luozenglin	8515a03ef9	[fix](compile) fix compile error caused by `mysql_scan_node.cpp` not being found when enabling `WITH_MYSQL` (#15277 )	2022-12-23 16:25:28 +08:00
AlexYue	fe562bc3e7	[Bug](Agg) fix crash when encountering not supported agg function like last_value(bitmap) (#15257 ) The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.	2022-12-23 14:23:21 +08:00
Gabriel	b085ff49f0	[refactor](non-vec) delete non-vec data sink (#15283 ) * [refactor](non-vec) delete non-vec data sink Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-12-23 14:10:47 +08:00
luozenglin	38530100d8	[fix](localgc) check gc only cache directory (#15238 )	2022-12-23 10:40:55 +08:00
Pxl	6b3721af23	[Bug](function) fix core dump on reverse() when big string input fix core dump on reverse() when big string input	2022-12-23 10:14:09 +08:00
yiguolei	83a99a0f8b	[refactor](non-vec) Remove non vec code from be (#15278 ) * [refactor](removecode) remove some non-vectorization Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-22 23:28:30 +08:00
HaveAnOrangeCat	df5969ab58	[Feature] Support function roundBankers (#15154 )	2022-12-22 22:53:09 +08:00
HappenLee	388df291af	[pipeline](schedule) Add profile for except node and fix steal task problem (#15282 )	2022-12-22 22:42:37 +08:00
Kang	e331e0420b	[improvement](topn)add per scanner limit check for new scanner (#15231 ) Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100` (ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key). Check per scanner limit and set eof true to reduce the data need to be read.	2022-12-22 22:39:31 +08:00
Gabriel	d38461616c	[Pipeline](error msg) format error message (#15247 )	2022-12-22 20:55:06 +08:00
Xinyi Zou	77c15729d4	[fix](memory) Fix too many repeat cause OOM (#15217 )	2022-12-22 17:16:18 +08:00
zhengyu	6fb61b5bbc	[enhancement] (streamload) allow table in url when do two-phase commit (#15246 ) (#15248 ) Make it works even if user provide us with (unnecessary) table info in url. i.e. `curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H \ "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc` can still works! Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>	2022-12-22 17:00:51 +08:00
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00
Gabriel	e9a201e0ec	[refactor](non-vec) delete some non-vec exec node (#15239 ) * [refactor](non-vec) delete some non-vec exec node	2022-12-22 14:05:51 +08:00
yixiutt	1cc79510c9	[enhancement](compaction) add delete_sign_index check before filter delete (#15190 )	2022-12-22 09:26:37 +08:00
HappenLee	8ecf69b09b	[pipeline](regression) nested loop join test get error result in pipeline engine and refactor the code for need more input data (#15208 )	2022-12-21 19:03:51 +08:00
Gabriel	af54299b26	[Pipeline](projection) Support projection on pipeline engine (#15220 )	2022-12-21 15:47:29 +08:00
TengJianPing	a447121fc3	[fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan (#15199 ) * [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan * fix	2022-12-21 15:44:47 +08:00
Gabriel	2445ac9520	[Bug](runtimefilter) Fix BE crash due to init failure (#15228 )	2022-12-21 15:36:22 +08:00
Zhengguo Yang	5aefb793f9	[Bugfix](round) fix round function may coredump (#15203 ) * [Bugfix](round) fix round function may coredump	2022-12-21 14:36:10 +08:00
Xin Liao	efdc73777a	[enhancement](load) verify the number of rows between different replicas when load data to avoid data inconsistency (#15101 ) It is very difficult to investigate the data inconsistency of multiple replicas. When loading data, the number of rows between replicas is checked to avoid some data inconsistency problems.	2022-12-21 09:50:13 +08:00
Gabriel	732417258c	[Bug](pipeline) Fix bugs to pass TPCDS cases (#15194 )	2022-12-20 22:29:55 +08:00
Gabriel	2501198800	[Bug](compile) Fix compiling error (#15207 )	2022-12-20 20:05:49 +08:00
AlexYue	821c12a456	[chore](BE) remove all useless segment group related code #15193 The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.	2022-12-20 17:11:47 +08:00
morrySnow	5cf21fa7d1	[feature](planner) mark join to support subquery in disjunction (#14579 ) Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>	2022-12-20 15:22:43 +08:00
zbtzbtzbt	9d48154cdc	[minor](non-vec) delete unused interface in RowBatch (#15186 )	2022-12-20 13:06:34 +08:00
yiguolei	a2d56af7d9	[profile](datasender) add more detail profile in data stream sender (#15176 ) * [profile](datasender) add more detail profile in data stream sender Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-20 12:07:34 +08:00
Lijia Liu	938f4f33d6	[Pipeline] Add MLFQ when schedule (#15124 )	2022-12-20 11:49:15 +08:00
luozenglin	0c2911efb1	[enhancement](gc) sub_file_cache checks the directory files when gc (#15114 ) * [enhancement](gc) sub_file_cache checks the directory files when gc * update	2022-12-20 10:50:11 +08:00
Zhengguo Yang	98cdeed6e0	[chore](routine load) remove deprecated property of librdkafka reconnect.backoff.jitter.ms #15172	2022-12-20 10:13:56 +08:00
HappenLee	40141a9c9c	[opt](vectorized) opt the null map _has_null logic (#15181 ) opt the null map _has_null logic	2022-12-20 10:01:54 +08:00
zhangstar333	494eb895d3	[vectorized](pipeline) support union node operator (#15031 )	2022-12-19 22:01:56 +08:00
HappenLee	7c67fa8651	[Bug](pipeline) fix bug of right anti join error result in pipeline (#15165 )	2022-12-19 19:28:44 +08:00
Gabriel	0732f31e5d	[Bug](pipeline) Fix bugs for scan node and join node (#15164 ) * [Bug](pipeline) Fix bugs for scan node and join node * update	2022-12-19 15:59:29 +08:00
TengJianPing	445ec9d02c	[fix](counter) fix coredump caused by updating destroyed counter (#15160 )	2022-12-19 14:35:03 +08:00
xueweizhang	1597afcd67	[fix](mutil-catalog) fix get many same name db/table when show where (#15076 ) when show databases/tables/table status where xxx, it will change a selectStmt to select result from information_schema, it need catalog info to scan schema table, otherwise may get many database or table info from multi catalog. for example mysql> show databases where schema_name='test'; +----------+ \| Database \| +----------+ \| test \| \| test \| +----------+ MySQL [internal.test]> show tables from test where table_name='test_dc'; +----------------+ \| Tables_in_test \| +----------------+ \| test_dc \| \| test_dc \| +----------------+	2022-12-19 14:27:48 +08:00
Ashin Gau	7730a88d11	[fix](multi-catalog) add support for orc binary type (#15141 ) Fix three bugs: 1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed 2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed 3. fmt::format can't generate right format string, and will be failed	2022-12-19 14:24:12 +08:00
Xin Liao	03ea2866b7	[fix](load) add to error tablets when delta writer failed to close (#15118 ) The result of load should be failed when all tablets delta writer failed to close on single node. But the result returned to client is success. The reason is that the committed tablets and error tablets are both empty, so publish will be success. We should add it to error tablets when delta writer failed to close, then the transaction will be failed.	2022-12-19 14:22:25 +08:00

1 2 3 4 5 ...

3327 Commits