doris

Author	SHA1	Message	Date
TengJianPing	8c0e13ab51	[improvement](profile) add detail memory counter for exec nodes (#14806 ) * [improvement](profile) improve accuraccy of memory usage and add detail memory counter * fix	2022-12-05 11:51:52 +08:00
wxy	e141664339	[fix](statistics) fix missing scanBytes and scanRows in query statist… (#14750 ) * [fix](statistics) fix missing scanBytes and scanRows in query statistics when enable_vectorized_engine=true. Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>	2022-12-05 09:17:51 +08:00
HappenLee	12304bc0ee	[Pipeline](exec) Support pipeline exec engine (#14736 ) Co-authored-by: Lijia Liu <liutang123@yeah.net> Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: shee <13843187+qzsee@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> ## Problem Summary: ### 1. Design DSIP: https://cwiki.apache.org/confluence/display/DORIS/DSIP-027%3A+Support+Pipeline+Exec+Engine ### 2. How to use: Set the environment variable `set enable_pipeline_engine = true; `	2022-12-02 17:11:34 +08:00
Gabriel	9dd1d989e8	[test](decimalv3) add regression test cases for decimalv3 (#14672 )	2022-12-01 15:18:40 +08:00
Xinyi Zou	176f519fa1	[enhancement](memtracker) Optimize exec node memory tracking (#14711 )	2022-12-01 14:52:21 +08:00
Jerry Hu	b4d32a0c44	[fix](join) runtime filter shared from other instance wasn't be published (#14717 )	2022-12-01 14:17:23 +08:00
Pxl	bba77fa9dd	[Enhancement](profile) enhance column predicates display on profile (#14664 )	2022-12-01 13:07:12 +08:00
luozenglin	7873bc95a6	[Enhancement](bitmapfilter) Support bitmap filter to apply zone_map index to filter pages (#14635 )	2022-12-01 10:41:09 +08:00
luozenglin	6c70d794f6	[fix](bitmapfilter) fix core dump caused by bitmap filter (#14702 )	2022-12-01 09:56:22 +08:00
Tiewei Fang	9272680d00	[feature](multi-catalog) support Jdbc catalog (#14527 ) Issue Number: close #xxx I add jdbc catalog for doris multi-catalog feature. Currently, the jdbc catalog only supports MYSQL DBMS. TODO: support for postgre DB Support for other databases. Problem summary For jdbc catalog, we can create catalog like: CREATE CATALOG jdbc4 PROPERTIES ( "type"="jdbc", "jdbc.user"="root", "jdbc.password"="123456", "jdbc.jdbc_url" = "jdbc:mysql://127.0.0.1:13396/demo?yearIsDateType=false", "jdbc.driver_url" = "file:/mnt/disk2/ftw/tools/jar/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar", "jdbc.driver_class" = "com.mysql.jdbc.Driver" ); Note: yearIsDateType is a param of jdbc: If yearIsDateType configuration property is set to false, then the returned object type is java.sql.Short. If set to true (the default), then the returned object is of type java.sql.Date with the date set to January 1st, at midnight. To compat with mysql, we force the use of yearIsDateType=false in FE. if user sets yearIsDateType=true, doris FE will force to change yearIsDateType=false.	2022-11-30 11:28:08 +08:00
Gabriel	3e8b3658c7	[feature-wip](decimalv3) Support basic agg and arithmetic operations for decimal v3 (#14513 )	2022-11-29 15:12:41 +08:00
lsy3993	f7a827c06b	[fix](new-scan) fix some bugs about new scan node and readers (#14504 ) json reader DCHECK fail because of missing TYPE_STRING fix bug that if no file is found, the tvf will throw NPE. The predicate conjuncts can not be pushed down to parquet reader if this is a load task. Because the predicate should be applied on column of dest table, not on column of source file. Add a temp property "use_new_load_scan_node" of broker load to make regression test happy. So that we can use new load scan node for a certain job and avoid setting global FE config.	2022-11-29 10:21:41 +08:00
Gabriel	7513c82431	[NLJoin](conjuncts) separate join conjuncts and general conjuncts (#14608 )	2022-11-29 08:55:54 +08:00
starocean999	78adecac1b	[enhancemennt](be)optimize mem usage in join and set node (#14602 )	2022-11-27 13:38:49 +08:00
Tiewei Fang	36419fae48	[fix](JdbcExecutor) fix that JdbcExecutor did not load the class jar (#14598 ) JdbcExecutor did not load jdbc driver jar, so add classloader to load jdbc jar.	2022-11-26 23:53:05 +08:00
Mingyu Chen	064b8d2aa6	[fix](multi-catalog) fix coredump when querying partitioned hive table with text format (#14604 ) BE will crash when querying partitioned hive table with text format and put partition column at first of select items. 1. FE should use file slots to set the column mapping index of csv file. 2. BE should use `get_by_name` of block to get right column in a block in csv reader.	2022-11-26 11:42:40 +08:00
luozenglin	4728e75079	[feature](bitmap) Support in bitmap syntax and bitmap runtime filter (#14340 ) 1.Support in bitmap syntax, like 'where k1 in (select bitmap_column from tbl)'; 2.Support bitmap runtime filter. Generate a bitmap filter using the right table bitmap and push it down to the left table storage layer for filtering.	2022-11-25 15:22:44 +08:00
Ashin Gau	25de068a05	[fix](parquet-reader) the value of null map will overflow when LazyRead merges too many empty batches (#14558 ) The run length of null map is saved as `uint16_t`. Previously, the run length of null map was limited by `batch_size` in the `ParquetReader`, by setting `batch_size = std::min(batch_size, (size_t)USHRT_MAX)`. It works well when the batch size is less than `USHRT_MAX`. However, [Lazy read](https://github.com/apache/doris/pull/13917) will merge empty batches until reading a non-empty batch or reaching the EOF of a row group, so the `batch_size` may be greater than `USHRT_MAX` in non-predicate columns. In addition, even if the `batch_size` does not exceed `USHRT_MAX`, the adjacent batches may also make the run length exceed the `USHRT_MAX` in `ColumnSelectVector::get_next_run`.	2022-11-25 12:22:18 +08:00
Jerry Hu	9103ded1dd	[improvement](join)optimize sharing hash table for broadcast join (#14371 ) This PR is to make sharing hash table for broadcast more robust: Add a session variable to enable/disable this function. Do not block the hash join node's close function. Use shared pointer to share hash table and runtime filter in broadcast join nodes. The Hash join node that doesn't need to build the hash table will close the right child without reading any data(the child will close the corresponding sender).	2022-11-24 21:06:44 +08:00
TengJianPing	6c7f758ef7	[improvement](hashjoin) support partitioned hash table in hash join (#14480 )	2022-11-24 14:16:47 +08:00
Gabriel	d14e1d25ff	[Bug](vectorized) Fix wrong column type (#14387 )	2022-11-23 18:07:33 +08:00
starocean999	1520e5c88a	[enhancement](agg)use new method to serialize keys in batch if the key is too large (#14484 ) * [enhancement](agg)use new method to serialize keys in batch if the key is too large * fix compile error	2022-11-23 17:35:39 +08:00
luozenglin	30e1818724	[fix](tracing) fix tracing in the new scan node does not meet expectations (#14155 ) Issue Number: close #14149 - Remove unexpected tracing, like 'vscanner::scan' - Merge span vscannode::get_next	2022-11-22 16:44:02 +08:00
Gabriel	1ec7f45fb6	[Bug](avg) Fix `avg` for bigint (#14433 )	2022-11-22 10:29:59 +08:00
Xin Liao	fea9966728	[fix](parquet-orc) fix that be core dump when some columns specified are not in the parquet or orc file (#14440 ) When some columns specified are not in the parquet or orc file in broker load, _batch->num_columns() will less than _num_of_columns_from_file. It will lead to be core dump. To prevent be core dump, just return an error in this case.	2022-11-22 09:10:38 +08:00
Pxl	bcd641877f	[Enhancement](scan) disable build key range and filters when push down agg work (#14248 ) disable build key range and filters when push down agg work	2022-11-21 12:47:57 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
Mingyu Chen	512b787559	[fix](parquet-reader) fix stack-use-after-return error (#14411 )	2022-11-19 10:52:50 +08:00
starocean999	1f326fc0d6	[enhancement](be)limit mem cost to 16m when pre serialize keys in agg node (#14321 ) * [enhancement](be)limit mem cost to 16m when pre serialize keys in agg node * use only one chunk memory when serializing keys in agg node	2022-11-18 12:31:52 +08:00
spaces-x	1a035e2073	[fix](profile)(AggNode) fix the GetResultsTime is always zero (#14366 ) add scoped_timer in _serialize_with_serialized_key_result	2022-11-17 22:30:21 +08:00
Gabriel	50bfd99b59	[feature](join) support nested loop semi/anti join (#14227 )	2022-11-17 22:20:08 +08:00
HappenLee	d5af4f6558	[Neried](Profile) Add projection timer for neried (#14286 )	2022-11-17 22:17:55 +08:00
slothever	6da2948283	[feature-wip](multi-catalog) support iceberg v2(step 1) (#13867 ) Support position delete(part of).	2022-11-17 17:56:48 +08:00
Mingyu Chen	7182f14645	[improvement][fix](multi-catalog) speed up list partition prune (#14268 ) In previous implementation, when doing list partition prune, we need to generation `rangeToId` every time we doing prune. But `rangeToId` is actually a static data that should be create-once-use-every-where. So for hive partition, I created the `rangeToId` and all other necessary data structures for partition prunning in partition cache, so that we can use it directly. In my test, the cost of partition prune for 10000 partitions reduce from 8s -> 0.2s. Aslo add "partition" info in explain string for hive table. ``` \| 0:VEXTERNAL_FILE_SCAN_NODE \| \| predicates: `nation` = '0024c95b' \| \| inputSplitNum=1, totalFileSize=4750, scanRanges=1 \| \| partition=1/10000 \| \| numNodes=1 \| \| limit: 10 \| ``` Bug fix: 1. Fix bug that es scan node can not filter data 2. Fix bug that query es with predicate like `where substring(test2,2) = "ext2";` will fail at planner phase. `Unexpected exception: org.apache.doris.analysis.FunctionCallExpr cannot be cast to org.apache.doris.analysis.SlotRef` TODO: 1. Some problem when quering es version 8: ` Unexpected exception: Index: 0, Size: 0`, will be fixed later.	2022-11-17 08:30:03 +08:00
Ashin Gau	20634ab7e3	[feature-wip](multi-catalog) support partition&missing columns in parquet lazy read (#14264 ) PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader, but can't trigger lazy read when predicate columns are partition or missing columns. This PR support such case, and fill partition and missing columns in `FileReader`.	2022-11-16 08:43:11 +08:00
huangzhaowei	5badd70db2	[fix](csv-reader) Fix core dump when load text into doris with special delimiter (#14196 )	2022-11-15 16:06:59 +08:00
starocean999	6d2e6d85d3	[enhancement](be)release memory in Node's close() method (#14258 ) * [enhancement](be)release memory in Node's close() method * format code	2022-11-15 15:59:23 +08:00
Gabriel	215a4c6e02	[Bug](BHJ) Fix wrong result when use broadcast hash join for naaj (#14253 )	2022-11-15 09:40:00 +08:00
Ashin Gau	fc70179acb	[multi-catalog](fix) the eof of lazy read columns may be not equal to the eof of predicate columns (#14212 ) Fix three bugs: 1. The EOF of lazy read columns may be not equal to the EOF of predicate columns. (for example: If the predicate column has 3 pages, with 400 rows for each, but the last page is filtered by page index. When batch_size=992, the EOF of predicate column is true. However, we should set batch_size=800 for lazy read column, so the EOF of lazy read column may be false.) 2. The array column does not count the number of nulls 3. Generate wrong NullMap for array column	2022-11-14 14:37:21 +08:00
Adonis Ling	7bb3792d51	[chore](build) Split the compliation units to build them in parallel (#14232 )	2022-11-14 10:57:10 +08:00
pengxiangyu	d55faa7f6a	[feature](remote)Only query can use local cache when reading remote files. (#13865 ) When calling select on remote files, download cache files to local disk. When calling alter table on remote files, read files directly from remote storage. So if tablet is too large, it will not take up too many local disk when creating local cache file.	2022-11-14 10:30:15 +08:00
starocean999	139c4a77f1	[enhancement](be)close ExecNode ASAP to release resource earlier (#14203 )	2022-11-14 09:41:35 +08:00
Xinyi Zou	dd11d5c0a5	[enhancement](memory) Support try catch bad alloc (#14135 )	2022-11-13 11:22:56 +08:00
luozenglin	376b4fda9f	[fix](scankey) fix extended scan key errors. (#14200 ) Issue Number: close #14199	2022-11-12 20:44:09 +08:00
xy720	035657c5a1	[typo](comment) Fix a lot of spell errors in be comments (#14208 ) fix typos.	2022-11-12 16:06:15 +08:00
Gabriel	fe2944d56d	[Bug](nljoin) Keep compatibility for nljoin (#14182 )	2022-11-11 15:54:55 +08:00
Adonis Ling	118a7dff07	[chore](build) Optimize the compilation time (#14170 ) Currently, it takes too much time to build BE from source in workflow environments (P0/P1) which affects the efficiency of daily development. We can measure the time by executing the following command. time EXTRA_CXX_FLAGS='-O3' BUILD_TYPE=ASAN ./build.sh --be --fe --clean -j "$(nproc)" This PR optimizes the compilation time by exploiting the following methods. Reduce the codegen by removing some useless std::visit. Disable the optimization for some template functions which are instantiated by std::visit conditionally (except for the RELEASE build).	2022-11-11 12:09:54 +08:00
Zhengguo Yang	12652ebb0e	[UDF](java udf) using config to enable java udf instead of macro at compile time (#14062 ) * [UDF](java udf) useing config to enable java udf instead of macro at compile time	2022-11-11 09:03:52 +08:00
Gabriel	1ef85ae1f2	[Improvement](join) Support nested loop outer join (#13965 )	2022-11-10 19:50:46 +08:00
Ashin Gau	6bd5378f66	[feature-wip](multi-catalog) lazy read for ParquetReader (#13917 ) Read predicate columns firstly, and use VExprContext(push-down predicates) to generate the select vector, which is then applied to read the non-predicate columns. The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced. If a whole page can be skipped, the decompress-time can also be reduced.	2022-11-10 16:56:14 +08:00

1 2 3 4 5 ...

401 Commits