doris

Author	SHA1	Message	Date
Qi Chen	9ee7fa45d1	[Refactor](multi-catalog) Refactor to process splitted conjuncts for dict filter. (#21459 ) Conjuncts are currently split, so refactor source code to handle split conjuncts for dict filters.	2023-07-07 09:19:08 +08:00
Gabriel	4d17400244	[profile](join) add collisions into profile (#21510 )	2023-07-06 14:30:10 +08:00
airborne12	009b300abd	[Fix](ScannerScheduler) fix dead lock when shutdown group_local_scan_thread_pool (#21553 )	2023-07-06 13:09:37 +08:00
Mingyu Chen	242a35fa80	[fix](s3) fix s3 fs benchmark tool (#21401 ) 1. fix concurrency bug of s3 fs benchmark tool, to avoid crash on multi thread. 2. Add `prefetch_read` operation to test prefetch reader. 3. add `AWS_EC2_METADATA_DISABLED` env in `start_be.sh` to avoid call ec2 metadata when creating s3 client. 4. add `AWS_MAX_ATTEMPTS` env in `start_be.sh` to avoid warning log of s3 sdk.	2023-07-05 16:20:58 +08:00
Ashin Gau	9adbca685a	[opt](hudi) use spark bundle to read hudi data (#21260 ) Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data. Advantage for using spark-bundle to read hudi data: 1. The performance of spark-bundle is more than twice that of hive-bundle 2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm 3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris Disadvantage for using spark-bundle to read hudi data: 1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M) 2. spark-bundle only provides `RDD` interface and cannot be used directly	2023-07-04 17:04:49 +08:00
morrySnow	90dd8716ed	[refactor](multicast) change the way multicast do filter, project and shuffle (#21412 ) Co-authored-by: Jerry Hu <mrhhsg@gmail.com> 1. Filtering is done at the sending end rather than the receiving end 2. Projection is done at the sending end rather than the receiving end 3. Each sender can use different shuffle policies to send data	2023-07-04 16:51:07 +08:00
Jerry Hu	b5da3f74f5	[improvement](join) avoid unnecessary copying in _build_output_block (#21360 ) If the source columns are mutually exclusive within a temporary block, there is no need to duplicate the data.	2023-07-04 12:13:49 +08:00
Xinyi Zou	b86dd11a7d	[fix](pipeline) refactor olap table sink close (#20771 ) For pipeline, olap table sink close is divided into three stages, try_close() --> pending_finish() --> close() only after all node channels are done or canceled, pending_finish() will return false, close() will start. this will avoid block pipeline on close(). In close, check the index channel intolerable failure status after each node channel failure, if intolerable failure is true, the close will be terminated in advance, and all node channels will be canceled to avoid meaningless blocking.	2023-07-04 11:27:51 +08:00
Jerry Hu	ca0953ea51	[improvement](join) Serialize build keys in a vectorized (columnar) way (#21361 ) There is a significant performance improvement in serializing keys in the aggregate node through vectorization. Now, applying it to the join node also brings performance improvement.	2023-07-03 09:29:10 +08:00
yongjinhou	df23ab3f29	[Enhancement](tvf) Add authentication for workload group tvf (#21323 )	2023-06-30 12:56:23 +08:00
Jerry Hu	7f0e37069f	[improvement](olap) filter the whole segment by dictionary (#21239 )	2023-06-29 10:34:29 +08:00
DongLiang-0	a6b51ec19a	[Feature](avro) Support Apache Avro file format (#19990 ) support read avro file by hdfs() or s3() . ```sql select * from s3( "uri" = "http://127.0.0.1:9312/test2/person.avro", "ACCESS_KEY" = "ak", "SECRET_KEY" = "sk", "FORMAT" = "avro"); +--------+--------------+-------------+-----------------+ \| name \| boolean_type \| double_type \| long_type \| +--------+--------------+-------------+-----------------+ \| Alyssa \| 1 \| 10.0012 \| 100000000221133 \| \| Ben \| 0 \| 5555.999 \| 4009990000 \| \| lisi \| 0 \| 5992225.999 \| 9099933330 \| +--------+--------------+-------------+-----------------+ select * from hdfs( "uri" = "hdfs://127.0.0.1:9000/input/person2.avro", "fs.defaultFS" = "hdfs://127.0.0.1:9000", "hadoop.username" = "doris", "format" = "avro"); +--------+--------------+-------------+-----------+ \| name \| boolean_type \| double_type \| long_type \| +--------+--------------+-------------+-----------+ \| Alyssa \| 1 \| 8888.99999 \| 89898989 \| +--------+--------------+-------------+-----------+ ``` current avro reader only support common data type, the complex data types will be supported later.	2023-06-28 21:15:35 +08:00
Xinyi Zou	0396f78590	[fix](memory) Remove ChunkAllocator & fix Allocator no use mmap (#21259 )	2023-06-28 16:10:24 +08:00
Gabriel	e348b9464e	[scan](freeblocks) use ConcurrentQueue to replace vector for free blocks (#21241 )	2023-06-28 15:10:07 +08:00
caiconghui	db50face41	[fix](time_zone) be compatible with doris old version for CST time_zone when load orc file in broker load (#21263 ) Fix error for broker load with orc file when time_zone is CST of which message is "Failed to create orc row reader. reason = Can't open /usr/share/zoneinfo/CST" Co-authored-by: caiconghui1 <caiconghui1@jd.com>	2023-06-28 09:44:42 +08:00
lihangyu	50c1d55769	[Improve](dynamic schema) support filtering invalid data (#21160 ) * [Improve](dynamic schema) support filtering invalid data 1. Support dynamic schema to filter illegal data. 2. Expand the regular expression for ColumnName to support more column names. 3. Be compatible with PropertyAnalyzer and support legacy tables. 4. Default disable parse multi dimenssion array, since some bug unresolved	2023-06-26 19:32:43 +08:00
Lijia Liu	76bdcf1d26	[improvement](pipeline) task group scan entity (#19924 )	2023-06-25 14:43:35 +08:00
zhangstar333	a33521b2ce	[enhancement](exchange) add filter for exchange node in BE (#21087 )	2023-06-22 01:04:47 +08:00
TsukiokaKogane	3dfeee3946	[fix](typesystem) fix wrong return type argument cause type check fail (#21082 )	2023-06-22 00:04:46 +08:00
Xinyi Zou	2c9bdd64fa	[fix](memory) arena support memory reuse after clear() (#21033 )	2023-06-21 23:27:21 +08:00
Gabriel	2ce8cfbebd	[profile](sort) add some metrics in profile (#21056 )	2023-06-21 22:57:46 +08:00
Qi Chen	bad22dd4e2	[Fix](orc-reader) Fix orc dict filter null value issue in `_convert_dict_cols_to_string_cols` which caused incorrect result. (#21047 ) Query results should not have empty values. ``` use regresssion.multi_catalog; select commit_id from github_events_orc WHERE (event_type = 'CommitCommentEvent') AND commit_id != "" limit 10; ``` ``` +------------------------------------------+ \| commit_id \| +------------------------------------------+ \| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 \| \| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 \| \| 4e3ab2ff2d2474f5d51334b9b0fdf17e9845a166 \| \| \| \| \| \| \| \| \| \| \| \| \| \| 7191c20cb49da07a7fc16aa32dc0de4faff528b2 \| +------------------------------------------+ 10 rows in set (0.54 sec) ```	2023-06-21 14:54:01 +08:00
Gabriel	81abdeffbc	[Improvement](pipeline) Improve shared scan performance (#20785 )	2023-06-21 14:36:05 +08:00
Ashin Gau	ef17289925	[feature](jni) add jni metrics and attach to BE profile automatically (#21004 ) Add JNI metrics, for example: ``` - HudiJniScanner: 0ns - FillBlockTime: 31.29ms - GetRecordReaderTime: 1m5s - JavaScanTime: 35s991ms - OpenScannerTime: 1m6s ``` Add three common performance metrics for JNI scanner: 1. `OpenScannerTime`: Time to init and open JNI scanner 2. `JavaScanTime`: Time to scan data and insert into vector table in java side 3. `FillBlockTime`: Time to convert java vector table to c++ block And support user defined metrics in java side, for example: `OpenScannerTime` is a long time for the open process, we want to determine which sub-process takes too much time, so we add `GetRecordReaderTime` in java side. The user defined metrics in java side can be attached to BE profile automatically.	2023-06-21 11:19:02 +08:00
Xinyi Zou	6d579d924d	[fix](profile) delete useless profile add_child #20989	2023-06-20 23:21:52 +08:00
Kang	2c11ce0a02	[bugfix](topn) fix key topn merge block conflict with index predicate result columns (#20820 )	2023-06-20 21:23:00 +08:00
Qi Chen	c85271d2ae	[Fix](orc-reader) Fix filter size mismatch in orc reader. (#20998 ) Fix filter size mismatch in orc reader introduced by #20806	2023-06-20 12:27:16 +08:00
Ashin Gau	923f7edad0	[opt](hudi) using native reader to read the base file with no log file (#20988 ) Two optimizations: 1. Insert string bytes directly to remove decoding&encoding process. 2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.	2023-06-20 11:20:21 +08:00
yongjinhou	26cca5e00a	[Enhancement](tvf) Add frontends table-valued-function (#20857 )	2023-06-19 13:57:40 +08:00
TengJianPing	fb9fcf460a	[fix](leftjoin) fix bug of left and full join with other conjuncts (#20946 ) Fix bug of left and full outer join with other conjuncts. When equal matched row count of a probe row exceed batch_size, some times the _join_node->_is_any_probe_match_row_output flag is not set correcty, which result in outputing extra rows for the probe row.	2023-06-19 12:27:06 +08:00
YueW	d6b7640cf0	[fix](inverted index) fix check failed for block erase temp column (#20924 )	2023-06-18 19:27:48 +08:00
xzj7019	ab32299ba4	[feature](nereids) Support multi target rf #20714 Support multi target runtime filter, mainly for set operation, such as union/intersect/except.	2023-06-16 20:26:00 +08:00
Qi Chen	b7a50a09fe	[Opt](orc-reader) Optimize orc reader by dict filtering. (#20806 ) Optimize orc reader by dict filtering. It is similar with #17594. Test result ssb-flat-100: (3 nodes) \| Query \| before opt \| after opt \| \| ------------- \|:-------------:\| ---------:\| Q1.1 \| 1.239 \| 1.145 Q1.2 \| 1.254 \| 1.128 Q1.3 \| 1.931 \| 1.644 Q2.1 \| 1.359 \| 1.006 Q2.2 \| 1.229 \| 0.674 Q2.3 \| 0.934 \| 0.427 Q3.1 \| 2.226 \| 1.712 Q3.2 \| 2.042 \| 1.562 Q3.3 \| 1.631 \| 1.021 Q3.4 \| 1.618 \| 0.732 Q4.1 \| 2.294 \| 1.858 Q4.2 \| 2.511 \| 1.961 Q4.3 \| 1.736 \| 1.446 total \| 22.004 \| 16.316	2023-06-16 13:11:37 +08:00
Pxl	17a395f5e3	[Bug](runtime-filter) fix runtime filter not register on vdata_gen_scan_node (#20787 ) fix runtime filter not register on vdata_gen_scan_node	2023-06-15 14:06:14 +08:00
Mryange	460399f214	[fix](profile) remove same profile in join node (#20734 )	2023-06-15 08:08:39 +08:00
zy-kkk	09d187ec77	[improvement](ck jdbc) Optimized reading of datetime and ip types of the ClickHouse JDBC Catalog (#20804 )	2023-06-14 23:28:08 +08:00
slothever	bb617ee2cc	[fix](parquet-reader)fix page v2 header offset (#20814 ) fix page v2 header offset. get correct offset when read next page in file.	2023-06-14 23:27:31 +08:00
yiguolei	31a4f96f01	[refactor](exprcontext) move close to expr context's dector method (#20747 ) The close method does nothing. But I am not sure we could remove it. So that I add it to dector method and remove many many calls.	2023-06-14 18:01:07 +08:00
lihangyu	0f470fec0e	[Bug](topn opt) Fix Two-Phase read when some rowset swept (#20732 ) * [Bug](topn opt) Fix Two-Phase read when some rowset swept If this is a Two-Phase read query, and we need to delay the release of Rowset by row->update_delayed_expired_timestamp() to expand the lifespan of rowsets. This is necessary to avoid data loss during the second phase reading, where some stale rowsets may be swept and result in missing data.	2023-06-14 15:46:29 +08:00
Pxl	9244cb6553	[Chore](runtime-filter) do not make query fail when rf publish failed (#20742 ) do not make query fail when rf publish failed	2023-06-13 18:23:46 +08:00
lihangyu	2dddab03a1	[compatibility](schema cache) ensure schema version when using schema cache (#20729 ) When FE is old version, be is new version, issue a schema change(add column) and then query, old version of FE query without schema version could result in reading stale schema from schema cache	2023-06-13 15:19:26 +08:00
Mingyu Chen	4b15185e25	[improvement](hdfs) add parquet footer cache and hdfs file handle cache (#20544 ) 1. Add hdfs file handle cache for hdfs file reader Copied from Impala, `https://github.com/apache/impala/blob/master/be/src/util/lru-multi-cache.h`. (Thanks for the Impala team) This is a lru cache that can store multi entries with same key. The key is build with {file name + modification time} The value is the hdfsFile pointer that point to a certain hdfs file. This cache is to avoid reopen same hdfs file mutli time, which can save query time. Add a BE config `max_hdfs_file_handle_cache_num` to limit the max number of file handle cache, default is 20000. 2. Add file meta cache The file meta cache is a lru cache. the key is {file name + modification time}, the value is the parsed file meta info of the certain file, which can save the time of re-parsing file meta everytime. Currently, it is only used for caching parquet file footer. The test show that is cache is hit, the `FileOpenTime` and `ParseFooterTime` is reduce to almost 0 in query profile, which can save time when there are lots of files to read.	2023-06-13 15:13:57 +08:00
Pxl	e010fa8d4f	[Chore](runtime filter) remove runtime filter ready_for_publish/publish_finally (#20593 )	2023-06-13 11:20:49 +08:00
lexluo09	57656b2459	[Enhancement](java-udf) java-udf module split to sub modules (#20185 ) The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner. Co-authored-by: lexluo <lexluo@tencent.com>	2023-06-13 09:41:22 +08:00
HappenLee	51bbf17786	[Refactor](Profile) Add and refactor the join profile (#20693 )	2023-06-13 09:06:51 +08:00
Qi Chen	73ad885e19	[Feature][Fix](multi-catalog) Implements transactional hive full acid tables. (#20679 ) After supporting insert-only transactional hive full acid tables #19518, #19419, this PR support transactional hive full acid tables. Support hive3 transactional hive full acid tables. Hive2 transactional hive full acid tables need to run major compactions.	2023-06-13 08:55:16 +08:00
HappenLee	ea264ce9de	[Opt](join) short circuit probe for join node (#20585 ) Support the _short_circuit_for_probe for join node	2023-06-12 16:01:09 +08:00
Xujian Duan	0b228b3414	[fix](load)Support load json data with default value (#20624 ) * support json default value --------- Co-authored-by: duanxujian <duanxujian@jd.com>	2023-06-12 14:51:31 +08:00
GoGoWen	4c340f2851	[Feature] (Multi-Catalog) support query hll column in doris jdbc table - part 1 (#19413 ) Issue Number: close #17895	2023-06-12 11:16:19 +08:00
yiguolei	a6f625676b	[profile](remove child) child is for node, should not be used to organize counters (#20676 ) Currently, there are many profiles using add child profile to orgnanize profile into blocks. But it is wrong. Child profile will have a total time counter. Actually, what we should use is just a label. - MemoryUsage: - HashTable: 23.98 KB - SerializeKeyArena: 446.75 KB Add a new macro ADD_LABEL_COUNTER to add just a label in the profile. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-06-12 10:00:35 +08:00

1 2 3 4 5 ...

835 Commits