doris

Author	SHA1	Message	Date
zhannngchen	4a5095f00d	[cleanup](config) remove unused config push_write_mbytes_per_sec (#13290 )	2022-10-12 15:58:04 +08:00
Kang	1bd14f1d82	[feature-wip](jsonb) jsonb parse function and load (#13129 ) add function to parse json string to jsonb format and use it to support stream load.	2022-10-12 13:56:37 +08:00
zhangstar333	16999ef02d	[Vectorized][Function] support date_trunc and countequal function (#13039 )	2022-10-12 10:01:09 +08:00
赵立伟	334708dc8c	[fix](memory): avoid coredump when list pointer is null (#12919 )	2022-10-11 16:00:23 +08:00
Pxl	bdcb600f3d	[Bug](load) fix core dump on big block load (#13014 )	2022-10-10 12:38:32 +08:00
yiguolei	dc2d33298b	[chore](be config) remove config use_mmap_allocate_chunk #13196 This config is never used online and there exist bugs if enable this config. So that I remove this config and related tests. Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-10-09 16:19:59 +08:00
Pxl	245490d6b7	[Enhancement](runtime filter) optimize for runtime filter (#12856 ) optimize for runtime filter	2022-10-09 14:11:03 +08:00
Mingyu Chen	d286aa7bf7	[fix](spark-load) no need to filter row group when doing spark load (#13116 ) 1. Fix issue #13115 2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly. Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers. 3. Add more checks for broker load test cases.	2022-10-05 23:00:56 +08:00
Ashin Gau	026ffaf10d	[feature-wip](parquet-reader) add detail profile for parquet reader (#13095 ) Add more detail profile for ParquetReader: ParquetColumnReadTime: the total time of reading parquet columns ParquetDecodeDictTime: time to parse dictionary page ParquetDecodeHeaderTime: time to parse page header ParquetDecodeLevelTime: time to parse page's definition/repetition level ParquetDecodeValueTime: time to decode page data into doris column ParquetDecompressCount: counter of decompressing page data ParquetDecompressTime: time to decompress page data ParquetParseMetaTime: time to parse parquet meta data	2022-10-02 15:11:48 +08:00
Adonis Ling	e7f18e998a	[chore](be-ut) Remove useless lines which cause compilation errors (#13053 )	2022-09-30 11:26:25 +08:00
slothever	820ec435ce	[feature-wip](parquet-reader) refactor parquet_predicate (#12896 ) This change serves the following purposes: 1. use ScanPredicate instead of TCondition for external table, it can reuse old code branch. 2. simplify and delete some useless old code 3. use ColumnValueRange to save predicate	2022-09-28 21:27:13 +08:00
Mingyu Chen	d80b7b9689	[feature-wip](new-scan) support more load situation (#12953 )	2022-09-27 21:48:32 +08:00
Pxl	9607f60845	[Feature](serialize) move block_data_version to fe heart beat (#12667 ) Move block_data_version from be config to fe heart beat	2022-09-27 18:25:54 +08:00
TengJianPing	1bb42a7bc0	[function](hash) add support of murmur_hash3_64 (#12923 )	2022-09-26 14:23:37 +08:00
Ashin Gau	692176ec07	[feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898 ) 1. Fix the bug of file position in `HdfsFileReader` 2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory	2022-09-25 21:21:06 +08:00
Shane	59699a4321	[feature](JSON datatype)Support JSON datatype (#10322 ) Add `JSON` datatype, following features are implemented by this PR: 1. `CREATE` tables with `JSON` type columns 2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE 3. `SELECT` JSON columns Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type) * add JSONB data storage format type * fix JsonLiteral resolve bug * add DataTypeJson case in data_type_factory * add JSON syntax check in FE * add operators for jsonb_document, currently not support comparison between any JSON type value * add ColumnJson and DataTypeJson * add JsonField to store JsonValue * add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral * add push_json for MysqlResultWriter * JSON column need no zone_map_index * Revert "JSON column need no zone_map_index" This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79. * add JSON writer and reader, ignore zone-map for JSON column * add json_to_string for DataTypeJson * add olap_data_convertor for JSON type * add some enum * add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions * fix column_json offsets overflow bug, format code * remove useless TODOs, add CmpType cases for JSON type * add license header * format license * format be codes * resolve rebase master conflicts * fix bugs for CREATE and meta related code * refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes * modification be codes along code review advice * fix rebase conflicts with master * add unit test for json_value and column_json * fix rebase error * rename json to jsonb * fix some data convert bugs, set Mysql type to JSON	2022-09-25 14:06:49 +08:00
Ashin Gau	5bfdfac387	[feature-wip](parquet-reader) add parquet reader profile (#12797 ) Add profile for parquet reader. New counters: - ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics - ParquetReadGroups: The number of row groups to read - ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics - ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics - ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics - ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole ## Result ``` ┌──────────────────────────────────────────────────────┐ │[0: VFILE_SCAN_NODE] │ │(Active: 1s29ms, non-child: 96.42) │ │ - Counters: │ │ - BytesRead: 0.00 │ │ - FileReadCalls: 1.826K (1826) │ │ - FileReadTime: 510.627ms │ │ - FileRemoteReadBytes: 65.23 MB │ │ - FileRemoteReadCalls: 1.146K (1146) │ │ - FileRemoteReadRate: 128.29331970214844 MB/sec │ │ - FileRemoteReadTime: 508.469ms │ │ - NumDiskAccess: 0 │ │ - NumScanners: 1 │ │ - ParquetFilteredBytes: 0.00 │ │ - ParquetFilteredGroups: 0 │ │ - ParquetFilteredRowsByGroup: 0 │ │ - ParquetFilteredRowsByPage: 6.600003M (6600003)│ │ - ParquetReadBytes: 2.13 GB │ │ - ParquetReadGroups: 20 │ │ - PeakMemoryUsage: 0.00 │ │ - PredicateFilteredRows: 3.399797M (3399797) │ │ - PredicateFilteredTime: 133.302ms │ │ - RowsRead: 3.399997M (3399997) │ │ - RowsReturned: 200 │ │ - RowsReturnedRate: 194 │ │ - TotalRawReadTime(*): 726.566ms │ │ - TotalReadThroughput: 0.0 /sec │ │ - WaitScannerTime: 1s27ms │ └──────────────────────────────────────────────────────┘ ```	2022-09-23 18:42:14 +08:00
HappenLee	f7e3ca29b5	[Opt](Vectorized) Support push down no grouping agg (#12803 ) Support push down no grouping agg	2022-09-23 18:29:54 +08:00
slothever	1ca6d559e4	[feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771 ) refactor some arguments for parquet reader 1. Add new parquet context to wrap reader arguments 2. Reduced some arguments for function call Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-09-22 09:34:01 +08:00
Xinyi Zou	b41eaa5ac0	[fix](memtracker) Introduce orphan mem tracker to verify memory tracking accuracy (#12794 ) The mem hook consumes the orphan tracker by default. If the thread does not attach other trackers, by default all consumption will be passed to the process tracker through the orphan tracker. In real time, consumption of all other trackers + orphan tracker consumption = process tracker consumption. Ideally, all threads are expected to attach to the specified tracker, so that "all memory has its own ownership", and the consumption of the orphan mem tracker is close to 0, but greater than 0.	2022-09-21 15:47:10 +08:00
Xinyi Zou	bd4bfa8f00	[fix](memtracker) Fix thread mem tracker try consume accuracy #12782	2022-09-21 09:20:41 +08:00
slothever	d435f0de41	[feature-wip](parquet-reader) add page index row range (#12652 ) Add some utils and provide the candidate row range (generated with skipped row range of each column) to read for page index filter this version support binary operator filter todo: - use context instead of structures in close() - process complex type filter - use this instead of row group minmax filter - refactor _eval_binary() for row group filter and page index filter	2022-09-20 10:36:19 +08:00
yinzhijian	fb9e48a34a	[fix](vstream load) Fix bug when load json with jsonpath (#12660 )	2022-09-19 10:13:18 +08:00
Lightman	e01986b8b9	[feature](light-schema-change) fix light-schema-change and add more cases (#12160 ) Fix _delete_sign_idx and _seq_col_idx when append_column or build_schema when load. Tablet schema cache support recycle when schema sptr use count equals 1. Add a http interface for flink-connector to sync ddl. Improve tablet->tablet_schema() by max_version_schema.	2022-09-17 11:29:36 +08:00
yinzhijian	a97f63141e	[fix](cast) Add validity check for date conversion for non-vectorization (#12608 ) actual result select cast("0.0000031417" as date); +------------------------------+ \| CAST('0.0000031417' AS DATE) \| +------------------------------+ \| 2000-00-00 \| +------------------------------+ expect result select cast("0.0000031417" as date); +------------------------------+ \| CAST('0.0000031417' AS DATE) \| +------------------------------+ \| NULL \| +------------------------------+	2022-09-16 09:08:53 +08:00
yixiutt	b136d80e1a	[enhancement](compress) reuse compression ctx and buffer (#12573 ) Reuse compression ctx and buffer. Use a global instance for every compression algorithm, and use a thread saft buffer pool to reuse compression buffer, pool size is equal to max parallel thread num in compression, and this will not be too large. Test shows this feature increase 5% of data import and compaction. Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-15 10:59:46 +08:00
Mingyu Chen	c5ad989065	[refactor](reader) refactor the interface of file reader (#12574 ) Currently, Doris has a variety of readers for different file formats, such as parquet reader, orc reader, csv reader, json reader and so on. The interfaces of these readers are not unified, which makes it impossible to call them through a unified method. In this PR, I added a `GenericReader` interface class, and other Readers will implement this interface class to use the `get_next_block()` method. This PR currently only modifies `arrow_reader` and `parquet reader`. Other readers will be modified one by one in subsequent PRs.	2022-09-14 22:31:11 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
Mingyu Chen	e33f4f90ae	[fix](exec) Avoid query thread block on wait_for_start (#12411 ) When FE send cancel rpc to BE, it does not notify the wait_for_start() thread, so that the fragment will be blocked and occupy the execution thread. Add a max wait time for wait_for_start() thread. So that it will not block forever.	2022-09-13 08:57:37 +08:00
TaoZex	c8e9a32bb2	[Function](cbrt)Add cbrt function for doris (#12523 ) Add cbrt function for doris	2022-09-12 19:58:45 +08:00
Yongqiang YANG	c3af60eff8	[fix](threadpool) threadpool schedules does not work right on concurr… (#12370 ) * [fix](threadpool) threadpool schedules does not work right on concurrent token Assuming there is a concurrent thread token whose concurrency is 2, and the 1st submit on the token is submitted to threadpool while the 2nd is not submitted due to busy. The token's active_threads is 1, then thread pool does not schedule the token. The patch fixes the problem.	2022-09-08 14:54:46 +08:00
camby	26cf2d3742	[enhancement](array-type) avoid abuse of Offset and Offset64 #12378 We already separate Array Offset64 and String Offset(32bit) in PR: #12341 Now we limit: Offset inside IColumn, Offset64 only inside ColumnArray, to avoid abuse of them. If we use the wrong one, it will compile failed.	2022-09-08 14:53:07 +08:00
yongjinhou	09b45f2b71	[Function](ELT)Add elt function (#12321 )	2022-09-07 15:21:08 +08:00
camby	cf5d194fe1	[enhancement](array-type) Split Array Offsets and String Offsets (#12341 ) In old Doris version string offsets are 32bit, but it is not enough for Array type. If we change string offsets from 32bit to 64bit, there will be problem if we upgrade BE one by one. Because at the same time 32bit Offsets and 64 bit Offsets String will exist at the same time. As a result, we separate the Codes for Array Offsets. Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-09-06 11:18:27 +08:00
Ashin Gau	1cc9eeeb1a	[feature-wip](parquet-reader) read and generate array column (#12166 ) Read and generate parquet array column. When D=1, R=0, representing an empty array. Empty array is not a null value, so the NullMap for this row is false, the offset for this row is [offset_start, offset_end) whose `offset_start == offset_end`, and offset_end is the start offset of the next row, so there is no value in the nested primitive column. When D=0, R=0, representing a null array, and the NullMap for this row is true.	2022-08-31 17:08:12 +08:00
HappenLee	573e5476dd	[Opt](load) Speed up the vectorized load (#12146 ) * [Opt](load) Speed up the vectorized load	2022-08-31 16:23:36 +08:00
yiguolei	2f192019d3	[bugfix](delete hanlder) delete predicate is merged and could not find schema cause core dump (#12161 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-08-30 09:18:21 +08:00
Gabriel	5f7d6e8f2b	[Refactor](predicate) Unify Conditions and ColumnPredicate (#11985 )	2022-08-29 12:11:22 +08:00
plat1ko	db07e51cd3	[refactor](status) Refactor status handling in agent task (#11940 ) Refactor TaggableLogger Refactor status handling in agent task: Unify log format in TaskWorkerPool Pass Status to the top caller, and replace some OLAPInternalError with more detailed error message Status Premature return with the opposite condition to reduce indention	2022-08-29 12:06:01 +08:00
Ashin Gau	dec576a991	[feature-wip](parquet-reader) generate null values and NullMap for parquet column (#12115 ) Generate null values and NullMap for the nullable column by analyzing the definition levels.	2022-08-29 09:30:32 +08:00
Gabriel	6e6269c682	[Improvement](load) accelerate streamload and compaction (#12119 ) * [Improvement](load) accelerate streamload and compaction	2022-08-28 23:10:47 +08:00
Ashin Gau	0b5bb565a7	[feature-wip](parquet-reader) parquet dictionary decoder (#11981 ) Parse parquet data with dictionary encoding. Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files. refer: https://github.com/apache/parquet-format/blob/master/Encodings.md	2022-08-26 19:24:37 +08:00
yiguolei	ccff3f5711	[bugfix](light weight schema change) support delete condition in schema change (#11869 ) * [bugfix](light weight schema change) support delete condition in schema change Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-08-26 11:45:55 +08:00
slothever	0c16740f5c	[feature-wip](parquet-reader) parquert scanner can read data (#11970 ) Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-08-26 09:43:46 +08:00
Pxl	620d33a763	[Enchancement](optimize) set result_size_hint to filter_block (#11972 )	2022-08-25 11:42:52 +08:00
Jerry Hu	f875684345	[fix](agg) Crashing caused by serialization in streaming aggregation (#12027 )	2022-08-24 14:38:25 +08:00
Xinyi Zou	1fc5515a78	[enhancement](memory) Remove unused reservation tracker (#11969 )	2022-08-24 08:49:34 +08:00
yixiutt	60fddd56e7	[feature-wip](unique-key-merge-on-write) opt lock and only save valid delete_bitmap (#11953 ) 1. use rlock in most logic instead of wrlock 2. filter stale rowset's delete bitmap in save meta 3. add a delete_bitmap lock to handle compaction and publish_txn confict Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-08-23 14:43:40 +08:00
Jerry Hu	c22d097b59	[improvement](compress) Support compress/decompress block with lz4 (#11955 )	2022-08-22 17:35:43 +08:00
Ashin Gau	6d925054de	[feature-wip](parquet-reader) decode parquet time & datetime & decimal (#11845 ) 1. Spark can set the timestamp precision by the following configuration: spark.sql.parquet.outputTimestampType = INT96(NANOS), TIMESTAMP_MICROS, TIMESTAMP_MILLIS DATETIME V1 only keeps the second precision, DATETIME V2 keeps the microsecond precision. 2. If using DECIMAL V2, the BE saves the value as decimal128, and keeps the precision of decimal as (precision=27, scale=9). DECIMAL V3 can maintain the right precision of decimal	2022-08-22 10:15:35 +08:00

1 2 3 4 5 ...

820 Commits