doris

Author	SHA1	Message	Date
Xin Liao	8001d65811	[fix](insert) fix memory leak for insert transaction (#17530 )	2023-03-08 14:10:59 +08:00
zxealous	a0782a1855	[fix](file reader) fix be core in broker file reader (#17039 ) A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log _broker_addr has been destroyed in BrokerFileReader	2023-02-26 12:35:31 +08:00
Tiewei Fang	f6ce072297	[Enhencement](csv-reader) Optimize csv_reader `_split_value` and fix json_reader case sensitive (#17093 ) 1. Enhencement: For single-charset column separator，csv_reader use another method of `split value`. 2. BugFix Set `json` file format loading to be sensitive.	2023-02-26 09:03:04 +08:00
Mingyu Chen	491d269412	[fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928 ) In previous implementation, when querying tvf, FE will get schema from BE. And BE will try to open the first file to get its schema info, but for orc or parquet format, if the file is empty, it will return error. But even for an empty file, we can still get schema info from file's footer. So we should handle the empty file to get schema info correctly. Also modify the catalog doc to add some FAQ.	2023-02-21 14:14:32 +08:00
Qi Chen	a46941c684	[Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931 ) Fix switch-case fall-through issue in multi-catalog module.	2023-02-20 21:35:41 +08:00
Ashin Gau	9618427020	[improvement](multi-catalog) increase default batch_size to 4064 (#16326 ) The performance of ClickBench Q30 is affected by batch_size: \| batch_size \| 1024 \| 4096 \| 20480 \| \| -- \| -- \| -- \| -- \| \| Q30 query time \| 2.27 \| 1.08 \| 0.62 \| Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance. Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same as internal table.	2023-02-02 11:51:09 +08:00
huangzhaowei	b878a7e61e	[feature](Load)Suppot skip specific lines number for csv stream load (#16055 ) Support set skip line number for stream load to load csv file. Usage `-H skip_lines:number`: ``` curl --location-trusted -u root: -T test.csv -H skip_lines:5 -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load ``` Skip line number also can be used in mysql load as below: ```sql LOAD DATA LOCAL INFILE '${mysql_load_skip_lines}' INTO TABLE ${tableName} COLUMNS TERMINATED BY ',' IGNORE 2 LINES PROPERTIES ("auth" = "root:"); ```	2023-02-01 20:42:43 +08:00
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
Mingyu Chen	4075e3aec6	[fix](csv-reader) fix new csv reader's performance issue (#15581 )	2023-01-04 18:25:08 +08:00
luozenglin	f8bb8c7829	[fix](broker) fix be core dump caused by broker load (#15390 ) * [fix](broker) fix be core dump caused by broker load	2022-12-28 10:57:41 +08:00
Xin Liao	bf71943605	[feature](load) stream load trim double quotes for csv (#15241 )	2022-12-26 11:45:54 +08:00
Tiewei Fang	ec055e1acb	[feature](new file reader) Integrate new file reader (#15175 )	2022-12-26 08:55:52 +08:00
Ashin Gau	5cefd05869	[fix](multi-catalog) fix and optimize iceberg v2 reader (#15274 ) Fix three bugs when read iceberg v2 tables: 1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in `RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first row of current row group from `delete position`. 2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`. 3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row group in such case. Optimize four performance issues: 1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly. 2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set. 3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to save memory usage, and the same as `IcebergReader`. 4. Change the recursive call of `_do_lazy_read` to loop logic.	2022-12-24 16:02:07 +08:00
Mingyu Chen	064b8d2aa6	[fix](multi-catalog) fix coredump when querying partitioned hive table with text format (#14604 ) BE will crash when querying partitioned hive table with text format and put partition column at first of select items. 1. FE should use file slots to set the column mapping index of csv file. 2. BE should use `get_by_name` of block to get right column in a block in csv reader.	2022-11-26 11:42:40 +08:00
huangzhaowei	5badd70db2	[fix](csv-reader) Fix core dump when load text into doris with special delimiter (#14196 )	2022-11-15 16:06:59 +08:00
xy720	035657c5a1	[typo](comment) Fix a lot of spell errors in be comments (#14208 ) fix typos.	2022-11-12 16:06:15 +08:00
Tiewei Fang	27549564a7	[feature](table-valued-function) Support S3 tvf (#13959 ) This pr does three things： 1. Modified the framework of table-valued-function(tvf). 2. be support `fetch_table_schema` rpc. 3. Implemented `S3(path, AK, SK, format)` table-valued-function.	2022-11-06 11:04:26 +08:00
Mingyu Chen	7b4c2cabb4	[feature](new-scan) support transactional insert in new scan framework (#13858 ) Support running transactional insert operation with new scan framework. eg: admin set frontend config("enable_new_load_scan_node" = "true"); begin; insert into tbl1 values(1,2); insert into tbl1 values(3,4); insert into tbl1 values(5,6); commit; Add some limitation to transactional insert Do not support non-literal value in insert stmt Fix some issue about array type: Forbid cast other non-array type to NESTED array type, it may cause BE crash. Add getStringValueForArray() method for Expr, to get valid string-formatted array type value. Add useLocalSessionState=true in regression-test jdbc url without this config, the jdbc driver will send some init cmd each time it connect to server, such as select @@session.tx_read_only. But when we use transactional insert, after begin command, Doris do not support any other type of stmt except for insert, commit or rollback. So adding this config to let the jdbc NOT send cmd when connecting.	2022-11-03 08:36:07 +08:00
Mingyu Chen	3a3def447d	[fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515 ) 1. Missing field and line delimiter 2. When query external table with text(csv) format, we should pass the column position map to BE, otherwise the column order is wrong. TODO: 1. For now, if we query csv file with non-exist column, it will return null. But it should return null or default value of that column. 2. Add regression test after hive docker is ready.	2022-10-22 22:40:03 +08:00
Mingyu Chen	32b1456b28	[feature-wip](array) remove array config and check array nested depth (#13428 ) 1. remove FE config `enable_array_type` 2. limit the nested depth of array in FE side. 3. Fix bug that when loading array from parquet, the decimal type is treated as bigint 4. Fix loading array from csv(vec-engine), handle null and "null" 5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 6. Remove `check_array_format()`, because it's logic is wrong and meaningless 7. Add stream load csv test cases and more parquet broker load tests	2022-10-20 15:52:31 +08:00
Mingyu Chen	dbf71ed3be	[feature-wip](new-scan) Support stream load with csv in new scan framework (#13354 ) 1. Refactor the file reader creation in FileFactory, for simplicity. Previously, FileFactory had too many `create_file_reader` interfaces. Now unified into two categories: the interface used by the previous BrokerScanNode, and the interface used by the new FileScanNode. And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files. 2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode 3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside. 4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.	2022-10-17 23:33:41 +08:00
Tiewei Fang	b7621e1615	[feature-wip](new-scan) support csv reader (#13282 ) Issue Number: close #12574 This pr adds CsvReader which implements GenericReader interface to support read csv format file.	2022-10-12 16:22:13 +08:00

22 Commits