This pr does three things:
1. Modified the framework of table-valued-function(tvf).
2. be support `fetch_table_schema` rpc.
3. Implemented `S3(path, AK, SK, format)` table-valued-function.
Support running transactional insert operation with new scan framework. eg:
admin set frontend config("enable_new_load_scan_node" = "true");
begin;
insert into tbl1 values(1,2);
insert into tbl1 values(3,4);
insert into tbl1 values(5,6);
commit;
Add some limitation to transactional insert
Do not support non-literal value in insert stmt
Fix some issue about array type:
Forbid cast other non-array type to NESTED array type, it may cause BE crash.
Add getStringValueForArray() method for Expr, to get valid string-formatted array type value.
Add useLocalSessionState=true in regression-test jdbc url
without this config, the jdbc driver will send some init cmd each time it connect to server, such as
select @@session.tx_read_only.
But when we use transactional insert, after begin command, Doris do not support any other type of
stmt except for insert, commit or rollback.
So adding this config to let the jdbc NOT send cmd when connecting.
1.
In the previous implementation, the max thread num of olap scanner was set relatively small, such as 3.
which would slow down some of queries.
In this PR, I changed the max thread num to a quarter of the scaner thread pool(default is 12),
which is less than the old scan node's max thread num, but larger than the previous implementation.
The upper limit of the max thread num of the old scan node is too high, which is not reasonable.
2.
Lower down the number of pre allocated free blocks.
PR(https://github.com/apache/doris/pull/13404) introduced that ParquetReader
will break up batch insertion when encountering null values, which leads to the bad performance
compared to OrcReader.
So this PR has pushed null map into decode function, reduce the time of virtual function call
when encountering null values.
Further more, reuse hdfsFS among file readers to reduce the time of building connection to hdfs.
Issue Number: close#12574
This pr adds `NewJsonReader` which implements GenericReader interface to support read json format file.
TODO:
1. modify `_scann_eof` later.
2. Rename `NewJsonReader` to `JsonReader` when `JsonReader` is deleted.
The index for external table columns from path is incorrect in new scanner. This is a fix for it.
e.g. In the next query, nation and city columns are from path
```
mysql> select nation, city, count(*) from parquet_two_part group by nation, city;
+--------+------------+----------+
| nation | city | count(*) |
+--------+------------+----------+
| cn | beijing | 1199969 |
| cn | shanghai | 1199771 |
| jp | tokyo | 599715 |
| rus | moscow | 600659 |
| us | chicago | 1199805 |
| us | washington | 1201296 |
+--------+------------+----------+
6 rows in set (0.39 sec)
```
1. Missing field and line delimiter
2. When query external table with text(csv) format, we should pass the column position map to BE,
otherwise the column order is wrong.
TODO:
1. For now, if we query csv file with non-exist column, it will return null.
But it should return null or default value of that column.
2. Add regression test after hive docker is ready.
1. remove FE config `enable_array_type`
2. limit the nested depth of array in FE side.
3. Fix bug that when loading array from parquet, the decimal type is treated as bigint
4. Fix loading array from csv(vec-engine), handle null and "null"
5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null.
6. Remove `check_array_format()`, because it's logic is wrong and meaningless
7. Add stream load csv test cases and more parquet broker load tests
# Proposed changes
Implement predicate pushdown in `OrcReader` by converting doris `ColumnValueRange` to orc `SearchArgument`.
## Remaining problems
1. Orc support `not in`, which may have effect on bloom filter. However, doris `ScanNode` has not push down `not in` to file scanner.
2. Orc support `is null`, and row range has `hasNull` identifier. However, `_contain_null` in `ColumnValueRange` is ambiguous. `_contain_null = true` only means that the value can be nullable, not equal to null.
3. `DateTimeV2` has lost microsecond precision in `ColumnValueRange`, which may cause filtering error when a min-max value equals to the predicate value.
4. `DateTimeV1` is not accurate enough, and only saved to seconds.
5. Orc support the predicate pushdown of `float&double` type, but doris has not push down `float&double` type for precision reason.
# Proposed changes
This PR fixed lots of issues when building from source on macOS with Apple M1 chip.
## ATTENTION
The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime:
1. Some errors with memory tracker occur when BE (RELEASE) starts.
2. Some UT cases fail.
...
Temporarily, the following changes are made on macOS to start BE successfully.
1. Disable memory tracker.
2. Use tcmalloc instead of jemalloc.
This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues.
## Use case
```shell
./build.sh -j 8 --be --clean
cd output/be/bin
ulimit -n 60000
./start_be.sh --daemon
```
## Something else
It takes around _**10+**_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.
1. Refactor the file reader creation in FileFactory, for simplicity.
Previously, FileFactory had too many `create_file_reader` interfaces.
Now unified into two categories: the interface used by the previous BrokerScanNode,
and the interface used by the new FileScanNode.
And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files.
2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode
3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside.
4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.
1. Refactor arguments list of parquet min max filter, pass parquet type for min max value parsing
2. Fix the filter of string min max
Co-authored-by: jinzhe <jinzhe@selectdb.com>
For an vectorized expression which meets the condition vexpr->is_constant(), a const column is expected to return.
But now we still don't cover all predicates for const expression.
For example, for query SELECT col FROM tbl WHERE 'PROMOTION' LIKE 'AAA%', predicate like will return a ColumnVector which contains a single value.
This PR want to cover all const predicates in scan node whether it returns a constcolumn or not