Issue Number: close #xxx
I add jdbc catalog for doris multi-catalog feature.
Currently, the jdbc catalog only supports MYSQL DBMS.
TODO:
support for postgre DB
Support for other databases.
Problem summary
For jdbc catalog, we can create catalog like:
CREATE CATALOG jdbc4 PROPERTIES (
"type"="jdbc",
"jdbc.user"="root",
"jdbc.password"="123456",
"jdbc.jdbc_url" = "jdbc:mysql://127.0.0.1:13396/demo?yearIsDateType=false",
"jdbc.driver_url" = "file:/mnt/disk2/ftw/tools/jar/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar",
"jdbc.driver_class" = "com.mysql.jdbc.Driver"
);
Note:
yearIsDateType is a param of jdbc:
If yearIsDateType configuration property is set to false, then the returned object type is java.sql.Short. If set to true (the default), then the returned object is of type java.sql.Date with the date set to January 1st, at midnight.
To compat with mysql, we force the use of yearIsDateType=false in FE. if user sets yearIsDateType=true, doris FE will force to change yearIsDateType=false.
json reader DCHECK fail because of missing TYPE_STRING
fix bug that if no file is found, the tvf will throw NPE.
The predicate conjuncts can not be pushed down to parquet reader if this is a load task.
Because the predicate should be applied on column of dest table, not on column of source file.
Add a temp property "use_new_load_scan_node" of broker load to make regression test happy.
So that we can use new load scan node for a certain job and avoid setting global FE config.
BE will crash when querying partitioned hive table with text format
and put partition column at first of select items.
1. FE should use file slots to set the column mapping index of csv file.
2. BE should use `get_by_name` of block to get right column in a block in csv reader.
1.Support in bitmap syntax, like 'where k1 in (select bitmap_column from tbl)';
2.Support bitmap runtime filter. Generate a bitmap filter using the right table bitmap and push it down to the left table storage layer for filtering.
The run length of null map is saved as `uint16_t`. Previously, the run length of null map was
limited by `batch_size` in the `ParquetReader`, by setting `batch_size = std::min(batch_size, (size_t)USHRT_MAX)`.
It works well when the batch size is less than `USHRT_MAX`.
However, [Lazy read](https://github.com/apache/doris/pull/13917) will merge empty batches until reading
a non-empty batch or reaching the EOF of a row group, so the `batch_size` may be greater than `USHRT_MAX`
in non-predicate columns.
In addition, even if the `batch_size` does not exceed `USHRT_MAX`, the adjacent batches may also make
the run length exceed the `USHRT_MAX` in `ColumnSelectVector::get_next_run`.
This PR is to make sharing hash table for broadcast more robust:
Add a session variable to enable/disable this function.
Do not block the hash join node's close function.
Use shared pointer to share hash table and runtime filter in broadcast join nodes.
The Hash join node that doesn't need to build the hash table will close the right child without reading any data(the child will close the corresponding sender).
When some columns specified are not in the parquet or orc file in broker load, _batch->num_columns() will less than _num_of_columns_from_file. It will lead to be core dump.
To prevent be core dump, just return an error in this case.
In previous implementation, when doing list partition prune, we need to generation `rangeToId`
every time we doing prune.
But `rangeToId` is actually a static data that should be create-once-use-every-where.
So for hive partition, I created the `rangeToId` and all other necessary data structures for partition prunning
in partition cache, so that we can use it directly.
In my test, the cost of partition prune for 10000 partitions reduce from 8s -> 0.2s.
Aslo add "partition" info in explain string for hive table.
```
| 0:VEXTERNAL_FILE_SCAN_NODE |
| predicates: `nation` = '0024c95b' |
| inputSplitNum=1, totalFileSize=4750, scanRanges=1 |
| partition=1/10000 |
| numNodes=1 |
| limit: 10 |
```
Bug fix:
1. Fix bug that es scan node can not filter data
2. Fix bug that query es with predicate like `where substring(test2,2) = "ext2";` will fail at planner phase.
`Unexpected exception: org.apache.doris.analysis.FunctionCallExpr cannot be cast to org.apache.doris.analysis.SlotRef`
TODO:
1. Some problem when quering es version 8: ` Unexpected exception: Index: 0, Size: 0`, will be fixed later.
PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader,
but can't trigger lazy read when predicate columns are partition or missing columns.
This PR support such case, and fill partition and missing columns in `FileReader`.
Fix three bugs:
1. The EOF of lazy read columns may be not equal to the EOF of predicate columns.
(for example: If the predicate column has 3 pages, with 400 rows for each, but the last page
is filtered by page index. When batch_size=992, the EOF of predicate column is true.
However, we should set batch_size=800 for lazy read column, so the EOF of lazy read column may be false.)
2. The array column does not count the number of nulls
3. Generate wrong NullMap for array column
When calling select on remote files, download cache files to local disk.
When calling alter table on remote files, read files directly from remote storage. So if tablet is too large, it will not take up too many local disk when creating local cache file.
Currently, it takes too much time to build BE from source in workflow environments (P0/P1) which affects the efficiency of daily development.
We can measure the time by executing the following command.
time EXTRA_CXX_FLAGS='-O3' BUILD_TYPE=ASAN ./build.sh --be --fe --clean -j "$(nproc)"
This PR optimizes the compilation time by exploiting the following methods.
Reduce the codegen by removing some useless std::visit.
Disable the optimization for some template functions which are instantiated by std::visit conditionally (except for the RELEASE build).
Read predicate columns firstly, and use VExprContext(push-down predicates)
to generate the select vector, which is then applied to read the non-predicate columns.
The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced.
If a whole page can be skipped, the decompress-time can also be reduced.