The current column statistic cache loader is to load data from column_statistics olap table.
This pr is to change the cache loader logic to First load from column_statistics olap table, if no data was loaded, then load from table metadata. This is mainly to support fetch statistics data for external catalog using HMS or Iceberg api.
This is the first PR, next pr will implement the fetch logic for different external catalogs.
Two optimizations:
1. Insert string bytes directly to remove decoding&encoding process.
2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.
When we use a label to load data, this label can not be used twice. But when we execute a sql 'CLEAN LABEL [label] FROM db;', we hope that the same label can be used again.
However, the sql above does not work. This PR is fixing this problem.
if we do type coercion on subquery, it return datatype after type coercion
error info
```
Both side of binary arithmetic is not numeric. left type is DECIMALV3(2, 1) and right type is DECIMAL(27, 9)')
```
in hash join condition, some equals are trustable, some are not.
an equal is trustable if one side is almost unique, like primary key. for such equal condition we could estimate more accurate.
the problem is in rewriten q20, the are 2 equal condition, one is trustable, another is not. But we treat both of them as trustable.
Test result:
on tpch100, from 2.2 sec to 0.44 sec
no impact on tpch other queries
no performance impact on tpcds queries
1. Don't write to table_statistics when create sync analyze job anymore since it's meaningless
2. Capture exceptions when creating each system analyze job to avoid the failure of creation of all automatic collection jobs due to a single job creation failure.
3. Mark auto triggered period job's job type as system
Sometimes we meet a hive table with parameter: "transactional" = "true", but format is parquet, which is not supported.
So we need to check the input format for transactional table.
When converting query results into MySQL format, it involves transforming columnar data storage into row-based storage. This process raises the question of choosing between sequential reading and sequential writing. In reality, sequential writing is the better choice for performance optimization.
Test with 9M rows contains more than 20 columns, this patch can reduce the conversion time from 20s to 11s.
Fix bug of left and full outer join with other conjuncts. When equal matched row count of a probe row exceed batch_size, some times the _join_node->_is_any_probe_match_row_output flag is not set correcty, which result in outputing extra rows for the probe row.
1. get inverted index size before segment writer's column writer clear, then add size to total data size and total index size
2. also do this in vertical compaction
The metadata indexes in MaterializedIndexMeta is introduced in 2.0-beta version and it's relied by writing inverted index. When upgrade doris from old version to 2.0-beta, the metadata indexes in MaterializedIndexMeta is empty and no inverted index file will be written.
This PR add compatibility logic to use metadata indexes in table if indexes in MaterializedIndexMeta is empty and indexes in MaterializedIndexMeta indexId is equal to table baseIndexId.
This PR also fix metadata indexes missing for schema change.