1. expand the semantics of variable strict_mode to control the behavior for stream load: if strict_mode is true, the stream load can only update existing rows; if strict_mode is false, the stream load can insert new rows if the key is not present in the table
2. when inserting a new row in non-strict mode stream load, the unmentioned columns should have default value or be nullable
The cost estimation can be more accurate if the statistics of partition are available. But we are running big data like 1T, can not really import.
So now we want to extend this by injecting partition statistics.
Syntax:
ALTER TABLE table_name MODIFY COLUMN column_name SET STATS ('stat_name' = 'stat_value', ...)
[ PARTITION (partition_name) ];
Explanation:
- Table_name: The table to which the statistics are dropped. It can be a db_name.table_name form.
Column_name: Specified target column. table_name Must be a column that exists in. Statistics can only be modified one column at a time.
- Stat _ name and stat _ value: The corresponding stat name and the value of the stat info. Multiple stats are comma separated. Statistics that can be modified include row_count, ndv, num_nulls min_value max_value, and data_size.
- Partition_name: specifies the target partition. Must be a partition existing in table_name. Multiple partitions are separated by commas.
Fix hadoop short circuit reading can not enabled in some environments.
- Revert #21430 because it will cause performance degradation issue.
- Add `$HADOOP_CONF_DIR` to `$CLASSPATH`.
- Remove empty `hdfs-site.xml`. Because in some environments it will cause hadoop short circuit reading can not enabled.
- Copy the hadoop common native libs(which is copied from https://github.com/apache/doris-thirdparty/pull/98
) and add it to `LD_LIBRARY_PATH`. Because in some environments `LD_LIBRARY_PATH` doesn't contain hadoop common native libs, which will cause hadoop short circuit reading can not enabled.
Add a session var & config enable_strong_consistency_read to solve the problem that loading result may be shortly invisible to follwers, to meet users requirements in strong consistency read scenario.
Will sync max journal id from master and wait for replaying.
1. fix concurrency bug of s3 fs benchmark tool, to avoid crash on multi thread.
2. Add `prefetch_read` operation to test prefetch reader.
3. add `AWS_EC2_METADATA_DISABLED` env in `start_be.sh` to avoid call ec2 metadata when creating s3 client.
4. add `AWS_MAX_ATTEMPTS` env in `start_be.sh` to avoid warning log of s3 sdk.
exclude old netty version
upgrade spring-boot version to 2.7.13
used ojdbc8 replace ojdbc6
upgrade jackson version to 2.15.2
upgrade fabric8 version to 6.7.2
`Wrong data type for column` error when column order in hive table is not same in orc file schema.
The root cause is in order to handle the following case:
The table in orc format of Hive 1.x may encounter system column names such as `_col0`, `_col1`, `_col2`... in the underlying orc file schema, which need to use the column names in the hive table for mapping.
### Solution
Currently fix this issue by handling the following case by specifying hive version to 1.x.x in the hive catalog configuration.
```sql
CREATE CATALOG hive PROPERTIES (
'hive.version' = '1.x.x'
);
```
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.