When schema change and compaction is executing simutaneously, both
nullable and not nullable data can be read for the same column, need to
reset _nullmap for each Block when converting Block data, or else Column
case will be wrong.
This pr is mainly to optimize statistical tasks. Includes the following:
1. No longer generate statistics tasks for empty tables, and move the logic of skipping empty partitions to the process of task generation.
2. Adjusted the default configuration related to statistics to improve the efficiency of statistics collection, parameters include `cbo_concurrency_statistics_task_num`,`statistic_job_scheduler_execution_interval_ms` and `statistic_task_scheduler_execution_interval_ms`.
3. Optimize the display of statistical tasks.
4. In addition, some `org.apache.parquet.Strings` packages are changed to `com.google.common.base.Strings` to avoid the exception that Strings cannot be found in local debug.
etc.
This config is never used online and there exist bugs if enable this config. So that I remove this config and related tests.
Co-authored-by: yiguolei <yiguolei@gmail.com>
For string/varchar/text type, the length field is fixed to 2GB. (`ColumnMetaPB`)
We don't actually have to allocate 2GB for every string type because we
will reallocate the precise size of memory for the string in
`WrapperField::from_string()`
```
Status from_string(const std::string& value_string, const int precision = 0,
const int scale = 0) {
if (_is_string_type) {
if (value_string.size() > _var_length) {
Slice* slice = reinterpret_cast<Slice*>(cell_ptr());
slice->size = value_string.size();
_var_length = slice->size;
_string_content.reset(new char[slice->size]);
slice->data = _string_content.get();
}
}
return _rep->from_string(_field_buf + 1, value_string, precision, scale);
}
```
Convert Parquet column into doris column via batch method.
In the previous implementation, only numeric types can be converted in batches,
and other types can only be inserted one by one.
This process will generate repeated virtual function calls and container expansion.
Fix some logic about broker load using new file scanner, with parquet format:
1. If columns are specified in load stmt, but none of them are in parquet file,
error will be thrown like `err: No columns found in file`. See `parquet_s3_case4`
2. If the first column of table are not in table, the result number of rows is wrong.
See `parquet_s3_case8`
3. If column specified in `columns` in load stmt does not exist in file and table,
error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`
Now, every preare put a runtime filter controller, so it takes the
mutex lock on the controller map. Init of bloom filter takes some
time in allocate and memset. If we run p1 tests with -parallel=20
-suiteParallel=20 -actionParallel=20, then we get error message like
'send fragment timeout 5s'.
The patch fixes the problem in the following 2 ways:
1. Replace one mutex block with 128s.
2. If a plan fragment does not have a runtime filter, it does not need to take
the locks.