Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case:
Change complex types in hive:
hive> create table struct_test(
id int,
sf struct<f1: int, f2: map<string, string>>) stored as parquet;
hive> insert into struct_test values
(1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))),
(2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))),
(3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6')));
hive> alter table struct_test change sf sf struct<f1:int, f2: string>;
The UTF8 format of the Windows system has BOM.
We add a new user property to `Outfile/Export`。Therefore, when exporting Doris data, users can choose whether to bring BOM on the beginning of the CSV file.
**Usage:**
```sql
-- outfile:
select * from demo.student
into outfile "file:///xxx/export/exp_"
format as csv
properties(
"column_separator" = ",",
"with_bom" = "true"
);
-- Export:
EXPORT TABLE student TO "file:///xx/tmpdata/export/exp_"
PROPERTIES(
"format" = "csv",
"with_bom" = "true"
);
```
Issue Number: close#29406
1. increase lzop version to 0x1040,
I set to 0x1040 only for decompressing lzo files compressed by higher version of lzop,
no change of decompressing logic,
actully, 0x1040 should have "F_H_FILTER" feature,
but it mainly for audio and image data, so we do not support it.
2. use orc::lzoDecompress() instead of lzo1x_decompress_safe() to decompress lzo data
3. use crc32c::Extend() instead of lzo_crc32()
4. use olap_adler32() instead of lzo_adler32()
5. thus, remove dependency of Markus F.X.J. Oberhumer's lzo library
6. remove DORIS_WITH_LZO, so lzo file are supported by stream and broker load by default
7. add some regression test
This PR introduces 4 bugprone linter rules to .clang-tidy, these linters found some bugs in #28965. This PR also add some comments to mute false positive reports.
1. Skip parquet file which has only 4 bytes length: PAR1
2. Refactor the schema init method of iceberg/hudi/hive table in hms catalog
1. Remove some redundant methods of `getIcebergTable`
2. Fix issue described in #23771
3. Support HoodieParquetInputFormatBase, treat it as normal hive table format
4. When listing file, skip all hidden dirs and files
Before this PR, Paimon has created the schema of `VectorTable` by accessing meta information. However, once the schema of `VectorTable` in java is not same as `Block` in c++, BE will crashed, and there is no good way to troubleshoot errors.
`ScannerContext` will schedule scanners even after stopped, and confused with `_is_finished` and `_should_stop`.
Only Fix the concurrency bugs when scanner is stopped or finished reported in https://github.com/apache/doris/pull/28384
VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode.
However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately.
This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.
1. max compute partition prune,
we just support filter mc partitions by '=',it can filter just one partition
to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported.
2. add max compute row count cache and partitionValues cache
3. add max compute regression case
Fix null map issue in parquet reader which cause result incorrect such as `min()`, `max()`.
In order to share null map between parquet converted src column and dst column to avoid copying. It is very tricky that will call mutable function `doris_nullable_column->get_null_map_column_ptr()` which will set `_need_update_has_null = true`. Because some operations such as agg will call `has_null()` to set `_need_update_has_null = false`.