This patchset applies the following changes:
using vertical compaction machanism to do segcompaction
basic (WIP) refraction to separate segcompaction logic from BetaRowsetWriter
add segcompaction specific ut and regression tests
Daemon threads in doris_main.cpp will upload tablet metrics periodically, which will use StorageEngine::instance(). However loading file cache is a process in main thread, when it takes a lot of time to load file cache, StorageEngine::instance() will be a null pointer in daemon threads.
* [enhancement](execute model) using thread pool to execute report or join task instead of staring too many thread
Doris will start report thread and join thread during fragment execution. There are many problems if create and destroy thread very frequently. Jemalloc may not behave very well, it may crashed.
jemalloc/jemalloc#1405
It is better to using thread pool to do these tasks.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Make database, table, column and other names support unicode by changing LABEL_REGEX COMMON_NAME_REGIEX COMMON_TABLE_NAME_REGEX COLUMN_NAME_REGEX regular expressions in class FeNameFormat.
P.S. @SharpRay has transfered PR #13467 to me, and I‘m responsible for the task now. There will be some modifications during the review period, so I create a new PR and the original #13467 could be closed. Thanks.
bug: some chinese word not sort by pinyin in GBK coding
CREATE TABLE `test_convert` (
`a` varchar(100) NULL
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY HASH(`a`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
insert into test_convert values("b"), ("a"), ("c"), ("睿"), ("多"), ("丝");
Query OK, 6 rows affected (0.03 sec)
{'label':'insert_ca73a6acc2194d5b_888218a3949355a6', 'status':'VISIBLE', 'txnId':'18068'}
mysql [test]>select * from test_convert;
+------+
| a |
+------+
| a |
| c |
| 丝 |
| b |
| 多 |
| 睿 |
+------+
6 rows in set (0.01 sec)
mysql [test]>select * from test_convert order by convert(a using gbk);
+------+
| a |
+------+
| a |
| b |
| c |
| 多 |
| 丝 |
| 睿 |
+------+
6 rows in set (0.01 sec)
1. Make sure all sub types which STRUCT supported work correctly;
2. remove unused variable `_need_validate_data`;
3. lazy init min or max decimal to support nested DecimalV2 column validate;
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
For Unique Key MoW table, if there are duplicate keys in one single load job and there's multiple segments, we need to calculate delete bitmap to mark these duplicate keys deleted.
Add a check here to detect any bugs that might cause duplicate keys.
Different version numbers are used to calculate the delete bitmap between segments and rowsets, resulting in the failure of the last update of the delete bitmap.
Enhance aggregate function `collect_set` and `collect_list` to support optional `max_size` param,
which enables to limit the number of elements in result array.
decode method is only used for big int and other decode method is only used in unit test.
I remove the useless method and we can remove mempool parameter from decode method.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
The logic of topn and full sort is wrong when there are both offsets and limits, the offset is not considered when doing the max heap optimization, which will lead to wrong result.
* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint.
`_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode,
introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order,
should be quite the same between each lines
* fix case
function pushdown: #10355
NGram BloomFilter Index apply like pushdown: #11579
Enabled by default, make sure it stays active.
If NGram BloomFilter Index is not used, this like pushdown can be replaced by #15917, which can push down all expressions including like.
use StringRef instead of string_view operator == for vectorized impl for array_contains function.
- test data: 10,000,000 rows with a ARRAY<STRING> column. There are 10 elements, average length 11 chars, in the array column in each row.
- test SQL: `select count() from test_like_array where array_contains(s_arr, 'xxxxxxxx');`
- test result: 0.76 sec vs. 0.52 sec, 30% time reduced
A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log
_broker_addr has been destroyed in BrokerFileReader
add a defer in fold constant to close.
add more type when call _get_result function in fold constant.3.
fix in can't handle null. eg:select 1 in (2, NULL, 1);
in java udf jni_ctx will be nullptr, so call close will be core dump.
Describe your changes.
1. Enhencement:
For single-charset column separator,csv_reader use another method of `split value`.
2. BugFix
Set `json` file format loading to be sensitive.
Support parsing map&struct type in parquet&orc reader.
## Remaining Problems
1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted.
2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet.
3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.
Orc doesn't fill null values in new batch, but the former batch has been release.
Other types like int/long/timestamp... are flat types without pointer in them,
so other types do not need to be handled separately like string.
This pr implements the list default partition referred in related #15507.
It's similar as GreenPlum's default's partition which would store all data not satisfying prior partition key's
constraints and optimizer wouldn't filter default partition which means default partition would be scanned
each time you try to select data from one table with default partition.
User could either create one table with default partition or alter add one default partition.
```sql
PARTITION LIST(key) {
PARTITION p1 values in (xx,xx),
PARTITION DEFAULT
}
ALTER TABLE XXX ADD PARTITION DEFAULT
```
We don't support automatically migrate data inside default partition which meets newly added partition key's
constraint to newly add partition when alter add new partition. User should select default partition using new
constraints as predicate and insert them to new partition.
```sql
insert into tbl select * from tbl partition default where partition_key=xx;
```