Commit Graph

4865 Commits

Author SHA1 Message Date
Pxl
45f1909bc3 [Bug](lateral-view) make lateral view function's nullable mode work (#21242)
make lateral view function's nullable mode work
2023-06-29 10:50:07 +08:00
7f0e37069f [improvement](olap) filter the whole segment by dictionary (#21239) 2023-06-29 10:34:29 +08:00
3f99b91ddf [fix](gc_binlog) Fix tablet gc_binlogs nullptr (#21158) 2023-06-29 10:10:33 +08:00
Pxl
f8cfe5e579 [Bug](pipeline) add DCHECK for _instance_to_sending_by_pipeline = false on _send_rpc (#21169)
add DCHECK for _instance_to_sending_by_pipeline = false on _send_rpc
2023-06-29 10:03:57 +08:00
86af533e83 [Enhancement](heartbeat) make heartbeat ok when config repeated host-ip pairs (#21228) 2023-06-28 23:12:06 +08:00
a6b51ec19a [Feature](avro) Support Apache Avro file format (#19990)
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
2023-06-28 21:15:35 +08:00
d2c42ec638 [fix](memory) Purge Jemalloc arena dirty pages when memory insufficient (#21237)
Jemalloc dirty page only use madvise MADV_FREE, memory is not release back to system, RSS won't reduce in time,

So when the process memory exceed limit or system available memory is insufficient,
manually transfer dirty page to the muzzy page, which will call MADV_DONTNEED to release the physical memory back to the system.

https://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms
2023-06-28 16:49:45 +08:00
0396f78590 [fix](memory) Remove ChunkAllocator & fix Allocator no use mmap (#21259) 2023-06-28 16:10:24 +08:00
3304af848e [Fix](storage)read page cache when seek #21272
Currently, when a columnIter is used for seek, then page cache is not set;
When this colunIter is used for later read data, then page cache could not be used.
2023-06-28 15:53:40 +08:00
e348b9464e [scan](freeblocks) use ConcurrentQueue to replace vector for free blocks (#21241) 2023-06-28 15:10:07 +08:00
a4fdf7324a [Bug](javaudf) fix BE crash if javaudf is push down (#21139) 2023-06-28 15:01:24 +08:00
Pxl
1fc1e76fc7 [Bug](alter table) return error status to avoid core dump on schema change meet invalid input (#21273)
return error status to avoid core dump on schema change meet invalid input
2023-06-28 14:20:16 +08:00
21b30820fd [fix](partial-update) fix a coredump in commit_phase_update_delete_bitmap (#21254) 2023-06-28 11:47:07 +08:00
de9172e476 [enhancement](merge-on-write) replace map with vector for segment handle caches (#21162) 2023-06-28 11:33:02 +08:00
5d1fb33f2d [enhancement](merge-on-write) increasing the max_write_buffer_number parameter to improve save meta performance (#21243) 2023-06-28 11:32:11 +08:00
b1e973b721 [Improve](func)support array to window-func first-last-value arg type (#21201)
* support array to windown-func first-last-value arg type

* add regress test for first-last-value of array type

* update

* format be:
2023-06-28 10:02:00 +08:00
db50face41 [fix](time_zone) be compatible with doris old version for CST time_zone when load orc file in broker load (#21263)
Fix error for broker load with orc file when time_zone is CST of which message is "Failed to create orc row reader. reason = Can't open /usr/share/zoneinfo/CST"
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2023-06-28 09:44:42 +08:00
92882ebd91 [fix](inverted index) update output rowset index meta with input rowset when drop inverted index (#21248) 2023-06-27 23:54:35 +08:00
d545e00bc7 [improve](error) include detailed messages in rowset reader init error (#21229) 2023-06-27 20:45:14 +08:00
4061783674 [Fix](invert index)fix s3 failed to check the directory (#21232) 2023-06-27 20:01:46 +08:00
7c569fd9db [fix](s3_writer) init member's value to avoid undefined behavior (#21233) 2023-06-27 20:01:20 +08:00
29b3d39561 [enhancement](memory) print stacktrace for large allocation (#21069) 2023-06-27 19:39:51 +08:00
609410d82b [opt](hashmap) memset the hashmap memory to improve performance (#21225) 2023-06-27 19:30:57 +08:00
c470bf56a5 [chore](build) Fix compilation errors reported by GCC-13 (#21215)
Add missing headers to fix the compilation errors reported by GCC-13.
2023-06-27 17:04:44 +08:00
ec0e398c50 [enhancement](merge-on-write) record precise primary key index size (#21196) 2023-06-27 16:50:09 +08:00
Pxl
70ddf64126 [Chore](agg-state) add documentation about agg_state, add group_concat agg_state test case (#21147)
add documentation about agg_state, add group_concat agg_state test case
2023-06-27 11:28:19 +08:00
e0b20f0437 [feature](function) add ip function ipv4numtostring (alias inet_ntoa) (#20936) 2023-06-27 10:17:40 +08:00
b2dc4a8cb9 [Fix](inverted index) check inverted index file existence befor data compaction (#21173) 2023-06-26 19:55:55 +08:00
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
5fdd9b9254 [Bug](RuntimeFiter) Fix bf error change the murmurhash to crc32 in regression test p2 (#21167) 2023-06-26 16:39:45 +08:00
960e04b0ed [fix](inverted index) fix build inverted index failed but not return immediately (#21165) 2023-06-26 14:05:12 +08:00
66005570c9 [fix](regression) fix p1 test_backup_restore fail caused by http download 401 invalid token error #21107 2023-06-26 12:56:46 +08:00
1dec592e91 [improvement](fs_bench) optimize the usage of fs benchmark tool for hdfs (#21154)
Optimize the usage of fs benchmark tool:

1. Remove `Open` benchmark, it is useless.
2. Remove `Delete` benchmark, it is dangerous.
3. Add `SingleRead` benchmark, user can specify an exist file to test read operation:

    `sh bin/run-fs-benchmark.sh --conf=conf/hdfs_read.conf --fs_type=hdfs --operation=single_read`

4. Modify the `run-fs-benchmark.sh`, remove `OPTS` section, use options in `fs_benchmark_tool` directly
5. Add some custom counters in the benchmark result, eg:

```
--------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              6864 ms         2385 ms            1 ReadRate=200.936M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3919 ms         1828 ms            1 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1              3839 ms         1819 ms            1 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_mean         4874 ms         2011 ms            3 ReadRate=304.054M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_median       3919 ms         1828 ms            3 ReadRate=351.96M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_stddev       1724 ms          324 ms            3 ReadRate=89.3768M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_cv          35.37 %         16.11 %             3 ReadRate=29.40%
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_max          6864 ms         2385 ms            3 ReadRate=359.265M/s
HdfsReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_min          3839 ms         1819 ms            3 ReadRate=200.936M/s
```

- For `open_read` and `single_read`, add `ReadRate` as `bytes per second`.
- For `create_write`, add `WriteRate` as `bytes per second`.
- For `exists` and `rename`, add `ExistsCost` and `RenameCost` as `time cost per one operation`.
2023-06-26 11:37:14 +08:00
2e6d91aa99 [chore](block) temporarily disable DCHECK for column name equality in MutableBlock (#21116)
* tempororyly disable DCHECK for column name equality in MutableBlock::add_rows

* num columns EQ to LE
2023-06-26 10:49:27 +08:00
28abeef72b [performace](colddata) opt cold data read performance (#21141)
In #10370, we try to opt string evaluate performance by rewrite the predicate using dict value. But it has to check if the string column is full dict encoding. So that we add a logic to read the last page of the string column to check it.

But it has some bad performance for cold data because it has to load the column's ordinal index and zone map index. In some scenario for example, select * from table where pk_col=1. If the query condition is primary key, the result maybe just a few rows but the result may have 100 columns, it will cost a lot of time to load these indices. We could find a lot of time is spending on block_init_time.

In my test, a table with 50 string columns and query with primary key.

The first read time will reduce from 220ms to 40ms.
2023-06-26 10:39:20 +08:00
6f7759b08d [fix](memory) fix mem tracker grace exit (#21136) 2023-06-26 10:28:24 +08:00
1ac8cdec7e [Fix](inverted index) fix inverted query cache for chinese tokenizer (#21106)
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
2023-06-25 22:04:02 +08:00
76bdcf1d26 [improvement](pipeline) task group scan entity (#19924) 2023-06-25 14:43:35 +08:00
d49c412c59 [Feature](multi-catalog) Add hdfs benchmark tools. (#21074) 2023-06-25 09:35:27 +08:00
601120db04 [Bug](pipeline) access map may cause coredump in sink buffer (#21108) 2023-06-24 23:03:59 +08:00
691a988c97 [enhancement](merge-on-write) add async publish task when version is discontinuous for merge on write table when clone (#21025)
version discontinuity may occur when clone. To deal with this case, add async publish task when version is discontinuous.
2023-06-22 21:50:14 +08:00
a33521b2ce [enhancement](exchange) add filter for exchange node in BE (#21087) 2023-06-22 01:04:47 +08:00
49bbe88327 [fix](log) fix the too large warning log of BE (#21027) 2023-06-22 00:39:04 +08:00
3dfeee3946 [fix](typesystem) fix wrong return type argument cause type check fail (#21082) 2023-06-22 00:04:46 +08:00
2c9bdd64fa [fix](memory) arena support memory reuse after clear() (#21033) 2023-06-21 23:27:21 +08:00
2ce8cfbebd [profile](sort) add some metrics in profile (#21056) 2023-06-21 22:57:46 +08:00
661e1ae7c5 [fix](memory) no switch bthread context in UBSAN compile (#21064)
When UBSAN is compiled, all memory will be tracked to the orphan (unknown) mem tracker, and the bthread context and mem tracker will no longer be switched.

The supplementary fixes are as follows: #20999
2023-06-21 21:14:07 +08:00
b2c4e51be1 [fix](load) delete lazy open DCheck when unkown load id (#21083) 2023-06-21 20:42:31 +08:00
18a0824eb3 [fix](compaction)Modify time series compaction policy default config (#21079) 2023-06-21 20:29:58 +08:00
442a734ef5 [improvement](config) update be config max_runnings_transactions_per_txn_map default value (#21060) 2023-06-21 20:29:13 +08:00