Commit Graph

4090 Commits

Author SHA1 Message Date
3d28de6e54 [Enhencement](like) fallback to re2 if hyperscan failed (#18350) 2023-04-09 09:18:13 +08:00
60c0bbe272 [fix](profile) fix show load query profile (#18487)
Sometimes, `show load profile` will only show part of the insert opertion's profile.
This is because we assume that for all load operation(including insert), there is only one fragment in the plan.
But actually, there will be more than 1 fragment in plan. eg:

`insert into tbl1 select * from tbl1 limit 1` will have 2 fragments.

This PR mainly changes:

1. modify the `show load profile`
   Before:  `show load profile "/queryid/taskid/instanceid";`
   After: `show load profile "/queryid/taskid/fragmentid/instanceid";`

2. Modify the display of `ReadColumns` in OlapScanNode
    Because for wide table, the line of `ReadColumns` may be too long for show in profile.
    So I wrap it and each line contains at most 10 columns names.

3. Fix tvf not working with pipeline engine, follow up #18376
2023-04-09 08:41:18 +08:00
fb50626075 [optimize](string) optimize concat function by SIMD memcpy (#18458)
Optimize concat function 29% up by memcpy_small_allow_read_write_overflow15.
Optimize string functions list: concat, convert_to, mask, initcap, lower, upper.

concat function has 29% up:
2023-04-08 17:05:34 +08:00
58bbd46c65 [Optimization](string) optimize constant empty string compare ( column='', column!='') (#18321)
Optimize constant empty string compare:
(1) When the constant empy string '' (size is 0), we can compare offsets in SIMD directly.

q10: SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
q11: SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
q12: SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
q13: SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
q14: SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
Issue Number: close #xxx
2023-04-08 16:04:10 +08:00
0517616242 [vectorized](function) support array_repeat function to be compatible with hive syntax (#18028)
---------

Co-authored-by: zhangyu209 <zhangyu209@meituan.com>
2023-04-08 15:50:28 +08:00
0b8bc51b72 [fix](inverted index) Fix key column match query failed (#18436)
* [fix](inverted index) Fix key column match query failed

* [chore](regression case) add regression case

* [fix] fix regression case no order by
2023-04-08 15:45:08 +08:00
161678380c [bug](GC)the issue of incorrect disk usage (#18397) 2023-04-08 09:32:36 +08:00
d881d71cd1 [Bug](cast) Fix bug for cast function between datetimev2 and string (#18442)
Fix bug for cast function between datetimev2 and string
2023-04-07 22:02:15 +08:00
30f2abe5d3 [FIX](Map)fix calculate map offset in olap convertor (#18295)
Fix be core when load bigger kv data in one row for map.
2023-04-07 17:04:08 +08:00
e3ff2e3d21 [fix](file cache) Fix be core while use block/whole/sub file cache (#18440)
BE will core dump while use whole/sub file cache.
Call func CachedRemoteFileReader/WholeFileCache/SubFileCache::read_at_impl() did not pass IOContext when reading segment footer.
2023-04-07 16:39:59 +08:00
f6f4dac1d0 [Improvement](DECIMAL) Improve decimal operation (#18437) 2023-04-07 15:58:28 +08:00
308ff9a16f [enchancement](memory) tracking lru cache memory and page memory not in cache (#18361)
Statistics lru cache memory in metrics
Statistics page memory not in cache in mem tracker
2023-04-07 14:22:44 +08:00
d36e9bd523 [chore](scan) Disable low cardinality optimization for compaction (#18424) 2023-04-07 14:19:11 +08:00
c32adba1cf [Refactor](Pipeline) Refactor pipeline code to improve coverage (#18376)
Refactor pipeline code to improve coverage
2023-04-07 13:09:44 +08:00
2b662ac26b [Fix](segment iterator) fix filter block size and filter size mismatch problem (#18395)
adding result column id to _column_filter in _output_index_result_column
2023-04-07 09:43:33 +08:00
4e1cdb9ce7 [fix](agg_sort)fix bug of agg sort group concat with order by(#18447) 2023-04-07 08:42:36 +08:00
759f1da32e [Enhencement](Backends) add HostName filed in backends table and delete backends table in information_schema (#18156)
1.  Add `HostName` field for `show backends` statement and `backends()` tvf.
2. delete the `backends` table in `information_schema` database
2023-04-07 08:30:42 +08:00
e848e456be [config] modify tablet_shard to 4 and add some log (#18416)
modify the default value of BE config tablet_map_shard_size to 4. To reduce lock contention.
Add log when failed writing disk test file, for debug
2023-04-06 17:18:16 +08:00
82248ab392 [FIX](complex-type) get_default to return real nested default value (#18413)
make real default value to return with nested type in complex type
2023-04-06 15:24:32 +08:00
591f76a6a4 [fix](alter inverted index) Temporary deal with add or drop inverted index by directly schema change (#18378)
In the current implementation of the function of dynamically add and drop inverted index, there is a problem that the inverted index information of historical data is out of date after compaction on the base tablet.

In the future, I will submit PRs to solve this problem. Now, temporarily add or drop inverted index by the directly schema change logic
2023-04-06 15:07:37 +08:00
550c8aa648 [Bug](DECIMALV3) fix wrong decimal scale returned by function round (#18375) 2023-04-06 14:44:21 +08:00
Pxl
76d76f672c [Chore](build) enchancement for backend build time usage (#18344) 2023-04-06 11:13:21 +08:00
4ca0c0face [fix](join) fix wrong result of right join (#18365)
When processing data in hash table for right join and full outer join, if the output data rows of one hash bucket excceeds batch size, the logic when continue processing this bucket is wrong, it should differentiate between different join types.
2023-04-06 10:55:58 +08:00
a01d824256 [Improvement](bloom filter) inline function call (#18396) 2023-04-06 10:21:48 +08:00
f28c75bd80 [fix](file_reader) bad_typeid when reading csv&json files (#18400)
PR(#18340) resolve the conflict with PR(#18301) has changed the file_reader to create, resulting in e: [E-123] std::bad_typeid exception.
2023-04-06 10:00:29 +08:00
66a0c090b8 [fix](column) Add unimplemented replicate function in ColumnStruct (#18368) 2023-04-06 09:50:27 +08:00
47aa8a6d8a [fix](file_cache) turn on file cache by FE session variable (#18340)
Fix tow bugs:
1. Enabling file caching requires both `FE session` and `BE` configurations(enable_file_cache=true) to be enabled.
2. `ParquetReader` has not used `IOContext` previously, but `CachedRemoteFileReader::read_at` needs `IOContext` after PR(#17586).
2023-04-05 15:51:47 +08:00
7f8d92656e [fix](streamload) fix stream load failed when enable profile (#18364)
#18015 enables stream load profile log,  however be will encounter rpc fail when loading tpch data(see #18291). This is because when `is_report_success` is true, be will reportExecStatus to fe, but fe cannot find QueryInfo in `coordinatorMap`, thus it will return error to be.
2023-04-05 01:01:46 +08:00
e29fc3b46b [fix](chore) fix compile failed in JdbcExecutor and revert #18306 since be crash randomly (#18371)
fix 2 problems:
1. PR #18187 use the api resizeColumn in JNINativeMethod has been removed by #17960
2. revert PR #18306 to fix pipeline core when load
2023-04-04 20:04:28 +08:00
66bfd18601 [opt](file_reader) add prefetch buffer to read csv&json file (#18301)
Co-authored-by: ByteYue <[yj976240184@gmail.com](mailto:yj976240184@gmail.com)>
This PR is an optimization for https://github.com/apache/doris/pull/17478:
1. Change the buffer size of `LineReader` to 4MB to align with the size of prefetch buffer.
2. Lazily prefetch data in the first read to prevent wasted reading.
3. S3 block size is 32MB only, which is too small for a file split. Set 128MB as default file split size.
4. Add `_end_offset` for prefetch buffer to prevent wasted reading.

The query performance of reading data on object storage is improved by more than 3x+.
2023-04-04 19:05:22 +08:00
175e5d405c [improvement](merge-on-write) remove CHECK if lookup_row_key return unexpected status (#18326) 2023-04-04 12:42:07 +08:00
0cada3f81d [Enhancement](compaction) return error instead of core when ctx not valid (#18363) 2023-04-04 12:27:13 +08:00
54dbb4af67 [vectorzied](jdbc) refactor jdbc table read array type (#18187)
jdbc read array type get result from Doris is string, PG is java.sql.array, CK is java.lang.object
it's difficult to maintain and read the code,
so change all database's array result to string, then add a cast function from string to doris array type
2023-04-04 11:57:04 +08:00
418ea0a24e [fix](merge-on-write) fix that failed to capture_consistent_rowsets when full clone (#18346)
When full clone, if the max version of the local table is less than or equal to the max version of the clone table, there is no need to calculate the delete bitmap again.
2023-04-04 10:39:28 +08:00
50e6c4216a [vectorized](function) suppoort date_trunc function truncate week mode (#18334)
support date_trunc could truncate week eg:
select date_trunc('2023-4-3 19:28:30', 'week');
2023-04-04 10:24:26 +08:00
a724443eb9 [Improvement](predicate) optimize short-circuit predicates (#18278)
For scan node with no vectorized predicate, the input column for the first short-circuit predicate is dense and we don't need to access the selector column.

This PR improve performance by ~30% on TPCH Q3.
2023-04-04 10:21:41 +08:00
af80e65094 [Improve](FileCahe) Support the file cache profile in olap scan node and Update the profile (#17710)
We want to use file cache for caching cold data in S3.
When reading them, we want to know where the data come from and the time taken to read the datas.
So we support the metrics in olap scan node.
And for clearing the information, i also update the fields about the metrics.
2023-04-04 10:18:30 +08:00
8b85c55117 [vectorized](function) Support array_shuffle and shuffle function. (#18116)
---------

Co-authored-by: zhangyu209 <zhangyu209@meituan.com>
2023-04-04 08:53:13 +08:00
eb0fd0017e [Fix](orc-reader) Fix the scale of decimal column is incorrect when query orc tables. (#18324)
The scale of decimal column is incorrect when query orc tables.
2023-04-04 08:50:47 +08:00
fc407f4afe [improvement](executor) Reduce ScannnerCtx Scheduling times (#18306)
* remove sche in scan operator
2023-04-03 22:54:34 +08:00
1e51af0784 [fix](scan) Avoid using incorrect cache code in ComparisonPredicate (#18332)
* [fix](scan) Avoid using incorrect cache code in ComparisonPredicate

* recovery the regression test
2023-04-03 20:37:35 +08:00
dd78001cc1 [fix](memory) Fix memtable flush mem tracker #18330 2023-04-03 20:37:14 +08:00
b627088e8c [Optimization](String) Optimize q20 q21 q22 q23 LIKE_SUBSTRING (like '%xxx%') (#18309)
Optimize q20, q21, q22, q23 LIKE_SUBSTRING (like '%xxxx%'). Idea is from clickhouse stringsearcher:

Stringsearcher is about 10%~20% faster than volnitsky algorithm when needle size is less than 10 using two chars at beginning search in SIMD .
Stringsearcher is faster than volnitsky algorithm, when needle size is less than 21.
The changes are as follows:

Using first two chars of needle at beginning search. We can compare two chars of needle and [n:n+17) chars in haystack in SIMD in one loop. Filter efficiency will be higher.
When env support SIMD, we use stringsearcher.
Test result in clickbench:

q20 is about 15% up.
q20: SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
q21, q22 is about 1%~5% up.
q21: SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
q22: SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
q23 is about 30%~40% up and not stable.
q23: SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
2023-04-03 18:09:15 +08:00
d4688620e9 [opt](array) optimize array_sortby using qsort instead of bubble sort #18311 2023-04-03 17:10:51 +08:00
368a2f7ace [Bug](decimal) Fix string to decimal (#18282) 2023-04-03 15:30:48 +08:00
6677841b7e [fix](merge-on-write) fix that failed to capture_consistent_rowsets when revise tablet meta (#18283)
Should modify _timestamped_version_tracker firstly before capture_consistent_rowsets when update delete bitmap in revise_tablet_meta.
2023-04-03 13:02:34 +08:00
961f5d1bb7 [feature](function)Add St_Angle/St_Azimuth function (#18293)
Add St_Angle/St_azimuth function:
St_Angle:
Enter three point, which represent two intersecting lines. Returns the angle between these lines. Point 2 and point 1 represent the first line and point 2 and point 3 represent the second line. The angle between these lines is in radians, in the range [0, 2pi). The angle is measured clockwise from the first line to the second line.

`

mysql> SELECT ST_Angle(ST_Point(1, 0),ST_Point(0, 0),ST_Point(0, 1));
+----------------------------------------------------------------------+
| st_angle(st_point(1.0, 0.0), st_point(0.0, 0.0), st_point(0.0, 1.0)) |
+----------------------------------------------------------------------+
| 4.71238898038469 |
+----------------------------------------------------------------------+
1 row in set (0.04 sec)
`

St_azimuth:
Enter two point, and returns the azimuth of the line segment formed by points 1 and 2. The azimuth is the angle in radians measured between the line from point 1 facing true North to the line segment from point 1 to point 2.
`

mysql> SELECT st_azimuth(ST_Point(0, 0),ST_Point(1, 0));
+----------------------------------------------------+
| st_azimuth(st_point(0.0, 0.0), st_point(1.0, 0.0)) |
+----------------------------------------------------+
| 1.5707963267948966 |
+----------------------------------------------------+
1 row in set (0.04 sec)
2023-04-03 13:01:59 +08:00
Pxl
e77833bfa1 [Bug](materialized-view) fix where clause persistence replay incorrect (#18228)
fix where clause persistence replay incorrect
2023-04-03 12:49:01 +08:00
94e3472050 [bug](function) fix count equal function return incorrect value (#18200)
fix count equal function return incorrect value
2023-04-03 11:20:36 +08:00
7cd8f7c9ba [fix](grouping) fix coredump of grouping function for outer join (#18292)
Result of functions grouping and grouping_id is always not nullable, but outer join will convert the result column to nullable when necessary, which will cause mismatch of column type and column object when executing unctions grouping and grouping_id.
2023-04-03 09:35:31 +08:00