Commit Graph

205 Commits

Author SHA1 Message Date
614a76beea [Doris on ES] Support compound_and predicate push down to Elasticsearch (#3277)
Relate Issue: https://github.com/apache/incubator-doris/issues/3248


SQL:

```
select * from test where (k2 = 6 and k3 = 1) or (k2 = 2 and k3 =3 and k4 = 'beijing');
```

Output filter:

```
((#k2:[6 TO 6] #k3:[1 TO 1]) (#(#k2:[2 TO 2] #k3:[3 TO 3]) #k4:beijing))~1
```

SQL:

```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and k3 =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```
Output filter:

```
(k2:[6 TO 6] k3:[7 TO 7] (#(#k2:[2 TO 2] #k3:[3 TO 3]) #((k4:beijing k4:zhaochun)~1)))~1
```

SQL:

```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and abs(k3) =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```

Output filter (`abs` can not be pushed down to es, so doris on es would not process this scenario ):

```
match_all
```
2020-04-08 21:09:39 +08:00
2ed184e06a Add config: tablet writer open rpc timeout (#3258) 2020-04-03 16:43:56 +08:00
63cee94c5c Fix output results may incorrect when using intersect and except statements (#3228)
output results may  incorrect  when using intersect and except statements
2020-04-01 20:58:43 +08:00
6a9a62901f Fix bug of memory limit when group by varchar columns. (#3242)
select date_format(k10, '%Y%m%d') as myk10 from baseall group by myk10;
The date_format function in query above will be stored in MemPool during
the query execution. If the query handles millions of rows, it will
consume much memory. Should clear the MemPool at interval.
2020-04-01 18:48:18 +08:00
5f9359d618 Use SleepFor() instead of usleep() (#3211) 2020-03-29 14:18:19 +08:00
e20d905d70 Remove unused KUDU codes (#3175)
KUDU table is no longer supported long time ago. Remove code related to it.
2020-03-24 13:54:05 +08:00
d4c1938b5c Open datetime min value limit (#3158)
the min_value in olap/type.h of datetime is 0000-01-01 00:00:00, so we don't need restrict datetime min in tablet_sink
2020-03-24 10:52:57 +08:00
dff3c0d57e Revert "Remove deep copy when doing hash table EvalRow (#3171)" (#3173) 2020-03-23 15:29:46 +08:00
wyb
dd8d748c55 Remove deep copy when doing hash table EvalRow (#3171)
remove varchar column deep copy in partitioned hash table EvalRow function
2020-03-21 09:52:49 +08:00
d29ed84b6a [Bug] Fix bug that right semi/anti join is not right (#3167)
This bug is introduced by PR: #3148.
right semi/anti join can not use `insert_unique` in build phase of join.
2020-03-20 20:58:55 +08:00
2dc995df7b [CodeStyle] Rename new_partition_aggregation_node and new_partitioned_hash_table (#3166) 2020-03-20 19:59:01 +08:00
5a8fcd263f [CodeStyle] Delete obsolete code of partition_aggregation_node and partitioned_hash_table (#3162) 2020-03-20 16:25:29 +08:00
c08d6e4708 [tablet meta] Do some refactor on TabletMeta (#3136)
remove some functions' return value which always return OLAP_SUCCESS
optimize some loops
2020-03-20 15:03:22 +08:00
2d3dbc2c42 Revert "[CodeStyle] Del obsolete code of partition_aggregation_node (#3154)" (#3160)
This reverts commit dae013d797c1c2c9e54246d5ace4bdd90b297d43.
2020-03-20 14:47:25 +08:00
5f004cb009 Revert "[CodeStyle] Remove unused PartitionedHashTable (#3156)" (#3159)
This reverts commit d3fd44f0a2fe076d2c62851babc162fcebe4d63b.
2020-03-20 14:42:40 +08:00
d3fd44f0a2 [CodeStyle] Remove unused PartitionedHashTable (#3156) 2020-03-20 12:19:08 +08:00
dae013d797 [CodeStyle] Del obsolete code of partition_aggregation_node (#3154) 2020-03-20 11:33:55 +08:00
f0db9272dd [Performance] Improve performence of hash join in some case (#3148)
improve performent of hash join  when build table has to many duplicated rows, this will cause hash table collisions and slow down the probe performence.
In this pr when join type is  semi join or anti join, we will build a hash table without duplicated rows.
benchmark:
dataset: tpcds dataset  `store_sales` and `catalog_sales`
```
mysql> select count(*) from catalog_sales;
+----------+
| count(*) |
+----------+
| 14401261 |
+----------+
1 row in set (0.44 sec)

mysql> select count(distinct cs_bill_cdemo_sk) from catalog_sales;
+------------------------------------+
| count(DISTINCT `cs_bill_cdemo_sk`) |
+------------------------------------+
|                            1085080 |
+------------------------------------+
1 row in set (2.46 sec)

mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 28800991 |
+----------+
1 row in set (0.84 sec)

mysql> select count(distinct ss_addr_sk) from store_sales;
+------------------------------+
| count(DISTINCT `ss_addr_sk`) |
+------------------------------+
|                       249978 |
+------------------------------+
1 row in set (2.57 sec)
```

test querys:
query1: `select count(*) from (select store_sales.ss_addr_sk  from store_sales left semi join catalog_sales  on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;`

query2: `select count(*) from (select catalog_sales.cs_bill_cdemo_sk from catalog_sales left semi join store_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;`

benchmark result:


||query1|query2|
|:--:|:--:|:--:|
|before|14.76 sec|3 min 16.52 sec|
|after|12.64 sec|10.34 sec|
2020-03-20 10:31:14 +08:00
b286f4271b Remove unused PreAggregtionNode (#3151) 2020-03-20 09:19:47 +08:00
d01b58bff6 Support 64 bit timestamp in from_unixtime (#3069)
Support 64 bit timestamp in from_unixtime
2020-03-17 17:30:42 +08:00
0959abc1dc [ExceptNode] Implement except node (#3056)
implement except node,
support  statement like:

``` 
select a from t1 except select b from t2
```
2020-03-17 10:54:40 +08:00
a80e9bf229 Fix broker scan node mem limit check (#3123) 2020-03-16 20:36:46 +08:00
dc07182bd4 [Intersect] Implements intersect node (#3034)
imlement of the intersect node
now can support statement like `select a from t intersect select b from t1 intersect select 1;`
2020-03-09 10:52:55 +08:00
1d296e907d Fix orc load timestamp bug (#3047)
The timestamp value load from orc file is error, the value has an offset with hive and spark.
Becuase the time zone of orc's timestamp is stored inside orc's stripe information, so the timestamp obtained here is an offset timestamp, so parse timestamp with UTC is actual datetime literal.
2020-03-06 18:03:27 +08:00
3b5a0b6060 [TPCDS] Implement the planner for set operation (#2957)
Implement intersect and except planner.
This CL does not implement intersect and except node in execution level.
2020-02-27 16:03:31 +08:00
e23d735bac Fix decimal bug in orc load (#2984) 2020-02-26 10:58:18 +08:00
a340bc7a00 Remove unused LLVM related codes of directory:be/src/runtime (#2910) (#2985)
Remove unused LLVM related codes of directory (step 4):be/src/runtime (#2910)

there are many LLVM related codes in code base, but these codes are not really used.
The higher version of GCC is not compatible with the LLVM 3.4.2 version currently used by Doris.
The PR delete all LLVM related code of directory: be/src/runtime
2020-02-25 13:47:20 +08:00
8eb413fa69 [Bug][RoutineLoad] Fix bug that routine Load encounter "label already used" exception (#2959)
This CL modify 2 things:

1. When a routine load task submit failed, it will not be put back to the task queue.
2. The rpc timeout when executing a routine load task in BE is set to `query_timeout` of the task plan.

ISSUE: #2964
2020-02-22 22:01:14 +08:00
35b09ecd66 [JDK] Support OpenJDK (#2804)
Support compile and running Frontend process and Broker process with OpenJDK.
OpenJDK 13 is tested.
2020-02-20 23:47:02 +08:00
839ec45197 Remove llvm relative code from be/src/exec (#2955)
Remove unused LLVM related codes of directory:be/src/exec (#2910)

there are many LLVM related codes in code base, but these codes are not really used.
The higher version of GCC is not compatible with the LLVM 3.4.2 version currently used by Doris.
The PR delete all LLVM related code of directory: be/src/exec.
2020-02-20 20:43:26 +08:00
1cf0fb9117 Use ThreadPool to refactor MemTableFlushExecutor (#2931)
1. MemTableFlushExecutor maintain a ThreadPool to receive FlushTask.
2. FlushToken is used to seperate different tasks from different tablets.
   Every DeltaWriter of tablet constructs a FlushToken,
   task in FlushToken are handle serially, task between FlushToken are
   handle concurrently.
3. I have remove thread limit on data_dir, because of I/O is not the main
   timer consumer of Flush thread. Much of time is consumed in CPU decoding
   and compress.
2020-02-18 18:39:04 +08:00
43583e7bd2 Fix orc load bug (#2912) 2020-02-16 19:14:42 +08:00
6c33f80544 Add disable_storage_page_cache config (#2890)
1. when read column data page:
    for compaction, schema_change, check_sum: we don't use page cache
    for query and config::disable_storage_page_cache is false, we use page cache
2. when read column index page
    if config::disable_storage_page_cache is false, we use page cache
2020-02-16 19:13:30 +08:00
fd492e3b6f [Doris on ES] Support escape character (#2865) 2020-02-13 11:32:48 +08:00
3c539aac54 [Refactor] Some tiny refactor on streaming-load related code (#2891)
Mainly contains the following modifications:
1. Use `std::unique_ptr` to replace some naked pointers
2. Modify some methods from member-method to local-static-function
3. Modify some methods do not need to be public to private
4. Some formatting changes: such as wrapping lines that are too long
5. Remove some useless variables
6. Add or modify some comments for easier understanding

No functional changes in this patch.
2020-02-13 10:42:52 +08:00
3e160aeb66 [GroupingSet] fix a bug when using grouping set without all column in a grouping set item (#2877)
fix a bug when using grouping sets without all column in a grouping set item will produce wrong value.
fix grouping function check will not work in group by clause
2020-02-12 21:50:12 +08:00
502fa2eb50 [GroupingSet] Fix core when using grouping sets in large data (#2858)
dst_tuples memory size to Allocate is wrong
2020-02-07 21:40:29 +08:00
b35e8153c0 [Doris on Es] Fix lte and gte error expression (#2851)
LE should LTE
GE should GTE
2020-02-06 20:52:14 +08:00
99ad56d1bf Support bitmap index for more type (#2630)
For #2589

1. date(uint24_t)/datetime(int64_t)/largeint(int128_t) use frame of reference code as dict.
2. decimal(decimal12_t) also uses frame of reference code as dict.
3. float/double use bitshuffle code as dict.
2020-01-31 21:09:29 +08:00
64e99f29e6 Fix parquet arrow read batch bug (#2812)
Fix parquet arrow read batch bug
#2811

The original code was to determine the number of rows in the batch based on the number of rows in the parquet RowGroup.But now it's a batch take 65535 lines. So when parquet row greater than 65535,the number of batch don't match the number of rowgroup. The code using the field "_current_line_of_group" as a position of array can cause the data to be out of array cause be crash
2020-01-21 10:57:56 +08:00
fc55423032 [SQL] Support Grouping Sets, Rollup and Cube to extend group by statement
Support Grouping Sets, Rollup and Cube to extend group by statement
support GROUPING SETS syntax 
```
SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) );
```
cube  or rollup like 
```
SELECT a, b,c, SUM( d ) FROM tab1 GROUP BY ROLLUP|CUBE(a,b,c)
```

[ADD] support grouping functions in expr like grouping(a) + grouping(b) (#2039)
[FIX] fix analyzer error in window function(#2039)
2020-01-17 16:24:02 +08:00
a36193dfab Support decimal and timestamp type in orc load (#2759) 2020-01-15 07:40:30 +08:00
1c9cfa7e0f Fix invalid to_bitmap input lead to BE core (#2706) 2020-01-08 22:14:37 +08:00
852046de29 Fix incompatibility with arm architecture in olap #2645 (#2682) 2020-01-07 19:16:10 +08:00
2326b478b6 Support load orc format in Apache Doris (#2554)
Support load orc format in Apache Doris
2020-01-07 14:22:43 +08:00
87a50070c4 Fix bug: parquet scanner don't seek (#2661) 2020-01-06 13:55:40 +08:00
1648226927 Adapt arrow 0.15 API (#2657)
This CL supports arrow's zero copy read interface, which can make code
comply with arrow 0.15.
And the schema change unit test has some problem, I disable it in run-ut.sh
2020-01-04 15:54:29 +08:00
d05768ffd4 Fix core when es_scanner_node exit (#2634) 2020-01-02 16:30:11 +08:00
4c5b0b6dc9 Remove VersionHash used to comparison in BE (#2622) 2019-12-31 19:38:45 +08:00
e41aef54f2 [Doirs-On-Elasticsearch] Accerlate first scroll search (#2575)
Add terminate_after for the first scroll to avoid decompress all postings list
2019-12-27 14:05:21 +08:00