Commit Graph

153 Commits

Author SHA1 Message Date
6815979ba5 Fix invalid to_bitmap input lead to BE core (#2510) 2019-12-19 21:28:00 +08:00
49b8097495 Fix the core of get_next in exchange node (#2505)
The _input_batch hasn't been initialized in exchange node.
The undefined behavior will cause that the BE wants to get the capacity of input_batch before BE initialize it.
The issue is #2504
2019-12-19 16:40:33 +08:00
d31f774852 Add block split bloom filter (#2471)
[STORAGE][SEGMENTV2]

    use block split bloom filter
    build bloom filter against data page
    add distinct value to bloom filter
    add ordinal index to bloom filter index
2019-12-18 12:57:44 +08:00
5a3f71dd6b Push limit to Elasticsearch external table (#2400) 2019-12-07 21:13:44 +08:00
a3b7cf484b Set the load channel's timeout to be the same as the load job's timeout (#2405)
[Load] 

When performing a long-time load job, the following errors may occur. Causes the load to fail.

load channel manager add batch with unknown load id: xxx

There is a case of this error because Doris opened an unrelated channel during the load
process. This channel will not receive any data during the entire load process. Therefore,
after a fixed timeout, the channel will be released.

And after the entire load job is completed, it will try to close all open channels. When it try to
close this channel, it will find that the channel no longer exists and an error is reported.

This CL will pass the timeout of load job to the load channel, so that the timeout of load channels
will be same as load job's.
2019-12-06 21:51:00 +08:00
a46bf1ada3 [Authorization] Modify the authorization checking logic (#2372)
**Authorization checking logic**

There are some problems with the current password and permission checking logic. For example:
First, we create a user by:
`create user cmy@"%" identified by "12345";`

And then 'cmy' can login with password '12345' from any hosts.

Second, we create another user by:
`create user cmy@"192.168.%" identified by "abcde";`

Because "192.168.%" has a higher priority in the permission table than "%". So when "cmy" try
to login in by password "12345" from host "192.168.1.1", it should match the second permission
entry, and will be rejected because of invalid password.
But in current implementation, Doris will continue to check password on first entry, than let it pass. So we should change it.

**Permission checking logic**

After a user login, it should has a unique identity which is got from permission table. For example,
when "cmy" from host "192.168.1.1" login, it's identity should be `cmy@"192.168.%"`. And Doris
should use this identity to check other permission, not by using the user's real identity, which is
`cmy@"192.168.1.1"`.

**Black list**
Functionally speaking, Doris only support adding WHITE LIST, which is to allow user to login from
those hosts in the white list. But is some cases, we do need a BLACK LIST function.
Fortunately, by changing the logic described above, we can simulate the effect of the BLACK LIST.

For example, First we add a user by:
`create user cmy@'%' identified by '12345';`

And now user 'cmy' can login from any hosts. and if we don't want 'cmy' to login from host A, we
can add a new user by:
`create user cmy@'A' identified by 'other_passwd';`

Because "A" has a higher priority in the permission table than "%". If 'cmy' try to login from A using password '12345', it will be rejected.
2019-12-06 17:45:56 +08:00
27d6794b81 Support subquery with non-scalar result in Binary predicate and Between-and predicate (#2360)
This commit add a new plan node named AssertNumRowsNode
which is used to determine whether the number of rows exceeds the limit.
The subquery in Binary predicate and Between-and predicate should be added a AssertNumRowsNode
which is used to determine whether the number of rows in subquery is more than 1.
If the number of rows in subquery is more than 1, the query will be cancelled.

For example:
There are 4 rows in table t1.
Query: select c1 from t1 where c1=(select c2 from t1);
Result: ERROR 1064 (HY000): Expected no more than 1 to be returned by expression select c2 from t1

ISSUE-2270
TPC-DS 6,54,58
2019-12-05 21:27:33 +08:00
1532282942 Support push down is null predicate for Doris-On-ES (#2378) 2019-12-04 22:56:22 +08:00
0f00febd21 Optimize Doris On Elasticsearch performance (#2237)
Pure DocValue optimization for doris-on-es

Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable,  should check pure_docvalue enabled for one whole scan outside,  I will add this todo in future
2019-12-04 12:57:45 +08:00
f828670245 Add Bitmap index reader (#2319)
[STORAGE] [INDEX]

For #2061 and #2062

Add bitmap index reader
SegmentIterator support bitmap index
Add some metrics
2019-12-03 23:01:40 +08:00
a2d7c42042 Add a variable to specifically limit the memory usage of the load part in the insert operation (#2305)
This variable is mainly for INSERT operation, because INSERT operation has both query and load part.
Using only the exec_mem_limit variable does not make a good distinction of memory limit between the two parts.
2019-11-28 13:03:11 +08:00
569d0bb3af Replace all remaining boost::split() with strings::split() (#2302) 2019-11-26 22:22:14 +08:00
a465b38874 Enhance doris on es error message (#2297)
Enhance doris on es error message and modify some field data transform error.
For varchar/char type, sometimes elasticsearch user post some not-string value to Elasticsearch Index. because of reading value from _source, we can not process all json type and then just transfer the value to original string representation this may be a tricky, but we can workaround this issue
2019-11-26 18:32:25 +08:00
b187c0881c Fix bug of null safe equal join (#2193) 2019-11-14 08:52:48 +08:00
35b2800542 Keep num_of_columns_from_file incompatibile with 0.10 protocol (#2187)
After checking, I found that broker load in 0.11 added num_of_columns_from_file parameter in thrift. This parameter does not consider compatibility in BE.
So broker load could cause BE crashed during the upgrade
2019-11-13 22:04:15 +08:00
42395d2455 Change Null-safe equal operator from cross join to hash join (#2156)
* Change Null-safe equal operator from cross join to hash join
ISSUE-2136

This commit change the join method from cross join to hash join when the equal operator is Null-safe '<=>'.
It will improve the speed of query which has the Null-safe equal operator.
The finds_nulls field is used to save if there is Null-safe operator.
The finds_nulls[i] is true means that the i-th equal operator is Null-safe.
The equal function in hash table will return true, if both val and loc are NULL when finds_nulls[i] is true.
2019-11-08 12:43:48 +08:00
95a3b4ccfe Add object type (#1948)
Add a new type: Object. Currently, it's mainly for complex aggregate metrics(HLL , Bitmap).

The Object type has the following constraints:
1 Object type could not as key column type
2 Object type doesn't support all indices (BloomFilter, short key, zone map, invert index)
3 Object type doesn't support filter and group by

In the implementation:

The Object type reuse the StringValue and StringVal, because in storage engine, the Object type is binary, it has a pointer and length.
2019-10-31 21:42:58 +08:00
9bc2325c6a Fix incorrect scan bytes in metrics (#2034) 2019-10-23 18:13:40 +08:00
0f94b685ab Add ES7.x compatibility for doris on es (#2033) 2019-10-22 17:23:33 +08:00
9c2d149c36 add profile for segment v2 (#2015) 2019-10-22 09:43:16 +08:00
3c12af4dcc Limit the memory consumption of broker scan node (#1996)
If memory exceed limit, no more row batch will be pushed to batch queue
2019-10-17 14:40:16 +08:00
62acf5d098 Limit the memory usage of Loading process (#1954) 2019-10-15 09:26:20 +08:00
463b462b8d Add create_time to information_schema.tables 2019-10-12 21:45:14 +08:00
024348d74b Enable auto convert when check in (#1926)
Leverage gitattributes to enable auto convert end-of-line to LF when
checking in. Convert already exist CRLF to LF by removing all files and
checking out with new .gitattributes file. Except .gitattributes, all
files are only modified at the end of line.
2019-10-09 22:31:27 +08:00
cbf6214762 Add a miss break (#1923) 2019-09-30 20:32:05 +08:00
69d0a34bfd Remove unused _request_columns_size from olap_scanner (#1916) 2019-09-30 15:25:10 +08:00
8f016d3ab2 Make HLL be able to handle invalid data (#1908)
In this change list
1. validate HLL column when loading data, if data is invalid, this row
will be filtered.
2. seems as empty HLL when serializing invalid type of HLL data, with
this change, all ingested data will be valid.
3. seems as empty HLL when deserializing nullptr or invalid type of HLL data.
With this change, dirty data can be handled normally.
4. rename function empty_hll to hll_empty.
5. disable memtable_flush_execute_test because this will fails
sometimes. When tearing down, some thread is not joined, and they will
visit destroyed resource, which is invalid.
2019-09-29 10:55:23 +08:00
1131f53420 Fix parquet_scanner_test in debug mode (#1900) 2019-09-28 01:15:33 +08:00
cafb9f1e62 Replace Arena with MemPool first step (#1899) 2019-09-28 01:12:22 +08:00
e67b398916 Fix bug that backup may create an empty file on remote storage. (#1869)
Sometime the broker writer failed to close, but we do not handle this failure.
This may create an empty file on remote storage but be treated as normal.

Also enhance some usabilities:
1. getting latest 2000 transactions instead of getting the earliest.
2. Show backend which download and upload tasks are being executed.
2019-09-28 00:11:43 +08:00
1c229fbd92 Fix es_scan_reader_test in debug mode (#1905) 2019-09-28 00:02:30 +08:00
2f0808137a Refactor FrontendHelper (#1888) 2019-09-27 13:21:14 +08:00
b246d93128 Avoid SerDe for aggregation query with object pool (#1854) 2019-09-26 13:51:13 +08:00
c2de62d6a1 Collect scanner's status when es_http_scan_node close (#1861) 2019-09-25 12:20:13 +08:00
93fe10a268 Reduce size of HyperLogLog struct (#1845)
Now size of HyperLogLog struct is so large that it lead the rowset is
too small when ingesting data. In this CL, registers in HyperLogLog are
only created when it is needed. When ingesting data, it's normal case
that there are only few values in one HyperLogLog.
2019-09-21 14:38:58 +08:00
9aa2045987 Refactor alter job (#1695) 2019-09-12 16:31:29 +08:00
cd5cfea5cc Encapsulate HLL logic (#1756) 2019-09-09 15:52:10 +08:00
3f22238012 Add check for to_bitmap function argument (#1747) 2019-09-05 18:11:38 +08:00
0dc0dadad1 Reduce unnecessary memory allocat and copy in OlapScanNode (#1742) 2019-09-04 21:05:12 +08:00
9f5e5717d4 Unify the msg of 'Memory exceed limit' (#1737)
The new msg of limit exceed: "Memory exceed limit. %msg, Backend:%ip, fragment:%id Used:% , Limit:%. xxx".
This commit unifies the msg of 'Memory exceed limit' such as check_query_state, RETURN_IF_LIMIT_EXCEEDED and LIMIT_EXCEEDED.
2019-09-03 10:42:16 +08:00
8034d83e20 Add scroll keepalive and http timeout configuration (#1731) 2019-09-02 19:04:30 +08:00
81ca3e3abf Free olap scanner out of lock (#1733)
Close scanner out of OlapScanner's batch lock,
which will lead all scanners wait for one scanner to finish.
2019-09-02 16:49:28 +08:00
3a33f3d350 Make bitmap_union agg column support insert into and broker load (#1721) 2019-08-30 14:44:51 +08:00
c6dfe83b6d Add particular log info for doris on es (#1711) 2019-08-27 22:16:28 +08:00
dc2d49fe07 Make StringValue's memory layout same with Slice (#1712)
In our storage engine's code, we cast StringValue to Slice. Because
their memory layout is different, it may cause BE process crash.

We make their memory layout same in this patch to resolve this problem
temporary. We should improve it some day.
2019-08-27 22:15:46 +08:00
a1b92768dd Add a loaded rows in SHOW LOAD result (#1686)
Loaded rows will be updated periodically by query report. So that
user can see that a load job is still running or being blocked.
2019-08-27 14:13:47 +08:00
1e4dd77d2a Add bitmap agg type and udaf (#1610) 2019-08-26 14:24:42 +08:00
4449316d85 Add error msg when memory limit exceeded (#1685) 2019-08-23 11:13:01 +08:00
0a27ef030b Reduce the number of partition info in BrokerScanNode param (#1675)
And we should reduce the number of partition info in BrokerScanNode param if user already
set target partitions to load, instead of adding all partitions' info.
It will cause the size of RPC packet too large.
2019-08-20 19:30:57 +08:00
cd2b8373c2 Fix Stream load double NumberTotalRows (#1664) 2019-08-19 12:23:43 +08:00