Commit Graph

18429 Commits

Author SHA1 Message Date
a622ffe263 Update gitignore file (#2585) 2019-12-27 10:53:56 +08:00
4ed87964fe Add zip util(#2348) (#2441)
Support .zip file extract by minizip
2019-12-27 10:10:21 +08:00
1421a9be41 [Compaction] Support compact only one rowset (#2558)
Support compaction operation to compact only one rowset.
After the modification, the last rowset of the tablet will
also be compacted.

At the same time, we added a `segments_overlap_pb` field to
the rowset meta. Used to describe whether the segment data
in the rowset overlaps. This field is set by `rowset_writer`.
Initially UNKNOWN for compatibility with existing data.

In addition, the version hash of the rowset generated after
compaction is directly set to the version hash of last rowset
participating in compaction, to ensure that the tablet's
version hash remains unchanged after compaction.
2019-12-27 10:08:41 +08:00
043a9528f7 Support decompressing csv file with deflate format in hdfs broker load (#2583) 2019-12-27 08:06:22 +08:00
f7032b07f3 Support more schema change from VARCHAR type (#2501) 2019-12-26 22:38:53 +08:00
11f8d542db [Segment V2] Support lazy-materialization-read (#2547)
Current read path of SegmentIterator
----

1. apply short key index and various column indexes to get the row ranges (ordinals of rows) to scan
2. read all return columns according to the row ranges
3. evaluate column predicates on the RowBlockV2 to further prune rows

Problem
----

When the column predicates at step 3 could filter a large proportion of rows in RowBlockV2, most values of non-predicate columns we read at step 2 are thrown away, i.e we did lots of useless work and I/O at step 2.

Lazy materialization read
----
With lazy materialization, the read path changes to
1. apply short key index and various column indexes to get the row ranges (ordinals of rows) to scan (unchanged)
2. **read only predicate columns** according to the row ranges
3. evaluate column predicates on the RowBlockV2 to further prune rows, a selection vector is maintained to indicate the selected rows
4. **read the remaining columns** based on the *selection vector* of RowBlockV2

In this way, we could avoid reading values of non-predicate columns of all rows that can't pass the predicates.

Example
----
```
function: seek(ordinal), read(block_offset, count)

(step 1) row ranges: [0,2),[4,8),[10,11),[15,20)
(step 1) row ordinals: [0 1 4 5 6 7 10 15 16 17 18 19]
(step 2) read of predicate columns: seek(0),read(0,2),seek(4),read(2,4),seek(10),read(6,1),seek(15),read(7,5)
(step 3) selection vector: [3 4 5 6]
(step 3) selected ordinals: [5 6 7 10]
(step 4) read of remaining columns: seek(5),read(3,3),seek(10),read(6,1)
```

Performance evaluation
----
Lazy materialization is particularly useful when column predicates could filter many rows and lots of big metrics (e.g., hll and bitmap type columns) are queried. In our internal test cases on bitmap columns, queries run 20%~120% faster when using lazy materialization.
2019-12-26 22:00:16 +08:00
726fa923c9 Support compiling LLVM in aarch64 (#2559) 2019-12-26 21:59:06 +08:00
3e3cdd8f2e Add log to indicate version upon scan failed (#2582) 2019-12-26 20:09:14 +08:00
c43d0e2a75 [Tablet report] Fix bug that tablet report throw NPE. (#2578)
When processing tablet reports, some tablets carry transaction information.
This information is used by the FE to determine whether to publish these
transactions or clear these transactions.

During this process, Doris may try to obtain the commit information of some
deleted partitions, resulting in a null pointer exception.
2019-12-26 15:31:36 +08:00
ee64ab55db Fix segment size (#2549) 2019-12-26 11:51:53 +08:00
6f3c50a95c [Document] Add example for using CTE in INSERT operation (#2572) 2019-12-26 10:00:34 +08:00
37f2dccc96 Support bitshuffle on aarch64 (#2574) 2019-12-25 22:21:46 +08:00
a76333a400 Support s2 on aarch64 (#2568) 2019-12-25 18:56:52 +08:00
35503cf8a3 Support glog on aarch64 (#2563) 2019-12-25 13:56:15 +08:00
4ff1299e0b Fix ORC build-thirdpart.sh (#2564) 2019-12-25 11:00:13 +08:00
c8173c689a Support Openssl on aarch64 platform (#2561) 2019-12-25 10:53:47 +08:00
6444187908 Fix Bug : Load parquet data during the upgrade may result in data errors (#2556) 2019-12-24 23:27:33 +08:00
7f48bd3c5a Support bloom filter index for large int type (#2550) 2019-12-24 19:04:03 +08:00
f9685372a1 Fix bloom filter bug #2526 (#2532) 2019-12-24 07:45:11 +08:00
a511042397 [Export] Forget to set timeout for export job (#2516) 2019-12-23 18:14:41 +08:00
e7be52fa58 Update basic-usage_EN.md (#2530) 2019-12-23 16:04:27 +08:00
20abfc5f6f Modify stream-load-manual_EN.md (#2528) 2019-12-23 15:34:19 +08:00
5ff5bf20c9 Fix core dump when using datetime in window function (#2482) 2019-12-23 09:38:37 +08:00
b4d935ab37 Fix compaction with delete rowset bug (#2523)
[STORAGE][SEGMENTV2]
when base compaction rowsets with delete rowset of more than two
condition, stats rows_del_filtered is wrong and compaction will
fail because of line check.
2019-12-21 12:13:46 +08:00
008e59476d Add curdate function doc (#2520) 2019-12-20 21:24:56 +08:00
5b9b0a84d5 Add curdate function (#2521) 2019-12-20 21:23:16 +08:00
11b78008cd Timezone variable support one digital time (#2513)
Support time zone variable like "-8:00","+8:00","8:00"
Time zone variable like "-8:00" is illegal in time-zone ID ,so we mush transfer it to standard format
2019-12-20 07:45:29 +08:00
6815979ba5 Fix invalid to_bitmap input lead to BE core (#2510) 2019-12-19 21:28:00 +08:00
5111f8cfe8 [Export] Fix bug that NPE may be thrown when executing "show export;" (#2509)
Some export job from old version of Doris may not has timeout property,
which will cause NPE.

2 more changes:
1. Change the default BE config "max_runnings_transactions" to 2000.
2. Add a new metric to FE to show the master ip:port.
2019-12-19 19:09:25 +08:00
49b8097495 Fix the core of get_next in exchange node (#2505)
The _input_batch hasn't been initialized in exchange node.
The undefined behavior will cause that the BE wants to get the capacity of input_batch before BE initialize it.
The issue is #2504
2019-12-19 16:40:33 +08:00
435fdd236e Fix npe in spark-doris-connector when query is complex (#2503) 2019-12-19 14:53:29 +08:00
45fa9c999e Add Apache ORC lib in Doris (#2479) 2019-12-19 11:09:49 +08:00
4220e3b3dc Merge pull request #2486 from EmmyMiao87/assert_node
Only specified function could be supported in correlated subquery
2019-12-19 10:21:06 +08:00
53132b4199 Chnange the name of specified agg function 2019-12-18 19:35:49 +08:00
e1ff744a99 [Alter Job] Cancel the alter job after a task failed for 3 times (#2447)
To avoid waiting timeout when it is a invalid alter job.
2019-12-18 19:17:34 +08:00
8342eb0b02 Only UDA function could be supported in correlated subquery
Those query of issue could not be supported. #2483 #2493
Those query is forbidden:
query1: select * from t1 where k1=(select k1 from t2 where t1.k2=t2.k2);
query2: select * from t1 where k1=(select distinct k1 from t2 where t1.k2=t2.k2);
Only sum, max, min, avg and count function could appear on select clause for correlated subquery. #2420
Those query is legal:
query1: select * from t1 where k1=(select avg(k1) from t2 where t1.k2=t2.k2);
2019-12-18 18:56:48 +08:00
63ea05f9c7 Add convert tablet rowset type (#2294)
to solve the issue #2246.

scheme is as following:

    add a optional preferred_rowset_type in TabletMeta for V2 format rollup index tablet
    add a boolean session variable use_v2_rollup, if set true, the query will v2 storage format rollup index to process the query.
    test queries will be sent to online service to verify the correctness of segment-v2 by send the the same queries to fe with use_v2_rollup set or not to check whether the returned results are the same.
2019-12-18 18:49:47 +08:00
48f559600f Fix bug when spark on doris run long time (#2485) 2019-12-18 13:08:21 +08:00
222f8390c7 [Compaction] Fix the bug that cumulative point grows unreasonably (#2490)
When there are to many segment in one rowset, which is larger than
BE config 'max_cumulative_compaction_num_singleton_deltas', the
cumulative compaction will not work and just increase the cumulative
point, because there is only once rowset being selected.

So when selecting rowset for cumulative compaction, we should meet 2
requirments before finishing the selection logic:

1. compaction score is larger than 'max_cumulative_compaction_num_singleton_deltas'
2. at least 2 rowsets are selected.
2019-12-18 12:59:17 +08:00
c81b1db406 Support convert VARCHAR type to DATE type (#2489) 2019-12-18 12:58:47 +08:00
d31f774852 Add block split bloom filter (#2471)
[STORAGE][SEGMENTV2]

    use block split bloom filter
    build bloom filter against data page
    add distinct value to bloom filter
    add ordinal index to bloom filter index
2019-12-18 12:57:44 +08:00
efd32f7a85 Remove unused import package (#2492) 2019-12-18 10:55:56 +08:00
89003b774b Support Convert Varchar to INT (#2481) 2019-12-17 22:02:28 +08:00
b1bac4d0cd Support to create materialized view (#2431)
Support to create materialized view

This commit support to create materiliazed view.
The syntax of stmt is following:
CREATE Materialized View [MV name] AS
  SELECT select_expr[, select_expr ...]
  FROM [Base table name]
  GROUP BY column_name[, column_name ...]
  ORDER BY column_name[, column_name ...]

The CreateMaterializedViewClause is used to check the semantic of stmt in the first step.
Now, the where, having, limit clause is forbidden in CREATE MATERIALIZED VIEW.
Also the aggregation function is restricted in SUM/MIN/MAX.

The second step is to validate stmt according to metadata of base table.
For example, the aggregate type of mv column must be same as the aggregate type of base column in aggregate table.

The last step is to prepare index of mv and add this new mvJob in Handler.
The handler will asynchronous process this new mvJob.
2019-12-17 21:12:24 +08:00
3e58e2d543 Forbidden the distinct function of subquery in binary predicate 2019-12-17 19:38:15 +08:00
e1ba0efbc7 Optimize compaction strategy of tablet on BE (#2473)
The current compaction selection strategy and cumulative point update logic
will cause the cumulative compaction to not work, and all compaction tasks
will be completed only by the base compaction. This can cause a large number
of data versions to pile up.

In the current cumulative point update logic, when a cumulative cannot select
enough number of rowsets, it will directly increase the cumulative point.
Therefore, when the data version generates the same speed as the cumulative
compaction polling, it will cause the cumulative point to continuously increase
without triggering the cumulative compaction.

The new strategy mainly modifies the update logic of cumulative point to ensure
that the above problems do not occur. At the same time, the new strategy also
takes into account the problem that compaction cannot be performed if cumulative
points stagnate for a long time. Cumulative points will be forced to increase
through threshold settings to ensure that compaction has a chance to execute.

Also add a new HTTP API to view the compaction status of specified tablet.
See `compaction-action.md` for details.
2019-12-17 10:30:43 +08:00
55cb1cd1f1 Update date_format.md (#2476) 2019-12-16 20:43:55 +08:00
b20a76163b Update from_unixtime.md (#2475) 2019-12-16 19:39:54 +08:00
9244db40f7 Update bitmap doc (#2467) 2019-12-16 18:56:53 +08:00
2c90915362 Support correlated non-scalar subquery (#2468)
The first item of non-scalar subquery could be non-aggregation function such as column k1.
This commit remove this prohibit.
2019-12-16 18:52:05 +08:00