Commit Graph

2020 Commits

Author SHA1 Message Date
f03abcdfb3 [Spark Load] Rollup Tree Builder (#3727)
1 A tree data structure to describe doris table's rollup
2 A builder to build the data structure
2020-06-22 14:06:33 +08:00
56bb218148 [Bug] Can not use non-key column as partition column in duplicate table (#3916)
The following statement will throw error:
```
create table test.tbl2
(k1 int, k2 int, k3 float)
duplicate key(k1)
partition by range(k2)
(partition p1 values less than("10"))
distributed by hash(k3) buckets 1
properties('replication_num' = '1'); 
```
Error: `Only key column can be partition column`

But in duplicate key table, columns can be partition or distribution column
even if they are not in duplicate keys.

This bug is introduced by #3812
2020-06-22 09:24:21 +08:00
4c3ccfb906 [FE] Prohibit pointing helper to itself when starting FE (#3850)
When starting FE with `start_fe.sh --helper xxx` command, do not allow to
point helper to FE itself. Because this is meaningless and may cause some
confusing problemes.
2020-06-22 09:21:08 +08:00
66a8383ac0 [Running_Profile] Fix all counter in DataStreamRecv and change the image path in docs (#3858) 2020-06-22 09:20:22 +08:00
35d07d8012 [Doc] Fix audit-plugin doc error (#3922)
Fix audit-plugin doc error
Droris -> Doris
2020-06-22 09:09:56 +08:00
wyb
a63fa88294 [Spark load][Fe 6/6] Fe process etl and loading state job (#3717)
1. Fe checks the status of etl job regularly 
1.1 If status is RUNNING, update etl job progress
1.2 If status is CANCELLED, cancel load job
1.3 If status is FINISHED, get the etl output file paths, update job state to LOADING and log job update info

2. Fe sends PushTask to Be and commits transaction after all push tasks execute successfully

#3433
2020-06-21 22:17:03 +08:00
03fa1fefa9 [Doc] Fix doc-bug (#3914) 2020-06-21 16:39:27 +08:00
1e42c4adb7 [Bug] Fix bug that BE crash when doing some queries (#3918)
This bug is introduced by PR #3872

In that PR, I removed the obj_pool param of the RuntimeProfile constructor.
So the first param is std::string.
But in DataStreamRecv, it accidentally pass a nullptr to std::string, it compiles
OK but will cause runtime error.

Fix #3917
2020-06-21 15:25:15 +08:00
8cd36f1c5d [Spark Load] Support java version hyperloglog (#3320)
mainly used for Spark Load process to calculate approximate deduplication value and then serialize to parquet file.
Try to keep the same calculation semantic with be's C++ version
2020-06-21 09:37:05 +08:00
fdd65c50c4 [Bug] fix mem_tracker use-after-free & add UT for it (#3899) 2020-06-20 19:08:53 +08:00
8e895958d6 [Bug] Checkpoint thread is not running (#3913)
This bug is introduced by PR #3784 
In #3784, I remove the `Catalog.getInstance()`, and use `Catalog.getCurrentCatalog()` instead.

But actually, there are some place that should use the serving catalog explicitly.

Mainly changed:

1. Add a new method `getServingCatalog()` to explicitly return the real catalog instance.
2. Fix a compile bug of broker introduced by #3881
2020-06-20 09:32:14 +08:00
wyb
532d15d381 [Spark Load]Fe submit spark etl job (#3716)
After user creates a spark load job which status is PENDING, Fe will schedule and submit the spark etl job.
1. Begin transaction
2. Create a SparkLoadPendingTask for submitting etl job
2.1 Create etl job configuration according to https://github.com/apache/incubator-doris/issues/3010#issuecomment-635174675
2.2 Upload the configuration file and job jar to HDFS with broker
2.3 Submit etl job to spark cluster
2.4 Wait for etl job submission result
3. Update job state to ETL and log job update info if etl job is submitted successfully

#3433
2020-06-19 17:44:47 +08:00
5d40218ae6 [Config] Support max_stream_load_timeout_second config in fe (#3902)
This configuration is specifically used to limit timeout setting for stream load.
It is to prevent that failed stream load transactions cannot be canceled within
a short time because of the user's large timeout setting.
2020-06-19 17:09:27 +08:00
51367abce7 [Bug] Fix bug that BE crash when doing Insert Operation (#3872)
Mainly change:
1. Fix the bug in `update_status(status)` of `PlanFragmentExecutor`.
2. When the FE Coordinator executes `execRemoteFragmentAsync()`, if it finds an RPC error, return a Future with an error code instead of exception.
3. Protect the `_status` in RuntimeState with lock
4. Move the `_runtime_profile` of RuntimeState before the `_obj_pool`, so that the profile will be
deconstructed after the object pool.
5. Remove the unused `ObjectPool` param in RuntimeProfile constructor. If I don't remove it,
RuntimeProfile will depends on the `_obj_pool` in RuntimeProfile.
2020-06-19 17:09:04 +08:00
355df127b7 [Doris On ES] Support fetch _id field from ES (#3900)
More information can be found: https://github.com/apache/incubator-doris/issues/3901

The created ES external Table must contains `_id` column if you want to fetch the Elasticsearch document `_id`.
```
CREATE EXTERNAL TABLE `doe_id2` (
  `_id` varchar COMMENT "",
   `city`  varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://10.74.167.16:8200",
"user" = "root",
"password" = "root",
"index" = "doe",
"type" = "doc",
"version" = "6.5.3",
"enable_docvalue_scan" = "true",
"transport" = "http"
);

Query:

```
mysql> select * from doe_id2 limit 10;
+----------------------+------+
| _id                  | city |
+----------------------+------+
| iRHNc3IB8XwmcbhB7lEB | gz   |
| jBHNc3IB8XwmcbhB71Ef | gz   |
| jRHNc3IB8XwmcbhB71GI | gz   |
| jhHNc3IB8XwmcbhB71Hx | gz   |
| ThHNc3IB8XwmcbhBkFHB | sh   |
| TxHNc3IB8XwmcbhBkFH9 | sh   |
| URHNc3IB8XwmcbhBklFA | sh   |
| ahHNc3IB8XwmcbhBxlFq | gz   |
| axHNc3IB8XwmcbhBxlHw | gz   |
| bxHNc3IB8XwmcbhByVFO | gz   |
+----------------------+------+
```

NOTICE:
This change the column name format to support column name start with "_".
2020-06-19 17:07:07 +08:00
e0461cc7f4 [bug] Make compaction metrics value is right (#3903)
Now _input_rowsets will be cleared when calling gc_used_rowsets().
After that, the metrics is not right upon be calculated.
2020-06-19 11:22:06 +08:00
1d9fa5071d [BUG][Broker] Fix broker read buffer size from input stream (#3881)
This commit fixs a bug that broker cannot read the full length of buffer size, when the buffer size is set larger than 128k.

This bug will cause the data size returned by pread request to be less than 128K all the time.
2020-06-19 09:33:09 +08:00
5a253bc2c6 [BE][Tool] Add segment v2 footer meta viewer (#3822)
Add segment v2 footer meta viewer tool
2020-06-19 09:32:11 +08:00
ca96ea3056 [Memory Engine] MemTablet creation and compatibility handling in BE (#3762) 2020-06-18 09:56:07 +08:00
2f99f632e8 Modify docs format (#3896) 2020-06-18 09:43:28 +08:00
a62cebfccf Forbidden float column in short key (#3812)
* Forbidden float column in short key

When the user does not specify the short key column, doris will automatically supplement the short key column.
However, doris does not support float or double as the short key column, so when adding the short key column, doris should avoid setting those column as the key column.
The short key columns must be less then 3 columns and less then 36 bytes.

The CreateMaterailizedView, AddRollup and CreateDuplicateTable need to forbidden float column in short key.
If the float column is directly encountered during the supplement process, the subsequent columns are all value columns.

Also the float and double could not be the short key column. At the same time, Doris must be at least one short key column.
So the type of first column could not be float or double.
If the varchar is the short key column, it can only be the least one short key column.

Fixed #3811

For duplicate table without order by columns, the order by columns are same as short key columns.
If the order by columns have been designated, the count of short key columns must be <= the count of order by columns.
2020-06-17 14:16:48 +08:00
e9f7576b9d [Enhancement] make metrics api more clear (#3891) 2020-06-17 12:17:54 +08:00
c6f2b5ef0d [Doris On ES][Docs] refator documentation for doe (#3867) 2020-06-17 10:54:28 +08:00
d659167d6d [Planner] Set MysqlScanNode's cardinality to avoid unexpected shuffle join (#3886) 2020-06-17 10:53:36 +08:00
a2df29efe9 [Bug][RoutineLoad] Fix bug that exception thrown when txn of a routineload task become visible (#3890) 2020-06-17 10:52:51 +08:00
bfbe22526f Show create table result with bitmap column should not return default value (#3882) 2020-06-17 09:43:17 +08:00
ae7028bee4 [Enhancement] Replace N/A with NULL in ShowStmt result (#3851) 2020-06-17 09:41:51 +08:00
0224d49842 [Fix][Bug] Fix compile bug (#3888)
Co-authored-by: chenmingyu <chenmingyu@baidu.com>
2020-06-16 18:42:04 +08:00
6c4d7c60dd [Feature] Add QueryDetail to store query statistics. (#3744)
1. Store the query statistics in memory.
2. Supporting RESTFUL interface to get the statistics.
2020-06-15 18:16:54 +08:00
2211cb0ee0 [Metrics] Add metrics document and 2 new metrics of TCP (#3835) 2020-06-15 09:48:09 +08:00
3c09e1e1d8 [trace] Adapt trace util to compaction module (#3814)
Trace util is helpful for diagnosing compaction performance problems,
we can get trace log for base compaction like:
```
W0610 11:26:33.804431 56452 storage_engine.cpp:552] Trace:
0610 11:23:03.727535 (+     0us) storage_engine.cpp:554] start to perform base compaction
0610 11:23:03.728961 (+  1426us) storage_engine.cpp:560] found best tablet 546859
0610 11:23:03.728963 (+     2us) base_compaction.cpp:40] got base compaction lock
0610 11:23:03.729029 (+    66us) base_compaction.cpp:44] rowsets picked
0610 11:24:51.784439 (+108055410us) compaction.cpp:46] got concurrency lock and start to do compaction
0610 11:24:51.784818 (+   379us) compaction.cpp:74] prepare finished
0610 11:26:33.359265 (+101574447us) compaction.cpp:87] merge rowsets finished
0610 11:26:33.484481 (+125216us) compaction.cpp:102] output rowset built
0610 11:26:33.484482 (+     1us) compaction.cpp:106] check correctness finished
0610 11:26:33.513197 (+ 28715us) compaction.cpp:110] modify rowsets finished
0610 11:26:33.513300 (+   103us) base_compaction.cpp:49] compaction finished
0610 11:26:33.513441 (+   141us) base_compaction.cpp:56] unused rowsets have been moved to GC queue
Metrics: {"filtered_rows":0,"input_row_num":3346807,"input_rowsets_count":42,"input_rowsets_data_size":1256413170,"input_segments_num":44,"merge_rowsets_latency_us":101574444,"merged_rows":0,"output_row_num":3346807,"output_rowset_data_size":1228439659,"output_segments_num":6}
```
for cumulative compaction like:
```
W0610 11:14:18.714366 56468 storage_engine.cpp:518] Trace:
0610 11:14:08.068484 (+     0us) storage_engine.cpp:520] start to perform cumulative compaction
0610 11:14:08.069844 (+  1360us) storage_engine.cpp:526] found best tablet 547083
0610 11:14:08.069846 (+     2us) cumulative_compaction.cpp:42] got cumulative compaction lock
0610 11:14:08.069947 (+   101us) cumulative_compaction.cpp:46] calculated cumulative point
0610 11:14:08.070141 (+   194us) cumulative_compaction.cpp:50] rowsets picked
0610 11:14:08.070143 (+     2us) compaction.cpp:46] got concurrency lock and start to do compaction
0610 11:14:08.070518 (+   375us) compaction.cpp:74] prepare finished
0610 11:14:15.389893 (+7319375us) compaction.cpp:87] merge rowsets finished
0610 11:14:15.390916 (+  1023us) compaction.cpp:102] output rowset built
0610 11:14:15.390917 (+     1us) compaction.cpp:106] check correctness finished
0610 11:14:15.409460 (+ 18543us) compaction.cpp:110] modify rowsets finished
0610 11:14:15.409496 (+    36us) cumulative_compaction.cpp:55] compaction finished
0610 11:14:15.410138 (+   642us) cumulative_compaction.cpp:65] unused rowsets have been moved to GC queue
Metrics: {"filtered_rows":0,"input_row_num":136707,"input_rowsets_count":302,"input_rowsets_data_size":76617836,"input_segments_num":302,"merge_rowsets_latency_us":7319372,"merged_rows":0,"output_row_num":136707,"output_rowset_data_size":53893280,"output_segments_num":1}
```
2020-06-13 19:31:51 +08:00
b3811f910f [Spark load][Fe 4/6] Add hive external table and update hive table syntax in loadstmt (#3819)
* Add hive external table and update hive table syntax in loadstmt

* Move check hive table from SelectStmt to FromClause and update doc

* Update hive external table en sql reference
2020-06-13 16:28:24 +08:00
414a0a35e5 [Dynamic Partition] Use ZonedDateTime to support set timezone (#3799)
This CL mainly support timezone in dynamic partition:
1. use new Java Time API to replace Calendar.
2. support set time zone in dynamic partition parameters.
2020-06-13 16:27:09 +08:00
b8ee84a120 [Doc] Add docs to OLAP_SCAN_NODE query profile (#3808) 2020-06-13 16:25:40 +08:00
6928c72703 Optimize the logic for getting TabletMeta from TabletInvertedIndex to reduce frequency of getting read lock (#3815)
This PR is to optimize the logic for getting tabletMeta from TabletInvertedIndex to reduce frequence of getting read lock
2020-06-13 12:46:59 +08:00
61be7132a9 fix for be server crash which throwing syntax error when parse json … (#3846)
Fix for be server crash which throwing syntax error when parse json from kafka message
2020-06-13 12:45:16 +08:00
38b6d291f1 [Bug] fix uninitialized member vars (#3848)
This fix is based on UBSAN unit test. So if we create & use class obj in a different way, may have runtime error: load of value XX, which is not a valid value for type 'YYY' warnings again.

Unit test should build in DEBUG or XXSAN mode(at lease DEBUG). RELEASE mode will add -DNDEBUG, turn off dchecks/asserts/debug.
2020-06-13 12:44:49 +08:00
83d39ff9c9 Avoid pass NULL to memcmp() (#3844)
If we exec "StringVal(len=0, ptr="") == StringVal(len=0,ptr=NULL)", it will pass NULL ptr to memcmp(). It should be avoided.
2020-06-13 12:43:41 +08:00
dac156b6b1 [Spill To Disk] Analytic_Eval_Node Support Spill Disk and Del Some Unless Code (#3820)
* 1. Add enable spilling in query option, support spill disk in Analytic_Eval_Node, FE can open enable spilling by

         set enable_spilling = true;

Now, Sort Node and Analytic_Eval_Node can spill to disk.
2. Delete merge merge_sorter code we do not use now.
3. Replace buffered_tuple_stream by buffered_tuple_stream2 in Analytic_Eval_Node and support spill to disk. Delete the useless code of buffered_block_mgr and buffered_tuple_stream.
4. Add DataStreamRecvr Profile. Move the counter belong to DataStreamRecvr from fragment to DataStreamRecvr Profile to make clear of Running Profile.

* change some hint in code

* replace disable_spill with enable_spill which is better compatible to FE
2020-06-13 10:19:02 +08:00
wyb
44dbdf4986 Update hive external table en sql reference 2020-06-12 21:38:05 +08:00
88a5429165 [FE] Add db&tbl info in broker load log (#3837)
stream load log in FE has db & tbl info, broker load log should have too.
2020-06-12 20:54:41 +08:00
7591527977 [Bug] Fix a bug that insert null bitmap crashes BE (#3830)
INSERT INTO VALUES to_bitmap('xx') may insert null into bitmap column, which may cause dirty data to be written.
2020-06-12 18:03:02 +08:00
75f4df400e Stop travis building when an error occurred (#3838)
Co-authored-by: fariel huang <farielclaire@gmail.com>
2020-06-12 09:16:01 +08:00
wyb
7f7ee63723 Move check hive table from SelectStmt to FromClause and update doc 2020-06-11 16:53:41 +08:00
2ce2cf78ac Remove unused import (#3826)
Change-Id: Ic6ef5a0d372a9b17ffa21cffb9027d2d7e856474
2020-06-11 11:44:51 +08:00
8d11ad3a16 [Doc] Fix website doc error (#3823) 2020-06-11 10:01:54 +08:00
86d235a76a [Extension] Logstash Doris output plugin (#3800)
This plugin is used to output data to Doris for logstash
Use the HTTP protocol to interact with the Doris FE Http interface
Load data through Doris's stream load
2020-06-11 08:54:51 +08:00
8caedadb67 use scoped_refptr to new HashIndex (#3818) 2020-06-10 23:47:10 +08:00
cd402a6827 [Restore] Fix error message not match of restore job when job is time out (#3798)
For the current code if a restore job is time out it will be reported as user canceled. This error message is very misleading
2020-06-10 23:12:04 +08:00
ef94c25773 [Bug]fix the crash of checksum task #3735 (#3738)
1. the table include key column of double/float type
2. when run checksum task, will use all of key columns to compare
3. schema.column(idx) of double/float type is NULL

#3735
2020-06-10 22:59:15 +08:00