Commit Graph

97 Commits

Author SHA1 Message Date
a6d3099a68 Fix bug: localtime is not thread-safe,then changed to localtime_r. (#1614) 2019-08-08 22:00:43 +08:00
f4ad2381e6 Fix error DCHECK for partition_columns (#1606) 2019-08-08 16:29:08 +08:00
dc4a5e6c10 Support Decimal Type when load Parquet File (#1595) 2019-08-07 19:52:23 +08:00
9402456f5b Fix parquet directory have empty file (#1593) 2019-08-07 15:08:22 +08:00
93a3577baa Support multi partition column when creating table (#1574)
When creating table with OLAP engine, use can specify multi parition columns.
eg:

PARTITION BY RANGE(`date`, `id`)
(
    PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"),
    PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"),
    PARTITION `p201703_all`  VALUES LESS THAN ("2017-04-01")
)

Notice that load by hadoop cluster does not support multi parition column table.
2019-08-05 16:16:43 +08:00
9128af6499 Broker load hang when rpc failed (#1567)
Broker load hang on broker reader when the thrift request between broker and be is failed.
2019-07-31 19:03:38 +08:00
c5edf9dae0 Unify Field and ColumnSchema in Storage (#1561)
Currently, we have Field and ColumnSchema to access column data in a
row. These two classes are mostly the same. So we should unify these to
one class. Now, Field has offset information, which is an row attribute,
so we remove offset in Field.

RowCursor now has some logic which belong to Schema, so in this patch I
add Schema attribute to RowCursor to make RowCursor simple. After this
change, only Schema will handle Field/ColumnSchema.

I extract some logic from RowCursor to be/src/olap/row.h, then we can
use same logic to handle different types of row. Each type of row has
same function that to get Cell of this row. A cell represent a column
content with a null indicator.
2019-07-30 14:01:57 +08:00
97718a35a2 Do not get file size in Broker openReader() method (#1560)
The file is already got when listing files.
Get file size in openReader() again is unnecessary and inefficient.
2019-07-29 23:05:01 +08:00
0694b6a6fa Fix bugs of Broker load (#1546)
Use same UUID as query ID and load ID of a load execution plan.
Each load execution plan has a load ID, and as a plan, there is also a query ID.
We can use same UUID as query ID and load ID, for tracing the load process more easily.

Change the load ID when retrying a load execution plan.
When a load execution plan retry, the load ID should be changed, otherwise BE can not
distinguish the old and new load requests.

Cancel the running loading task when cancelling the broker load.
When user cancel a broker load, the running loading task should also be cancelled, or
it may occupies the worker thread for a long time.

Remove the unnecessary query report when doing load execution plan.
Only the last query report is needed.

Add a new BE config tablet_writer_rpc_timeout_sec.
It is used for RPC of tablet sink. The default is 600 seconds. which is long enough for flushing
about 6GB data. The long timeout config will reduce the possibility of encountering fail to send batch error when loading.

Use streaming_load_max_mb instead of mini_load_max_mb in BE config.

Add more logs for tracing a broker load process easily.
2019-07-27 20:17:05 +08:00
a88b55e649 Add more logs and metrics to trace the broker load process (#1530)
The Operator wants to known when the job being scheduled as PENDING
and LOADING. And how long it takes to finish these sub states.

Also add 2 metrics on BE to monitor the memtable's flush time.
`memtable_flush_total` and `memtable_flush_duration_us`
2019-07-23 21:42:44 +08:00
69040572fb Use different ID instead of table ID for base index of an OLAP table (#1524) 2019-07-23 15:48:45 +08:00
6c1f95c3a0 Fix bug that BE may crash when closing OlapTableSink (#1507)
The `_profile` in OlapTableSink may not be initialized if `prepare()`
method is not called. So when close the OlapTableSink, we should
check if `_profile` is initialized.
2019-07-19 10:30:44 +08:00
4e043e66e2 Modify the result json format of mini load (#1487)
Mini load is now using stream load framework. But we should keep the
mini load return behavior and result json format be same as old.
So PUBLISH_TIMEOUT error should be treated as OK in mini load.

Also add 2 counters for OlapTableSink profile:
SerializeBatchTime: time of serializing all row batch.
WaitInFlightPacketTime: time of waiting last send packet
2019-07-16 19:15:41 +08:00
0d48a3961c Refactor Storage Engine (#1478)
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.

1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
   Nameing rowset with tablet_id and version will lead to
   many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
   with different format.
2019-07-15 21:18:22 +08:00
ae6f2d99c5 Fix bug when use SELECT * FROM TABLE LIMIT 1 (#1469) 2019-07-13 23:57:14 +08:00
aff1559c4d FixBug: if columns of doris table less than parquet file columns , BE will be crash (#1464) 2019-07-12 15:23:13 +08:00
b9c79d4b1b Fix importing non-parquet format file causing be crash (#1454) 2019-07-11 16:04:36 +08:00
51c92a0bec Validate the UTF-8 encode of loading data (#1457)
Currently, Doris only support UTF-8 encoded data. All data will be
shown to user in UTF-8 format. So if data loaded in Doris does not
UTF-8 encoded, user will see garbled data when querying.

I introduce a fast UTF-8 validator from

    https://github.com/lemire/fastvalidate-utf-8

This validator is highly optimized that it only takes 0.7 CPU cycles
to validata a 64k string. And by testing 1GB data load to Doris, the
validator has no impact on performance.
2019-07-11 09:46:38 +08:00
615c979727 Fix bug that BE crashes when inserting null value to non-nullable columns (#1447) 2019-07-10 09:20:09 +08:00
7eab12a40e Support reading Parquet file when loading data (#1173) 2019-07-01 18:39:27 +08:00
1422414e43 Add varchar column name to stream load error msg (#1366) 2019-06-24 14:52:59 +08:00
687d57be66 Fix bug that query statistics in audit log are wrong (#1354) 2019-06-21 19:16:05 +08:00
ba44249f80 Remove unused code (#1320) 2019-06-15 20:41:48 +08:00
30028bc35b Deny specify partition for unpartitioned table (#1319) 2019-06-15 18:19:56 +08:00
9d03ba236b Uniform Status (#1317) 2019-06-14 23:38:31 +08:00
53062122ea Change strategy of incorrect data (#1255)
This change adds a load property named strict_mode which is used to prohibit the incorrect data.
When it is set to false, the incorrect data will be loaded by NULL just like before.
When it is set to true, the incorrect data which belongs to a column without expr will be filtered.
The strict_mode is supported in broker load v2 now. It will be supported in stream load later.
2019-06-10 20:39:45 +08:00
e4e04e8203 Make LZO support optional (#1263) 2019-06-07 22:26:54 +08:00
934ca2481a Make MySQL support optional (#1248) 2019-06-05 12:28:15 +08:00
9f5f44ec48 Reduce memory RowBlock needed (#1238)
Before RowBlock will reserve memory for all columns in schema, even if
it is not queried. Which will cause bad performance when quering wide
table.

In this patch, RowBlock will reserve memory for needed columns. In a
case, this reduce ConvertBatchTime from 10s to 60ms when quering a wide
table who has 178 columns.

 #1236
2019-06-04 12:58:41 +08:00
85b4619d54 Change insert into to streaming (#1191)
The non-streaming hint of insert into will use the streamin plan which is same as the plan of stream insert.
It will also record the load info and return the label of insert stmt.
The partition is supportted in insert into stmt. The result which meet the target partitions will be loaded.
The introduction of example has been changed especially non-streaming insert.
Also, the param of partition_names is added in sql syntax which is used to declare the target partition_names in target table.

Change META_VERSION to 50
2019-05-23 20:53:30 +08:00
02f36c23ed Set tablet as bad when loading index failed (#1146)
Bad tablet will be reported to FE and be handled

And add a config auto_recover_index_loading_failure to control the index loading failure processing
2019-05-13 10:22:04 +08:00
79ab7f4413 Change label of broker load txn (#1134)
* Change label of broker load txn

1. put broker load label into txn label
2. fix the bug of `label is already used`
3. fix partition error of new broker load

* Fix count error in mini load and broker load

There are three params (num_rows_load_total, num_rows_load_filtered, num_rows_load_unselected) which are used to count dpp.norm.ALL and dpp.abnorm.ALL.
num_rows_load_total is the number rows of source file.
num_rows_load_unselected is the not satisfied (where conjuncts) rows of num_rows_load_total
num_rows_load_filtered is the rows (quality not good enough) of (num_rows_load_total-num_rows_load_unselected)
2019-05-10 16:53:46 +08:00
afa3aa9069 Add some pre-calculated metrics (#1079)
1. max io util of disks
2. max network send/receive bytes rate of all network devices
3. base/cumulative compaction request counter and failure counter
2019-04-30 11:12:23 +08:00
310a375aec Fix bug that null value is not correctly handled when loading data (#1070)
When partition column's value is NULL, it should be loaded into
    the partition which include MIN VALUE
2019-04-29 13:55:28 +08:00
9c82d41981 Support Doris query ES by HTTP way (#925) 2019-04-28 17:14:44 +08:00
b7b66527ce Fix some load bugs (#961)
1. Use load job's timeout as its txn timeout
2. Add a new session variable 'forward_to_master' for SHOW PROC and ADMIN stmt
2019-04-28 10:33:50 +08:00
2b4d02b2fa Add error load log url for routine load job (#938) 2019-04-28 10:33:50 +08:00
9d08be3c5f Add metrics for routine load (#795)
* Add metrics for routine load
* limit the max number of routine load task in backend to 10
* Fix bug that some partitions will no be assigned
2019-04-28 10:33:50 +08:00
8474061d63 Add some logs (#711) 2019-04-28 10:33:50 +08:00
0820a29b8d Implement the routine load process of Kafka on Backend (#671) 2019-04-28 10:33:50 +08:00
da308da17c Fix bug that empty stream load return unexpected error msg (#1052) 2019-04-28 09:36:19 +08:00
c0fbc84381 Fix bug that ScanBytes is when collect executing query's infos (#869) 2019-04-03 18:27:50 +08:00
348c61c69f Fix doris on es bug (#826)
* Get in pred from hybridset

* ignore new_filter_in when push down

* Ignore cast case in to_ext_literal
2019-03-28 12:54:17 +08:00
f4a63b29d8 Fix doris on es bug (#791) 2019-03-22 19:03:27 +08:00
c34b306b4f Decimal optimize branch #695 (#727) 2019-03-22 17:22:16 +08:00
11307b23c8 Fix bug: stream load ignore last line with no-newline (#785)
#783
2019-03-21 19:18:22 +08:00
7965a7129a Add esquery function (#652) 2019-03-08 09:27:41 +08:00
397747af2c Fix bug that push down the predicates past AggregateNode (#658) 2019-02-26 10:55:14 +08:00
aba1b9e5d6 Reopen the thrift client when got exception (#610)
To avoid broken connection being reused.
2019-01-31 16:54:49 +08:00
af445b6cc2 Optimize something (#607)
1. Unify the thrift rpc timeout from BE to FE.
    Add a BE config 'thrift_rpc_timeout_ms', default is 5000
2. Add hostname in "show proc '/frontends';" stmt result.
3. Fix a lock order bug in Load.java
2019-01-31 13:30:45 +08:00