Commit Graph

13073 Commits

Author SHA1 Message Date
834834dc44 [SparkLoadk] Avoid to read whole hive table when we add a where (#5047)
When we use spark load from hive table, the function loadDataFromHiveTable
will read whole hive table and then filter the data in process()
if hive table have lots of partitions and history data,the load will be cost too much time and resource.
So we can do filter work in loadDataFromHiveTable function when read from hive table.
Co-authored-by: 杜安明 <anming.du@mihoyo.com>
2020-12-15 09:26:42 +08:00
ff4bd1223f [Profile] Add cpu time cost in query audit (#5051) 2020-12-13 22:22:15 +08:00
f847e22eeb [AuditLog] Send queryId to master FE (#5064)
For fix #4977, we return queryId in master FE when finish query for non master to audit it in #4978.
But when the query fail(timeout), the client may not receive the right queryId for audit.
In this PR:
None master FE send queryId to master for querying;
Add more log.
2020-12-13 22:05:35 +08:00
115d4332aa [ODBC] Support ODBC Sink for insert into data to ODBC external table (#5033)
issue:#5031

1. Support ODBC Sink for insert into data to ODBC external table.
2. Support Transaction for ODBC sink to make sure insert into data is atomicital.
3. The document about ODBC sink has been modified
2020-12-13 21:53:27 +08:00
1267d6bf66 [Bug][MultiLoad] Fix multiload missing userinfo and rebase error (#5058) 2020-12-11 12:01:32 +08:00
e47fb502b2 [Compatibility] Support embedded quota in string literal (#5045)
```
mysql> select 'I''m a student';
+-----------------+
| 'I'm a student' |
+-----------------+
| I'm a student   |
+-----------------+

mysql> select "I""m a student";
+-----------------+
| 'I"m a student' |
+-----------------+
| I"m a student   |
+-----------------+

mysql> select 'I""m a student';
+------------------+
| 'I""m a student' |
+------------------+
| I""m a student   |
+------------------+

mysql> select "I''m a student";
+------------------+
| 'I''m a student' |
+------------------+
| I''m a student   |
+------------------+
```
2020-12-10 21:34:06 +08:00
bc063ebce2 fix typo in docs (#5046) 2020-12-10 15:10:22 +08:00
e278e0b3db [Load] Support full StreamLoad feature in multiload (#4717) 2020-12-10 09:37:18 +08:00
ca9e5c4785 [Bug] Add a flag to prevent repeated close operation of OlapTabletSink (#5034)
The close method of OlapTabletSink may be called twice.
In the open_internal() method of plan_fragment_executor, close is called once.
If an error occurs in this call, it will be called again in fragment_mgr.
So here we use a flag to prevent repeated close operations.

Co-authored-by: morningman <chenmingyu@baidu.com>
2020-12-09 09:30:09 +08:00
56fd82ffa1 [Doc] Fix enable_strict_storage_medium_check description (#5023) 2020-12-09 09:29:37 +08:00
f2d69a51d4 [Docs]Remove some unused variables and update BE config documents (#4987)
Remove some unused variables and update BE config documents about compaction.
2020-12-09 09:28:56 +08:00
49f7eb69bf [Refactor] Refactor DeleteHandler and Cond module (2nd) (#5030)
* [Refactor] Refactor DeleteHandler and Cond module (#4925)

This patch mainly do the following refactors:
- Use int64_t instead of int32_t for 'version' in DeleteHandler
- Move some comments from .cpp to .h file, add some new comments in .h files, and also remove some meaningless comments
- Use switch...case... instead of multiple if..else.. for DeleteConditionHandler::is_condition_value_valid
- Use range loop to simplify code
- Reduce some compare operations in Cond::del_eval
- Improve some branch predictions in Reader
- Fix and improve some unit tests
2020-12-08 10:01:18 +08:00
2dbcb726ac [Bug] Fix bug that failed to write meta image of load job (#5029)
In #4863, we add userInfo in load job, but the userInfo must be analyzed
so that it can be written to the image.
2020-12-08 10:00:42 +08:00
eb0cb04a70 Fix a core dump introduced by pr #5022 (#5032)
* fix a core dump caused by pr #5022
2020-12-08 10:00:07 +08:00
3bd56bd441 fix Get FE log file doc typo (#4985) 2020-12-08 07:07:13 +08:00
b9dabc3b5b [Enhance] Push down predicate on value column of unique table to base rowset (#5022) 2020-12-06 08:50:37 +08:00
6021d6fc7f [Performance Optimization] Remove push down conjuncts in olap scan node (#4999)
Push conjunct to Storage Engine as more as possible

olap scan node do not need filter data use push down conjuncts again.

fix #4986
2020-12-06 08:50:08 +08:00
b954dfd82d [Bug] Fix the bug of Largetint and Decimal json load failed. (#4983)
Use param of json load "num_as_string" to use flag kParseNumbersAsStringsFlag to parse json data.
2020-12-06 08:49:30 +08:00
b1b99ae884 [Function] Support Decimal to calculate variance and standard deviation (#4959) 2020-12-06 08:49:01 +08:00
42dd821021 [Refactor] Private constructor for singleton (#4956) 2020-12-06 08:47:29 +08:00
c440aa07d1 Revert "[Refactor] Refactor DeleteHandler and Cond module (#4925)" (#5028)
This reverts commit 9c9992e0aa28ee85364eebf86a6675f1073e08fb.

Co-authored-by: morningman <chenmingyu@baidu.com>
2020-12-05 21:39:49 +08:00
c5f780305e [Repair] Add an option whether to allow the partition column to be NULL (#5013) 2020-12-05 14:58:32 +08:00
9c9992e0aa [Refactor] Refactor DeleteHandler and Cond module (#4925)
This patch mainly do the following refactors:
- Use int64_t instead of int32_t for 'version' in DeleteHandler
- Move some comments from .cpp to .h file, add some new comments in .h files, and also remove some meaningless comments
- Use switch...case... instead of multiple if..else.. for DeleteConditionHandler::is_condition_value_valid
- Use range loop to simplify code
- Reduce some compare operations in Cond::del_eval
- Improve some branch predictions in Reader
- Fix and improve some unit tests
2020-12-04 12:13:30 +08:00
1f236a5339 [BUG] Fix core when schema change (#5018) 2020-12-04 09:53:19 +08:00
8823f2d928 [Buf] Fix incorrect name of TaskWorkerPool (#5015)
'_task_worker_type' is not well initialized when use it to init '_name',
then '_name' is always 'TaskWorkerPool.CREATE_TABLE', this patch fix
this bug.
2020-12-04 09:30:23 +08:00
1ae6de7117 [Enhance] Add "statistics" meta table and fix some mysql compatibility problem (#4991)
1. Add metadata table 'statistics' to store index information;
2. In the header information returned by mysql, the data type length is returned according to the actual type.
2020-12-03 09:38:18 +08:00
bd558f1895 [Doris][Doris On ES] support prefix @ symbol for column name (#5006)
Support `@` leading  column name, such as:

```
CREATE EXTERNAL TABLE `es_10` (
  `@k3` bigint(20) NULL COMMENT "",
  `@k1` boolean NULL COMMENT "",
  `@k2` varchar(20) NULL COMMENT ""
) ENGINE=ELASTICSEARCH
COMMENT "ELASTICSEARCH"
PROPERTIES (
"hosts" = "ip:port",
"user" = "root",
"password" = "",
"index" = "data_type_test",
"type" = "doc",
"transport" = "http"
); 
```
2020-12-03 09:33:49 +08:00
5215727b45 [Function] Let "str_to_date" return correct type (#5004)
The return type of str_to_date depends on whether the time part is included in the format.
If included, it is DATETIME, otherwise it is DATE.
If the format parameter is not constant, the return type will be DATETIME.
The above judgment has been completed in the FE query planning stage,
so here we directly set the value type to the return type set in the query plan.

For example:
A table with one column k1 varchar, and has 2 lines:
    "%Y-%m-%d"
    "%Y-%m-%d %H:%i:%s"
Query:
    SELECT str_to_date("2020-09-01", k1) from tbl;
Result will be:
    2020-09-01 00:00:00
    2020-09-01 00:00:00

Query:
     SELECT str_to_date("2020-09-01", "%Y-%m-%d");
Return type is DATE

Query:
     SELECT str_to_date("2020-09-01", "%Y-%m-%d %H:%i:%s");
Return type is DATETIME
2020-12-03 09:33:26 +08:00
204c15119f [Bug] ConcurrentModificationException when finish transaction (#5003) 2020-12-03 09:33:04 +08:00
92db00bd86 [Bug] Fix concurrent access of _tablets_under_clone in TabletManager (#5000)
_tablets_under_clone in TabletManager is not sharded but the lock
used to prevent concurrent access is sharded, so when shards size
is not 1, it will cause coredump.
This patch fix this bug, and also do some refactor to make shard
locks more convenient to use.
2020-12-03 09:32:44 +08:00
4fa47bc3f5 [Docs]adding instructions for converting dynamic and manual partition tables to each other (#4994)
There is no clear instruction to manually modify partitions, when dynamic partition feature is enabled.
The user will be informed only after trying to modify the partition in the command line.
This PR adds instructions for converting dynamic and manual partition tables to each other
2020-12-03 09:32:30 +08:00
b4c1eabe3f [Bug] fix finished load jobs cost too much heap (#4993)
Since the plan is retained in the task, if the task is not cleaned up, the memory usage will be too large caused Memory leak or OOM.
When load job finished, there is no need to hold the tasks which are the biggest memory consumers.
Fixed #4992
2020-12-02 17:11:27 +08:00
af06adb57f [Doris On ES][Bug-fix] fix boolean predicate pushdown manner (#4990)
Correct handling `boolean` field predicate through set the predicate value to `true`、`false` or `empty set` for DOE
2020-12-02 10:13:13 +08:00
df1f06e60b Optimized the read performance of the table when have multi versions (#4958)
* Optimized the read performance of the table when have multi versions,
changed the merge method of the unique table,
merged the cumulative version data first, and then merged with the base version.
For the data with only one base version, read directly without merging
2020-12-01 12:25:11 +08:00
99404df8b2 [Bug][Compaction] Fix bug that output rowset is not deleted after compaction failure (#4964)
This CL fix 2 bugs:

1. 
When the compaction fails, we must explicitly delete the output rowset,
otherwise the GC logic cannot process these rows.

2. 
Base compaction failed if compaction process include some delete version in SegmentV2,
Because the number of filtered rows is wrong.
2020-11-30 22:02:03 +08:00
ec7e1c6b1b [Refactor] Execute 'pick rowsets' before applying for permits for a compaction task (#4891)
The current compaction mechanism is that there is a producer thread that has been producing compaction tasks,
and the selected tablet must apply for `permits`.
When a tablet could hold `permits`, compaction task for this tablet will be submitted to  thread pool.
We take compaction score as `permits` which is used for limiting memory consumption.
However,  `pick_rowset_to_compaction()` will be executed before the file merge in compaction thread,
and the number of segment files that actually perform the merge operation is smaller than compaction score.
In addition, it is also possible that compaction task exits directly because the tablet doesn't meet
the requirements of compaction. 

This patch optimizes and refactors the code of compaction, so that we can execute 'pick rowsets'
before applying for permits for a compaction task, calculate the number of segment files that actually
participate in the merge operation, and take this number as `permits`.
2020-11-30 11:41:14 +08:00
27ef5b4d2c [Bug] Use the right queryId to audit master only query in non master (#4978)
Add queryId in TMasterOpResult.
Audit it in non master FE.
2020-11-29 11:14:17 +08:00
bb36de52a6 [Bug] Fix locate bug when start_pos larger than str len (#4975)
```
select locate('', 'abc', 10); 
```
Return 0 not 10
2020-11-29 10:38:30 +08:00
d7225d61ef [CodeFormat] Add clang-format script (#4934)
run build-support/check-format.sh to check cpp styles;
run build-support/clang-format.sh to fix cpp style issues;
2020-11-28 18:40:06 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
f944bf4d44 [Compile][Bug] Fix FE compilation bug (#4979)
[Bug] Fix compile failed that cannot find symbol for variable scanRangeLength, Introduced by #4914 #4912
2020-11-28 16:19:54 +08:00
4c63dc0027 [Metric] Add metrics for compaction permits and log for compaction merge (#4893)
1. Add metrics to `used permits` and `waitting permits` for compaction.
It would be useful to monitor `permits` hold by all executing compaction tasks and waitting compaction task.

2. Add log which can be chosen by config  for merge rowsets. 
It would be helpful to track the process of rowsets merging for compaction task which lasts for a long time.
2020-11-28 10:00:08 +08:00
f1248cb10e [BUG] Fix colocate balance bug when there is decommissioned be (#4955)
We should ignore decommissioned BE when select BEs to balance group bucketSeq.
2020-11-28 09:59:25 +08:00
2e9c8dda04 [Doris On ES][Bug-Fix] fix problem for selecting random be (#4972)
1.  Random().nextInt() maybe return negative numeric value which would result in `java.lang.ArrayIndexOutOfBoundsException`, 
pass a positive numeric value would avoid this problem.

```
int seed = new Random().nextInt(Short.MAX_VALUE) % nodesInfo.size()
```

2.  EsNodeInfo[] nodeInfos = (EsNodeInfo[]) nodesInfo.values().toArray() maybe lead `java.lang.ClassCastException  in some JDK version : [Ljava.lang.Object; cannot be cast to [Lorg.apache.doris.external.elasticsearch.EsNodeInfo` , pass the original `Class Type` can resolve this.

```
EsNodeInfo[] nodeInfos = nodesInfo.values().toArray(new EsNodeInfo[0]);
```
2020-11-28 09:57:44 +08:00
2331ce10f1 [Bug]Parquet map/list/struct structure recognize (#4968)
When a parquet file contains a `Map/List/Struct` structure, Doris can not recognize the column correctly,
and throws exception 'Invalid column: xxxx', that means Doris can not find the column.
The `Map` structure will be recognized into two columns: `key and value`.
The follow is the schema of a parquet file recognized by Doris. This patch tries to solve this problem.
2020-11-28 09:56:29 +08:00
cb749ce51d [Improvement] Add parquet file name to the error message (#4954)
When a user tries to load parquet file into Doris, like this path: `hdfs://hadoop/user/data/date=20201024/*`,
but acturally the path contains some none parquet files,the error is throwed
`Couldn't deserialize thrift: No more data to read.\\nDeserializing page header failed.`.
If the error message includes the file name information, we can quickly locate the errors.
Therefore, this patch try to add the file name to the error message.
2020-11-28 09:54:18 +08:00
c6bc30e375 [Bug] Fix httpv2 append extra useless information in get_small_file api (#4953) 2020-11-28 09:52:52 +08:00
55ce88da34 [Schema change] Support More column type in schema change (#4938)
1. Support modify column type CHAR to TINYINT/SMALLINT/INT/BIGINT/LARGEINT/FLOAT/DOUBLE/DATE
and TINYINT/SMALLINT/INT/BIGINT/LARGEINT/FLOAT/DOUBLE convert to a wider range of numeric types (#4937)

2. Use template to refactor code of types.h and schema_change.cpp to delete redundant code.
2020-11-28 09:52:28 +08:00
3b56b601fb Show fe commit hash on proc (#4943)
Show FE's commit has in SHOW PROC "/frontends" result.
2020-11-28 09:50:48 +08:00
0493eb172f [Optimize] optimize host selection strategy (#4914)
When a tablet selects which replica's host to execute scan operation,
it takes `round-robin` strategy to load balance. `minAssignedBytes` is the current load of one host.
If a backend is not alive momently, it will randomly take one of other replicas as the choice,
but the unalive backend's `minAssignedBytes`  not be descreased and the new choice's `minAssignedBytes`
also not be increased. That will make the real load of the backends not correct.
2020-11-28 09:48:13 +08:00