doris

Author	SHA1	Message	Date
airborne12	347cceb530	[Feature](inverted index) push count on index down to scan node (#22687 ) Co-authored-by: airborne12 <airborne12@gmail.com>	2023-09-02 22:24:43 +08:00
Jack Drogon	de8fa2cff5	[Fix](thrift) Add fe master check in some thrift calls (#23757 ) Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>	2023-09-02 14:05:31 +08:00
Jack Drogon	bbc893c953	[Enhancement](binlog) Add ModifyPartition, BatchModifyPartitions && ReplacePartitionOperationLog support (#23773 )	2023-09-02 13:19:07 +08:00
TengJianPing	75e2bc8a25	[function](bitmap) support bitmap_to_base64 and bitmap_from_base64 (#23759 )	2023-09-02 00:58:48 +08:00
hzq	16d6357266	[fix] (mac compile) Fix mac compile error & fe start time related (#23727 ) Fix of PR #23582 Some Fe codes are deleted by [Improvement](pipeline) Cancel outdated query if original fe restarts #23582 , need to be added back; Fix mac build failed caused by wrong thrift declaration order.	2023-09-01 08:02:30 +08:00
daidai	e680d42fe7	[feature](information_schema)add metadata_name_ids for quickly get catlogs,db,table and add profiling table in order to Compatible with mysql (#22702 ) add information_schema.metadata_name_idsfor quickly get catlogs,db,table. 1. table struct : ```mysql mysql> desc internal.information_schema.metadata_name_ids; +---------------+--------------+------+-------+---------+-------+ \| Field \| Type \| Null \| Key \| Default \| Extra \| +---------------+--------------+------+-------+---------+-------+ \| CATALOG_ID \| BIGINT \| Yes \| false \| NULL \| \| \| CATALOG_NAME \| VARCHAR(512) \| Yes \| false \| NULL \| \| \| DATABASE_ID \| BIGINT \| Yes \| false \| NULL \| \| \| DATABASE_NAME \| VARCHAR(64) \| Yes \| false \| NULL \| \| \| TABLE_ID \| BIGINT \| Yes \| false \| NULL \| \| \| TABLE_NAME \| VARCHAR(64) \| Yes \| false \| NULL \| \| +---------------+--------------+------+-------+---------+-------+ 6 rows in set (0.00 sec) mysql> select * from internal.information_schema.metadata_name_ids where CATALOG_NAME="hive1" limit 1 \G; ************************* 1. row ************************* CATALOG_ID: 113008 CATALOG_NAME: hive1 DATABASE_ID: 113042 DATABASE_NAME: ssb1_parquet TABLE_ID: 114009 TABLE_NAME: dates 1 row in set (0.07 sec) ``` 2. when you create / drop catalog , need not refresh catalog . ```mysql mysql> select count() from internal.information_schema.metadata_name_ids\G; ************************ 1. row ************************* count(): 21301 1 row in set (0.34 sec) mysql> drop catalog hive2; Query OK, 0 rows affected (0.01 sec) mysql> select count() from internal.information_schema.metadata_name_ids\G; ************************* 1. row ************************* count(): 10665 1 row in set (0.04 sec) mysql> create catalog hive3 ... mysql> select count() from internal.information_schema.metadata_name_ids\G; ************************* 1. row ************************* count(): 21301 1 row in set (0.32 sec) ``` 3. create / drop table , need not refresh catalog . ```mysql mysql> CREATE TABLE IF NOT EXISTS demo.example_tbl ... ; mysql> select count() from internal.information_schema.metadata_name_ids\G; ************************* 1. row ************************* count(): 10666 1 row in set (0.04 sec) mysql> drop table demo.example_tbl; Query OK, 0 rows affected (0.01 sec) mysql> select count() from internal.information_schema.metadata_name_ids\G; ************************* 1. row ************************* count(): 10665 1 row in set (0.04 sec) ``` 4. you can set query time , prevent queries from taking too long . ``` fe.conf : query_metadata_name_ids_timeout the time used to obtain all tables in one database ``` 5. add information_schema.profiling in order to Compatible with mysql ```mysql mysql> select from information_schema.profiling; Empty set (0.07 sec) mysql> set profiling=1; Query OK, 0 rows affected (0.01 sec) ```	2023-08-31 21:22:26 +08:00
zhangstar333	3a34ec95af	[FE](fucntion) add date_floor/ceil in FE function (#23539 )	2023-08-31 19:26:47 +08:00
hzq	c083336bbe	[Improvement](pipeline) Cancel outdated query if original fe restarts (#23582 ) If any FE restarts, queries that is emitted from this FE will be cancelled. Implementation of #23704	2023-08-31 17:58:52 +08:00
Xiangyu Wang	126606cb4d	[Fix](cache) fix query cache returns wrong result after deleting partitions. (#23555 ) The reason is that sql cache just use partitionKey , latestVersion and latestTime to check if the cache should be returned, if we delete some partition(s) which is not the latest updated partition, all above values are not changed, so the cache will hit. Use a field to save the partition num of these tables and sum the partition nums and send it to BE, there are two situations which contains delete-partition ops: - just delete some partition(s), so the sum of partition num will be lower than before. - delete some partition(s) coexists with add some partition(s), so the latest time or latest version will be higher than before.	2023-08-31 14:22:52 +08:00
zzzzzzzs	05771e8a14	[Enhancement](Load) stream Load using SQL (#23362 ) Using stream load in SQL mode for example: example.csv 10000,北京 10001,天津 curl -v --location-trusted -u root: -H "sql: insert into test.t1(c1, c2) select c1,c2 from stream(\"format\" = \"CSV\", \"column_separator\" = \",\")" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql curl -v --location-trusted -u root: -H "sql: insert into test.t2(c1, c2, c3) select c1,c2, 'aaa' from stream(\"format\" = \"CSV\", \"column_separator\" = \",\")" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql curl -v --location-trusted -u root: -H "sql: insert into test.t3(c1, c2) select c1, count(1) from stream(\"format\" = \"CSV\", \"column_separator\" = \",\") group by c1" -T example.csv http://127.0.0.1:8030/api/_stream_load_with_sql	2023-08-30 19:02:48 +08:00
amory	d326cb0c99	[fix](planner) array constructor do type coercion with decimal in wrong way (#23630 ) array creator with decimal type and integer type parameters should return array<decimal>, but the legacy planner return array<double>	2023-08-30 11:18:31 +08:00
Siyang Tang	1ac0ff0ea9	[feature](delete-predicate) support delete sub predicate v2 (#22442 ) New structure for delete sub predicate. Delete sub predicate uses a string type condition_str to stored temporarily now and fields will be extracted from it using std::regex, which may introduces stack overflow when matching a extremely large string(bug of libc). Now we attempt to use a new PB structure to hold the delete sub predicate, to avoid that problem. message DeleteSubPredicatePB { optional int32 column_unique_id = 1; optional string column_name = 2; optional string op = 3; optional string cond_value = 4; } Currently, 2 versions of sub predicate will both be filled. For query, we use the v2, and during compaction we still use v1. The old rowset meta with delete predicates which had sub predicate v1 will be attempted to convert to v2 when read from PB. Moreover, efforts will be made to rewrite these meta with the new delete sub predicate. Make preparation to use column unique id to specify a column globally. Using the column unique id rather than the column name to identify a column is vital for flexible schema change. The rewritten delete predicate will attach column unique id.	2023-08-29 19:37:23 +08:00
Chenyang Sun	153e8f0f72	[imporvement](table property) support for alter table property: skip wirte index , single compaction (#23475 )	2023-08-26 23:52:09 +08:00
Mingyu Chen	40be6a0b05	[fix](hive) do not split compress data file and support lz4/snappy block codec (#23245 ) 1. do not split compress data file Some data file in hive is compressed with gzip, deflate, etc. These kinds of file can not be splitted. 2. Support lz4 block codec for hive scan node, use lz4 block codec instead of lz4 frame codec 4. Support snappy block codec For hadoop snappy 5. Optimize the `count()` query of csv file For query like `select count() from tbl`, only need to split the line, no need to split the column. Need to pick to branch-2.0 after this PR: #22304	2023-08-26 12:59:05 +08:00
slothever	f66f161017	[fix](multi-catalog)fix hive table with cosn location issue (#23409 ) Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc). This PR mainly changes: 1. Fix the bug of accessing files via cosn. 2. Add a new field `fs_name` in TFileRangeDesc This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name for each file, otherwise, it may return error: `reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`	2023-08-26 00:16:00 +08:00
Kaijie Chen	2b6d876280	[feature](move-memtable)[6/7] add options to enable memtable on sink node (#23470 ) Co-authored-by: Siyang Tang <82279870+TangSiyang2001@users.noreply.github.com>	2023-08-25 22:32:22 +08:00
Kang	8ef6b4d996	[fix](json) fix json int128 overflow (#22917 ) * support int128 in jsonb * fix jsonb int128 write * fix jsonb to json int128 * fix json functions for int128 * add nereids function jsonb_extract_largeint * add testcase for json int128 * change docs for json int128 * add nereids function jsonb_extract_largeint * clang format * fix check style * using int128_t = __int128_t for all int128 * use fmt::format_to instead of snprintf digit by digit for int128 * clang format * delete useless check * add warn log * clang format	2023-08-25 11:40:30 +08:00
Kaijie Chen	c775f8e7bd	[feature](move-memtable)[2/7] add protos for memtable on sink node (#23348 ) Co-authored-by: zhengyu <freeman.zhang1992@gmail.com> Co-authored-by: laihui <1353307710@qq.com>	2023-08-24 11:11:46 +08:00
zclllyybb	51ac92f65c	Revert "[fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236 )" (#23368 ) This reverts commit 1c3cc77a54938ed948ad8186b8dea8385977d23c.	2023-08-23 18:27:35 +08:00
camby	22e373a799	[feature](vector-search) add 4 distance functions to support vector search (#23129 )	2023-08-23 15:51:15 +08:00
lihangyu	527293aa41	[refactor](dynamic table) remove dynamic table (#23298 )	2023-08-23 14:15:14 +08:00
AKIRA	35d0c9e71e	[refactor](nereids) Refactor stats collection framework (#22963 ) * remove auto analyze grammer * refactor ResultRow	2023-08-23 10:05:57 +08:00
Gabriel	dcd6c3c022	[pipelineX](refactor) propose a new pipeline execution model (#22562 )	2023-08-21 15:38:45 +08:00
ZenoYang	1c3cc77a54	[fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236 ) * [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty * add ut * fix nereids * fix regression-test	2023-08-18 14:37:49 +08:00
wuwenchi	a5ca6cadd6	[Improvement] Optimize count operation for iceberg (#22923 ) Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.	2023-08-18 09:57:51 +08:00
TengJianPing	343a6dc29d	[improvement](hash join) Return result early if probe side has no data (#23044 )	2023-08-17 09:17:09 +08:00
amory	390c52f73a	[Improve](complex-type) update for array/map element_at with nested complex type with local tvf (#22927 )	2023-08-16 20:47:36 +08:00
Siyang Tang	b49dc8042d	[feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539 ) ## Proposed changes Refactor thoughts: close #22383 Descriptions about `enclose` and `escape`: #22385 ## Further comments 2023-08-09: It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic. Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph. Trimming escape will be enable after fix: #22411 is merged Cases should be discussed: 1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large? 2. What if an infinite line occurs in the case? Essentially, `1.` is equivalent to this. Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.	2023-08-15 09:23:53 +08:00
Xiangyu Wang	8c3b95c523	[Fix](multi-catalog) sync default catalog when forwarding query to master. (#22684 ) Assume that there is a hive catalog named hive_ctl, a hive db named db1 and a table named tbl1, if we connect a slave FE and execute following commands: 1. `switch hive_ctl` 2. `show partitions from db1.tbl1` Then we will meet the error like this: ``` MySQL [(none)]> show partitions from db1.tbl1; ERROR 1049 (42000): errCode = 2, detailMessage = Unknown database 'default_cluster:db1' ``` The reason is that the slave FE will forward the `ShowPartitionStmt` to master FE but we do not sync the default catalog information, so the parser can not find the db and throws this exception. This is just one case, some other simillar cases will failed too.	2023-08-11 14:59:04 +08:00
Chuanle Chen	71807ceb5f	[Enhancement](tvf) Table value function support reading local file (#17404 ) I tested the local tvf with tpch queries. First, generate `lineitem` datasets with 6001215 rows, and load it into `lineitem` table by: ``` insert into lineitem select c11, c1, c4, c2, c3, c5, c6, c7, c8, c9, c10, c12, c13, c14, c15, c16 from local( "file_path" = "tools/tpch-tools/bin/tpch-data/lineitem.tbl.1", "backend_id" = "10003", "format" = "csv", "column_separator" = "\|" ); ``` Then, run `q1` and `q16` tpch queries, the query result is correct. It can also analyze the BE's log directly like: ``` mysql> select * from local( "file_path" = "log/be.out", "backend_id" = "10006", "format" = "csv") where c1 like "%start_time%" limit 10; +--------------------------------------------------------+ \| c1 \| +--------------------------------------------------------+ \| start time: 2023年 08月 07日星期一 23:20:32 CST \| \| start time: 2023年 08月 07日星期一 23:32:10 CST \| \| start time: 2023年 08月 08日星期二 00:20:50 CST \| \| start time: 2023年 08月 08日星期二 00:29:15 CST \| +--------------------------------------------------------+ ```	2023-08-10 20:07:42 +08:00
Qi Chen	f2658dc7bd	[Feature](multi-catalog) Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema. (#22318 ) Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema by session var `truncate_char_or_varchar_columns`.	2023-08-10 14:37:20 +08:00
daidai	f1db6bd8c1	[feature](hive)append support for struct and map column type on textfile format of hive table (#22347 ) 1. append support for struct and map column type on textfile format of hive table. 2. optimizer code that array column type. ```mysql +------+------------------------------------+ \| id \| perf \| +------+------------------------------------+ \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 2 \| {"name":"John", "age":"30"} \| +------+------------------------------------+ ``` ```mysql +---------+------------------+ \| column1 \| column2 \| +---------+------------------+ \| 1 \| {10, "data1", 1} \| \| 2 \| {20, "data2", 0} \| \| 3 \| {30, "data3", 1} \| +---------+------------------+ ``` Summarizes support for complex types(support assign delimiter) : 1. array< primitive_type > and array< array< ... > > 2. map< primitive_type , primitive_type > 3. Struct< primitive_type , primitive_type ... >	2023-08-10 13:47:58 +08:00
herry2038	eafdab0cfd	[Enhancement](tvf) Add frontends_disks table-valued-function (#22568 ) --------- Co-authored-by: yuxianbing <yuxianbing@yy.com> Co-authored-by: yuxianbing <iloveqaz123>	2023-08-10 10:40:24 +08:00
Jack Drogon	508cbe030b	[Chore](binlog) Refactor TABLET_MISSING in ingest_binlog && set_tstatus (#22727 ) Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>	2023-08-09 09:58:54 +08:00
zzzzzzzs	66784cef71	[Enhancement](Load) Stream Load using SQL (#22509 ) This PR was originally #16940 , but it has not been updated for a long time due to the original author @Cai-Yao . At present, we will merge some of the code into the master first. thanks @Cai-Yao @yiguolei	2023-08-08 13:49:04 +08:00
Jack Drogon	22cbf43b14	[Improvement](binlog) Add full/incr engine clone with binlog (#22678 ) Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>	2023-08-08 10:03:11 +08:00
Jack Drogon	34164f69ba	[Enhancement](binlog) Add Barrier log into BinlogManager (#22559 ) Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>	2023-08-04 14:37:12 +08:00
herry2038	4f9969ce1e	[feature](show-frontends-disk) Add Show frontend disks (#22040 ) Co-authored-by: yuxianbing <yuxianbing@yy.com> Co-authored-by: yuxianbing <iloveqaz123>	2023-08-03 14:04:48 +08:00
Chenyang Sun	19d1f49fbe	[improvement](compaction) compaction policy and options in the properties of a table (#22461 )	2023-08-01 22:02:23 +08:00
Gabriel	b64f62647b	[runtime filter](profile) add merge time on non-pipeline engine (#22363 )	2023-07-31 12:52:42 +08:00
AlexYue	06e4061b94	[enhance](ColdHeatSeparation) carry use path style info along with cold heat separation to support using minio (#22249 )	2023-07-30 21:03:33 +08:00
daidai	ae8a26335c	[opt](hive)opt select count() stmt push down agg on parquet in hive . (#22115 ) Optimization "select count() from table" stmtement , push down "count" type to BE. support file type : parquet ，orc in hive . 1. 4kfiles , 60kwline num before: 1 min 37.70 sec after: 50.18 sec 2. 50files , 60kwline num before: 1.12 sec after: 0.82 sec	2023-07-29 00:31:01 +08:00
HHoflittlefish777	05abfbc5ef	[improvement](regression-test) add compression algorithm regression test (#22303 )	2023-07-28 17:28:52 +08:00
zclllyybb	c2155678ca	[fix](functions) fix now(null) crash (#22321 ) before: BE crash now: mysql [test]>select now(null); +-----------+ \| now(NULL) \| +-----------+ \| NULL \| +-----------+ 1 row in set (0.06 sec)	2023-07-28 14:07:56 +08:00
YueW	7be349a10b	[opt](inverted index) add session variable enable_inverted_index_query to control whether query with inverted index (#22255 )	2023-07-28 12:43:26 +08:00
Mingyu Chen	00863f25e9	[improvement](profile) add table name for file scan node (#22299 ) ``` VFILE_SCAN_NODE(region) (id=0):(Active: 3.537us, % non-child: 0.00%) - RuntimeFilters: : - UseSpecificThreadToken: False - AcquireRuntimeFilterTime: 501ns - AllocateResourceTime: 105.598us ```	2023-07-27 23:54:31 +08:00
Jack Drogon	816fd50d1d	[Enhancement](binlog) Add binlog enable diable check in BinlogManager (#22173 ) Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>	2023-07-27 20:16:21 +08:00
Gabriel	341c45974c	[round](decimalv2) round precise decimalv2 value (#22258 )	2023-07-27 10:00:36 +08:00
HHoflittlefish777	9e16c69925	[improvement](compression) support LZ4_HC algorithm and parse LZ4_RAW (#22165 )	2023-07-26 18:23:39 +08:00
airborne12	fc2b9db0ad	[Feature](inverted index) add tokenize function for inverted index (#21813 ) In this PR, we introduce TOKENIZE function for inverted index, it is used as following: ``` SELECT TOKENIZE('I love my country', 'english'); ``` It has two arguments, first is text which has to be tokenized, the second is parser type which can be english, chinese or unicode. It also can be used with existing table, like this: ``` mysql> SELECT TOKENIZE(c,"chinese") FROM chinese_analyzer_test; +---------------------------------------+ \| tokenize(`c`, 'chinese') \| +---------------------------------------+ \| ["来到", "北京", "清华大学"] \| \| ["我爱你", "中国"] \| \| ["人民", "得到", "更", "实惠"] \| +---------------------------------------+ ```	2023-07-25 15:05:35 +08:00

1 2 3 4 5 ...

915 Commits