doris

Author	SHA1	Message	Date
Qi Chen	73ad885e19	[Feature][Fix](multi-catalog) Implements transactional hive full acid tables. (#20679 ) After supporting insert-only transactional hive full acid tables #19518, #19419, this PR support transactional hive full acid tables. Support hive3 transactional hive full acid tables. Hive2 transactional hive full acid tables need to run major compactions.	2023-06-13 08:55:16 +08:00
jiawei liang	99c0592157	[Feature](array-function) Support array_pushback function #17417 (#19988 ) Implement array_pushback. mysql> select array_pushback([1, 2], 3); +--------------------------------+ \| array_pushback(ARRAY(1, 2), 3) \| +--------------------------------+ \| [1, 2, 3] \| +--------------------------------+ 1 row in set (0.01 sec)	2023-06-12 16:51:12 +08:00
yiguolei	a6f625676b	[profile](remove child) child is for node, should not be used to organize counters (#20676 ) Currently, there are many profiles using add child profile to orgnanize profile into blocks. But it is wrong. Child profile will have a total time counter. Actually, what we should use is just a label. - MemoryUsage: - HashTable: 23.98 KB - SerializeKeyArena: 446.75 KB Add a new macro ADD_LABEL_COUNTER to add just a label in the profile. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-06-12 10:00:35 +08:00
Ashin Gau	9a83d78dfe	[Enhancement](hudi) support hudi mor table, step2 follow #19909 (#20570 ) PR(https://github.com/apache/doris/pull/19909) has implemented the framework of hudi reader for MOR table. This PR completes all functions of reading MOR table and enables end-to-end queries. Key Implementations: 1. Use hudi meta information to generate the table schema, not from hive client. 2. Use hive client to list hudi partitions, so it strongly depends the sync-tools(https://hudi.apache.org/docs/syncing_metastore/) which syncs the partitions of hudi into hive metastore. However, we may get the hudi partitions directly from .hoodie directory. 3. Remove `HudiHMSExternalCatalog`, because other catalogs like glue is compatible with hive catalog. 4. Read the COW table originally from c++. 5. Hudi RecordReader will use ProcessBuilder to start a hotspot debugger process, which may be stuck when attaching the origin JNI process, soI use a tricky method to kill this useless process.	2023-06-10 12:25:53 +08:00
lihangyu	fa785f3b24	[chore](proto) make some `required` fields `optional` for compability (#20609 )	2023-06-09 08:51:01 +08:00
Jack Drogon	03cb69c0ee	[feature](backup-restore) Add local backup/restore not upload/download by broker (#20492 )	2023-06-07 21:35:15 +08:00
zhengyu	09344eaab5	[feature](load) introduce single-stream-multi-table load (#20006 ) For routine load (kafka load), user can produce all data for different table into single topic and doris will dispatch them into corresponding table. Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>	2023-06-07 17:55:25 +08:00
yuxuan-luo	fe63a0a3bb	[Feature](multi-catalog)support paimon catalog (#19681 ) CREATE CATALOG paimon_n2 PROPERTIES ( "dfs.ha.namenodes.HDFS1006531" = "nn2,nn1", "dfs.namenode.rpc-address.HDFS1006531.nn2" = "172.16.65.xx:4007", "dfs.namenode.rpc-address.HDFS1006531.nn1" = "172.16.65.xx:4007", "hive.metastore.uris" = "thrift://172.16.65.xx:7004", "type" = "paimon", "dfs.nameservices" = "HDFS1006531", "hadoop.username" = "hadoop", "paimon.catalog.type" = "hms", "warehouse" = "hdfs://HDFS1006531/data/paimon1", "dfs.client.failover.proxy.provider.HDFS1006531" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" );	2023-06-06 15:08:30 +08:00
TengJianPing	1b94b6368f	[fix](load) in strict mode, return error for insert if datatype convert fails (#20378 ) * [fix](load) in strict mode, return error for load and insert if datatype convert fails Revert "[fix](MySQL) the way Doris handles boolean type is consistent with MySQL (#19416)" This reverts commit 68eb420cabe5b26b09d6d4a2724ae12699bdee87. Since it changed other behaviours, e.g. in strict mode insert into t_int values ("a"), it will result 0 is inserted into table, but it should return error instead. * fix be ut * fix regression tests	2023-06-06 12:04:03 +08:00
wangbo	65100d8083	[improvement](profile)add max/min rpc time (#20339 )	2023-06-06 12:03:01 +08:00
Yang, Xu	d02737a293	[feature](struct-type) support struct_element function (#19045 ) This commit support a function allows return a field column in named struct column. Since the function can return any type, this commit also supports ANY_STRUCT_TYPE and ANY_ELEMENT_TYPE.	2023-06-06 10:44:08 +08:00
slothever	b7fc17da68	[feature-wip](multi-catalog)(step2)support read max compute data by JNI (#19819 ) Issue Number: #19679	2023-06-05 22:10:08 +08:00
luozenglin	1cb4c7bc51	[enhancement](function) Compatible with python 2.6 and keep the code style consistent (#20429 )	2023-06-05 15:33:38 +08:00
amory	59a0f80233	[Improve](array-function)Improve array function intersect (#20085 ) now we just support array function with 2 arrays , but intersect operator can support more than 2 arrays	2023-06-05 10:38:48 +08:00
Pxl	8e39f0cf6b	[Enchancement](Agg State) storage function name and result is nullable in agg state type (#20298 ) storage function name and result is nullable in agg state type	2023-06-04 22:44:48 +08:00
Kang	ffadaa4935	[improvement](inverted index) skip write index on load and generate index on compaction (#20325 )	2023-06-03 16:03:21 +08:00
zy-kkk	9c9f5fec0f	[chore](function) Refactor FunctionSet Initialization for Better Maintainability and Compilation Success (#20285 ) In this PR, I have refactored the initialization of the FunctionSet. Previously, all the functions were in one large method which led to the generation of Java code that was too long. This posed a problem for the compiler, as the length of the method exceeded the limit imposed by the Java compiler. To resolve this issue and improve the readability and manageability of our code, I have categorized these functions by type, and created dedicated initialization methods for each type. As such, our code is now not only more readable and understandable, but also each method is of a length that is acceptable to the compiler and can be compiled successfully. Moreover, this change makes it easier for us to add new functions as we can directly locate the right category and add new functions there. This is a significant change aimed at enhancing the maintainability and scalability of our code, while ensuring that our code can be successfully compiled.	2023-06-02 17:50:47 +08:00
amory	d68f3f3b3d	[Feature](array-functions)improve array functions for array_last_index (#20294 ) Now we just support array_first_index for lambda input , but no array_last_index	2023-06-02 13:54:03 +08:00
lihangyu	f0513a861d	[Improve](Scan) add a session variable to make scan run serial (#20220 ) Parallel scanning can result in some read amplification, for example, select * from xx where limit 1 actually requires only one row of data. However, due to parallel scanning of multiple tablets, read amplification occurs, leading to performance bottlenecks in high-concurrency scenarios. This PR Adding a SessionVariable to enforce serial scanning can help mitigate this issue.	2023-06-01 15:06:35 +08:00
Gabriel	4387f47fb5	[pipeline](load) support pipeline load (#20217 )	2023-06-01 11:42:43 +08:00
Lijia Liu	f9dfcb923d	[Enhancement] Change Create Resource Group Grammar (#20249 )	2023-05-31 15:23:24 +08:00
zy-kkk	56fa38de1d	[Enhencement](JDBC Catalog) refactor jdbc catalog insert logic (#19950 ) This PR refactors the old way of writing data to JDBC External Table & JDBC Catalog, mainly including the following tasks 1. Continuing the work of @BePPPower 's PR #18594, changing the logic of splicing Inster sql to operating off-heap memory and using preparedStatement.set to write data logic to complete 2. Supplement the support written by largeint type, mainly to adapt to Java.Math.BigInteger, which uses binary operations 3. Delete the splicing SQL logic in the JDBC External Table & JDBC Catalog related written code ToDo: Binary type，like bit,binary, blob... Finally, special thanks to @BePPPower , @AshinGau for his work Co-authored-by: Tiewei Fang <43782773+BePPPower@users.noreply.github.com>	2023-05-30 22:03:39 +08:00
Chenyang Sun	accaff1026	[Feature](compaction) wip: single replica compaction (#19237 ) Currently, compaction is executed separately for each backend, and the reconstruction of the index during compaction leads to high CPU usage. To address this, we are introducing single replica compaction, where a specific primary replica is selected to perform compaction, and the remaining replicas fetch the compaction results from the primary replica. The Backend (BE) requests replica information for all peers corresponding to a tablet from the Frontend (FE). This information includes the host where the replica is located and the replica_id. By calculating hash(replica_id), the replica with the smallest hash value is responsible for executing compaction, while the remaining replicas are responsible for fetching the compaction results from this replica. The compaction task producer thread, before submitting a compaction task, checks whether the local replica should fetch from its peer. If it should, the task is then submitted to the single replica compaction thread pool. When performing single replica compaction, the process begins by requesting rowset versions from the target replica. These rowset_versions are then compared with the local rowset versions. The first version that can be fetched is selected.	2023-05-30 21:12:48 +08:00
YueW	de08c4a57b	[enhance](match) Support match query without inverted index (#19936 )	2023-05-30 15:02:57 +08:00
Pxl	e9917612f0	[Chore](gensrc) remove gen_vector_functions.py #20150	2023-05-29 18:16:31 +08:00
lihangyu	ab8125d56f	[Improve](performance) introduce SchemaCache to cache TabletSchame & Schema (#20037 ) * [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema 1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse. 2. Make some variables wrapped with std::unique<unique_ptr> Performance: \| 状态 \| QPS \| 平均响应时间 (avg) \| P99 响应时间 \| \|------------------\|-----\|------------------\|-------------\| \| 开启 SchemaCache \| 501 \| 20ms \| 34ms \| \| 关闭 SchemaCache \| 321 \| 31ms \| 61ms \| * handle schema change with schema version * remove useless header * rebase	2023-05-29 17:34:53 +08:00
Jerry Hu	9f8de89659	[refactor](exec) replace the single pointer with an array of 'conjuncts' in ExecNode (#19758 ) Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity. By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed. This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.	2023-05-29 11:47:31 +08:00
Yongqiang YANG	e0d9f7f955	[enhancement](load) add some profile items for load (#20141 )	2023-05-29 09:54:03 +08:00
YueW	ae352997b4	[Enhancement](alter inverted index) Improve alter inverted index performance with light weight add or drop inverted index (#19063 )	2023-05-28 11:23:07 +08:00
Jack Drogon	93933308e6	[Feature-WIP](CCR): Add ccr doris interface (WIP) (#17881 )	2023-05-26 23:40:49 +08:00
lihangyu	317338913c	[Bug](topn) Fix topn fetch set real default value (#20074 ) 1. Before this PR if rowset does not contain column which should be read for related SlotDescriptor will call `insert_default` to column, but it's not this real defautl value.Real default value relevant information should be provided by the frontend side. 2. Support fetch when light schema change is not enabled, but disable for AGG or UNIQUE MOR model	2023-05-26 16:06:55 +08:00
lexluo09	3f971889b7	[Enhancement](multi catalog) Support hudi mor only java side ,be side not support (#19909 ) Support reading Hudi MOR table by using jni connector. Note: the FE part of the current PR is not completed all, and the BE part will be supplemented in next PR.	2023-05-25 20:37:01 +08:00
zhangstar333	e04b9cb47e	[vectorized](function) fix array_map funtion return type maybe get wrong (#19320 )	2023-05-25 11:30:28 +08:00
zhangstar333	53ae24912f	[vectorized](feature) support partition sort node (#19708 )	2023-05-25 11:22:02 +08:00
Gabriel	5547bbbaef	[decimalv3](function) support function width_bucket (#19806 )	2023-05-19 20:28:59 +08:00
Kang	294599ee45	[feature](jsonb) rename JSONB type name and function name to JSON (#19774 ) To be more compatible with MySQL, rename JSONB type name and function name to JSON. The old JSONB type name and jsonb_xx function can still be used for backward compatibility. There is a function jsonb_extract remained since json_extract is used by json string function and more work need to change it. It will be changed further.	2023-05-18 16:16:52 +08:00
Pxl	a2c9ed7be8	[Chore](build) fix some undefined behavior about incomplete type vector #19753	2023-05-18 15:13:45 +08:00
airborne12	303bee6fa3	[Fix](single replica load) add inverted index copy for single replica load (#19663 ) * [Fix](single replica load) add inverted index copy for single replica load	2023-05-18 14:13:41 +08:00
yixiutt	943e5fb7e5	[improvement](MOW) use seperated cache for mow pk cache (#19686 ) In mow, primary key cache have a big impact on load performance, so we add a new cache type to seperate it from page cache to make it more flexible in some cases	2023-05-18 13:27:09 +08:00
Kang	88ca4f3e6b	[feature](like) make like regexp used as a sql function (#19755 )	2023-05-18 10:03:12 +08:00
Ziyu Wang	325a1d4b28	[vectorized](function) support array_count function (#18557 ) support array_count function. array_count：Returns the number of non-zero and non-null elements in the given array.	2023-05-16 17:00:01 +08:00
lihangyu	e22f5891d2	[WIP](row store) two phase opt read row store (#18654 )	2023-05-16 13:21:58 +08:00
slothever	3f2d1ae9a4	[feature-wip](multi-catalog)(step1)support connect to max compute (#19606 ) Issue Number: #19679 support connect to max compute metadata by odps sdk	2023-05-16 11:30:27 +08:00
Zhengguo Yang	6748ae4a57	[Feature] Collect the information statistics of the query hit (#18805 ) 1. Show the query hit statistics for `baseall` ```sql MySQL [test_query_db]> show query stats from baseall; +-------+------------+-------------+ \| Field \| QueryCount \| FilterCount \| +-------+------------+-------------+ \| k0 \| 0 \| 0 \| \| k1 \| 0 \| 0 \| \| k2 \| 0 \| 0 \| \| k3 \| 0 \| 0 \| \| k4 \| 0 \| 0 \| \| k5 \| 0 \| 0 \| \| k6 \| 0 \| 0 \| \| k10 \| 0 \| 0 \| \| k11 \| 0 \| 0 \| \| k7 \| 0 \| 0 \| \| k8 \| 0 \| 0 \| \| k9 \| 0 \| 0 \| \| k12 \| 0 \| 0 \| \| k13 \| 0 \| 0 \| +-------+------------+-------------+ 14 rows in set (0.002 sec) MySQL [test_query_db]> select k0, k1,k2, sum(k3) from baseall where k9 > 1 group by k0,k1,k2; +------+------+--------+-------------+ \| k0 \| k1 \| k2 \| sum(`k3`) \| +------+------+--------+-------------+ \| 0 \| 6 \| 32767 \| 3021 \| \| 1 \| 12 \| 32767 \| -2147483647 \| \| 0 \| 3 \| 1989 \| 1002 \| \| 0 \| 7 \| -32767 \| 1002 \| \| 1 \| 8 \| 255 \| 2147483647 \| \| 1 \| 9 \| 1991 \| -2147483647 \| \| 1 \| 11 \| 1989 \| 25699 \| \| 1 \| 13 \| -32767 \| 2147483647 \| \| 1 \| 14 \| 255 \| 103 \| \| 0 \| 1 \| 1989 \| 1001 \| \| 0 \| 2 \| 1986 \| 1001 \| \| 1 \| 15 \| 1992 \| 3021 \| +------+------+--------+-------------+ 12 rows in set (0.050 sec) MySQL [test_query_db]> show query stats from baseall; +-------+------------+-------------+ \| Field \| QueryCount \| FilterCount \| +-------+------------+-------------+ \| k0 \| 1 \| 0 \| \| k1 \| 1 \| 0 \| \| k2 \| 1 \| 0 \| \| k3 \| 1 \| 0 \| \| k4 \| 0 \| 0 \| \| k5 \| 0 \| 0 \| \| k6 \| 0 \| 0 \| \| k10 \| 0 \| 0 \| \| k11 \| 0 \| 0 \| \| k7 \| 0 \| 0 \| \| k8 \| 0 \| 0 \| \| k9 \| 1 \| 1 \| \| k12 \| 0 \| 0 \| \| k13 \| 0 \| 0 \| +-------+------------+-------------+ 14 rows in set (0.001 sec) ``` 2. Show the query hit statistics summary for all the mv in a table ```sql MySQL [test_query_db]> show query stats from baseall all; +-----------+------------+ \| IndexName \| QueryCount \| +-----------+------------+ \| baseall \| 1 \| +-----------+------------+ 1 row in set (0.005 sec) ``` 3. Show the query hit statistics detail info for all the mv in a table ```sql MySQL [test_query_db]> show query stats from baseall all verbose; +-----------+-------+------------+-------------+ \| IndexName \| Field \| QueryCount \| FilterCount \| +-----------+-------+------------+-------------+ \| baseall \| k0 \| 1 \| 0 \| \| \| k1 \| 1 \| 0 \| \| \| k2 \| 1 \| 0 \| \| \| k3 \| 1 \| 0 \| \| \| k4 \| 0 \| 0 \| \| \| k5 \| 0 \| 0 \| \| \| k6 \| 0 \| 0 \| \| \| k10 \| 0 \| 0 \| \| \| k11 \| 0 \| 0 \| \| \| k7 \| 0 \| 0 \| \| \| k8 \| 0 \| 0 \| \| \| k9 \| 1 \| 1 \| \| \| k12 \| 0 \| 0 \| \| \| k13 \| 0 \| 0 \| +-----------+-------+------------+-------------+ 14 rows in set (0.017 sec) ``` 4. Show the query hit for a database ```sql MySQL [test_query_db]> show query stats for test_query_db; +----------------------------+------------+ \| TableName \| QueryCount \| +----------------------------+------------+ \| compaction_tbl \| 0 \| \| bigtable \| 0 \| \| empty \| 0 \| \| tempbaseall \| 0 \| \| test \| 0 \| \| test_data_type \| 0 \| \| test_string_function_field \| 0 \| \| baseall \| 1 \| \| nullable \| 0 \| +----------------------------+------------+ 9 rows in set (0.005 sec) ``` 5. Show query hit statistics for all the databases ```sql MySQL [(none)]> show query stats; +-----------------+------------+ \| Database \| QueryCount \| +-----------------+------------+ \| test_query_db \| 1 \| +-----------------+------------+ 1 rows in set (0.005 sec) ```	2023-05-15 10:56:34 +08:00
HHoflittlefish777	f8ef25bb10	[enhancement](load) lazy-open necessary partitions when load (#18874 )	2023-05-14 16:09:55 +08:00
Tiewei Fang	91cdb79d89	[Bugfix](Outfile) fix that export data to parquet and orc file format (#19436 ) 1. support export `LARGEINT` data type to parquet/orc file format. 2. Export the DORIS `DATE/DATETIME` type to the `Date/Timestamp` logic type of parquet file format. 3. Fix that the data is not correct when the DATE type data is exported to ORC.	2023-05-13 22:39:24 +08:00
Xiangyu Wang	589dd8a9b3	[Fix](multi-catalog) Fix query hms tbl with compressed data files. (#19387 ) If submit a query contains hms tbls which data files are compressed (bz2,lzo,lz4 ...), a error will occurs like this: ```[INTERNAL_ERROR]Only support csv data in utf8 codec``` . This is because `org.apache.doris.planner.external.HiveScanNode` set `fileFormatType` as `TFileFormatType.FORMAT_CSV_PLAIN` whether the real compress algo of data files are. This pr try to fix this problem.	2023-05-11 14:53:58 +08:00
herry2038	834bf2eab7	[feature](array) Add array_last lambda function (#18388 ) Add array_last lambda function	2023-05-11 13:15:54 +08:00
Ashin Gau	3ba3b6c66f	[opt](FileCache) use modification time to determine whether the file is changed (#18906 ) Get the last modification time from file status, and use the combination of path and modification time to generate cache identifier. When a file is changed, the modification time will be changed, so the former cache path will be invalid.	2023-05-11 07:50:39 +08:00
zhangdong	b129c9901b	[improvement](FQDN)Change the implementation of fqdn (#19123 ) Main changes: 1. If fqdn is enabled in the configuration file, when fe starts, localAddr will obtain fqdn instead of IP, priority_ Networks will fail 2. The IP and host names of Backend and Front are combined into one field, host. When fqdn is enabled, it represents the host name, and when not enabled, it represents the IP address 3. The communication between clusters directly uses fqdn, and various Connection pool add authentication mechanisms to prevent the IP address of the domain name from changing and the connection between nodes from making errors 4. No longer requires polling to verify if the IP has changed, delete fqdnManager 5. Change the method of verifying the legitimacy of nodes between FEs from obtaining client IP to displaying the identity of the transmitting node itself in the HTTP request header or the message body of the throttle 6. When processing the heartbeat, if BE finds that the host stored by itself is inconsistent with the host stored by the master, after verifying the legitimacy of the host, it will change its own host instead of directly reporting an error 7. Simplify the generation logic of fe name Scope of influence: 1. Establishing communication connections between clusters 2. Determine whether it is the same node through attributes such as IP 3. Print Log 4. Information display 5. Address Splicing 6. k8s deployment 7. Upgrade compatibility Test plan: 1. Change the IP address of the node, while keeping the fqdn unchanged, change the IP addresses of fe and be, and verify whether the cluster can read and write data normally 2. Use the master code to generate metadata, and use the previous metadata on the current pr to verify whether it is compatible with the old version (upgrading is no longer supported if fqdn has been enabled before) 3. Deploy fe and be clusters using k8s to verify whether the cluster can read and write data normally 4. According to https://doris.apache.org/zh-CN/docs/dev/admin-manual/cluster-management/fqdn?_highlight=fqdn#%E6%97%A7%E9%9B%86%E7%BE%A4%E5%90%AF%E7%94%A8fqdn Upgrading old clusters 5. Use streamload to specify the fqdn of fe and be to import data separately 6. Use different users to start transactions and write data using insert statements	2023-05-11 00:44:48 +08:00

1 2 3 4 5 ...

830 Commits