doris

Author	SHA1	Message	Date
LingBin	3c539aac54	[Refactor] Some tiny refactor on streaming-load related code (#2891 ) Mainly contains the following modifications: 1. Use `std::unique_ptr` to replace some naked pointers 2. Modify some methods from member-method to local-static-function 3. Modify some methods do not need to be public to private 4. Some formatting changes: such as wrapping lines that are too long 5. Remove some useless variables 6. Add or modify some comments for easier understanding No functional changes in this patch.	2020-02-13 10:42:52 +08:00
LingBin	4e151b1551	Remove boost exception when parse store path (#2861 )	2020-02-10 17:50:52 +08:00
kangkaisen	e7817053cc	[Uitls] ParseUtil::parse_mem_spec support K and T suffix (#2854 )	2020-02-07 09:31:35 +08:00
Yunfeng,Wu	b35e8153c0	[Doris on Es] Fix lte and gte error expression (#2851 ) LE should LTE GE should GTE	2020-02-06 20:52:14 +08:00
kangpinghuang	a27e89065b	Add file cache for v2 (#2782 ) Add file descriptor cache for segment v2 to solve too many open file problems	2020-02-04 00:16:01 +08:00
Lijia Liu	99ad56d1bf	Support bitmap index for more type (#2630 ) For #2589 1. date(uint24_t)/datetime(int64_t)/largeint(int128_t) use frame of reference code as dict. 2. decimal(decimal12_t) also uses frame of reference code as dict. 3. float/double use bitshuffle code as dict.	2020-01-31 21:09:29 +08:00
Lishi	89c7234c1c	Support starts_with (str, prefix) function (#2813 ) Support starts_with function	2020-01-21 14:09:08 +08:00
yangzhg	634928e4d0	Fix typo and remove tmp file in ut (#2789 )	2020-01-19 21:33:48 +08:00
LingBin	7c4149cf27	Improve comparison and printing of Version (#2796 ) * Improve comparison and printing of Version There are two members in `Version`:` first` and `second`. There are many places where we need to print one `Version` object and compare two `Version` objects, but in the current code, these two members are accessed directly, which makes the code very tedious. This patch mainly do: 1. Adds overloaded methods for `operator<<()` for `Version`, so we can directly print a Version object; 2. Adds the `cantains()` method to determine whether it is an containment relationship; 3. Uses `operator==()` to determine if two `Version` objects are equal. Because there are too many places need to be modified, there are still some naked codes left, which will be modified later. This patch also removes some necessary header file references. No functional changes in this patch.	2020-01-19 18:04:28 +08:00
LingBin	c71eefa2ac	Add path util (#2747 ) Note that the methods in path_util are only related to path processing, and do not involve any file and IO operations The upcoming patch will use these util methods, used to extract operations such as concatenation of directory strings from processing logic.	2020-01-18 00:05:00 +08:00
Dayue Gao	3b24287251	Support 64 bits integers for BITMAP type (#2772 ) Fixes #2771 Main changes in this CL * RoaringBitmap is renamed to BitmapValue and moved into bitmap_value.h * leveraging Roaring64Map to support unsigned BIGINT for BITMAP type * introduces two new format (SINGLE64 and BITMAP64) for BITMAP type So far we have three storage format for BITMAP type ``` EMPTY := TypeCode(0x00) SINGLE32 := TypeCode(0x01), UInt32LittleEndian BITMAP32 := TypeCode(0x02), RoaringBitmap(defined by https://github.com/RoaringBitmap/RoaringFormatSpec/) ``` In order to support BIGINT element and keep backward compatibility, introduce two new format ``` SINGLE64 := TypeCode(0x03), UInt64LittleEndian BITMAP64 := TypeCode(0x04), CustomRoaringBitmap64 ``` Please note that SINGLE64/BITMAP64 doesn't replace SINGLE32/BITMAP32. Doris will choose the smaller (in terms of space) type automatically during serializing. For example, BITMAP32 is preferred over BITMAP64 when the maximum element is <= UINT32_MAX. This will also make BE rollback possible as long as user didn't write element larger than UINT32_MAX into bitmap column. Another important design decision is that we fork and maintain our own version of Roaring64Map instead of using the one in "roaring/roaring64map.hh". The reasons are 1. RoaringBitmap doesn't define a standard for the binary format of 64-bits bitmap. As a result, different implementations of Roaring64Map use different format. For example the [C++ version](https://github.com/RoaringBitmap/CRoaring/blob/v0.2.60/cpp/roaring64map.hh#L545) is different from the [Java version](`35104c564e/src/main/java/org/roaringbitmap/longlong/Roaring64NavigableMap.java (L1097)`). Even for CRoaring, the format may change in future releases. However Doris require the serialized format to be stable across versions. Fork is a safe way to achieve this. 2. We may want to make some code changes to Roaring64Map according to our needs. For example, in order to use the BITMAP32 format when the maximum element can be represented in 32 bits, we may want to access the private member of Roaring64Map. Another example is we want to further customize and optimize the format for BITMAP64 case, such as using vint64 instead of uint64 for map size.	2020-01-17 14:13:38 +08:00
HangyuanLiu	0ddca59d36	Add timestampadd/timestampdiff function (#2725 )	2020-01-15 21:47:07 +08:00
DanyBin	7768629f08	Add bitmap_contains and bitmap_has_any functions (#2752 )	2020-01-15 14:31:44 +08:00
HangyuanLiu	a36193dfab	Support decimal and timestamp type in orc load (#2759 )	2020-01-15 07:40:30 +08:00
frwrdt	f071d5a307	Support ends_with function (#2746 )	2020-01-14 22:37:20 +08:00
ZHAO Chun	a99a49a444	Add bitamp_to_string function (#2731 ) This CL changes: 1. add function bitmap_to_string and bitmap_from_string, which will convert a bitmap to/from string which contains all bit in bitmap 2. add function murmur_hash3_32, which will compute murmur hash for input strings 3. make the function cast float to string the same with user result logic	2020-01-13 12:31:37 +08:00
yangzhg	4b8f7f9c32	Use cgroups memory limit and cpu cores in container (#2710 )	2020-01-10 00:45:50 +08:00
kangkaisen	1c9cfa7e0f	Fix invalid to_bitmap input lead to BE core (#2706 )	2020-01-08 22:14:37 +08:00
DanyBin	a028c52edd	Add BE function bitmap_or and bitmap_and (#2707 )	2020-01-08 19:59:44 +08:00
Mingyu Chen	13e5fdd512	[AlphaRowset] set num_segments field in rowset meta if missing (#2658 ) the num segments should be read from rowset meta pb. But the previous code error caused this value not to be set in some cases. So when init the rowset meta and find that the num_segments is 0(not set), we will try to calculate the num segments from AlphaRowsetExtraMetaPB, and then set the num_segments field. This should only happen in some rowsets converted from old version. and for all newly created rowsets, the num_segments field must be set.	2020-01-07 21:46:02 +08:00
HangyuanLiu	2326b478b6	Support load orc format in Apache Doris (#2554 ) Support load orc format in Apache Doris	2020-01-07 14:22:43 +08:00
yangzhg	de4d1778c6	Fix incompatibility with arm architecture in util and gutil (#2650 ) 1. upgrade gutil code from imapla to new verison， include `cpuinfo`, `spinlock` and `linux_syscall_support ` 2. impliments arm version utf8 check code 3. remove incompatible code from stopwatch	2020-01-06 18:39:31 +08:00
WingC	220ed8436c	[Unit Test]Fix Schema Change Test Case (#2659 )	2020-01-05 20:08:23 +08:00
ZHAO Chun	1648226927	Adapt arrow 0.15 API (#2657 ) This CL supports arrow's zero copy read interface, which can make code comply with arrow 0.15. And the schema change unit test has some problem, I disable it in run-ut.sh	2020-01-04 15:54:29 +08:00
yangzhg	c098178f7a	[Index] Implements create drop show index syntax for bitmap index [#2487 ] (#2573 ) ### create table with index ``` CREATE TABLE table1 ( siteid INT DEFAULT '10', citycode SMALLINT, username VARCHAR(32) DEFAULT '', pv BIGINT SUM DEFAULT '0', INDEX index_name [USING BITMAP] (siteid, citycode) COMMENT 'balabala' ) AGGREGATE KEY(siteid, citycode, username) DISTRIBUTED BY HASH(siteid) BUCKETS 10 PROPERTIES("replication_num" = "1"); ``` ### create index ``` CREATE INDEX index_name ON table1 (siteid, citycod) [USING BITMAP] COMMENT 'balabala'; or ALTER TABLE table1 ADD INDEX index_name [USING BITMAP] (siteid, citycod) COMMENT 'balabala'; ``` ### drop index ``` DROP INDEX index_name ON table1; or ALTER TABLE table1 DROP INDEX index_name ``` ### show index ``` SHOW INDEX[ES] FROM table1 ``` output ``` +---------+-------------+-----------------+------------+---------+ \| Table \| Index_name \| Column_name \| Index_type \| Comment \| +---------+-------------+-----------------+------------+---------+ \| table1 \| index_name \| siteid,citycode \| BITMAMP \| balabala\| +---------+-------------+-----------------+------------+---------+ ```	2020-01-03 17:41:26 +08:00
Mingyu Chen	cc924c9e6a	[Rowset Reader] Improve the merge read efficiency of alpha rowsets (#2632 ) When merge reads from one rowset with multi overlapping segments, I introduce a priority queue(A Minimum heap data structure) for multipath merge sort, to replace the old N*M time complexity algorithm. This can significantly improve the read efficiency when merging large number of overlapping data. In mytest: 1. Compaction with 187 segments reduce time from 75 seconds to 42 seconds 2. Compaction with 3574 segments cost 43 seconds, and with old version, I kill the process after waiting more than 10 minutes... This CL only change the reads of alpha rowset. Beta rowset will be changed in another CL. ISSUE: #2631	2020-01-02 14:10:05 +08:00
Youngwb	feda66f99f	Spark return error to users when spark on doris query failed (#2531 )	2019-12-30 21:58:13 +08:00
Dayue Gao	da8c9b4429	[Segment V2] refactor SegmentReaderWriterTest and add UT for lazy materialization (#2614 )	2019-12-30 21:07:58 +08:00
LingBin	ffea3f8825	[env] Add CREATE_OR_OPEN and rename existing open modes (#2604 ) The upcoming patch will use CREATE_OR_OPEN mode This patch also remove virtual dtors to cpp file. * Move the dtors back to env.h Generally, placing the dtor in an `.h` file(inline) or in a `cpp` file depends on the trade-off between code expansion and function call overhead. The code expansion rate is closely related to the number of class members and the inheritance level. For the several classes here: `Env`, `ReadableFile`, and `WritableFile` have no members and are the top level of the inheritance hierarchy, But for now I have no obvious evidence to prove that make their dtors inline will cause serious code expansion and more instruction cache-misses, even if there are thousands of `ReadableFile` objects kept being created and released during running.	2019-12-30 13:51:38 +08:00
kangkaisen	5fd7133e69	Fix bitmap, hll, segment v2 DefaultValue bug (#2570 ) 1. Change the bitmap and HLL default value to empty bitmap and empty bitmap HLL 2. Fix DefaultValueColumnIterator bug 3. Fix uint24.h ostream bug	2019-12-27 14:01:45 +08:00
Seaven	4ed87964fe	Add zip util(#2348 ) (#2441 ) Support .zip file extract by minizip	2019-12-27 10:10:21 +08:00
WingC	f7032b07f3	Support more schema change from VARCHAR type (#2501 )	2019-12-26 22:38:53 +08:00
kangpinghuang	7f48bd3c5a	Support bloom filter index for large int type (#2550 )	2019-12-24 19:04:03 +08:00
kangpinghuang	f9685372a1	Fix bloom filter bug #2526 (#2532 )	2019-12-24 07:45:11 +08:00
WingC	c81b1db406	Support convert VARCHAR type to DATE type (#2489 )	2019-12-18 12:58:47 +08:00
kangpinghuang	d31f774852	Add block split bloom filter (#2471 ) [STORAGE][SEGMENTV2] use block split bloom filter build bloom filter against data page add distinct value to bloom filter add ordinal index to bloom filter index	2019-12-18 12:57:44 +08:00
WingC	89003b774b	Support Convert Varchar to INT (#2481 )	2019-12-17 22:02:28 +08:00
kangkaisen	cf6d705df9	Add intersect_count UDAF (#2418 ) 1 Because we don't support array type currently, so I use variable arguments instead. 2 intersect_count directly return final count, not bitmap like bitmap_union, because intersect_count return bitmap is more complex and need more serialize. If we really need bitmap format from intersect_count, we could do that in another PR and which won't have compatibility problems.	2019-12-13 16:12:05 +08:00
lichaoyong	2b45434c0e	Fix RunLengthInteger unit test bug (#2454 )	2019-12-13 14:16:16 +08:00
lichaoyong	14293b39f3	Fix RLE encoding/decoding bug upon large negative number. (#2448 ) Doris have use RLE to encoding/decoding integer. Four types are comprised of the RLE encoding/decoding algorithm. Short Repeat : used for short repeating integer sequences. Direct : used for integer sequences whose values have a relatively constant bit width. Patched Base : used for integer sequences whose bit widths varies a lot. Delta : used for monotonically increasing or decreasing sequences. This bug occurs in Patched Base Type for large negative number. In patched base, base value is stored 1 to 8 bytes and encoding to 0 ~ 7. If the base value is 8 byte, the encoding value for base width should be 7. But now will encoding to 8, this is problem. It will result in inconsistent data with loaded data because wrong encoding procedure. In extreme case, the BE process will be cored dump because illegal address.	2019-12-13 08:51:05 +08:00
kangpinghuang	c07f37d78c	[Segment V2] Add a control framework between FE and BE through heartbeat #2247 (#2364 ) The control framework is implemented through heartbeat message. Use uint64_t as flags to control different functions. Now add a flag to set the default rowset type to beta.	2019-12-12 12:18:32 +08:00
Dayue Gao	83b5455be5	[Load] Fix several races in stream load that could cause BE crash (#2414 ) This CL fixes the following problems 1. check whether TabletsChannel has been closed/cancelled in `reduce_mem_usage` to avoid using a closed DeltaWriter 2. make `FlushHandle.wait` wait for all submitted tasks to finish so that memtable is deallocated before its delta writer 3. make `~MemTracker()` release its consumption bytes to accommodate situations in aggregate_func.h that bitmap and hll call `MemTracker::consume` without corresponding `MemTracker::release`, which cause the consumption of root tracker never drops to zero	2019-12-10 21:59:05 +08:00
WingC	af3d901a06	Convert INT type to DATE type (#2393 )	2019-12-07 21:56:52 +08:00
LingBin	f635552a20	Port latest faststring (#2403 )	2019-12-06 20:39:56 +08:00
LingBin	177fec8917	Improve SkipList memory usage tracking (#2359 ) The problem with the current implementation is that all data to be inserted will be counted in memory, but for the aggregation model or some other special cases, not all data will be inserted into `MemTable`, and these data should not be counted in memory. This change makes the `SkipList` use the exclusive `MemPool`, and only the data will be inserted into the `SkipList` can use this `MemPool`. In other words, those discarded rows will not be counted by the `MemPool` of` SkipList`. In order to avoid duplicate checking whether a row already exists in `SkipList`, this change also modifies the `SkipList` interface(A `Hint` will be fetched when `Find()`, and then use it in `InsertUseHint()`), and made `SkipList` no longer aware of the aggregation logic. At present, because of the data row(`Tuple`) generated by the upper layer is different from the data row(`Row`) internally represented by the engine, when inserting `MemTable`, the data row must be copied. If the row needs to be inserted into SkipList, we need copy it again to `MemPool` of `SkipList`. And, at present, the aggregation function only supports `MemPool` when copying, so even if the data will not be inserted into` SkipList`, `MemPool` is still used (in the future, it can be replaced with an ordinary` Buffer`). However, we reuse the allocated memory in MemPool, that is, we do not reallocate new memory every time. Note: Due to the characteristics of `MemPool` (once inserted, it cannot be partially cleared), the following scenarios may still cause multiple flushes. For example, the aggregation model of a string column is `MAX`, and the data inserted at the same time is in ascending order, then for each data row, it must apply for memory from `MemPool` in `SkipList`, that is, although the old rows in SkipList` will be discarded, the memory occupied will still be counted. I did a test on my development machine using `STREAM LOAD`: a table with only one tablet and all columns are keys, the original data was 1.1G (9318799 rows), and there were 377745 rows after removing duplicates. It can be found that both the number of files and the query efficiency are greatly improved, the price paid is only a slight increase in load time. before: ``` $ ll storage/data/0/10019/1075020655/ total 4540 -rw------- 1 dev dev 393152 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.dat -rw------- 1 dev dev 1135 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.idx -rw------- 1 dev dev 421660 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.dat -rw------- 1 dev dev 1185 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.idx -rw------- 1 dev dev 184214 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.dat -rw------- 1 dev dev 610 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.idx -rw------- 1 dev dev 329181 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.dat -rw------- 1 dev dev 935 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.idx -rw------- 1 dev dev 343813 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.dat -rw------- 1 dev dev 985 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.idx -rw------- 1 dev dev 315364 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.dat -rw------- 1 dev dev 885 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.idx -rw------- 1 dev dev 423806 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.dat -rw------- 1 dev dev 1185 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.idx -rw------- 1 dev dev 294811 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.dat -rw------- 1 dev dev 835 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.idx -rw------- 1 dev dev 403241 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.dat -rw------- 1 dev dev 1135 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.idx -rw------- 1 dev dev 350753 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.dat -rw------- 1 dev dev 860 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.idx -rw------- 1 dev dev 266966 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.dat -rw------- 1 dev dev 735 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.idx -rw------- 1 dev dev 451191 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.dat -rw------- 1 dev dev 1235 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.idx -rw------- 1 dev dev 398439 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.dat -rw------- 1 dev dev 1110 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.idx { "TxnId": 16, "Label": "cd9f8392-dfa0-4626-8034-22f7cb97044c", "Status": "Success", "Message": "OK", "NumberTotalRows": 9318799, "NumberLoadedRows": 9318799, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 1079581477, "LoadTimeMs": 46907 } mysql> select count() from xxx_before; +----------+ \| count() \| +----------+ \| 377745 \| +----------+ 1 row in set (0.91 sec) ``` aftr: ``` $ ll storage/data/0/10013/1075020655/ total 3612 -rw------- 1 dev dev 3328992 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.dat -rw------- 1 dev dev 8460 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.idx -rw------- 1 dev dev 350576 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.dat -rw------- 1 dev dev 985 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.idx { "TxnId": 12, "Label": "88f606d5-8095-4f15-b61d-49b7080c16b8", "Status": "Success", "Message": "OK", "NumberTotalRows": 9318799, "NumberLoadedRows": 9318799, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 1079581477, "LoadTimeMs": 48771 } mysql> select count() from xxx_after; +----------+ \| count() \| +----------+ \| 377745 \| +----------+ 1 row in set (0.38 sec) ```	2019-12-06 17:31:18 +08:00
WingC	102a845131	Support convert date to datetime through alter table (#2385 )	2019-12-05 07:37:45 +08:00
Yunfeng,Wu	1532282942	Support push down is null predicate for Doris-On-ES (#2378 )	2019-12-04 22:56:22 +08:00
lichaoyong	fbee3c7722	Remove VersionHash used to comparison in BE (#2358 )	2019-12-04 20:09:03 +08:00
Yunfeng,Wu	0f00febd21	Optimize Doris On Elasticsearch performance (#2237 ) Pure DocValue optimization for doris-on-es Future todo: Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable, should check pure_docvalue enabled for one whole scan outside, I will add this todo in future	2019-12-04 12:57:45 +08:00
kangkaisen	f828670245	Add Bitmap index reader (#2319 ) [STORAGE] [INDEX] For #2061 and #2062 Add bitmap index reader SegmentIterator support bitmap index Add some metrics	2019-12-03 23:01:40 +08:00

1 2 3 4 5

230 Commits