doris

Author	SHA1	Message	Date
ZHAO Chun	9ea14b83bb	Remove failed UT (#2165 )	2019-11-08 16:48:32 +08:00
ZHAO Chun	89dc461f91	Fix UT and remove unused code (#2160 )	2019-11-08 08:47:48 +08:00
Yunfeng,Wu	188d97c215	Add null bit verification for row_batch transformation (#2139 )	2019-11-07 14:05:23 +08:00
kangpinghuang	78a4270457	Fix in predicate bug (#2132 )	2019-11-05 20:27:22 +08:00
shengyunyao	ccc1b9d98c	Optimize percentile_approx through radix sort (#2102 ) (#2107 )	2019-11-05 09:25:47 +08:00
wangbo	e1a8f9d30f	Segment v2 stream load core dump(#2037 ) (#2075 ) [STORAGE] 1 fix mem fix mem leak when calling string builder.get_dictionary_page; 2 fix delete invalid mem addr in bitshuffleBuilder when no array grow happends when bitshuffleBuilder didn't grow array, the data page which not use new to allocate will be returned to ColumnWriter. When ColumnWriter destructs, the data page will be deleted,this causes core dump	2019-11-01 22:52:58 +08:00
Yunfeng,Wu	f53f188c5d	Add arrow IPC serialization for Doris-Spark-Connector (#2013 )	2019-10-31 10:32:06 +08:00
kangpinghuang	6b4ef34162	fix AlphaRowsetTest by remove StorageEngine #2078 (#2091 )	2019-10-30 19:39:41 +08:00
kangpinghuang	9408ad67e9	Fix predicate error when reading BetaRowset (#2067 )	2019-10-27 12:12:41 +08:00
Seaven	189e08faa5	Replace NewStatus with Status (#2046 )	2019-10-24 22:48:59 +08:00
Dayue Gao	d25f0ba69a	Make ColumnReader load lazily (#2026 ) [Storage][SegmentV2] Currently `segment_v2::Segment::open` will eagerly initialize all column readers, regardless of whether the column is queried or not. Initializing `segment_v2::ColumnReader` incurs additional I/O cost to read ordinal index and zonemap index and should be delayed to the time it's needed.	2019-10-23 10:25:28 +08:00
kangpinghuang	9c2d149c36	add profile for segment v2 (#2015 )	2019-10-22 09:43:16 +08:00
Dayue Gao	8aa2cbe12d	Load Rowset only once in a thread-safe manner (#2022 ) [Storage] This PR implements thread-safe `Rowset::load()` for both AlphaRowset and BetaRowset. The main changes are 1. Introduce `DorisCallOnce<ReturnType>` to be the replacement for `DorisInitOnce` . It works for both Status and OLAPStatus. 2. `segment_v2::ColumnReader::init()` is now implemented by DorisCallOnce. 3. `segment_v2::Segment` is now created by a factory open() method. This guarantees all Segment instances are in opened state. 4. `segment_v2::Segment::_load_index()` is now implemented by DorisCallOnce. 5. Implement thread-safe load() for AlphaRowset and BetaRowset	2019-10-21 16:05:12 +08:00
ZHAO Chun	05643dc403	Replace Arena with MemPool (#2012 ) After replacing Arena with MemPool, we can achieve one copy for string value read from segment v2. We can exchange MemPool's chunk between RowBlockV2 and RowBlock. This change only replace Arena, this work will be done in other change list.	2019-10-19 15:53:24 +08:00
kangpinghuang	4f7cc7e033	add predicate filter(#1652 ) (#1775 )	2019-10-17 19:20:00 +08:00
kangpinghuang	3bca253fb3	Fix beta rowset read slow (#1994 ) [Bug][BetaRowset] fix beta rowset read slowly with limit beta rowset do not update raw_rows_read in statistics and will read all data in tablet when query with limit, which lead to long query time.	2019-10-17 19:19:46 +08:00
Mingyu Chen	41e55cfca9	Modify fixed partition feature (#1989 ) 1. Not support MAVALUE in multi partition column. 2. Fix the incorrect show create table stmt.	2019-10-16 16:03:46 +08:00
kangpinghuang	2fcb79e3ef	Fix wrong group by result bug (#1987 )	2019-10-16 07:19:53 +08:00
Mingyu Chen	62acf5d098	Limit the memory usage of Loading process (#1954 )	2019-10-15 09:26:20 +08:00
ZHAO Chun	f130bd3e7b	Use Env function to operate directory (#1980 ) Now Env has unify all environment operation, such as file operation. However some of our old functions don't leverage it. This change unify FileUtils::scan_dir to use Env's function.	2019-10-15 09:25:12 +08:00
Lijia Liu	d68b1b287c	Support segment-level zone map (#1931 )	2019-10-13 22:06:09 +08:00
wangbo	80e9b21fb0	Make Segment v2 use string's real length(#1943 ) (#1944 )	2019-10-13 13:23:43 +08:00
ZHAO Chun	d5493fb20a	Replace std::regex with RE2 (#1930 ) In Storage Engine GC, TabletManger use std::regex to extract tablet id and schema hash from path. But it will construct regex pattern for every path to check, this is a huge waste. This change list make this pattern a global static pattern, and replace it with RE2, which has better performance.	2019-10-11 15:57:53 +08:00
yiguolei	e4f3e8fda7	Remove redundant method in rowset meta manager (#1949 )	2019-10-10 19:29:59 +08:00
ZHAO Chun	024348d74b	Enable auto convert when check in (#1926 ) Leverage gitattributes to enable auto convert end-of-line to LF when checking in. Convert already exist CRLF to LF by removing all files and checking out with new .gitattributes file. Except .gitattributes, all files are only modified at the end of line.	2019-10-09 22:31:27 +08:00
wangbo	8aa8e08f27	v2 segment support string encode(#1766 ) (#1816 ) major change change data format of binary dict page, appending (dict page data) and (dict page offset) to binary dict page; add new decoding method for new binary dict page format add ut for segment test set the elements of initial array to 0 ,when calling arena.AllocateNewBlock hard code way to choose dict coding for string 0919 commit major change change dict file format:when saving binary dict page, separate dict page from dict page,one dict page may have multi data pages;when reading a binary dict page,one ColumnReader keeps one dict page loading dict when calling column_reader._read_page 3.rollback BinaryDictPage no longer using memset(0) to inital column_zonemap.max_value 0926 17 commit major change init column_zone_map min value column_zone_map slice's data array; set char/varchar column_zone_map'max value size to 0 add ut for char column zone map query hit/miss 0929 10 commit major change allocate mem for column_zone_map 's max and min value direct copy content to column_zone_map's max and min value	2019-09-30 16:25:31 +08:00
kangpinghuang	2cecf5901f	Fix segment v2 bug (#1904 )	2019-09-30 13:50:39 +08:00
kangkaisen	262c7f4834	Make All BE UT pass in debug mode (#1913 ) Fix OrdinalPageIndexTest Fix ColumnReaderWriterTest Fix binary_dict_page_test Fix routine_load_task_executor_test	2019-09-29 19:37:51 +08:00
yiguolei	f852f50acb	Improve unique id performance (#1911 ) Remove the default constructor for UniqueID Add a gen_uid method in UniqueId. If need to generate a new uid, users should call this api explicitly. Reuse boost random generator not generate a new one every time.	2019-09-29 18:20:02 +08:00
ZHAO Chun	8f016d3ab2	Make HLL be able to handle invalid data (#1908 ) In this change list 1. validate HLL column when loading data, if data is invalid, this row will be filtered. 2. seems as empty HLL when serializing invalid type of HLL data, with this change, all ingested data will be valid. 3. seems as empty HLL when deserializing nullptr or invalid type of HLL data. With this change, dirty data can be handled normally. 4. rename function empty_hll to hll_empty. 5. disable memtable_flush_execute_test because this will fails sometimes. When tearing down, some thread is not joined, and they will visit destroyed resource, which is invalid.	2019-09-29 10:55:23 +08:00
kangkaisen	d3a445ee09	Fix memory_scratch_sink_test in debug mode (#1906 )	2019-09-28 10:33:24 +08:00
kangkaisen	1131f53420	Fix parquet_scanner_test in debug mode (#1900 )	2019-09-28 01:15:33 +08:00
kangkaisen	cafb9f1e62	Replace Arena with MemPool first step (#1899 )	2019-09-28 01:12:22 +08:00
kangkaisen	0c22d8fa08	Add frame_of_reference page (#1818 )	2019-09-28 01:10:29 +08:00
kangkaisen	1c229fbd92	Fix es_scan_reader_test in debug mode (#1905 )	2019-09-28 00:02:30 +08:00
kangkaisen	b246d93128	Avoid SerDe for aggregation query with object pool (#1854 )	2019-09-26 13:51:13 +08:00
wubiao	eb840ecca8	Support boolean/date/datetime/decimal types in segment V2 (#1863 )	2019-09-25 13:53:00 +08:00
Mingyu Chen	c643cbd30c	Optimize the load performance for large file (#1798 ) The current load process is: Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed: Insert tuple into different memtables according to tablet ID When the memtable size reaches the threshold, it is written to disk. The above operations are equivalent to single thread execution for a single load task. In fact, the insertion of memtable and the flush of memtable can be executed synchronously. Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing. In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads. By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE. DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data. This design can improve the performance of load large files. In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.	2019-09-25 13:49:32 +08:00
shgxwxl	c2de62d6a1	Collect scanner's status when es_http_scan_node close (#1861 )	2019-09-25 12:20:13 +08:00
kangpinghuang	8d0fee7e64	Add default value column iterator #1834 (#1835 )	2019-09-24 14:39:10 +08:00
kangpinghuang	fe27969978	add delete predicate filter(#1636 ) (#1745 ) Delete predicate can be used to prune data by zone map.	2019-09-24 14:38:19 +08:00
ZHAO Chun	93fe10a268	Reduce size of HyperLogLog struct (#1845 ) Now size of HyperLogLog struct is so large that it lead the rowset is too small when ingesting data. In this CL, registers in HyperLogLog are only created when it is needed. When ingesting data, it's normal case that there are only few values in one HyperLogLog.	2019-09-21 14:38:58 +08:00
lichaoyong	720808fda5	Remove config::max_file_descriptor_number (#1833 )	2019-09-20 07:50:57 +08:00
lichaoyong	315f762523	Seek block when starts a ScanKey (#1828 ) In Doris, one block has 1024 rows. 1. If the previous ScanKey scan rows multiple blocks, and also the final block has 1024 rows just right. 2. The current ScanKey scan rows with number less than one block. Under the two conditions, if not seek block, the position of prefix shortkey columns is wrong.	2019-09-19 20:08:03 +08:00
ZHAO Chun	17e52a4bac	Improve LRUCache to get better performance (#1826 ) In this CL, I move the entry's deleter out of LRUCache's mutex block, which can let others access this cache without waiting free cache entry.	2019-09-19 17:37:02 +08:00
ZHAO Chun	11eafe524f	Add ChunkAllocator to accelerate chunk allocation (#1792 ) I add ChunkAllocator in this CL to put unused memory chunk to a chunk pool other than return it to system allocator. Now we only change MemPool's chunk allocation and free to this. And two configuration are introduduced too. 'chunk_reserved_bytes_limit' is the limit of how many bytes this chunk pool can reserve in total and its default value is 2147483648(2GB). 'use_mmap_allocate_chunk': if chunk is allocated via mmap and default value is false. And in my test case with default configuration a simple like "select * from table limit 10", this can improve throughput from 280 QPS to to 650 QPS. And when I config 'chunk_reserved_bytes_limit' to 0, which means this is disabled, the throughput is the same with origin's.	2019-09-13 08:27:24 +08:00
Mingyu Chen	9aa2045987	Refactor alter job (#1695 )	2019-09-12 16:31:29 +08:00
wubiao	dad4def708	Support estimate size for v2 segment writer (#1787 )	2019-09-12 15:15:39 +08:00
Dayue Gao	5653822298	Writer magic number in footer instead of header (#1771 )	2019-09-10 09:54:13 +08:00
kangkaisen	cd5cfea5cc	Encapsulate HLL logic (#1756 )	2019-09-09 15:52:10 +08:00

1 2 3 4

169 Commits