[DOCS](refactor) refine en docs (#15244)
* Update basic-summary.md * Update README.md
This commit is contained in:
46
README.md
46
README.md
@ -35,7 +35,7 @@ under the License.
|
||||
|
||||
Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios.
|
||||
|
||||
Based on this, Apache Doris can better meet the scenarios of report analysis, ad-hoc query, unified data warehouse, Data Lake Query Acceleration, etc. Users can build user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, order analysis, and other applications on top of this.
|
||||
All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.
|
||||
|
||||
|
||||
🎉 Version 1.2.0 released now! It is fully evolved release and all users are encouraged to upgrade to this release. Check out the 🔗[Release Notes](https://doris.apache.org/docs/releasenotes/release-1.2.0) here.
|
||||
@ -52,15 +52,15 @@ Apache Doris is widely used in the following scenarios:
|
||||
|
||||
- Reporting Analysis
|
||||
|
||||
- Real-time Dashboards
|
||||
- Real-time dashboards
|
||||
- Reports for in-house analysts and managers
|
||||
- Highly concurrent user-oriented or customer-oriented report analysis.For example, in the scenarios of site analysis for website owners and advertising reports for advertisers, the concurrency usually requires thousands of QPS and the query latency requires sub-seconds response. The famous e-commerce company JD.com uses Doris in advertising reports, writing 10 billion rows of data per day, with tens of thousands of concurrent query QPS and 150ms query latency for the 99th percentile.
|
||||
- Highly concurrent user-oriented or customer-oriented report analysis: such as website analysis and ad reporting that usually require thousands of QPS and quick response times measured in miliseconds. A successful user case is that Doris has been used by the Chinese e-commerce giant JD.com in ad reporting, where it receives 10 billion rows of data per day, handles over 10,000 QPS, and delivers a 99 percentile query latency of 150 ms.
|
||||
|
||||
- Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.
|
||||
|
||||
- Unified data warehouse construction. A platform to meet the needs of unified data warehouse construction and simplify the complicated data software stack. HaiDiLao's Doris-based unified data warehouse replaces the old architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix, and greatly simplifies the architecture.
|
||||
- Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.
|
||||
|
||||
- Data Lake Query. By federating the data located in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, the query performance is greatly improved while avoiding data copying.
|
||||
- Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance.
|
||||
|
||||
## 🖥️ Core Concepts
|
||||
|
||||
@ -68,54 +68,54 @@ Apache Doris is widely used in the following scenarios:
|
||||
|
||||
The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.
|
||||
|
||||
- Frontend(FE): It is mainly responsible for user request access, query parsing and planning, management of metadata, and node management-related work.
|
||||
- Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc.
|
||||
|
||||
- Backend(BE): It is mainly responsible for data storage and query plan execution.
|
||||
- Backend (BE): data storage and query plan execution
|
||||
|
||||
Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.
|
||||
|
||||

|
||||
|
||||
Apache Doris adopts MySQL protocol, highly compatible with MySQL dialect, and supports standard SQL. Users can access Doris through various client tools and support seamless connection with BI tools.
|
||||
In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.
|
||||
|
||||
### 💾 Storage Engine
|
||||
|
||||
In terms of the storage engine, Apache Doris uses columnar storage to encode and compress and read data by column, enabling a very high compression ratio while reducing a large number of scans of non-relevant data, thus making more efficient use of IO and CPU resources. Doris also supports a relatively rich index structure to reduce data scans:
|
||||
Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans:
|
||||
|
||||
- Support sorted compound key index: Up to three columns can be specified to form a compound sort key. With this index, data can be effectively pruned to better support high concurrent reporting scenarios.
|
||||
- Z-order index :Using Z-order indexing, you can efficiently run range queries on any combination of fields in your schema.
|
||||
- MIN/MAX index: Effective filtering of equivalence and range queries for numeric types
|
||||
- Bloom Filter index: very effective for equivalence filtering and pruning of high cardinality columns
|
||||
- Invert index: It enables the fast search of any field.
|
||||
- Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios.
|
||||
- Z-order Index: This allows users to efficiently run range queries on any combination of fields in their schema.
|
||||
- MIN/MAX Indexing: This enables effective filtering of equivalence and range queries for numeric types.
|
||||
- Bloom Filter: very effective in equivalence filtering and pruning of high cardinality columns
|
||||
- Invert Index: This enables fast search for any field.
|
||||
|
||||
|
||||
### 💿 Storage Models
|
||||
|
||||
In terms of storage models, Apache Doris supports a variety of storage models, with specific optimizations for different scenarios:
|
||||
Doris supports a variety of storage models and has optimized them for different scenarios:
|
||||
|
||||
- Aggregate Key model: Merge the value columns with the same keys, by aggregating in advance to significantly improve performance.
|
||||
- Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance
|
||||
|
||||
- Unique Key model: The key is unique. Data with the same key will be overwritten to achieve row-level data updates.
|
||||
- Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates.
|
||||
|
||||
- Duplicate Key model: The detailed data model can satisfy the detailed storage of fact tables.
|
||||
- Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables.
|
||||
|
||||
Apache Doris also supports strong consistent materialized views, where updates and selections of materialized views are made automatically within the system and do not require manual selection by the user, thus significantly reducing the cost of materialized view maintenance.
|
||||
Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.
|
||||
|
||||
### 🔍 Query Engine
|
||||
|
||||
In terms of query engine, Apache Doris adopts the MPP model, with parallel execution between and within nodes, and also supports distributed shuffle join for multiple large tables, which can better cope with complex queries.
|
||||
Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.
|
||||
|
||||

|
||||
|
||||
The Doris query engine is vectorized, and all memory structures can be laid out in a columnar format to achieve significant reductions in virtual function calls, improved Cache hit rates, and efficient use of SIMD instructions. Performance in wide table aggregation scenarios is 5–10 times higher than in non-vectorized engines.
|
||||
The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines.
|
||||
|
||||

|
||||
|
||||
Apache Doris uses Adaptive Query Execution technology, which can dynamically adjust the execution plan based on runtime statistics, such as runtime filter technology to generate filters to push to the probe side at runtime and to automatically penetrate the filters to the probe side which drastically reduces the amount of data in the probe and speeds up join performance. Doris' runtime filter supports In/Min/Max/Bloom filter.
|
||||
Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.
|
||||
|
||||
### 🚅 Query Optimizer
|
||||
|
||||
In terms of the optimizer, Doris uses a combination of CBO and RBO, with RBO supporting constant folding, subquery rewriting, predicate pushdown, etc., and CBO supporting Join Reorder. CBO is still under continuous optimization, mainly focusing on more accurate statistical information collection and derivation, more accurate cost model prediction, etc.
|
||||
In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.
|
||||
|
||||
|
||||
**Technical Overview**: 🔗[Introduction to Apache Doris](https://doris.apache.org/docs/summary/basic-summary)
|
||||
|
||||
@ -23,11 +23,11 @@ under the License.
|
||||
|
||||
# Introduction to Apache Doris
|
||||
|
||||
Apache Doris is a high-performance, real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios. Based on this, Apache Doris can better meet the scenarios of report analysis, ad-hoc query, unified data warehouse, Data Lake Query Acceleration, etc. Users can build user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, order analysis, and other applications on top of this.
|
||||
Apache Doris is a high-performance, real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios. All this makes Apache Doris an ideal tool for scenarios including report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Apache Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.
|
||||
|
||||
Apache Doris was first born as Palo project for Baidu's ad reporting business, officially open-sourced in 2017, donated by Baidu to the Apache Foundation for incubation in July 2018, and then incubated and operated by members of the incubator project management committee under the guidance of Apache mentors. Currently, the Apache Doris community has gathered more than 400 contributors from nearly 100 companies in different industries, and the number of active contributors is close to 100 per month. Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.
|
||||
Apache Doris, formerly known as Palo, was initially created to support Baidu's ad reporting business. It was officially open-sourced in 2017 and donated by Baidu to the Apache Foundation for incubation in July 2018, where it was operated by members of the incubator project management committee under the guidance of Apache mentors. Currently, the Apache Doris community has gathered more than 400 contributors from nearly 100 companies in different industries, and the number of active contributors is close to 100 per month. In June 2022, Apache Doris graduated from Apache incubator as a Top-Level Project.
|
||||
|
||||
Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in over 1000 companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Weibo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.
|
||||
Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in over 1000 companies worldwide. Of the top 50 Chinese Internet companies by market capitalization (or valuation), more than 80% are long-term users of Apache Doris, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Weibo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.
|
||||
|
||||
# Usage Scenarios
|
||||
|
||||
@ -38,54 +38,56 @@ Apache Doris is widely used in the following scenarios:
|
||||
|
||||
- Reporting Analysis
|
||||
|
||||
- Real-time Dashboards
|
||||
- Real-time dashboards
|
||||
- Reports for in-house analysts and managers
|
||||
- Highly concurrent user-oriented or customer-oriented report analysis: For example, in the scenarios of site analysis for website owners and advertising reports for advertisers, the concurrency usually requires thousands of QPS and the query latency requires sub-seconds response. The famous e-commerce company JD.com uses Doris in advertising reports, writing 10 billion rows of data per day, with tens of thousands of concurrent query QPS and 150ms query latency for the 99th percentile.
|
||||
- Highly concurrent user-oriented or customer-oriented report analysis: such as website analysis and ad reporting that usually require thousands of QPS and quick response times measured in miliseconds. A successful user case is that Doris has been used by the Chinese e-commerce giant JD.com in ad reporting, where it receives 10 billion rows of data per day, handles over 10,000 QPS, and delivers a 99 percentile query latency of 150 ms.
|
||||
|
||||
- Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.
|
||||
|
||||
- Unified data warehouse construction. A platform to meet the needs of unified data warehouse construction and simplify the complicated data software stack. HaiDiLao's Doris-based unified data warehouse replaces the old architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix, and greatly simplifies the architecture.
|
||||
- Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.
|
||||
|
||||
- Data Lake Query. By federating the data located in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, the query performance is greatly improved while avoiding data copying.
|
||||
- Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance.
|
||||
|
||||
# Technical Overview
|
||||
|
||||
The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.
|
||||
As shown in the figure below, the Apache Doris architecture is simple and neat, with only two types of processes.
|
||||
|
||||
- Frontend(FE): It is mainly responsible for user request access, query parsing and planning, management of metadata, and node management-related work.
|
||||
- Backend(BE): It is mainly responsible for data storage and query plan execution.
|
||||
- Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc.
|
||||
- Backend (BE): data storage and query plan execution
|
||||
|
||||
Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.
|
||||
|
||||

|
||||
|
||||
Apache Doris adopts MySQL protocol, highly compatible with MySQL dialect, and supports standard SQL. Users can access Doris through various client tools and support seamless connection with BI tools.
|
||||
In terms of **interfaces**, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.
|
||||
|
||||
In terms of the storage engine, Doris uses columnar storage to encode and compress and read data by column, enabling a very high compression ratio while reducing a large number of scans of non-relevant data, thus making more efficient use of IO and CPU resources.
|
||||
Doris also supports a relatively rich index structure to reduce data scans:
|
||||
Doris uses a **columnar storage engine**, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources.
|
||||
|
||||
- Support sorted compound key index: Up to three columns can be specified to form a compound sort key. With this index, data can be effectively pruned to better support high concurrent reporting scenarios.
|
||||
- Z-order index :Using Z-order indexing, you can efficiently run range queries on any combination of fields in your schema.
|
||||
- MIN/MAX indexing: Effective filtering of equivalence and range queries for numeric types
|
||||
- Bloom Filter: very effective for equivalence filtering and pruning of high cardinality columns
|
||||
- Invert Index: It enables the fast search of any field
|
||||
Doris supports various **index** structures to minimize data scans:
|
||||
|
||||
In terms of storage models, Doris supports a variety of storage models, with specific optimizations for different scenarios:
|
||||
- Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios.
|
||||
- Z-order Index: This allows users to efficiently run range queries on any combination of fields in their schema.
|
||||
- MIN/MAX Indexing: This enables effective filtering of equivalence and range queries for numeric types.
|
||||
- Bloom Filter: very effective in equivalence filtering and pruning of high cardinality columns
|
||||
- Invert Index: This enables fast search for any field.
|
||||
|
||||
- Aggregate Key Model: Merge the value columns with the same keys, by aggregating in advance to significantly improve performance.
|
||||
- Unique Key model: The key is unique. Data with the same key will be overwritten to achieve row-level data updates.
|
||||
- Duplicate Key model: The detailed data model can satisfy the detailed storage of fact tables.
|
||||
Doris supports a variety of **storage models** and has optimized them for different scenarios:
|
||||
|
||||
Doris also supports strong consistent materialized views, where updates and selections of materialized views are made automatically within the system and do not require manual selection by the user, thus significantly reducing the cost of materialized view maintenance.
|
||||
- Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance
|
||||
- Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates.
|
||||
- Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables.
|
||||
|
||||
In terms of query engine, Doris adopts the MPP model, with parallel execution between and within nodes, and also supports distributed shuffle join for multiple large tables, which can better cope with complex queries.
|
||||
Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.
|
||||
|
||||
Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.
|
||||
|
||||

|
||||
|
||||
The Doris query engine is vectorized, and all memory structures can be laid out in a columnar format to achieve significant reductions in virtual function calls, improved Cache hit rates, and efficient use of SIMD instructions. Performance in wide table aggregation scenarios is 5–10 times higher than in non-vectorized engines.
|
||||
The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines.
|
||||
|
||||

|
||||
|
||||
Apache Doris uses Adaptive Query Execution technology, which can dynamically adjust the execution plan based on runtime statistics, such as runtime filter technology to generate filters to push to the probe side at runtime and to automatically penetrate the filters to the probe side which drastically reduces the amount of data in the probe and speeds up join performance. Doris' runtime filter supports In/Min/Max/Bloom filter.
|
||||
Apache Doris uses **Adaptive Query Execution technology** to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.
|
||||
|
||||
In terms of the optimizer, Doris uses a combination of CBO and RBO, with RBO supporting constant folding, subquery rewriting, predicate pushdown, etc., and CBO supporting Join Reorder. CBO is still under continuous optimization, mainly focusing on more accurate statistical information collection and derivation, more accurate cost model prediction, etc.
|
||||
In terms of **optimizers**, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user