Apache Doris just ‘graduated’: Why care about this SQL data warehouse

In case you are wondering who “she” is and what school she went to, Doris is an open source, SQL-based massively parallel processing (MPP) analytical data warehouse that was under development at Apache Incubator.

Last week, Doris achieved the status of top-level project, which according to the Apache Software Foundation (ASF) means that “it has proven its ability to be properly self-governed.” 

The data warehouse was recently released in version 1.0, its eighth release while undergoing development at the incubator (along with six Connector releases). It has been built to support online analytical processing (OLAP) workloads, often used in data science scenarios.

Doris, originally known as Palo, was born inside Chinese internet search giant Baidu as a data warehousing system for its advertisement business before being open sourced in 2017 and entering the Apache Incubator in 2018.

Doris has roots in Apache Impala and Google Mesa

Doris, according to the Apache Software Foundation, is based on the integration of Google Mesa and Apache Impala, an open source MPP SQL query engine, developed in 2012 and based on the underpinnings of Google F1.

Mesa, which was designed to be a highly scalable analytic data warehousing system around 2014, was used to store critical measurement data related to Google’s Internet advertising business.

According to its developers, both at Baidu and at the Apache Incubator, Doris offers simple design architecture while providing high availability, reliability, fault tolerance, and scalability.

“The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris,” the Apache Software Foundation said in a statement, adding that the data warehouse supports multidimensional reporting, user portraits, ad-hoc queries, and real-time dashboards.

Some of the other features of Doris includes columnar storage, parallel execution, vectorization technology, query optimization, ANSI SQL, and  integration with big data ecosystems via connectors for Apache Flink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Spark, and Elasticsearch, among other systems.

Uptake of open source databases forecast to grow

Uptake of enterprise grade, open source databases have been expected to grow. In Gartner’s State of the Open-Source DBMS Market 2019 report, the consulting firm predicted that more than 70% of new in-house applications will be developed on an Open Source Database Management System (OSDBMS) or an OSDBMS-based Database Platform-as-a-Service (dbPaaS) by the end of 2022.

In addition, as data proliferates and businesses’ need for real-time analytics grows, a simple yet massively parallel processing database that is also open source, seems to be the need of the hour.

“As data volumes have grown, MPP databases became the only realistic way to process data quickly enough or cheaply enough to meet organizations’ demands,” said David Menninger, research director at Ventana Research.

Cloud architecture fuels interest in MPP databases

The other trends fueling MPP databases are the availability of relatively inexpensive cloud-based instances of servers, which can be used as part of the MPP configuration, thus eliminating the need to procure and install the physical hardware these systems use, Menninger said.

Making a case for Doris, Menninger said that while there are many MPP database options, some of which are open sourced, there isn’t really an open source, MPP MySQL alternative.

“MySQL itself and MariaDB have been extended to support larger analytical workloads, but they were initially designed for transaction processing,” Menninger said, adding that open source PostreSQL database Greenplum and hyperscaler services such as Google BigQuery, Amazon RedShift, and Microsoft Synapse could be considered as rivals to Doris.

In addition, ClickHouse, Apache Druid, and Apache Pinot could also be considered rivals, said Sanjeev Mohan, former research vice president for big data and analytics at Gartner.

According to the Apache Foundation, using Doris could have multiple advantages, such as architectural simplicity and faster query times.

One of the reasons behind Doris’ simplicity is its non-dependency on multiple components for tasks such as class management, synchronization and communication. Its fast query times can be attributed to vectorization, a process that allows a program or an algorithm to operate on a multiple set of values at one time rather than a single value.

Another benefit of the data warehouse, according to the developers at the Apache Foundation, is Doris’ ultra-high concurrency support, meaning it can handle requests from tens of thousands of users to process data and gain insights from the database at the same time.

The need for high concurrency has increased because most organizations are allowing their employees to access data in order to drive data-driven insights in contrast to just C-suite executives having access to analytics.

Copyright © 2022 IDG Communications, Inc.