Apache Druid: Interactive Analytics at Scale
Apache Druid is a high-performance real-time analytics database. It is one of the open-source distributed columnar storages in the Big Data World that enables users to process large amounts of data and query them quickly. We used Apache Druid on a recent engagement. In this blog post, we’ll tell you about Druid itself and our experience with it.
Today’s technologies make it possible to process larger and larger volumes of data, but many companies face challenges analyzing it. To address this challenge, companies use various solutions for Big Data analytics, including those that support real-time data analysis.
Let’s see how Druid can serve as such a solution.
What is Druid?
Apache Druid is a real-time analytics database designed for fast ad-hoc analytics (OLAP queries) on large data sets. Druid is most often used as a database where real-time ingest, fast query performance and high uptime are important.
Druid works best in scenarios where incoming data is represented as a series of events. It combines the capabilities of different storage classes like:
- Time-Series database
- Data Warehouse
- Search systems
Graphically it can be demonstrated as a conjunction of these areas:
When to use Apache Druid
Druid is a distributed store that successfully serves data in the following scenarios:
- Insert rates are very high, but updates are less common.
- Most of the queries are aggregation and reporting queries (i.e., “group by” queries). But also searching and scanning queries are possible.
- Expected query latency is in a range of 100ms to a few seconds.
- Data has a time component. Druid includes optimizations and design choices specifically related to time.
- There’re a lot of table in the schema, but most of the queries hit just one big distributed table. Queries may potentially hit more than one smaller “lookup” table.
- High cardinality data columns (e.g. URLs, user IDs) and the need for fast counting and ranking over them.
- Data ingestion from Kafka, HDFS, flat files and object storage like Amazon S3.
When NOT to use Apache Druid
Despite Druid’s ability to process large amounts of data, there are situations where you would NOT want to use it:
- Low-latency updates of existing records using a primary key. Druid supports streaming inserts, but not streaming updates (i.e., updates are done using background batch jobs).
- Storing all the historical data in a raw format, without any aggregation or transformation.
- Building an offline reporting system where query latency is not very important.
- Queries that have “big” joins (joining one big fact table to another big fact table), where you’re comfortable with these queries taking a long time to complete.
Our Druid experience
Our team worked with Druid for two years. It was an interesting and fruitful journey during which we accomplished the following:
- Added possibility of real-time analytics
- Moved Druid from on-prem to Kubernetes
- Built a reporting engine on top of Druid
- Optimized the data structure inside Druid, resulting in 30% cost savings
- Optimized Druid Hadoop jobs for data ingestion
- Extended Druid data (metrics/dimensions) for better data analysis
Druid served tens of terabytes of data that hundreds of users could query quickly to build their analytics. Despite the high query rate, the responses were fast. Most of the queries completed in 1-3 seconds.
Our team started work on Druid when it was deployed manually on Amazon EC2 instances. The Druid cluster consisted of 40+ instances, plus a middle-manager cluster of 20+ worker nodes for segments calculations. Updating, upgrading and keeping the cluster up-to-date was a challenge.
After one year we could successfully run the Druid cluster in Kubernetes, which made cluster support easier. It enabled us to upgrade Druid with minimal effort. The introduction of real-time analytics to Druid brought users a clear picture of the existing situation with product and operations.
Let’s describe a couple of these achievements in more detail.
How we implemented real-time ingestion
Originally, we had a batch job that successfully processed 500 GB of compressed events in an hour. It produced required reports and dashboards to managers and provided them an understanding of business productivity. However, there was one inconvenient thing: the reports had a lag of about six hours, a bit outdated by the time they were available.
Our client required real-time analytics to better respond to changes in advertising campaigns. To support this requirement, we built a real-time processing pipeline based on the druid-kafka-indexing-service core Apache Druid extension.
This extension reads data from Kafka, then processes and creates Druid segments based on it. To start using this feature, one needs to compose an ingestion spec of type “kafka” and submit it to Overlord. It will then assign this task to an available Middle Manager for processing. In the ingesting specification you can specify many parameters. The most interesting are:
- topic: the Kafka topic to read from
- inputFormat and parser to specify the incoming data format. inputFormat is a new and preferable option that supports the following values: csv, delimited, and json. parser is the legacy option which supports avro_stream, protobuf, and thrift
- consumerProperties is a map of properties to be passed to the Kafka consumer
- taskDuration: the length of time before tasks stop reading and begin publishing their segment
To give users more understanding of data we specified segmentGranularity as MINUTE. As a result, users could not only see upcoming events in real-time, but also perform analytics queries over this data.
How we run Druid in Kubernetes
Druid clusters consist of many different services (e.g., Coordinator, Overlord, Broker, Historical, MiddleManager). In our case, the maintenance of these services was not easy because it was done manually. No DevOps tools like Ansible, Chef or Puppet were used.
In addition, we had to make sure a new instance was synchronized with the rest of the Druid cluster. We often faced an out-of-sync problem when we had to update or upgrade our cluster. The problem was that we needed to repeat the same set of provisioning operations on every node of the 50+ node cluster.
We had to verify that the configuration of the updated Druid component corresponded to the components currently running in the cluster. In addition, we had to verify that it was compatible with data in other environments, including production. The same problem arose when we upscaled the cluster (i.e., added more historical or broker nodes). This sort of manual maintenance was a nightmare.
After the successful evaluation of Druid on Kubernetes in staging, we decided to run it on production. It was mostly a DevOps routine. The first step was to prepare Druid assets to run on Kubernetes. The typical assets were:
- Druid configuration files
- Binary files to run in a Docker container or in a Hadoop Cluster
- Definitions of all Kubernetes elements, such as Services, Ingress, StatefulSets, Mounts, etc.
The second step was to resolve network problems. Because of security restrictions, some parts of the Druid cluster were not visible to Kubernetes.
When the assets were successfully tested and network problems were resolved, we could run the Druid cluster under Kubernetes orchestration in production. This gave us the ability to easily manage and monitor our cluster.
We could scale historical and broker nodes and upgrade Druid services right on the running production server. Upgrading the cluster “on the fly” is not a problem for Druid because it supports rolling updates. All these things we could run in minutes instead of hours.
For our team, it was a great time and useful experience to work with Druid. It’s a great solution to run analytics over large amounts of event-based data. It is mature and quickly evolves. It satisfies the requirements of many growth-stage companies as a successful tool in the world of Big Data.
Get in touch
We hope you found this post useful. If you’re interested in learning more about the services we provide, feel free to email us at email@example.com. You can find additional ways to reach us on our Contact page.