top of page

Exasol on a cluster of 24 nodes and with 400 TB of data: the Badoo experience

ATK

Exasol

Exasol on a cluster of 24 nodes and with 400 TB of data: the Badoo experience

As many of our readers already know, Exasol is one of the fastest analytical databases in the world, a performance leader according to TPC-H tests. For 20 years of its existence, this database has more than 500 installations in the world and already has several large clients in Russia (Citimobile, Badoo, Adidas). Today we will tell you about the experience of using Exasol in Badoo, a large international social dating network.

Translated with DeepL.com (free version).

About Badoo's BI department

Badoo is a large social dating network, with over 420 million users and over 60 million monthly active users. The company's central office is in London, but since the owner of the company is from Russia, Badoo also has a large office in Moscow, where part of the BI team is located.

So, Badoo's BI department is:

a.  40 employees who collect, research and analyze user data in order to understand user behavior

b.  7 teams (including data engineering, ETL, product analytics, data science, machine learning)

c. 2.5 Pb of data (in a Hadoop data lake)

d. 1.5 million/sec events on user behavior in applications

e. 10 TB/day processed in ETL systems

Video about Exasol in Badoo in Russian

For those who don't like to read for a long time, we have edited a 6 minute video about

a. how Exasol is used in Badoo;

b. on which cluster it is deployed;

c. what product (and not only) analytics tasks it is used for, including A/B testing;

d. what the BI director and data engineers at Badoo think about Exasol. 

+7 (495) 134-32-92

consult@atkcg.ru
Russia, 121467, Moscow,
Istrinskaya str. 4, Office 1

Company News

subscribe to our newsletter

Exasol in Badoo: how it all started (2015)

In 2015, Vectorwise, the main analytics database of Badoo at the time, stopped handling the accumulated data volumes (due to the fact that it is a single-node solution).

After several months of testing Exasol, Badoo was convinced of good performance (relative to the complexity of SQL queries and the hardware used) and deployed Exasol on a cluster of two nodes.

November 2015: growth of Exasol installation to a cluster of 8 nodes:

Increasing the Exasol cluster to 8 servers with the following specifications:

a. 16 to 20 CPU Cores;

-768 Gbytes RAM;

-16×1 Tbytes HDD (RAID 1 - 8 Tbytes);

-10 Gbit network.

b. Total storage available to Exasol: 5.6 TB

c. Total uncompressed data capacity: 85 TB

d. Size of large tables: 500 million to 50 billion rows

e. One analytical query: about 4.5 billion rows on average

Exasol performance on real queries for November 2015 (all queries):

These are real Badoo stats from real queries from live users:

Exasol performance on real queries for November 2015 (queries lasting 1 second or more):

Separate statistics are collected for queries lasting 1 second or more to reduce the impact of too fast in-memory queries on the result and to show the closest to reality picture for those cases when Exasol still needs to read something from disk.

As you can see, even in the most complex cases, Exasol does the calculation in seconds or minutes, but we are in no way talking about hours or days.

How does the table above calculate the number of objects in a custom query in Exasol? The query complexity is calculated conditionally by simply searching for the words FROM and JOIN in the SQL text. This is how the approximate number of used objects is determined. And the more objects are used, the more complex the query is.

Exasol in Badoo data warehouse infrastructure

Historically, Badoo does not use the cloud. Only its own data center. There is a data lake in Hadoop for 2.5 petabytes of raw data, deployed on a "modest" cluster of 90 nodes.

"Exasol is the heart of our analytics ecosystem. We started a fairly large scale pilot with Exasol 5 years ago because we had reached a "ceiling" in our previous data warehouse, and after a few months we deployed the solution on 2 nodes, and after a few more iterations we got to a cluster of 24 nodes and 2 nodes on backup. We have about 16TB of memory and 400TB of data in Exasol. This is hot data that is used by analysts, machine learning teams, data science, etc. We supply data to Exasol from the Hadoop data lake," says Artem Ivanov, Director of Business Intelligence, Badoo.

Briefly about the Badoo data warehouse:

Exasol as an analytical database: the view of Badoo data platform engineers (and more)

© ATK Consulting Group 1997 - 2024.

Vladimir Kazanov, Lead Data Platform Engineer, Badoo:

"Exasol is both a columnar and distributed database. This feature allows us to have 400TB of data in an Exasol cluster, and still do really efficient queries on it. I'm talking about minutes here, where comparable systems don't produce even roughly that speed. For example, in Hive, my recent query on the same data for the same 3 months took at least two hours. Exasol has amazing speed. The analysts in Exasol love the fact that they can just junk tables with events and query on them. This is great because you don't have to teach anyone any new language, unlike systems we also use, like Yandex ClickHouse or even Hive, or Presto. All in all, Exasol is really convenient."

Artem Ivanov, Director of Business Analytics, Badoo:

"Maintaining the A/B testing framework also helps us with Exasol. Right now we have over 100 tests running in parallel on user behavior, pricing, payments, localization and more. Some of them are successful, some are not, and to understand what worked, we put all the calculations into Exasol. With our user volumes, you can imagine the resources it takes to calculate the overlap between hundreds of different tests by dimension, country, and be sure the conclusions are correct. With Exasol, we can do this kind of analytics really quickly"

PyExasol: Open-source Python driver for Exasol, developed at Badoo

Badoo is even investing in creating Open Source solutions for Exasol. The company has developed a custom Python driver for Exasol, PyExasol. PyExasol is web socket based and allows easy integration with pandas (Python Data Analysis Library) via HTTP, and removes the limitation of one CPU core.

Repository on GitHub: https://github.com/badoo/pyexasol

P.s. You can test Exasol in several ways:

a. based on the free version, Exasol Community Edition. This is a full-featured version of the analytic database, limited to 200 Gigabytes of raw data. It is ideal for testing and then upgrading to a paid version of Exasol as your data grows.

b. A trial version of Exasol for 30 days in the AWS cloud (with $200 per account for testing)

c. Demo version of Exasol on the company's own cloud, ExaCloud.

If you have additional questions about Exasol or would like to conduct a pilot project, please contact us. ATK Consulting Group is an official Exasol partner in Russia, with a staff of certified Exasol deployment and support specialists.

Please contact: consult@atkcg.ru or +7 (495) 937 16 50

See also:

Have you looked at it? There's more information further down, so you'll have to read on =)

Vitaly Markov, Data Platform Engineer, Badoo:

"Overall, Exasol has been a real eye-opener for us. You just upload the data and you can analyze it immediately at high speed. It just works. There is no fiddling with indexes, views, or any manual optimizations. You can link data from completely different sources, not just the ones that were designed for this purpose beforehand. Any kind of joins are at your service.
In fact, you are limited only by your ability to formulate your task in the form of an SQL query. If SQL capabilities are not enough for some special cases, Exasol allows you to create custom functions in Python, LUA, Java, R. All the advantages of columnar storage, general parallelism of all operations and efficient memory usage are preserved"

bottom of page