image for blog post

Distributed SQL Engines

Databases and relational database management systems allow companies to leverage their data and efficiently create, read, update and delete (CRUD) data. A study by McKinsey in 2022 showed that among financial-services leaders, only 13 percent had half or more of their IT footprint in the cloud. As companies use more and more data, the processes that allow for the data to be used efficiently will non doubt need to be optimised. This optimisation may come in the form of distributed SQL engines which allow processing and retrieval of data. Distributed SQL engines were derived from the concept of parallelisation, a method of high performance computing whereby a computer program or system breaks a problem down into smaller pieces to be independently solved simultaneously by discrete computing resources, distributed SQL engines increase compute power by linking multiple database servers under the hood of one RDBMS. This allows companies prioritise the scalability, reliability, and usability of the orchestrating ecosystem while maintaining the robust ACID compliant transactions of a traditional RDMS. Under this hood is a) virtualisation and b) an abstraction layer. This abstraction layer allows users and developers to interact with virtualised resources without needing to understand the intricacies of the underlying hardware, crucially in this context providing data scientists and analysts access across disparate data sources. This means that you can query relational and non-relational data together in a scale-out fashion for better query performance. As such, “distributed” doesn’t just refer to the query itself but also storage and compute. Companies wanting to carry out analytics on terabytes of data will opt for technologies using distributed query engines to optimise performance. The engines are primarily used in intensive OLAP queries and are able to withstand the fragility and inconsistency seen in non-distributed query engine performance. Early, well-known technologies such as Hadoop use parallel processing engines to query and analyse data stored on Hadoop Distributed File System. Many subsequent distributed query engines are based on Hadoop and are used for batch-style data processing. Each distributed query engine varies, with some breaking SQL queries into multiple stages and storing intermediate results in disks and others taking advantage of in-memory and caching. However, looking at this holistically, these technologies are based on MapReduce, a framework for processing "parrallelisable" problems across large datasets using a large number of computers. Collectively, these computers are referred to as clusters (so long as all computers/nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) like HDFS or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. Some of the most well know technology companies leverage significant compute power to provide products and services, taking advantage of the cloud and distributed architectures including the separation of compute and storage. For example companies such as Netflix are known for having microservices that use different kinds of databases based on the capabilities of each database. Some of these microservices rely on datasources such as Hadoop, AWS S3 or data from multiple data sources. Crucially, distributed SQL query engine allows data to be queried from a variety of these sources within a single query. Example query engines include Presto, Apache Drill, Apache Spark. Companies such as Netflix and Uber use this to drive analysis across disparate datasets. The number of companies creating new and innovative services and products to suit business needs continues to increase and this will no doubt make it easier for companies of any size and in any industry to leverage their data.