Data lakes differ from data warehouses by the way data is stored without the need for any transformation. Data lakes store data in its raw form whereas warehouses require rigid filtering before it can be collected as part of their environment. Moreover, data lakes are able to easily collect and store data of any type.
The grid below outlines some of the key differences between data lakes and data warehouses. This article will explain in more detail some of the important features that make data lakes able to treat very large volumes of data with high velocity and variety. However, due to the size and lack of organization with data lakes, managing the veracity of data can be more challenging than with traditional data warehouses.
Aside from the elements in the above grid, there is another key fundamental differentiator between a data lake and a data warehouse, which is rooted in what is called schema management. While the warehouse operates according to what is called a schema-on-write model via approaches like standard ETL (Extract, Transform, Load) techniques, the lakes operate with a schema-on-read model, namely in the Hadoop framework. It is this subtle but important distinction that gives data scientists and analysts the opportunity to explore patterns in the raw with a data lake architecture.
In a data warehouse, if the data going in a database doesn’t fit the format of the schema built for a particular table, the data is simply rejected. For example, in SQL it is not possible to add data to your table without first creating the table. By extension, the creation of the table is not possible unless the schema of that table was predefined before actually loading the dataset. The implication is that if the data changes by adding fields or modifying the data type (i.e. from integer to text), the table will have to be dropped or reloaded in order to fit in the database.
This classic schema-on-write approach is manageable for relatively small- to medium-sized datasets or when it doesn’t involve any foreign keys affecting a large number of other tables. However, when dealing with foreign keys along with datasets of several terabytes, dropping or reloading tables can take days rather than minutes, while creating important computation costs in the process. Fortunately, modern cloud-based solutions like BigQuery and Amazon Redshift have radically improved the computational power of data warehousing. Nonetheless, for use cases requiring ACID (atomicity, consistency, isolation, durability) transactions with a response time into the milliseconds, even the powerful query engine of BigQuery is not enough. Furthermore, the latter requires a schema-on-write treatment of your tables.
With the schema-on-read approach, there are no prerequisites to loading files of any type into a data lake. In fact, in the Hadoop framework, which today is mostly managed in serverless cloud object stores like Google Cloud Storage, S3, etc., the process starts with loading a dataset without making any alterations. For example, the line of code below is an example of an HDFS (Hadoop Distributed File System) command to load data in the equivalent of a standard table in a Hadoop environment:
The above command line is an instruction in Python to pull all text files that begin with mktgfile within the temp folder into the HDFS system. This will initiate an entire mapping and distribution process in the background to optimize storage according to the Hadoop framework. From there, no schema is required to go ahead and directly query this data from the table. In Python, here is what that query might look like:
To manage Hadoop at scale, YARN (Yet Another Resource Negotiator) was created as a cluster resource management layer, which is used to perform resource allocation and job scheduling. YARN was introduced in the Hadoop 2.0 version and serves as the middle layer between HDFS and a processing framework called MapReduce, which we will describe in a moment.
In essence, what is important to remember is that in this NoSQL universe, the data structure is only interpreted once it is being read, which is at the core of the idea behind schema-on-read. Therefore, if a particular analytics file doesn’t respect the structure of a predefined schema, like for instance if a field is added or removed in a table, the mapper function in the above query will adjust for it on the fly. This is because the data schema in Hadoop is whatever the mapper decides it is.
This mapper function in question works through the MapReduce paradigm, described as “a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.” It essentially involves mapping a given function to different inputs, namely a dataset or file, then reducing the different outputs to a single one. Mathematically, a MapReduce function can be expressed as follows in an array multiplication:
C = AB = A ⊕.⊗ B
In the above notation, ⊗ corresponds to an approximation of the mapping assigned to each pair of inputs, and ⊕ approximately corresponds to the reduce function applied to the outputs. When computations are performed on database analysis systems, a table reader function is generated in correspondence with the mapping operations, followed by a table writer function being used to reduce the operation to a single output. This two-step technique is at the heart of all the processing magic of a data lake system built on this paradigm. The upshot of this technique is the broad flexibility it provides to a file storage system, essentially freeing it from any real need of a schema, as opposed to its traditional relational database counterpart.
However, due to the difficulty of maintaining and programming such parallel clusters of nodes, many IT teams have moved away from directly using MapReduce for big data processing. For most data engineers and devops, the new champion in town is called Apache Spark, which is described as “a unified analytics engine for large-scale data processing.” Spark can run as a standalone application or on top of Hadoop YARN, where it can read data directly from HDFS. This open source solution now counts one of the largest communities of developers in the world contributing to the project. Most big data cloud services offer Spark as a managed solution, like Databricks, Google Cloud’s Dataproc or Amazon’s EMR Step API.
When dealing with a big data environment, the flexibility of those frameworks in processing large and often unstructured or polymorphic files is what makes the data lake a more suitable solution. By allowing the data to be stored without schema constraints, the same dataset can accommodate a wide range of analysis and use cases, which is precisely what data exploration requires when searching for new patterns in analytics.
Moreover, data lakes and data warehouses are not mutually exclusive. They are mostly inclusive and complementary. An ETL tool can be architected in the post-staging phase of the data lake processes in order to be redirected toward a data warehouse. The latter remains important for controlled reporting and analytics, namely for descriptive or diagnostic dashboarding. As such, the two environments are important to develop jointly in today’s data management and business intelligence strategy. So don’t plan on choosing one over the other. Instead, plan on connecting the two together to yield the most value for your business. Of course, you should prioritize which project you start with based on business needs and IT capabilities. In the meantime, if you’d like to learn more about what a data lake is in more detail, please read our deep dive article on the subject.