Data Lakes - almost everything you need to know

Marco Westergren

Published on

•

January 4, 2022

Share this post

It starts with a stream...

Why do we write down important things? Is it because we want future generations to know about our lunch appointments or Strava records? Or is it because we want to remember what we need to do, or if we are improving our sports performance? Organisations have much the same questions, but added to this is the need for other people and systems to access these records. And thus, databases are a thing, and now you wonder if you need a data lake.

Overview

Data Lakes are central storage systems where you can store massive amounts of related, unrelated and unstructured data - everything from nice tweets and phone recordings to IoT data. Data Lakes differ from Data Warehouses which are databases for storing and analysing large amounts of relational data, typically from business applications. Due to the stored relational data, Data Warehouses have stricter rules (schema) about what and how data can be stored.

History & Hype

In the beginning, there was big data. People and companies were starting to realise or rediscover the value of connecting ‘unrelated ‘ data sets to try and draw out new insights. Much of this was driven by big tech companies which were amassing vast quantities of user data and wanted to leverage this data to uncover more insights into customer behaviour to gain a competitive advantage.

When enabling factors such as significant reductions in storage costs, availability of widespread broadband and the relative ease of using cloud services for hosting it all came into play, the amount of data being generated and retained trended ever upwards.

It quickly became apparent that storing all of this data in traditional relational databases was not ideal. There were no apparent ‘relationships’ to use to categorise and neatly store this data. The idea in Big Data is to try and discover these relationships. Thus, a ‘database for unrelated data’ was needed, which quickly became known as a ‘Data Lake’, a much easier concept to communicate and sell.

What they are used for

Analysis - gaining a deeper understanding of what happened and why.
Source for Prediction - using that understanding to predict what might happen in future scenarios
Storing data for future use - for example, by Machine Learning or AI.

What a data lake is and how it works

At its core, a data lake is simply a bucket into which data can be placed. This data could be documents, spreadsheets, photos, video, and even sensor readings and tweets. In this respect, it can be thought of as a directory into which many files are added. Those of us who like to keep things organised will grimace at this thought, and this is a key way in which data lakes differ from databases and data warehouses. A traditional database has structure; for example, all the cat photos are stored in one location with tags for breed and cuteness. All the employee records are kept separately, and so on. This requires planning which must include forethought about all the types of data that will be housed.

Data Lakes do away with this particular planning and just keep everything together. However, in order for this to be useful, the data needs to be catalogued. This process adds accompanying metadata to all the content in the data lake. This means that relevant data can later be retrieved by a person or program based on search criteria. In this manner, the structure of the data lake is imposed by the searcher when they make their search, and not by the administrator of the data lake. This process of cataloguing can be accomplished in different ways (such as during Extract, Transform and Load (ETL), links explaining this are are provided under ‘further reading’ below.

Why you might need one

It’s likely that if you really do need some form of Data Lake, then you probably know you need it. You will be aware of the scale of different data sets you are working with and be familiar with the challenges that prevent you from exploiting full value from the data you have.

If you are reading this out of curiosity, then a Data Lake may not be best route to value for you right now. In the same way that a tunnel boring machine is not the best way to dig a trench to lay a cable to the garage, Data Lakes are a specialist solution to a complex problem.

Why you might not need one - yet

There are other solutions that facilitate the analysis of structured and unstructured data. These can range from DIY tools such as Power BI to turnkey solutions that can ingest, process and store large quantities of data for later analysis, both in the solution and in external software via APIs. If you go down this route, you should favour solutions with open standards that don’t lock you into a particular ecosystem. It’s your data, no matter where you decide to store it.

These platform-based solutions are often better for organisations with simple to advanced analytics needs (typically with less than 10,000 people) that do not yet need to pursue multi-million dollar big data projects.

What we didn't cover

Security. This is highly important when you consider that organisations may be placing all kinds of sensitive data into the data lake.
Storage. Data Lakes are built on top of storage services, often serverless.
Management. While a Data Lakes stores unstructured data, there must be some organisation both to the content catalogue and the maintenance of the lake itself.
Governance. If the data lake is treated as a dumping ground for any and all data without appropriate organisation, its value will quickly diminish as it becomes more time consuming to find relevant data. Also known as a data swamp

Data Lake Providers

If you have the necessary experience in house or through consultants you trust, these providers all have robust Data Lake solutions.

Alternatives

Resources

Share this post

Data analytics