It starts with a stream...
Why do we write down important things? Is it because we want future generations to know about our lunch appointments or Strava records? Or is it because we want to remember what we need to do, or if we are improving our sports performance? Organisations have much the same questions, but added to this is the need for other people and systems to access these records. And thus, databases are a thing, and now you wonder if you need a data lake.
Data Lakes are central storage systems where you can store massive amounts of related, unrelated and unstructured data - everything from nice tweets and phone recordings to IoT data. Data Lakes differ from Data Warehouses which are databases for storing and analysing large amounts of relational data, typically from business applications. Due to the stored relational data, Data Warehouses have stricter rules (schema) about what and how data can be stored.
In the beginning, there was big data. People and companies were starting to realise or rediscover the value of connecting ‘unrelated ‘ data sets to try and draw out new insights. Much of this was driven by big tech companies which were amassing vast quantities of user data and wanted to leverage this data to uncover more insights into customer behaviour to gain a competitive advantage.
When enabling factors such as significant reductions in storage costs, availability of widespread broadband and the relative ease of using cloud services for hosting it all came into play, the amount of data being generated and retained trended ever upwards.
It quickly became apparent that storing all of this data in traditional relational databases was not ideal. There were no apparent ‘relationships’ to use to categorise and neatly store this data. The idea in Big Data is to try and discover these relationships. Thus, a ‘database for unrelated data’ was needed, which quickly became known as a ‘Data Lake’, a much easier concept to communicate and sell.
At its core, a data lake is simply a bucket into which data can be placed. This data could be documents, spreadsheets, photos, video, and even sensor readings and tweets. In this respect, it can be thought of as a directory into which many files are added. Those of us who like to keep things organised will grimace at this thought, and this is a key way in which data lakes differ from databases and data warehouses. A traditional database has structure; for example, all the cat photos are stored in one location with tags for breed and cuteness. All the employee records are kept separately, and so on. This requires planning which must include forethought about all the types of data that will be housed.
Data Lakes do away with this particular planning and just keep everything together. However, in order for this to be useful, the data needs to be catalogued. This process adds accompanying metadata to all the content in the data lake. This means that relevant data can later be retrieved by a person or program based on search criteria. In this manner, the structure of the data lake is imposed by the searcher when they make their search, and not by the administrator of the data lake. This process of cataloguing can be accomplished in different ways (such as during Extract, Transform and Load (ETL), links explaining this are are provided under ‘further reading’ below.
It’s likely that if you really do need some form of Data Lake, then you probably know you need it. You will be aware of the scale of different data sets you are working with and be familiar with the challenges that prevent you from exploiting full value from the data you have.
If you are reading this out of curiosity, then a Data Lake may not be best route to value for you right now. In the same way that a tunnel boring machine is not the best way to dig a trench to lay a cable to the garage, Data Lakes are a specialist solution to a complex problem.
There are other solutions that facilitate the analysis of structured and unstructured data. These can range from DIY tools such as Power BI to turnkey solutions that can ingest, process and store large quantities of data for later analysis, both in the solution and in external software via APIs. If you go down this route, you should favour solutions with open standards that don’t lock you into a particular ecosystem. It’s your data, no matter where you decide to store it.
These platform-based solutions are often better for organisations with simple to advanced analytics needs (typically with less than 10,000 people) that do not yet need to pursue multi-million dollar big data projects.
If you have the necessary experience in house or through consultants you trust, these providers all have robust Data Lake solutions.