A Cloud Data Lake MVP

Companies are looking at the Clouds to create, or expand, their centralized data repository, a data lake. The low storage costs has attracted big companies and enabled medium and small ones to organize their mission critical data. Building a data lake is just the first step to a data-driven decision making and enabling artificial intelligence use to give companies a competitive edge.

The goal here is to describe what an architecture for a minimum viable data lake would look like on these three clouds: AWS, Google Cloud and Azure. All three of them have B.I., analytics, data science and machine learning tools that can be used once the data is in the cloud.

A minimum viable product is a product built with minimum resources capable of testing, and hopefully proving, its business purpose. For a data lake in the cloud and MVP would try to minimize cloud costs.

Getting Data into the Cloud

The first step is to get the data in. All three clouds have (ETL-like) tools help to load data into the cloud that are charged by use.

As for storage, again, all three have low cost (pennies per gigabyte per month) high availability, storage services. They can be used not only a landing area for the data coming from your systems but also to store the result of processing such data.

Processing Data

All three clouds have comparable, pay as you go, services, for the batch, high volume, type of process.

The first differences appear with the need to access the data more iteratively, like in dashboards, reports or more generic data exploration use cases.

AWS and Google have  a pay per use SQL engine, capable of querying interactively the data in their respective storage. Google has a limitation, it does not support queries on files stored as Parquet or ORC. This is a downside to Google as it negatively impacts performance and increases cost.

Azure does not have a pay as you go SQL engine. The service that queries their storage has to be up and running (therefore charging) all the time.

Data Lake MVP Conclusion

In spite of the small differences pointed here, all three clouds are suited to host you minimum viable data lake. With a budget, proportional to the amount of data and processing effort intended, you can build and prove the purpose and value a cloud data lake on your company.

You also be building on a solid foundation as all three data lake solutions leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data.

There are other aspects as important that  were not discussed here, such as, data security, incremental data changes, near real-time data and event based data processing.

Finally, here is what it looks like with all the pieces put together, in all the three clouds.

Outros posts