Welcome! This is my first blog post of many that I plan to write on each AWS service or solution that I have built a demo or prototype with recently. For this post I’ll follow a format where I detail the problem that needed to be solved, then compare the solution features and capabilities to determine whether the proposed solution is fit for purpose, go through the implementation steps and finally discuss possible extensions I can plug-in. Happy reading!
The problem statement! What if you had multiple source systems generating large volumes of logging data. Other source systems generating data in formats ranging from json to unstructured text. Some of this data was then summarized and loaded onto relational data marts, some of it stored on Elasticsearch and visualized through Kibana dashboards. Some of the data was residing inside transactional relational databases. Multiple consumers wanting to perform data analysis had to pull down the subsets of data they wanted from each source, consolidate it and then perform their analysis using their analysis tools of choice. All of this was mostly ad-hoc and time consuming and my challenge was to build a data consolidation solution that addressed this pain point. I needed to have something I could demonstrate to “data consumers” very quickly.
The light bulb moment! Reading though this problem statement it occurred to me that the Serverless Data Lake solution from AWS had a very good fit for this particular use case.
This solution has been built using some key features of the AWS platform that makes it a truly differentiated solution. Here are some of those features …
Decouples compute from storage when storing vast amounts of enterprise data.
If you think about it, in the traditional data warehouse model, you do not have this freedom. The acquisition and operating costs to store what is effectively at rest data and execute on demand compute intensive analysis tasks on a subset of that data is quite significant. Not only that, more often than not, that data was transformed or even summarized prior to ingestion to suit use cases required at a given point in time. This meant the ability to possibly gain other insights from the raw data was permanently lost. The costs associated with retaining large volumes of raw data were quite prohibitive. The moment you decouple storage from compute, you can actually do this as you have the ability to use a low cost highly durable object storage platform like S3.
Has the capability to be the single datastore for structured and unstructured data.
Uses S3 encryption capability to implement compliance requirements.
Ability to use widely available compression formats for data storage to optimize storage as well as query costs. i.e from Redshift Spectrum, Athena
Uses S3 lifecycle functionality to implement data retention policies.
Ability to checkout data subsets on demand for analysis based on organizational data entitlement policies.
Once checked out, gives data consumers the freedom to use the most optimal and appropriate tool for a given computational task.
All of this really made me think this was a good fit for my use case.
On to it! The first step in the implementation was to type a few self explanatory stack input parameters …
… and about 30 minutes later “I’ve got mail”! The url and administrator login details for the Data Lake console arrives in my inbox. Yes, you do get access to a web console that you can then extend functionality should you want to. There is a command line interface (cli) too. For the demo I’m going to use the web console. Here’s a screen grab of the landing page on the web console once you get past the login dialog.
Something to note is that there are absolutely no servers to manage on this solution stack. Each building block of this solution uses AWS managed services. To name a few Lambda, API Gateway, Cognito for data entitlement management, S3 to host the website for the console, DynamoDB and Elasticsearch for data cataloging. You get all of this with the base build including access to the Kibana dashboard for the catalog.
The next step is to create a data package. I’m not going to detail each step I went through in this blog post, but suffice to say within a few minutes I had loaded and cataloged the publicly available IRS demo data set. Here’s a screen grab of my example data set …
… instantly indexed and available for search. If you are with me so far, the key to successfully implementing this solution is the tagging of the data that is being ingested. The tags in blue are example global tags I set up as the Data Lake Administrator and the rest are dataset specific tags. Hence my example below to intentionally use a timestamp and a “somevalue” in my search string.
So far I ingested some data, searched for it and got a hit. I’m going to now checkout the dataset that matched my search criteria by generating signed S3 urls that would be valid for a limited time. A point to note is that in this example, I’m the Data Lake Administrator and I have not set up any entitlements so I have the ability to search and checkout any data set for my analysis.
This makes me think! … what I find most interesting are the opportunities for extending this baseline solution. I could use the Database Migration Service (DMS) to migrate some of the on-premise transactional data to S3. I wouldn’t need to pull across everything in one go, I could just start with the data sets that do not currently make up the working data set. I could also set up a small Redshift instance that uses Redshift Spectrum functionality to query this data from S3. Current or new BI workloads could then be driven from that Redshift instance. Lambda and API Gateway capability could be used to build secure APIs to share and publish some data sets as required. And last but not least, these data sets could be used for training and evaluating machine learning models as well as generating machine learning predictions. There’s really a lot that can be done depending on the evolving requirements of the data consumers.
What I’m planning on next! … is to experiment with the bundled command line interface and the Database Migration Service. I want to see how that can be used to integrate with the source data systems to create data ingestion pipelines.