I recently wrapped up a unique project for a client and thought I'd share it with all of you.
The main reason I think it's interesting is because I managed to help a company which has no choice but to work with exported CSVs and TXT files go from static files to cloud-based reporting.
In this post I'm going to cover the exact process I went through with the client and how you can easily implement the same process if you're in a similar position.
The challenge
So before I dive into the solution I'll quickly mention the main challenges the client was facing.
The client's entire reporting process involved a junior analyst exporting files from a system, manipulating these files using Excel and finally, uploading these files to a server so colleagues could access them.
I broke down the main pain points as follows:
- Extremely manual process = a lot of wasted time.
- Difficult to build a historical record of the data over time.
- Limited reporting capabilities = not leveraging their data to its full potential.
The solution - S3 & Amazon to the rescue
The solution to all these pain points was implementing the BI stack shown above.
The stack includes Amazon S3 (file hosting / storage), Stitch (ETL solution), BigQuery (data warehouse), and Google Data Studio (reporting).
The junior analyst would still manually export the files each day but from that point all the analyst has to do is import the relevant file into the relevant Amazon S3 bucket.
From this point Stitch would take the CSV or TXT file and move the data from the bucket into BigQuery.
I purposefully created S3 buckets and Stitch connections for each report so we had complete control of each data set throughout the entire pipeline. This approach also makes it much easier to troubleshoot any data issues that may arise.
BigQuery makes it really easy to create views so it isn't an issue that each data set lives separately in the data warehouse.
Once the data sets are being loaded into S3 and the pipelines are functioning as planned, your data warehouse will start to grow in size. At this point you can leverage SQL, tools like Google Data Studio or Tableau and do significantly more with your data than before.
Useful pointers if you want to implement the same stack
There were a bunch of nuances that I ran into while working on this project which I wish I knew before hand. Below are the main lessons I learnt from this project and some pointers to help you get through the process of setting up this stack.
- As long as the size of the files you're exporting aren't massive (>1,000,000) you should be able to create this stack on Stitch's free plan.
- Since Stitch allows you to create an unlimited number of connections under a single connection type, you can create as many S3 connections as you need.
- The S3-Stitch-BigQuery pipeline will only work for CSV or TXT files.
- Stitch adds a few of their own columns to the final tables in BigQuery. These for the most part aren't very useful.
- The name of the bucket and the name of the file are themselves fields in the created tables in BigQuery. This is really helpful if you need to manually add a date or some other information to your file.
- I ended up creating views for each of the data sets to remove unnecessary files and rename columns to make things more "analyst-friendly". You'll most likely need to do the same.
- If you create a pipeline and it's not working it might be because the wildcard you entered in Stitch when creating the connection is incorrect. I made this mistake and it took me a while to realize it.
- Be smart with the naming conventions you use when creating the buckets and Stitch connections. You want to make sure there is no room for misunderstanding. Remember, this pipeline might become the responsibility of someone else in the future so make sure you're taking your time and doing things correctly.
- Once the data was flowing into the data warehouse I manually calculated some KPIs and asked the analyst to do the same using the old method. Thankfully our numbers matched for the most part. This is a good test as it will help confirm that the right data is landing up in the data warehouse and that you can quickly move from the old method to the new method.
- When setting up BigQuery (or a different type of database) as a destination in Stitch, make sure to set the loading mode to append. You want to append the data (add new rows each time), and not replace existing data. This way you can create history as more data is loaded.