This post is targeted at anyone who understands the concepts of Hadoop, HDFS and Spark engine, and a basic knowledge of Python and Linux terminal. For a general audience-friendly version, click here
In this post, I will explain the code that helped me leverage the power of distributed processing, bringing processing times down from 8 hours to under 8 minutes!
Want to see the code? Click here for GitHub repository
This post is split up into 5 parts:
This post is targeted at general audiences. If you’re interested in the code and in-depth explanations, click here
Want the code? Click here for the GitHub repository
Big data is characterised by its volume, pace of production and variety. And with it comes hefty amounts of processing and leaving your laptop…
A data engineer with Kubrick. I will share some of my projects and knowledge here.