Sign in

Big data, code explanation


This post is targeted at anyone who understands the concepts of Hadoop, HDFS and Spark engine, and a basic knowledge of Python and Linux terminal. For a general audience-friendly version, click here

In this post, I will explain the code that helped me leverage the power of distributed processing, bringing processing times down from 8 hours to under 8 minutes!

Want to see the code? Click here for GitHub repository

This post is split up into 5 parts:

  1. Brief technical point on the difference between Spark and PySpark, to emphasise
  2. Quick overview of the original (non-Spark) project, the goal of…

Big data, general audience

Network of nodes

This post is targeted at general audiences. If you’re interested in the code and in-depth explanations, click here


  • Spark is an application for distributing data processing amongst many connected computers
  • By turning my original program into a Spark program, I was able to make it run 65x quicker, taking only 8 minutes instead of 8 hours
  • I intend to leverage Spark whenever processing times are significantly long

Want the code? Click here for the GitHub repository

Big data is characterised by its volume, pace of production and variety. And with it comes hefty amounts of processing and leaving your laptop…


A data engineer with Kubrick. I will share some of my projects and knowledge here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store