Big data, code explanation

A simple introduction to Spark, User-Defined Functions (UDFs) and Joins.


In this post, I will explain the code that helped me leverage the power of distributed processing, bringing processing times down from 8 hours to under 8 minutes!

Want to see the code? Click here for GitHub repository

This post is split up into 5 parts:

  1. Brief technical point on the difference between Spark and PySpark, to emphasise
  2. Quick overview of the original (non-Spark) project, the goal of…

Big data, general audience

Utilising distributed processing to bring those numbers right down.

Network of nodes
Network of nodes


  • Spark is an application for distributing data processing amongst many connected computers
  • By turning my original program into a Spark program, I was able to make it run 65x quicker, taking only 8 minutes instead of 8 hours
  • I intend to leverage Spark whenever processing times are significantly long

Big data is characterised by its volume, pace of production and variety. And with it comes hefty amounts of processing and leaving your laptop…


A data engineer with Kubrick. I will share some of my projects and knowledge here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store