Big data, code explanation

A simple introduction to Spark, User-Defined Functions (UDFs) and Joins.

  1. Brief technical point on the difference between Spark and PySpark, to emphasise
  2. Quick overview of the original (non-Spark) project, the goal of…

Big data, general audience

Utilising distributed processing to bring those numbers right down.

Network of nodes
Network of nodes


  • Spark is an application for distributing data processing amongst many connected computers
  • By turning my original program into a Spark program, I was able to make it run 65x quicker, taking only 8 minutes instead of 8 hours
  • I intend to leverage Spark whenever processing times are significantly long


A data engineer with Kubrick. I will share some of my projects and knowledge here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store