Big data, code explanation

A simple introduction to Spark, User-Defined Functions (UDFs) and Joins.

  1. Brief technical point on the difference between Spark and PySpark, to emphasise
  2. Quick overview of the original (non-Spark) project, the goal of…

Big data, general audience

Utilising distributed processing to bring those numbers right down.

Network of nodes
  • Spark is an application for distributing data processing amongst many connected computers
  • By turning my original program into a Spark program, I was able to make it run 65x quicker, taking only 8 minutes instead of 8 hours
  • I intend to leverage Spark whenever processing times are significantly long


A data engineer with Kubrick. I will share some of my projects and knowledge here.

