Big data, general audience
Turning 8 hours into 8 minutes — a big data success story
Utilising distributed processing to bring those numbers right down.
This post is targeted at general audiences. If you’re interested in the code and in-depth explanations, click here
- Spark is an application for distributing data processing amongst many connected computers
- By turning my original program into a Spark program, I was able to make it run 65x quicker, taking only 8 minutes instead of 8 hours
- I intend to leverage Spark whenever processing times are significantly long
Want the code? Click here for the GitHub repository
Big data is characterised by its volume, pace of production and variety. And with it comes hefty amounts of processing and leaving your laptop running overnight, praying there are no errors when you arise from your restless sleep.
But all of that is in the past now. Enter the arena, distributed computing. A way to share your workload with other computers and cut down processing times substantially.
In this post, I will introduce fundamental concepts such as clusters and parallelisation, then describe how I turned a previously computationally-intensive project that runs for many hours into something that is finished quicker than I can boil the kettle and brew a tea.
If you’re working on something on your computer, it’s the only device in the universe working on it — all computational power comes from this single device.
So when you need to run millions of rows of data through an algorithm, you’re going to hit a speed barrier. Your computer can only do 1 thing at a time. Anything else you want to do must wait in line.
Wouldn’t it be great if you could have multiple computers working together on the same task at the same time?
This is where a series of computers, called nodes, are interconnected in a cluster, working together on the same task.
None of this would be possible without a distributed file system, where data and files are split up into blocks and shared amongst all nodes.
Often, replications of blocks exist amongst many nodes, so that the task can carry on even if two nodes crash. This is called fault-tolerance.
But there is an one particularly fantastic use of this system.
Previously, a computer must apply any analysis to your data row-by-row. This is especially slow if what you’re doing to each row is particularly complex.
With distributed processing, all nodes can work on the same task at the same time by sharing the workload. Each node can work on their own share of the data too, speeding up the process even further by minimising the distance it must travel.
Remember fault-tolerance from above? If one of the nodes unfortunately crashes, all nodes that have replicas of the crashed node’s data seamlessly pick up the slack of the workload. No errors, no worries.
You’re effectively combining the computational power of many computers. This is called scaling-out. Need more power? Just add another node to your network.
…all nodes [in a cluster] can work on the same task at the same time by sharing the workload…
Parallelisation is this idea of many different workers contributing to the same job at the same time. So if there are a million rows of data to process and 100 nodes, each node can potentially work on 10,000 rows at exactly the same time. In reality, the distribution of work is much smarter than this.
Apache, Spark, Hadoop and all that jazz, general audience-friendly
Apache Software Foundation (ASF) is a nonprofit that supports many software projects. If you do anything to with programming, you’ll hear this name a lot.
Hadoop, one of those projects Apache looks after, is the most famous framework for harnessing the power of distributed computing. It is a selection of different software, rather than its own standalone thing. Some of those can work on their own, but all of them working together brings the dream of super-fast and efficient processing and management of big data in to reality.
Hadoop (and its collection of software) is open-source — this means free for everyone to use, forever.
HDFS, which stands for Hadoop Distributed File System, is one of those softwares. It is the most successful system for distributed storage of data.
…data and files are split up into blocks and shared amongst all nodes…
Spark is another of those softwares. In fact, it’s an engine — this just means it does all the heavy lifting, all you have to do is tell it what you want to do. It is Hadoop’s best weapon in crushing processing times. It is one of the fastest ways to process distributed data — as long as your data exists in HDFS, Spark can use it.
Spark’s selling point is that it comfortably beats standard processing methods and even older distributed processing methods. It’s actually much smarter than I anticipated, achieving better performance than expected.
Top tip: if you need to speed up processing times, Spark is likely your solution
So clearly, this is exactly what I needed to optimise my project.
To utilise this awesome power, you don’t even need to know how the hardware works. By letting a big company’s cloud service, like Amazon’s EMR or Microsoft’s HDInsight, maintain and host all of this technology, all you need to worry about is the code.
Recommendation: ALWAYS use a cloud-based hosted service for using distributed computing — you don’t need or deserve the hassle of maintaining your own hardware
Need more power? Don’t worry about it, just run your program. Your cloud host will expand your cluster seamlessly however they see best fit, without us even knowing how. We just get the final result of “more computing power”. Pretty neat.
Want to get started right now, for free? Look into Databricks community edition — a web-based platform for coding and deploying projects using Hadoop and Spark straight away with minimal configuration. Did I mention it’s free? Click here for sign up right now.
Nicknamed the EDGAR project, my team and I were asked to analyse a specific type of financial report, called the 10-k reports, of all S&P 100 companies. We extracted sentiment (i.e. how positive or negative their wording was) and tried to use it as a predictor for short-term stock price changes.
We successfully developed a program to perform the required steps in a standard environment, and we were very pleased with the results.
But there was one thing wrong with our approach.
The crucial part was to take a set of raw document files, clean them up and then compare their wording to the Loughran-Mcdonald sentiment word list.
But this process took around 8 hours, because we had to process 1200 documents, a total 7.5 million words.
This was hard-hitting to me — our great program produced accurate results, but was neither efficient (we dare to re-run it), nor was it scalable to other companies, types of reports and other data sources.
So I decided to bring it into the world of Spark.
To utilise distributed processing, I had to refactor our code — use language that Spark understands, so that it can distribute the processing efficiently.
…[Spark] is one of the fastest ways to process distributed data…
The first part of the original program boiled down to a function (a set of encapsulated logic, like a formula) that turns our raw report text into clean, usable text by removing useless tags, unusual characters and other cleansing methods.
Thankfully, Spark has a way to ‘register’ pre-defined functions so that every node in the cluster knows how to use it and also coordinate their shared efforts properly.
The second part of the original program was inefficiently developed, so I redesigned the new logic around joins, which is taking two tables of data and joining their columns based on matching values of specific columns.
This is surprisingly light for Spark to join millions of words from the reports to thousands of words from the sentiment world list. So actually, what would have been the most inefficient method in standard programming actually came out the best solution in distributed processing.
I had access to a 6-node cluster hosted in the cloud by Amazon. So I was expecting times around 6x shorter, just under 2 hours. How pleasantly surprised I was with the final result.
I was gobsmacked (literally) to find that my program ran 65x faster than the original program. What was originally 8 hours of processing was crushed down to just 8 minutes.
Furthermore, the actually results of the program were a perfect replica of the original results, so accuracy was not negatively affected at all. But how sweet it was to receive them so swiftly.
By leveraging Spark, it took my program 8 minutes to process 1200 documents, a total of 7.5 million words
And distributed processing is nothing more than writing a few extra lines of code, changing a few original lines, then running it not on your laptop but on a cloud server, managed and maintained by a 3rd party who also provided an easy interface to access it. Then you let the architecture of Hadoop and Spark work its magic.
The trick is to have big data methodologies in mind from the very beginning.
- Add an initial stage that downloads all of the reports from the web for you. You currently have to supply the raw files beforehand.
- Generalise to look at any company and any type of financial report.
- Add a user interface — you choose what you want, and let the program to the heavy lifting.
Distributed computing is big now, but it is going to be huge in the future. And I am glad and honoured to have the skills needed to utilise such a powerful tool. I will always have it in mind when developing data pipelines — it’s better to be prepared now and be ready for the future.
Kubrick’s training program has definitely prepared me well for a fruitful career in data engineering, and I look forward to sharing more projects with everyone.
Let me know if you have any questions, suggestions or even if you want to collaborate on projects.
Interested in learning all this in-depth, and more? Consider applying at Kubrick. Send me a message on LinkedIn for more info.