The Hitchhiker's Guide to Handle Big Data Using Spark

Rahul Agarwal recently wrote an article for Towards Data Science that provides a great overview on the history of MapReduce, Spark, and using Python / PySpark for data engineering. I worked on a project recently with two people and we were shocked at how computationally efficient Spark was for our use case. We were able to load, read, manipulate, and explore over 30 GB of data in a Google Colab, GPU-enabled environment without reaching Google’s RAM limitation… If you’ve used Colab, you know that you can eat through 12 GB of GPU RAM quickly. I highly recommend this article for you if you’re unfamiliar with Spark and PySpark for Python.

Source:


blog · scatter podcast · resources · contact · who is javier? · main