This semester I am enrolled in a really great class at BU called CS506: Computational Tools in Data Science. We have gone over a lot of great foundational concepts through various mini projects, from clustering restaurant districts in Vegas from simple JSON data, to using machine learning algorithms to predict income of individuals from US Census data.
This past week we had an assignment to scrape Redfin houses in Massachusetts, which required utilizing Chromedriver’s browser automation to slowly scrape the site as to not trigger Redfins heavy anti-scraping security features. In total I would have needed to process over 16000 individual house listings in MA, which would’ve taken over 20 hours from my calculations and my somewhat naive code. However, a fellow classmate of mine made a great post about a tool to bypass these issues, and to get incredible parallelization on AWS Lambda machines in the cloud.
It is called Pywren, which is a library written by some UC Berkeley postdocs (Github source posted below).
The library is great, and offers a really easy extension to parallelize massive jobs like mine, which is critical for big data mining projects. It takes advantage of the ability for th AWS Lambda machines ability to quickly perform small tasks in parallel, and simple requires you to pass in groups of your job along with a function in Python (in my case I passed my scraping function with URLs bunches in 355 groups of 46).
For anyone doing some big data jobs or someone who just wants to speedup their code and learn some cloud computing, its a great step. Ill also post the link to AWS Lambda below.