Great Tools to get your Feet Wet in Big Data

For the past month I have been immersed at Optum as a data engineering intern. For the better part of June, my team and myself have been introduced to the tools used for our project, which focuses on encoding specifications in Scala for EMR standardization, and have just recently started writing code from scratch to get these specifications done. Given the ‘hot’ nature of this type of big data work, I felt it necessary to jump back into my blog to outline some great tools to learn for those who employ big data in their jobs or research, and skills my managers say are in high demand in industries such as Pharma and IT:

Hadoop:

Although my experience is limited (more blogs to come soon about some personal projects and side work), Hadoop is an incredible tool many teams I know are working to migrate systems over too. Essentially, Hadoop (more specifically, the HDFS or Hadoop Distributed File System) functions as clusters of remote computers (ie: nodes) that are able to parallelize and set aside jobs for large scale computations. The design aspect I found most impressive was the ‘fail-by-design’ feature which gives you reserved backups for the instance of a hardware failure, which allows the hardware to be cheap. This allows for higher opportunity of scalability, and is a system many groups in industry are moving/have moved towards. Again, my experience with Hadoop is limited, so I welcome any resources people have to help me further my knowledge.

Spark:

Along with many of the other Apache tools, Spark is a great tool we have learned about a lot in my internship at Optum. Spark functions as a tool to speed up processing jobs on Hadoop systems, and is responsible for a lot of the parallelization coordination. It is also able to perform computations right in RAM to avoid the slow performance of disk drive computations, and allows for calculations to be done on datasets that can’t fit in memory. Spark also comes with API’s to interface with Python and Java, and was also written in Scala, allowing for the use of any of these popular languages.

Scala:

Scala is great to use with Spark, and is coined a functional programming languages (ie: you can pass functions to functions, return functions from functions, etc.). When you actually start working with Hadoop and Spark, this paradigm makes much more sense since a lot of the code you write gets translated in runtime into those actual computations on your data in abstracted ways, which can create a very ‘black-box’ feel to writing the code. However, a lot of this ‘black-box’ thinking comes out in many of the concepts in Scala such as pattern recognition, tail recursion and other interesting techniques I had never been exposed to in any other language. Given that most of my experience lies within Python and C in my past work, some of these new concepts have added a lot to my skill set as a programmer, and I urge people to learn more about this language for their big data projects.

Pandas and SciKit Learn:

These additional Python libraries allow for data analysis and ML functionality within your programs. Using each of these libraries will give you a great insight into DataFrames, how data can be stored and manipulated, and how ML algorithms work at a very high level. Since many people my age already know and love Python, these easy to learn libraries will provide a good background to understand the usefulness of the tools I have mentioned so far, and provide a good interface for analytical projects. Given that Spark has a Python API, it is also useful to know these tools dependent on your project goals.

SQL:

Last but not least, SQL allows for various queries and manipulations of data tables, along with joins. Although many of these operations are available in Python’s Pandas library, it is important to be able to use SQL given the fact that Spark allows for queries provided you have the proper SQLContext imported so the statement is interpreted properly (Note: Apache Hive provides SQL-like querying functionality for older versions of Spark, and requires HiveContext imports, so be sure to know your version). Since data transformation is such a critical step before any analysis, this intermediate tool is great to know, and a good add to the skill set.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s