Thursday, October 15, 2015

R, Scala, Python or Groovy? Which to keep, which to learn?

If you are looking to add a new programming language  to your tool-chest, you may be investigating one of the languages popular in today's life-science informatic circles.

If you have been involved with any kind of analytics, statistics or machine learning you probably already know R and use some of its extensive libraries. R has been the granddaddy of the open source and free tools that most data scientists have 'cut their teeth' on, but alternative tools, and support for numerical analysis and graphing in general programming languages such as Python , Groovy and Scala are emerging.

Recently, I've started exploring Scala and in the process I've noticed a blog post from Bruce Eckel. Bruce has been one of my favorite authors, since the early days of learning Java. His book 'Thinking in Java' is one that I have most appreciate for its clarity and instructive value.

Now Bruce has published a new book 'Atomic Scala' and you can review the first 100 pages or so for free. Atomic Scala is again a foundational book that programmers with no or little Scala programming experience will use to get started before delving into more involved Scala topics.

What is most interesting for me however, are some observations/comments that Bruce makes on his August 15th, 2015 blog concerning Scala complexity. After being immersed in Scala for a couple years while writing the Atomic Scala book, Bruce says:

"One of the issues I had with Scala is the constant feeling of being unable to detect whether I just wasn’t “getting” a particular aspect, or if the language was too complex. This is a hard thing to know until you have a deep understanding of a language." 

He goes on to compare the complexity of Scala with the 'elegant simplicity' of Python and declares that when he wants to be productive he always reaches for Python.

The popularity and longevity of Python is not simply 'a programming fashion' statement. Data scientists are building practical and robust solutions to many of today's life-science informatic problems with Python, a tremendous improvement over the tools many of us wrote in Perl a decade ago in the 'heyday' of sequence analysis.

Now, Scala is the native programming language for Apache Spark, the popular cluster analytics framework for big data. Many (but not its creators) suspect that Spark will replace Hadoop in the field of big data processing. As a result, a new generation of data scientists will get to learn Scala so that they can use Spark. So Scala will become more popular among bio-informaticians despite it's perceived or real complexity. However, the Spark community is also developing additional framework bindings in Java, Python and R. Given the vast number of  people familiar with these languages it will be interesting to see how many will indeed undertake the task to learn Scala. It seems that Spark could well be productively used with just Python and R.

As for me, my next language learning will be in Python. In recent years I have been very productive using Groovy (a dynamic, object oriented and functional JVM language, and now an Apache Incubating project) and less so using R (although I'm quite functional with it). It will be interesting to see how Python stacks up to both of these languages since it can be used to perform general programming tasks (like Groovy) as well as statistics, numerical computing, data mining and machine learning (like R).

A recent publication by Stergios Papadimitriou, an academic researcher, describes the development of two frameworks (ScalaLab and GroovyLab) for numerical analysis in a MATLAB like environment. Papadimitriou's group concludes that :
"Both languages can elegantly integrate well-known Java numerical analysis libraries for basic tasks and can challenge the performance of the traditional C/C++/Fortran scientific code with an easier to use and more productive programming environment."

Finally, recent data suggests that R usage is increasing in recent years. So this venerable queen of statistics, graphical visualization and data mining will not be going away any time soon. The good news for life-science data scientists is that they will have a variety of powerful tools to analyze the ever increasing and complex data from the lab and the clinic.