Wednesday, November 4, 2015

High Content Data: Get to know them before you do analytics.

In recent weeks I have been refining and troubleshooting the analysis of a set of highly dimensional data from high content screening (HCS).  As a note to self, and others, I will mention a few important aspects of working with this type of data that leads to more robust downstream analytics. A typical use case for this data is machine learning, pattern analysis, and clustering to determine which of a set of chemical compounds are biologically active. So, these finding should apply in general to other types of multi-parametric or high dimensional data.

A gentle introduction to HCS

For those of you that may be have not heard of HCS before, all you need to know is that it's a modern technology for screening for promising new medicines. Cells are grown in multi-well plates (up to 1536 wells per plate) and each well is treated with one of thousands of chemical compounds whose biological response we will later examine. At the end of the experiment, cells are fixed and stained with one or more fluorescent tags that allows us to track the location and expression levels of various cellular components (such as proteins, DNA, organelles etc)

Image Credit: CellProfiler Team, Broad Institute

A computerized microscope takes high resolution pictures of each of the wells and the corresponding images are later analyzed using specialized image analysis software. A popular open source software, that I also use for this purpose, is CellProfiler from the Broad Institute. A typical image analysis includes segmenting important biological objects in the image (nuclei, cells etc) and extracting critical morphological characteristics (such as object intensity, shape, granularity, etc) into a set of numerical features that we can later analyze through statistical and machine learning algorithms.

Which image/object features to extract?

One of the first decisions that you need to make is what image/object features you'll need to collect from the cellular images. This decision is driven mainly by the expected biological response of the cells under treatment, but frequently it is a difficult one to make before one has a chance to review the data. In addition, it is helpful to know the response of cells to known positive and negative controls so that you can collect features that robustly distinguish the cellular response between these two extreme conditions.

CellProfiler modules for measuring Intensity and Texture generate a large number of numeric features
Image analysis software such as CellProfiler provide a number of modules that allow us to measure a multitude of image characteristics. Sometimes, one opts to collect all possible measurements, so that an un-biased decision can be made later as to which ones are really informative.

Which image/object features to use?

Even in the cases where several features are extracted and measured, it is still difficult to decide which ones to use as key metrics for  identifying assay hits (i.e. the metrics that best identify an unknown compound to have the desired biological effect). In a recent publication, Singh and  co-workers from the Broad Institute [1] found that:
"a majority of high-content screens published so far (60−80%) made use of only one or two image-based features measured from each sample and disregarded the distribution of those features among each cell population"
As a result, the information content of the typical HCS experiments is much lower that its potential.

Variability in the quality of the HCS data

We create hundreds of these types of plots,
one for each  feature
In practical use, HCS data for a large screening campaign are accumulated over a period of days, if not weeks, in which different batches of cells, reagents and experimental conditions may introduce variability in the quality of detection and measurement of certain cellular features. Some of these QC issues may get lost in the casual review of the separate batches of data and may be simply hidden in the large volume of accumulated data.

A good set of visualization tools that permit the global review of all the data set features are a necessity. For example, a simple, but effective way to review the performance of assay controls is to create box plots of each of the feature across all assay plates. Other ways such as kernel density plots are also very effective.

See what you are missing

Missing data and other abnormalities (such as infinity values) introduced by some analysis software are another issue that one must pay close attention to, as these will later create havoc with most machine learning algorithms. It may only be a few values among thousands of good data points but they are sure to cause you pain if you don't handle them appropriately. I usually generate summary statistics of each of the features using R to quickly review for missing (NA), non-number (NAN) and infinity values in each of the feature measurements.
Creating group statistics in R for each feature and presenting it as a table
is  a quick way to detect data quality issues (such as missing data)
In the example above I can quickly identify features that have some missing values and handle them appropriately before starting any data analysis

Don't forget the biology

While we have a tendency to concentrate on our many powerful computational tools, not only we should not forget the biologist, but we should actively involve them in the data QC and review process. It is only then that statistically interesting findings collide with the true biology. In some cases, such findings are only 'red herrings' and should be discarded.

Measurements represented as assay plate heat maps
facilitate data comprehension in the context of the experiment
To productively involve the biologist, you also have to provide visualizations that make sense in the experimental space. For example, since these screens are performed in assay plates, you must provide visualizations that present the numbers in an assay plate format.

We frequently generate these by displaying the feature measurements in a heatmap format that resembles the assay plates. The biologists can immediately recognize this format and determine whether the findings make sense given the control layouts on these plates. For example, the example here shows distinct patterns where the assay controls are arrayed and where various concentrations of compounds (dose response) are located.

We have numerous practical examples, where after features are selected based on their potential for good predictive value in an assay, we get even more robust results after a biologist reviews these features and removes those that are not considered reflective of the true biology.

References


  1. Singh, Shantanu, Anne E. Carpenter, and Auguste Genovesio. "Increasing the Content of High-Content Screening An Overview." Journal of biomolecular screening 19.5 (2014): 640-650. 

Thursday, October 15, 2015

R, Scala, Python or Groovy? Which to keep, which to learn?

If you are looking to add a new programming language  to your tool-chest, you may be investigating one of the languages popular in today's life-science informatic circles.

If you have been involved with any kind of analytics, statistics or machine learning you probably already know R and use some of its extensive libraries. R has been the granddaddy of the open source and free tools that most data scientists have 'cut their teeth' on, but alternative tools, and support for numerical analysis and graphing in general programming languages such as Python , Groovy and Scala are emerging.

Recently, I've started exploring Scala and in the process I've noticed a blog post from Bruce Eckel. Bruce has been one of my favorite authors, since the early days of learning Java. His book 'Thinking in Java' is one that I have most appreciate for its clarity and instructive value.

Now Bruce has published a new book 'Atomic Scala' and you can review the first 100 pages or so for free. Atomic Scala is again a foundational book that programmers with no or little Scala programming experience will use to get started before delving into more involved Scala topics.

What is most interesting for me however, are some observations/comments that Bruce makes on his August 15th, 2015 blog concerning Scala complexity. After being immersed in Scala for a couple years while writing the Atomic Scala book, Bruce says:

"One of the issues I had with Scala is the constant feeling of being unable to detect whether I just wasn’t “getting” a particular aspect, or if the language was too complex. This is a hard thing to know until you have a deep understanding of a language." 

He goes on to compare the complexity of Scala with the 'elegant simplicity' of Python and declares that when he wants to be productive he always reaches for Python.

The popularity and longevity of Python is not simply 'a programming fashion' statement. Data scientists are building practical and robust solutions to many of today's life-science informatic problems with Python, a tremendous improvement over the tools many of us wrote in Perl a decade ago in the 'heyday' of sequence analysis.

Now, Scala is the native programming language for Apache Spark, the popular cluster analytics framework for big data. Many (but not its creators) suspect that Spark will replace Hadoop in the field of big data processing. As a result, a new generation of data scientists will get to learn Scala so that they can use Spark. So Scala will become more popular among bio-informaticians despite it's perceived or real complexity. However, the Spark community is also developing additional framework bindings in Java, Python and R. Given the vast number of  people familiar with these languages it will be interesting to see how many will indeed undertake the task to learn Scala. It seems that Spark could well be productively used with just Python and R.

As for me, my next language learning will be in Python. In recent years I have been very productive using Groovy (a dynamic, object oriented and functional JVM language, and now an Apache Incubating project) and less so using R (although I'm quite functional with it). It will be interesting to see how Python stacks up to both of these languages since it can be used to perform general programming tasks (like Groovy) as well as statistics, numerical computing, data mining and machine learning (like R).

A recent publication by Stergios Papadimitriou, an academic researcher, describes the development of two frameworks (ScalaLab and GroovyLab) for numerical analysis in a MATLAB like environment. Papadimitriou's group concludes that :
"Both languages can elegantly integrate well-known Java numerical analysis libraries for basic tasks and can challenge the performance of the traditional C/C++/Fortran scientific code with an easier to use and more productive programming environment."

Finally, recent data suggests that R usage is increasing in recent years. So this venerable queen of statistics, graphical visualization and data mining will not be going away any time soon. The good news for life-science data scientists is that they will have a variety of powerful tools to analyze the ever increasing and complex data from the lab and the clinic.