In recent weeks I have been refining and troubleshooting the analysis of a set of highly dimensional data from
high content screening (HCS). As a note to self, and others, I will mention a few important aspects of working with this type of data that leads to more robust downstream analytics. A typical use case for this data is machine learning, pattern analysis, and clustering to determine which of a set of chemical compounds are biologically active. So, these finding should apply in general to other types of multi-parametric or high dimensional data.
A gentle introduction to HCS
For those of you that may be have not heard of HCS before, all you need to know is that it's a modern technology for screening for promising new medicines. Cells are grown in multi-well plates (up to 1536 wells per plate) and each well is treated with one of thousands of chemical compounds whose biological response we will later examine. At the end of the experiment, cells are fixed and stained with one or more fluorescent tags that allows us to track the location and expression levels of various cellular components (such as proteins, DNA, organelles etc)
|
Image Credit: CellProfiler Team, Broad Institute |
A computerized microscope takes high resolution pictures of each of the wells and the corresponding images are later analyzed using specialized image analysis software. A popular open source software, that I also use for this purpose, is
CellProfiler from the Broad Institute. A typical image analysis includes segmenting important biological objects in the image (nuclei, cells etc) and extracting critical morphological characteristics (such as object intensity, shape, granularity, etc) into a set of numerical features that we can later analyze through statistical and machine learning algorithms.
Which image/object features to extract?
One of the first decisions that you need to make is what image/object features you'll need to collect from the cellular images. This decision is
driven mainly by the expected biological response of the cells under treatment, but frequently it is a difficult one to make before one has a chance to review the data. In addition, it is helpful to know the response of cells to known positive and negative controls so that you can
collect features that robustly distinguish the cellular response between these two extreme conditions.
|
CellProfiler modules for measuring Intensity and Texture generate a large number of numeric features |
Image analysis software such as CellProfiler provide a number of modules that allow us to measure a multitude of image characteristics. Sometimes, one opts to collect all possible measurements, so that an un-biased decision can be made later as to which ones are really informative.
Which image/object features to use?
Even in the cases where several features are extracted and measured, it is still difficult to decide which ones to use as key metrics for identifying assay hits (i.e. the metrics that best identify an unknown compound to have the desired biological effect). In a recent publication, Singh and co-workers from the Broad Institute [1] found that:
"a majority of high-content screens published so far (60−80%) made use of only one or two image-based features measured from each sample and disregarded the distribution of those features among each cell population"
As a result, the information content of the typical HCS experiments is much lower that its potential.
Variability in the quality of the HCS data
|
We create hundreds of these types of plots,
one for each feature |
In practical use, HCS data for a large screening campaign are accumulated over a period of days, if not weeks, in which different batches of cells, reagents and experimental conditions may introduce variability in the quality of detection and measurement of certain cellular features. Some of these QC issues may get lost in the casual review of the separate batches of data and may be simply hidden in the large volume of accumulated data.
A good set of visualization tools that permit the
global review of all the data set features are a necessity. For example, a simple, but effective way to review the performance of assay controls is to create box plots of each of the feature across all assay plates. Other ways such as
kernel density plots are also very effective.
See what you are missing
Missing data and other abnormalities (such as infinity values) introduced by some analysis software are another issue that one must pay close attention to, as these will later create havoc with most machine learning algorithms. It may only be a few values among thousands of good data points but they are sure to cause you pain if you don't handle them appropriately. I usually generate summary statistics of each of the features using R to quickly review for missing (NA), non-number (NAN) and infinity values in each of the feature measurements.
|
Creating group statistics in R for each feature and presenting it as a table
is a quick way to detect data quality issues (such as missing data) |
In the example above I can quickly identify features that have some missing values and handle them appropriately before starting any data analysis
Don't forget the biology
While we have a tendency to concentrate on our many powerful computational tools, not only we should not forget the biologist, but we should actively involve them in the data QC and review process. It is only then that statistically interesting findings collide with the true biology. In some cases, such findings are only '
red herrings' and should be discarded.
|
Measurements represented as assay plate heat maps
facilitate data comprehension in the context of the experiment |
To productively involve the biologist, you also have to provide visualizations that make sense in the experimental space. For example, since these screens are performed in assay plates, you must provide visualizations that present the numbers in an assay plate format.
We frequently generate these by displaying the feature measurements in a heatmap format that resembles the assay plates. The biologists can immediately recognize this format and determine whether the findings make sense given the control layouts on these plates. For example, the example here shows distinct patterns where the assay controls are arrayed and where various concentrations of compounds (dose response) are located.
We have numerous practical examples, where after features are selected based on their potential for good predictive value in an assay, we get even more robust results after a biologist reviews these features and removes those that are not considered reflective of the true biology.
References
- Singh, Shantanu, Anne E. Carpenter, and Auguste Genovesio. "Increasing the Content of High-Content Screening An Overview." Journal of biomolecular screening 19.5 (2014): 640-650.