Friday, February 8, 2019

The 'snake pit' of Anaconda and R-reticulate

Background

The R-Studio team is making an important contribution with the 'reticulate' package for reusing Python modules in R. The reticulate package makes it possible to embed a Python session within an R process, allowing you to import Python modules and call their functions directly from R. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow.[1]

Into the 'snake pit'

The last few days I have been trying to do some data analysis using the 'umap' dimensionality reduction and visualization algorithm [3]. My preferred data analysis environment is R as it is easy to integrate R-scripts into our data management and analytics workflow using Jenkins-LSCI.[4]

It all started innocently enough, as I decided to use the flipDimensionReduction package[2] (from Displayr), a package that I had used previously to create t-SNE visualizations of multi-dimensional data. The Display r dimension reduction function also supports using the UMAP algorithm as an alternative to t-SNE

However, the use of the UMAP option requires the external python module 'umap-learn'. And this is where my adventure begun..

A quick test (code shown below) from within R-Studio on my desktop (a Win-10 laptop, R v3.5.2, Anaconda distribution of Python 3.6) worked flawlessly and I was very encouraged!

 library(flipDimensionReduction)  
 library(reticulate)
 ## Read in data set file and apply dimension reduction algorithm 
 redData<- read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')   
tSNE_p30<-flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  
 ##To use the UMAP algorithm simply replace 't-SNE' with 'UMAP'  
 umap_p30<-flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  

Goal: Run flipDimensionReduction with UMAP from an R-Script

Now, let's describe the environment that I would really like to run this
  1. Windows 2008 server
  2. R v 3.5.2
  3. Anaconda distribution of Python 2.7.15
  4. The R-script has to run from the command line and not from within R-Studio

[Failed] Update Anaconda Distribution from Python 2.7 to Python 3.x

Given that the Anaconda Navigator and Python installations on the Windows server were both older versions, I decided to update them using the recommended conda command [5,6]

conda install anaconda


However, this did not work as expected (I was expecting an update to the latest version of Python). It seems that all I got was an updated version of the Anaconda Python 2.7.15 distribution

[Success] Creating a 'conda' Python 3.7 environment'

I then decided to create a new conda environment, 'py37' with the latest version of Python 3.7. This was done from the user interface of the Anaconda Navigator and completed successfully.

[Failed] Anaconda Navigator: Add the 'conda-forge' channel

This is the repository (channel) that hosts the 'umap-learn' module. I proceeded to add it from the user interface of the Anaconda Navigator. Then we need to click 'Update Index...' This operation hung the Anaconda Navigator ...forever! Now, every time Anaconda starts, it hangs while displaying 'Adding featured channels...' I have to kill the python process to exit!

Apparently this is a known (and as of now, 2/7/2019, unresolved) issue

[Success] Command line installation of  'umap-learn' 

Thankfully, the Anaconda 'py37' environment console is still functional and so I continued the installation from the command line

conda install -c conda-forge umap-learn
conda list

The 'list' command displays all the 'py37' installed modules, and it displays the 'umap-learn' package as correctly installed.

Is it time to rejoice that we've made it out of the 'snake pit' of module updates and installation? The joy is being able to use the UMAP algorithm from R. Let's give it a try.

[Failed]: Attempt 1: Using 'umap-learn' from R

Now that the 'umap' package was installed it was time to try it from R.

Given that I had two environments for python  (the 'base' python 2.7.15 without the 'umap-learn' package, and the 'py37' with the 'umap-learn' module) I followed the instructions for configuring the python 'py37' environment for use by R[7].

The instructions suggest that 'reticulate' 'can bind to any of these Python versions' by one of several ways.

I decided to use the use_python("PATH_TO_PYTHON") 'reticulate' function.

> library(reticulate)
> use_python("D:/DEVTOOLS/Anaconda2/envs/py37")

> py_config()
python:         D:\DEVTOOLS\ANACON~1\python.exe
libpython:      D:/DEVTOOLS/ANACON~1/python27.dll
pythonhome:     D:\DEVTOOLS\ANACON~1
version:        2.7.15 |Anaconda, Inc.| (default, Dec 10 2018, 21:57:18) [MSC v.1500 64 bit (AMD64)]
Architecture:   64bit
numpy:          D:\DEVTOOLS\ANACON~1\lib\site-packages\numpy
numpy_version:  1.15.4

python versions found: 
 D:/DEVTOOLS/Anaconda2/envs/py37/python.exe
 D:\DEVTOOLS\ANACON~1\python.exe
 D:\DEVTOOLS\Anaconda2\python.exe
 D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
> 

But wait, py_config reports Python version is 2.7.15, not what I expected. What's happening here?

It seems that despite the use_python() directive, 'reticulate' still binds the older version of python (which I don't want since it does not include the 'umap-learn' module)

After none of the suggested 'reticulate' functions that dynamically define a Python environment to use worked, I decided to specify the Python location from a system environment variable.

[Failed] Attempt 2: Using 'umap-learn' from R

Add a new Windows system environment variable:
RETICULATE_PYTHON="D:\DEVTOOLS\Anaconda2\envs\py37\python.exe"

Start a new session of R. Apparently a session can only import a single Python environment and it can not be reset. So always test by starting a new R console session.
> library(reticulate)
> py_discover_config()
python:         D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
libpython:      D:/DEVTOOLS/Anaconda2/envs/py37/python37.dll
pythonhome:     D:\DEVTOOLS\ANACON~1\envs\py37
version:        3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Architecture:   64bit
numpy:           [NOT FOUND]

NOTE: Python version was forced by RETICULATE_PYTHON
> import("umap-learn")
Error in py_module_import(module, convert = convert) : 
  ModuleNotFoundError: No module named 'umap-learn'
> 

So, now I forced 'reticulate' to use the version of Python I wanted, but it still can't find the module!

[Success]What is your real name 'umap-learn'?

More investigation revealed that the 'umap-learn' module although appears in the 'conda list' with the name 'umap-learn' it is really named 'umap' (in the envs\py37\Lib\site-packages folder) so the import fails!

Let's retry importing with the 'umap' name.

> import("umap")
Module(umap)

[Failed] Attempt 3: Using 'umap-learn' from R (NumPy version?)

So at this point I'm finally able to import the desired module. Let's try to use it.

> redData=read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')
> library(flipDimensionReduction)
> tSNE_p30=flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)
> umap_p30=flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  Evaluation error: Required version of NumPy not available: installation of Numpy >= 1.6 not found.

As of this writing NumPy is at version 1.15.4 and I have the most recent version. What is this?
Well, after some more investigation and assistance from StackOverflow this is another unresolved issue with 'reticulate' [8]

The suggestion was to downgrade the version of 'numPy' but this did not work for me. What worked was the suggestion from gitHub: https://github.com/rstudio/reticulate/issues/367

The suggested solution is to add the Anaconda `libraries\bin` directory to the path prior to initializing Python.

[Success] Attempt 4: Initializing Python with path to libraries


#now some environment hacks to get around known issues with 'reticulate'
# see https://github.com/rstudio/reticulate/issues/367
> Sys.setenv(PATH= paste("D:/DEVTOOLS/Anaconda2/envs/py37/Library/bin",Sys.getenv()["PATH"],sep=";"))
> library(reticulate)
> import("umap")
Module(umap)
> py_config()
python:         D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
libpython:      D:/DEVTOOLS/Anaconda2/envs/py37/python37.dll
pythonhome:     D:\DEVTOOLS\ANACON~1\envs\py37
version:        3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Architecture:   64bit
numpy:          D:\DEVTOOLS\ANACON~1\envs\py37\lib\site-packages\numpy
numpy_version:  1.15.1
umap:           D:\DEVTOOLS\ANACON~1\envs\py37\lib\site-packages\umap\__init__.p

NOTE: Python version was forced by RETICULATE_PYTHON

> library(flipDimensionReduction)

> redData= read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')   
> tSNE_p30=flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  

> ##To use the UMAP algorithm simply replace 't-SNE' with 'UMAP'  
> umap_p30=flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  
> umap_p30
This last command uses the print function of the umap_p30 object to create an interactive UMAP web-page.

UMAP Interactive

'Out of the snake pit' What have I learned?

So, finally I have the reduction algorithm running in R with the UMAP option. And yes, it uses the 'reticulate' package to import the latest version of Python and its installed modules.

But the experience of getting this to work was anything but smooth!
  • The 'reticulate' R package issues seemed to have been known for several months, but still remain unresolved. 
  • The 'reticulate' functions for binding to specific Python versions and environments, and for reolving library class paths, did not work as expected. These seem fundamental to the usage of the package. If they did work, the process would have been a lot smoother.
  • Some of the issues seemed total 'red-herrings' and the error messages not helpful (see issue of NumPy version). 
  • The issue of the actual 'umap' module name vs. the name used to install it is still unclear to me, but perhaps someone with more experience in Python can explain why.
  • Finally, the Anaconda Navigator exhibited a crash behavior from which it never recovered. I think I will stick with the Anaconda command prompt.
In the end, I've put these notes together in the hope that they'll help someone else to setup and troubleshoot this important functionality of 'reticulate' in less time than it took me.

References

  1. RStudio 1.2 Preview: Reticulated Python https://blog.rstudio.com/2018/10/09/rstudio-1-2-preview-reticulated-python/
  2. flipDimensionReduction by Displayr:https://rdrr.io/github/Displayr/flipDimensionReduction/
  3. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/abs/1802.03426
  4. Continuous Integration of life-science data workflows with Jenkins https://github.com/Novartis/Jenkins-LSCI
  5. Updating Anaconda to Python 3.6 https://support.anaconda.com/customer/en/portal/articles/2797011-updating-anaconda-to-python-3-6
  6. Updating Anaconda: https://stackoverflow.com/questions/45197777/how-do-i-update-anaconda
  7. Python Version Configuration: https://rstudio.github.io/reticulate/articles/versions.html
  8. Using Python in R with reticulate package - Numpy not found
  9. https://stackoverflow.com/questions/54069769/using-python-in-r-with-reticulate-package-numpy-not-found