Friday, February 8, 2019

The 'snake pit' of Anaconda and R-reticulate

Background

The R-Studio team is making an important contribution with the 'reticulate' package for reusing Python modules in R. The reticulate package makes it possible to embed a Python session within an R process, allowing you to import Python modules and call their functions directly from R. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow.[1]

Into the 'snake pit'

The last few days I have been trying to do some data analysis using the 'umap' dimensionality reduction and visualization algorithm [3]. My preferred data analysis environment is R as it is easy to integrate R-scripts into our data management and analytics workflow using Jenkins-LSCI.[4]

It all started innocently enough, as I decided to use the flipDimensionReduction package[2] (from Displayr), a package that I had used previously to create t-SNE visualizations of multi-dimensional data. The Display r dimension reduction function also supports using the UMAP algorithm as an alternative to t-SNE

However, the use of the UMAP option requires the external python module 'umap-learn'. And this is where my adventure begun..

A quick test (code shown below) from within R-Studio on my desktop (a Win-10 laptop, R v3.5.2, Anaconda distribution of Python 3.6) worked flawlessly and I was very encouraged!

 library(flipDimensionReduction)  
 library(reticulate)
 ## Read in data set file and apply dimension reduction algorithm 
 redData<- read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')   
tSNE_p30<-flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  
 ##To use the UMAP algorithm simply replace 't-SNE' with 'UMAP'  
 umap_p30<-flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  

Goal: Run flipDimensionReduction with UMAP from an R-Script

Now, let's describe the environment that I would really like to run this
  1. Windows 2008 server
  2. R v 3.5.2
  3. Anaconda distribution of Python 2.7.15
  4. The R-script has to run from the command line and not from within R-Studio

[Failed] Update Anaconda Distribution from Python 2.7 to Python 3.x

Given that the Anaconda Navigator and Python installations on the Windows server were both older versions, I decided to update them using the recommended conda command [5,6]

conda install anaconda


However, this did not work as expected (I was expecting an update to the latest version of Python). It seems that all I got was an updated version of the Anaconda Python 2.7.15 distribution

[Success] Creating a 'conda' Python 3.7 environment'

I then decided to create a new conda environment, 'py37' with the latest version of Python 3.7. This was done from the user interface of the Anaconda Navigator and completed successfully.

[Failed] Anaconda Navigator: Add the 'conda-forge' channel

This is the repository (channel) that hosts the 'umap-learn' module. I proceeded to add it from the user interface of the Anaconda Navigator. Then we need to click 'Update Index...' This operation hung the Anaconda Navigator ...forever! Now, every time Anaconda starts, it hangs while displaying 'Adding featured channels...' I have to kill the python process to exit!

Apparently this is a known (and as of now, 2/7/2019, unresolved) issue

[Success] Command line installation of  'umap-learn' 

Thankfully, the Anaconda 'py37' environment console is still functional and so I continued the installation from the command line

conda install -c conda-forge umap-learn
conda list

The 'list' command displays all the 'py37' installed modules, and it displays the 'umap-learn' package as correctly installed.

Is it time to rejoice that we've made it out of the 'snake pit' of module updates and installation? The joy is being able to use the UMAP algorithm from R. Let's give it a try.

[Failed]: Attempt 1: Using 'umap-learn' from R

Now that the 'umap' package was installed it was time to try it from R.

Given that I had two environments for python  (the 'base' python 2.7.15 without the 'umap-learn' package, and the 'py37' with the 'umap-learn' module) I followed the instructions for configuring the python 'py37' environment for use by R[7].

The instructions suggest that 'reticulate' 'can bind to any of these Python versions' by one of several ways.

I decided to use the use_python("PATH_TO_PYTHON") 'reticulate' function.

> library(reticulate)
> use_python("D:/DEVTOOLS/Anaconda2/envs/py37")

> py_config()
python:         D:\DEVTOOLS\ANACON~1\python.exe
libpython:      D:/DEVTOOLS/ANACON~1/python27.dll
pythonhome:     D:\DEVTOOLS\ANACON~1
version:        2.7.15 |Anaconda, Inc.| (default, Dec 10 2018, 21:57:18) [MSC v.1500 64 bit (AMD64)]
Architecture:   64bit
numpy:          D:\DEVTOOLS\ANACON~1\lib\site-packages\numpy
numpy_version:  1.15.4

python versions found: 
 D:/DEVTOOLS/Anaconda2/envs/py37/python.exe
 D:\DEVTOOLS\ANACON~1\python.exe
 D:\DEVTOOLS\Anaconda2\python.exe
 D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
> 

But wait, py_config reports Python version is 2.7.15, not what I expected. What's happening here?

It seems that despite the use_python() directive, 'reticulate' still binds the older version of python (which I don't want since it does not include the 'umap-learn' module)

After none of the suggested 'reticulate' functions that dynamically define a Python environment to use worked, I decided to specify the Python location from a system environment variable.

[Failed] Attempt 2: Using 'umap-learn' from R

Add a new Windows system environment variable:
RETICULATE_PYTHON="D:\DEVTOOLS\Anaconda2\envs\py37\python.exe"

Start a new session of R. Apparently a session can only import a single Python environment and it can not be reset. So always test by starting a new R console session.
> library(reticulate)
> py_discover_config()
python:         D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
libpython:      D:/DEVTOOLS/Anaconda2/envs/py37/python37.dll
pythonhome:     D:\DEVTOOLS\ANACON~1\envs\py37
version:        3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Architecture:   64bit
numpy:           [NOT FOUND]

NOTE: Python version was forced by RETICULATE_PYTHON
> import("umap-learn")
Error in py_module_import(module, convert = convert) : 
  ModuleNotFoundError: No module named 'umap-learn'
> 

So, now I forced 'reticulate' to use the version of Python I wanted, but it still can't find the module!

[Success]What is your real name 'umap-learn'?

More investigation revealed that the 'umap-learn' module although appears in the 'conda list' with the name 'umap-learn' it is really named 'umap' (in the envs\py37\Lib\site-packages folder) so the import fails!

Let's retry importing with the 'umap' name.

> import("umap")
Module(umap)

[Failed] Attempt 3: Using 'umap-learn' from R (NumPy version?)

So at this point I'm finally able to import the desired module. Let's try to use it.

> redData=read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')
> library(flipDimensionReduction)
> tSNE_p30=flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)
> umap_p30=flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  Evaluation error: Required version of NumPy not available: installation of Numpy >= 1.6 not found.

As of this writing NumPy is at version 1.15.4 and I have the most recent version. What is this?
Well, after some more investigation and assistance from StackOverflow this is another unresolved issue with 'reticulate' [8]

The suggestion was to downgrade the version of 'numPy' but this did not work for me. What worked was the suggestion from gitHub: https://github.com/rstudio/reticulate/issues/367

The suggested solution is to add the Anaconda `libraries\bin` directory to the path prior to initializing Python.

[Success] Attempt 4: Initializing Python with path to libraries


#now some environment hacks to get around known issues with 'reticulate'
# see https://github.com/rstudio/reticulate/issues/367
> Sys.setenv(PATH= paste("D:/DEVTOOLS/Anaconda2/envs/py37/Library/bin",Sys.getenv()["PATH"],sep=";"))
> library(reticulate)
> import("umap")
Module(umap)
> py_config()
python:         D:\DEVTOOLS\Anaconda2\envs\py37\python.exe
libpython:      D:/DEVTOOLS/Anaconda2/envs/py37/python37.dll
pythonhome:     D:\DEVTOOLS\ANACON~1\envs\py37
version:        3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Architecture:   64bit
numpy:          D:\DEVTOOLS\ANACON~1\envs\py37\lib\site-packages\numpy
numpy_version:  1.15.1
umap:           D:\DEVTOOLS\ANACON~1\envs\py37\lib\site-packages\umap\__init__.p

NOTE: Python version was forced by RETICULATE_PYTHON

> library(flipDimensionReduction)

> redData= read.csv('http://localhost:8080/job/UTIL_CSVDATA_QUERY/39/artifact/customQuery_1548177063100.csv')   
> tSNE_p30=flipDimensionReduction::DimensionReductionScatterplot("t-SNE", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  

> ##To use the UMAP algorithm simply replace 't-SNE' with 'UMAP'  
> umap_p30=flipDimensionReduction::DimensionReductionScatterplot("UMAP", redData, data.groups = redData$LABEL, perplexity = 30, seed = 3456)  
> umap_p30
This last command uses the print function of the umap_p30 object to create an interactive UMAP web-page.

UMAP Interactive

'Out of the snake pit' What have I learned?

So, finally I have the reduction algorithm running in R with the UMAP option. And yes, it uses the 'reticulate' package to import the latest version of Python and its installed modules.

But the experience of getting this to work was anything but smooth!
  • The 'reticulate' R package issues seemed to have been known for several months, but still remain unresolved. 
  • The 'reticulate' functions for binding to specific Python versions and environments, and for reolving library class paths, did not work as expected. These seem fundamental to the usage of the package. If they did work, the process would have been a lot smoother.
  • Some of the issues seemed total 'red-herrings' and the error messages not helpful (see issue of NumPy version). 
  • The issue of the actual 'umap' module name vs. the name used to install it is still unclear to me, but perhaps someone with more experience in Python can explain why.
  • Finally, the Anaconda Navigator exhibited a crash behavior from which it never recovered. I think I will stick with the Anaconda command prompt.
In the end, I've put these notes together in the hope that they'll help someone else to setup and troubleshoot this important functionality of 'reticulate' in less time than it took me.

References

  1. RStudio 1.2 Preview: Reticulated Python https://blog.rstudio.com/2018/10/09/rstudio-1-2-preview-reticulated-python/
  2. flipDimensionReduction by Displayr:https://rdrr.io/github/Displayr/flipDimensionReduction/
  3. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/abs/1802.03426
  4. Continuous Integration of life-science data workflows with Jenkins https://github.com/Novartis/Jenkins-LSCI
  5. Updating Anaconda to Python 3.6 https://support.anaconda.com/customer/en/portal/articles/2797011-updating-anaconda-to-python-3-6
  6. Updating Anaconda: https://stackoverflow.com/questions/45197777/how-do-i-update-anaconda
  7. Python Version Configuration: https://rstudio.github.io/reticulate/articles/versions.html
  8. Using Python in R with reticulate package - Numpy not found
  9. https://stackoverflow.com/questions/54069769/using-python-in-r-with-reticulate-package-numpy-not-found

Wednesday, January 2, 2019

External Libraries for Active Choices

Motivation

The question 'How can I use jar-X or library-X in my Active Choice Groovy script ' is frequently asked. Using external java libraries in Groovy is one of the most useful features of the language, and so we need to explore how to easily make external libraries accessible to the Groovy scripts used to generate Active Choice parameters.

A Note on Jenkins Security

Groovy script execution in Jenkins is increasingly coming under scrutiny by the Jenkins security team. Several things that were easy to do with Groovy in Jenkins are now restricted, or next to impossible, due to security restrictions. As a result, some of the recommendations below may or may not work in future Jenkins versions and with future upgrades to the various plugins.

In later versions of Jenkins (v2.361.x and perhaps others) the approaches described below for v222.x have been blocked by security and JDK11 requirements. I will post any new information I find out, but for the time being, consider this limitation if you are trying to use external libraries with Active Choices in more recent versions of Jenkins.

External Libraries for Active Choices (Jenkins v2.222.x)


Options for Jenkins v2.222.x and earlier
There are at least three different ways we can employ to include external Java/Groovy libraries in the classpath of the Active Choices script.

  1. Configure an 'Additional Classpath' in the Active Choices Parameter Groovy script. 
    • Place the required library on the classpath folder on the Jenkins server. You can configure the additional classpath using tokenized variables accessible to the Active Choices script
    • I frequently place external libraries in a dedicated folder under the JENKINS_HOME/userContent folder. For example, a classpath to the H2 java database jar can be configured as $JENKINS_HOME/userContent/lib/h2-1.3.176.jar
    • Note that additional Classpaths seem to be discouraged in the latest Groovy Plugin. See https://issues.jenkins-ci.org/browse/JENKINS-43844
  2. Use Grape, the JAR dependency manager embedded into Groovy. The @Grab Groovy annotation dynamically fetches the required java library
  3. Place the required library in an external java libraries folder. Java (and Groovy) use these classpaths by default
    • In the Jenkins Groovy Console execute: println System.getProperty("java.ext.dirs")  to review what folders are used for external libraries
    • The path to all external java folders can be discovered by examining the 'java.library.path' property in the System Information link on the 'Manage Jenkins' page
    • Placing the required jar in one of the available java.library.path folders should work well for most cases and should be considered secure since you'll need admin access to have the ability to copy the jars to the appropriate location and restart the Jenkins server for these changes to take effect.

In Conclusion

As always, there are multiple ways to achieve  this programming requirement. Hopefully, one of these works for you! I will be happy to hear of other alternatives that you may discover or have used. Please, leave them in your comments and I can incorporate them in the blog entry.

References


  1. Jenkins Active Choices Plugin
  2. Grape dependency manager in Groovy
  3. BioUno: Jenkins and DevOps Tools for Life Sciences