Redux: How do I get Jython to use Python modules stored in Lib within its own jar file when running in Hadoop?
Asked 07 September, 2021
Viewed 1.2K times
  • 59
Votes

I'm attempting to use Jython for an implementation within Hadoop 1.2.1. I have seen strikingly little about Jython+Hadoop other than stale projects (like code.google.com/p/happy), and a stale implementation in $HADOOP_HOME/src/examples/python/WordCount.py, so perhaps I'm barking up the wrong tree to begin with... but this seems reasonable and possible. I am also very aware of Hadoop Streaming, with which I can use Python in Hadoop without using Jython, but that's not what I'm trying to do here.

Basically, when I invoke the embedded/standalone Jython jar file using java -jar /full/path/to/myjythonjar.jar, the /full/path/to/myjythonjar.jar/Lib is in my Python sys.path, but when I invoke using bin/hadoop jar /full/path/to/myjythonjar.jar input output the ...jar/Lib is not in my path, and the script can't find the Python modules I'm referencing.

Here's what I'm doing...

I'm using the standalone version of the Jython jar, and using the JarRunner interface, roughly as described on SO here and other places; essentially as follows:

 cp jython-standalone-2.7-b1.jar jythonsalib_test.jar  
 jar ufe jythonsalib_test.jar org.python.util.JarRunner __run__.py  

That is, take a copy of the standalone jar, add my script with name __run__.py, and change the Manifest to execute JarRunner -- many thanks to @Frank Wierzbicki for that gem.

This all works fine when I'm running directly as, e.g.,

java -jar jythonsalib_test.jar

My sys.path reports that it includes '/full/path/to/jar/file/jythonsalib_test.jar/Lib', which is exactly what I expect, and it is the path from which I'm getting the Python modules (empirically tested by setting sys.path to null-list (fails) and ONLY that path (works)).

When I run this same jar in Hadoop, e.g., as

bin/hadoop jar /full/path/to/jar/file/jythonsalib_test.jar input output

sys.path only includes

['__classpath__', '__pyclasspath__']

I've also used the Jython standalone jar versions 2.5.4-rc1 (which has the same behavior described above) and 2.5.3 (that doesn't work for me for unrelated reasons).

As pointed out in other SO answers, the workaround I'm currently using is basically to directly add my Lib directory of my jar, inside of the Jython script like

import sys
sys.path.append('/full/path/to/jar/file/jythonsalib_test.jar/Lib')

And this basically works -- but this is meant to be a distributed application! There is no path that I can reference in this way. Other SO articles suggest various mechanisms, but are all basically adding to library paths (again, no links because I have <10rep) by Python like above, Java, or Jython installation or Jython "registry" (startup/rc) files. Sure, I could use HDFS or bootstrapping mechanisms or other mechanisms to distribute something to the compute nodes, like the jar or Jython or whatever, but the code is already in the jar! So I shouldn't need to distribute it again, separately...

So, in sum: It looks like I need to be on a filesystem that can directly, and separately, reference the jar file containing Python modules. (akin to the old java -jar jythonjar.jar -jar jythonjar.jar) How do I convince an embedded, standalone Jython jar to always use the Python modules with in the Lib subdirectory of the Jar file, without separately pointing to (potentially the same) jar file?

Or: how do I add a relative path link to the current jar file...? Or am I missing something more insidious and fundamental about Hadoop or Jython or Java or...?

I had a boatload more links, but SO tells me that I can only have TWO links because I'm new here. I hope some day to get enough rep to be able to truly contribute to this fantastic site! :)

Anyway. LTWFTW -- long time watcher, first time writer -- many thanks!

2 Answer