Tuesday, September 20, 2011

Shutdown MiniDFSCluster and MiniMRCluster Takes Forever

Writing unit tests for Hadoop applications does not need to be more complicated than writing tests for any other Java application.
Usually, I use the following procedure for testing my Hadoop code:
  1. Testing the classes as real units.
  2. Each Mapper and each Reducer deserves to be tested in isolation. Spock with its awesome support for mocking and stubbing is a great tool for testing units in isolation. With few 3rd party dependencies it is even possible to mock final classes, bypass the existing constructors of these classes, and inject them into private fields of the units under test => isolation at its best (more on this in another post).
    Furthermore, there might be other units like custom Writables, InputFormat and OutputFormat implementations and classes containing the business logic to convert data, and so on. I try to write a Specification (the Spock counterpart of a TestCase) for each non-trivial class.
    This should reveal most of the nasty little bugs contained in the business logic of the application. Small side-note: I know, there is a mr-unit test framework provided by Cloudera. Personally, I found Spock to be more powerful and flexible, but as always, this is merely a matter of taste.
  3. Next step is to execute complete job roundtrips using the local Hadoop mode, again with Spock. If the application features a command line interface, I set up a simple Specification which executes the main() method of the main driver class and compares some expected output files against the actual output files.
  4. Being able to execute jobs in local mode is great, because it is a fast way to run blackbox test against the map reduce framework. Howerver, since the local mode is somewhat limited, it might be necessary to use a real cluster. For example, in local mode it is not possible to run more than one reducer. This limits the testing capabilities of custom partitioning and grouping code.
  5. To execute m/r code on a real cluster, I often use the hadoop-test project (include it with testCompile 'org.apache.hadoop:hadoop-test:0.20.2' in the build.gradle file). This project features implementations of both a distributed filesystem (org.apache.hadoop.hdfs.MiniDFSCluster) and a m/r cluster (org.apache.hadoop.mapred.MiniMRCluster). The most annoying thing about these clusters is the very long startup time, but hey, they're distributed. The remainder of this post is about usage of these clusters.
In case there are several different Hadoop jobs to test or only a single job with different configurations, because of the long startup time, one should think about putting the cluster management code into the static setup methods of the test framework.
static MiniMRCluster mrCluster
static MiniDFSCluster dfsCluster

def setupSpec() {
	def conf = new JobConf()
	if (System.getProperty("hadoop.log.dir") == null) {
		System.setProperty("hadoop.log.dir", "/tmp");
	}
		
	dfsCluster = new MiniDFSCluster(conf, 2, true, null)
	mrCluster = new MiniMRCluster(2, dfsCluster.getFileSystem().getUri().toString(), 1)
		
	def hdfs = dfsCluster.getFileSystem()
	hdfs.delete(new Path('/main-testdata'), true)
	hdfs.delete(new Path('/user'), true)
	FileUtil.copy(inputData, hdfs, new Path('main-testdata'), false, conf)
}
Notes:
  • setupSpec() is the Spock equivalent of JUnit 4's @BeforeClass
  • If the system property (lines 6 and 7) is not set, the m/r cluster will not startup
  • The clusters are configured to use 2 slave nodes each
  • After successful filesystem startup, it is possible to perform the usual filesystem operations with it
Teardown and cleanup is similarly performed in the static context:
def cleanupSpec() {
	mrCluster?.shutdown()
	dfsCluster?.getFileSystem().delete(new Path('/main-testdata'), true)
	dfsCluster?.getFileSystem().delete(new Path('/user'), true)
	dfsCluster?.shutdown()
}
However, there is a really annoying problem with this code: it takes forever. I don't know why (presumably, I'm doing something wrong), but this shutdown procedure is blocked by several data integrity tests which take a long time themselves. Since the generated data is garbage and of zero relevance after the tests have completed, I'd like to get rid of these checks, but I really cannot figure out how. The logs get s-l-o-w-l-y filled with lines like
11/09/19 23:47:06 INFO datanode.DataBlockScanner: Verification succeeded for blk_3329401068442722923_1001
11/09/19 23:47:12 INFO datanode.DataBlockScanner: Verification succeeded for blk_654270692326292497_1008
11/09/19 23:47:39 INFO datanode.DataBlockScanner: Verification succeeded for blk_-3673127094860948561_1006
and the single test and thus the whole test suite as well takes several minutes to complete, fatal for each continuous integration system. However, there is a workaround which is neither obvious nor very nice, but it works: wrapping the shutdown procedure in another thread. Simply by modifying the code from above into
def cleanupSpec() {
	Thread.start { 
		mrCluster?.shutdown()
		dfsCluster?.getFileSystem().delete(new Path('/main-testdata'), true)
		dfsCluster?.getFileSystem().delete(new Path('/user'), true)
		dfsCluster?.shutdown()
	}.join(5000)
}
I could not spot any blocking behavior any longer. If someone has a better idea to avoid blocking on shutdown, please feel free to comment.

Thursday, September 15, 2011

Hosting a Maven Repository on Github for Gradle Builds

Gradle is a great tool for organizing the build logic of software projects. Gradle is able to handle all the dependency management necessary even for very small projects. Under the hood, it relies on Ivy and can deal with Maven repositories.
The question is where to upload my own artifacts which should be accessible as dependencies for other projects. As always, there are several alternatives to choose from. A small excerpt:
  1. Use one of the free Maven repository hosting services as provided by Sonatype for example
  2. Set up an own server with Artifactory or Nexus or something similar (although http, ftp, ssh, ... would suffice, as well)
  3. Create a new Git project and publish it to Github
One benefit of the Sonatype solution is the automated repository synchronization with the Maven Central Repository. However, since the projects have to fulfill a certain set of requirements and I do not need to find my pet projects at Maven Central, this is not my preferred solution.
The company I work for uses Artifactory set up on a dedicated server which is running 24/7 and plays nicely with the continuous integration system. Again, this seems a bit overkill for my private projects.
Since I'm using Git already and have a Github account featuring a very small number of projects, the third alternative seems very appealing to me. This solution requires three basic steps:
  1. Set up and synchronize Git repositories
  2. Upload project artifacts to local Git repository and synchronize it with Github
  3. Configure the Github project as mavenRepo within Gradle build files for depending projects

Step 1: Create a Git Project and Synchronize with Github

For Github project creation I follow the Github Repository Creation Guide. At first I create a repository on Github using the web interface.

Afterwards, I set up a local repository und synchronize it with Github using the following steps:
$ mkdir maven-repository
$ cd maven-repository
$ git init
Initialized empty Git repository in /home/thevis/GIT/maven-repository/.git/
$ touch README.md
$ git add README.md
$ git commit -m 'first commit'
[master (root-commit) eeee5a8] first commit
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 README.md
$ git remote add origin git@github.com:tthevis/maven-repository.git
$ git push -u origin master
Counting objects: 3, done.
Writing objects: 100% (3/3), 209 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To git@github.com:tthevis/maven-repository.git
 * [new branch]      master -> master
Branch master set up to track remote branch master from origin.
$ 
A quick check at Github shows that README.md has actually been pushed to Github and that repository snchronization works fine.

Step 2: Upload Project Artifacts to Local Git Repository

My intention is to establish the following release workflow for my projects: Make and upload a release build using gradle clean build uploadArchives, manually check results in the local Git directory, add it to the Git repository, and finally push it to Github. Well, I know that software releases should not involve manual steps and should be performed on dedicated machines runing build server software and so on and so on...
However, since I'm the only contributor for my projects and the release cycles are somewhat unsettled, this procedure is fairly sufficient for my needs.
Therefore, I change to a project to be released and add following lines to the build.gradle file:
apply plugin: 'maven'

uploadArchives {
	repositories.mavenDeployer {
		repository(url: "file:///home/thevis/GIT/maven-repository/")
	}
}
Executing the release procedure described above yields the following result:
$ gradle clean build uploadArchives
[...suppressed a few output lines here...]
:build
:uploadArchives
Uploading: net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar to repository remote at file:///home/thevis/GIT/maven-repository/
Transferring 5239K from remote
Uploaded 5239K

BUILD SUCCESSFUL

Total time: 24.443 secs
Just to be sure, I check the result by hand:
$ find /home/thevis/GIT/maven-repository/ -name "*groovy-hadoop*"
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.md5
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.sha1
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.sha1
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.md5
Looks good, so I add the files to Git and push them to Github:
$ cd /home/thevis/GIT/maven-repository/
$ git add .
$ git commit -m "released groovy-hadoop-0.2.0"
[master 5026f70] released groovy-hadoop-0.2.0
 9 files changed, 40 insertions(+), 0 deletions(-)
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.md5
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.sha1
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.md5
 create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.sha1
 create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml
 create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml.md5
 create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml.sha1
$ git push origin master
Counting objects: 17, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (16/16), 5.12 MiB | 98 KiB/s, done.
Total 16 (delta 0), reused 0 (delta 0)
To git@github.com:tthevis/maven-repository.git
   eeee5a8..5026f70  master -> master

Step 3: Use Custom Repository for Depending Projects

There is only one potential pitfall here. The most difficult part is to determine the URL scheme for downloading files from Github. The file groovy-hadoop-0.2.0.jar page reveales the real artifact URL if one hovers the mous pointer over the raw link:
https://github.com/tthevis/maven-repository/raw/master/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar. Thus, the maven repository URL to configure is not https://github.com/tthevis/maven-repository but rather https://github.com/tthevis/maven-repository/raw/master/.
With this finding in mind, all the rest is pretty straightforward. For testing purposes I set up the follwing build.gradle:
apply plugin: 'java'
apply plugin: 'eclipse'

version = '0.1.0'
group = 'net.thevis.hadoop'
sourceCompatibility = '1.6'

repositories { 
	mavenRepo urls: 'https://github.com/tthevis/maven-repository/raw/master/'
} 

dependencies {
	compile 'net.thevis.hadoop:groovy-hadoop:0.2.0'
}
And guess what? It works like a charm.

Thursday, September 8, 2011

Spock in a Gradle-Powered Groovy Project

Spock. Is. Great.
One can do wonderful things with Spock, at leat when it comes to testing software. One of the fun things is that one can use Spock both for Groovy and for Java and obviously for mixed projects as well. I'll write about usecases and examples in another post. This one is about setting up Spock for a Gradle-powered Groovy project.

Basic Build

Spock relies heavily on Groovy itself, so the desired Spock version has to match the Groovy dependency for the project.
I tried it with Groovy-1.8.1 and Spock-0.5-groovy-1.8.
Excerpt from the build.gradle file:
apply plugin: 'groovy'
apply plugin: 'eclipse'

repositories { 
  mavenCentral()
} 

dependencies {
  groovy 'org.codehaus.groovy:groovy-all:1.8.1'
  testCompile 'org.spockframework:spock-core:0.5-groovy-1.8'
}
However, the gradle eclipse fails with an unresolved dependency:
:eclipseClasspath
:: problems summary ::
:::: WARNINGS
		module not found: org.codehaus.groovy#groovy-all;1.8.0-beta-3-SNAPSHOT
[...]
FAILURE: Build failed with an exception.

* Where:
Build file '/home/thevis/spock-test/build.gradle'

* What went wrong:
Execution failed for task ':eclipseClasspath'.
Cause: Could not resolve all dependencies for configuration 'detachedConfiguration1':
    - unresolved dependency: org.codehaus.groovy#groovy-all;1.8.0-beta-3-SNAPSHOT: not found


* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1.431 secs
What does this mean? gradle dependencies does not show any peculiarities. So I really don't know.
Fortunately, although kind of annoying, this is not really a problem for the project. Since the project is a Groovy project already, it is possible to exclude all the transitive Groovy dependency stuff introduced by Spock. One possibility is to change the Spock dependency in the build file to:
  testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
    transitive = false
  }
Alternatively, one could also exclude just the single missing dependency explicitely:
  testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
    exclude 'org.codehaus.groovy:groovy-all:1.8.0-beta-3-SNAPSHOT'
  }
Either way gradle eclipse will succeed.

Adding Optional Features

If you want to make use of Spock's mocking and stubbing support (and I'm sure you want) the basic configuration from above is kind of limited, since it allows only mocking of interfaces. Spock lets you also mock and stub classes and even bypass the standard object construction. For these purposes, Spock depends both on cglib-nodep and objenesis, but declares these dependencies as optional. Thus, we have to declare them ourselves:
  testCompile 'cglib:cglib-nodep:2.2'
  testCompile 'org.objenesis:objenesis:1.2'

Complete Build File

Finally, here is the build.gradle in its full glory providing full mock and stub support with Spock: Alternatively, one could also exclude just the single missing dependency explicitely:
apply plugin: 'groovy'
apply plugin: 'eclipse'

version = '0.1.0-SNAPSHOT'
sourceCompatibility = '1.6'

repositories { 
  mavenCentral()
} 

dependencies {
  groovy 'org.codehaus.groovy:groovy-all:1.8.1'

  testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
    transitive = false
  }
  testCompile 'cglib:cglib-nodep:2.2'
  testCompile 'org.objenesis:objenesis:1.2'
  testCompile 'junit:junit:4.7'
}
Happy specifying!