Friday, November 11, 2011

Gradle and Update of SNAPSHOT Dependencies

We use Gradle as the main build tool for our Java projects. Builds are scheduled by our Hudson CI server and artifacts are published with Gradle to our Artifactory server. Artifactory is also used for dependency resolution and referenced as a Maven repository within our Gradle setup.

The most annoying problem in this special constellation is that SNAPSHOT dependencies are not properly updated for depending projects. For example, if project B depends on project A and project A is currently in a SNAPSHOT state (i.e. the version entry in the build.gradle looks something like '2.1.3-SNAPSHOT'), then project B would not always get the latest artifact version of project A. Entries on the Gradle user mailing list, numerous blog posts, and some JIRA tickets show that other people are facing similar problems, but there seems not to be a common solution for this problem, but each solution depends on a very specific build environment and cannot simply be adopted in other constellations.

It is not clear to me whether this (in our constellation) is a Gradle problem, an general Ivy bug, a problem with the Ivy Maven adapter or a missconfiguration of our artifactory server. Recently, I tried several different solution approaches and found one in a comment of GRADLE-629 which is actually working for us. When configuring the project dependencies of a project, explicitly setting the changing property for all dependencies in SNAPSHOT state resolves our update problems. For example, project B from above would configure the project A dependency like

dependencies {
  compile ('my.fancy:project-a:2.1.3-SNAPSHOT') { changing=true }
}
Whenever a new version of project A is available in Artifactory, this latest version is downloaded when Gradle tasks are executed for project B.

Although this approach works, it still feels more like a workaround than a solution. If someone knows the real problem cause and a better solution, please feel free to comment.

Thursday, November 10, 2011

MultipleOutputFormat and File Handle Limitations

Recently, we used Hadoop for a heavy batch processing job. The job itself was not very special, in fact, the very same job is run on a daily basis to process some sort of data incrementally. The job instances had run fine for several months. Now we wanted to process the data of some months at once and all of a sudden, the processing job died with nasty (and somehow missleading) exceptions. The reduce task logs were filled with lots of stacktraces like
java.io.EOFException
 at java.io.DataInputStream.readByte(DataInputStream.java:250)
 at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
 at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
 at org.apache.hadoop.io.Text.readString(Text.java:400)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
This stacktrace alone did not provide a lot of information about the real exception cause. However, together with the corresponding name node logs, it became clear that the corresponding reduce processes could not open new files for output writing. After realizing that the job is using org.apache.hadoop.mapred.lib.MultipleOutputFormat for output writing, the reason for the failing jobs became clear: file handle limitations. The only question was: which ones? The Linux OS has these limitations and Hadoop's HDFS as well. To make a long story short, we had to increase both of them.

Linux and Open File Limits

Linux limits the number of parallel open files on a per process basis. A given user might only start a certain number of processes (type ulimit -u to see your own limit) in parallel and each of these processes is only allowed to write to a certain number of files in parallel (ulimit -n). The default value for open files is 1024 (at least in debian/ubuntu-flavored distributions). To get around our problem from above, we increased this limit for the user running our Hadoop cluster by editing the /etc/security/limits.conf. To make the machine recognize the new limits, it is necessary to log out and afterwards back in again. However, in our case we do not login as the user hadoop directly, but using the su command. Thus, no new login shell is started and the configuration option would not be recognized. In his blog post, Armin describes how to edit /etc/pamd.d/su in this scenario.

Hadoop and Open File Limits

Hadoop (we use version 0.20.2) has configuration parameters for almost everything. The one to specify the number of parallel files per datanode is named dfs.datanode.max.xcievers. Unfortunately, if not set otherwise, datanodes are at startup equipped with only 256 parallel file handles (see org.apache.hadoop.hdfs.server.datanode.DataXceiverServer class for details). The HBase documentation on the xcievers parameter recommends a value of 4096. As stated before, this parameter is evaluated during datanode startup time. Therefore, it is necessary to configure this parameter in the conf/hdfs-site.xml file on each datanode and restart the cluster afterwards.

Summary

We had to increase both the OS specific limits and the limits in the Hadoop configuration. None of them alone was sufficient. To accomplish the configuration changes, we had to update setting on each datanode machine and to restart the cluster. Afterwards, these exceptions from above were only to be seen on cluster nodes which were not properly updated.