tag:blogger.com,1999:blog-80295077916381763472024-03-13T00:54:23.106+01:00Downright AmazedThings I discovered, learned, and try to remember.Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-8029507791638176347.post-61217989159965637212012-02-07T22:56:00.001+01:002012-02-07T22:56:46.199+01:00Configure Oozie's Launcher Job<p>
We use Oozie as management application for some of our data processing pipelines. Although the Oozie developers wrote a lot of documentation on Oozie, there are several features and usecases which are covered quite minimalistic by documentation. How to configure the launcher job, for instance, is something I was only able to learn from mailing lists.
</p>
<p>
The launcher job is used by Oozie to supervise some of its actions, e.g. java or mapreduce actions. The launcher job is executed as a Hadoop job with a single map task and zero reduce tasks.
In most cases we do not care much about the launcher. However, there are some situations in which we would like to have some influence on the execution of the launcher job. For example, we wanted to run the complete data processing pipeline with priority <code>VERY_HIGH</code>. Java and mapreduce Oozie actions provide a <code>configuration</code> element which can be populated with arbitrary (with the exception of namenode and jobtracker) Hadoop properties. However, Oozie applies these properties only to the <i>real</i> actions and not to the launcher application. For this purpose, one has to add an <code>oozie.launcher.</code> prefix to the corresponding Hadoop properties.
</p>
<p>
For the purpose of prioritizing the data processing pipeline with configuration parameters we added the following XML blocks to the <code>configuration</code> elements to <b>all our</b> mapreduce and java actions:
<pre class="brush:java">
<property>
<name>oozie.launcher.mapred.job.priority</name>
<value>${priority}</value>
</property>
<property>
<name>mapred.job.priority</name>
<value>${priority}</value>
</property>
</pre>
When starting the Oozie workflows, we provide appropriate properties files which contain a <code>priority</code> key with the desired priority setup.
</p>
<p>
Other useful applications of the <code>oozie.launcher.</code> configuration prefix could be
<ul>
<li>to run the launcher job in another queue than the workflow jobs itselves (<code>oozie.launcher.mapred.job.queue.name</code>, see <a href="https://issues.apache.org/jira/browse/OOZIE-9">OOZIE-9</a>) or</li>
<li>to use special java options like increased heap space settings for java actions (<code>oozie.launcher.mapred.child.java.opts</code>)</li>
</ul>
</p>Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com1tag:blogger.com,1999:blog-8029507791638176347.post-58890959394454026982012-02-02T00:05:00.001+01:002012-02-02T00:21:54.277+01:00Base64 Decoding with Eclipse<p>There are very few things in software development that are equally annoying as localization topics, especially dealing with dates in different timezones and/or - and here is my all time favorite - encoding issues.</p>
<p>
We have a lot of data and use a bunch of different technologies, languages, and platforms to process the data. With regard to encoding topics, this does not help much either.
Someone in the company decided it could be a good idea to encode critical data, especially strings that are not under our direct control, with <i>Base64</i> encoding. In this way, data exchange between different platforms and languages can be restricted to exchange (relatively simple) ASCII data.</p>
<p>
And thus, we now have to deal a lot with Base64 encoded data. During creation of unit tests, debugging, or manual validation of productive data, there is a frequent need to decode Base64 literals. Most times I used one of the many free online tools for this purpose. Although these tools do what they promise, the associated workflow is kind of messy: step through a unit test in Eclipse, copy some Base64 string into clipboard, switch to the browser, find and open one of these conversion tools - if not already open, convert the string, copy the result, and take it back into Eclipse.</p>
<p>However, after a little preparation, leaving Eclipse is completely unnecessary. There is an Eclipse feature called <i>External Tool Configurations</i> which allows to execute arbitrary commands directly from Eclipse. On the other hand there is Groovy with its famous <code>-e</code> option to execute code in-line.
Combining these two, it is possible to execute some Groovy helper code directly from Eclipse. With the help of meta programming Groovy extended Java's String class with several features, one of them a build-in Base64 decoding method. The remainder of this post describes how to configure a simple Base64 decoding tool in Eclipse.</p>
<p>
<ol>
<li>Open the <i>External Tool Configuration</i> dialog: <div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-9tTrfWHXpuo/Tym-s7KVioI/AAAAAAAAAB0/OHUBs1UllL8/s1600/Auswahl_033.png" imageanchor="1" style=""><img border="0" height="127" width="320" src="http://4.bp.blogspot.com/-9tTrfWHXpuo/Tym-s7KVioI/AAAAAAAAAB0/OHUBs1UllL8/s320/Auswahl_033.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-sYsgCLXKWgE/Tym-tae0_WI/AAAAAAAAACA/UunVqk3TADk/s1600/Men%25C3%25BC_032.png" imageanchor="1" style=""><img border="0" height="139" width="245" src="http://1.bp.blogspot.com/-sYsgCLXKWgE/Tym-tae0_WI/AAAAAAAAACA/UunVqk3TADk/s320/Men%25C3%25BC_032.png" /></a></div>
</li>
<li>Create a new configuration, give it a name, specify the path to the Groovy executable, and finally insert the code.<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-DqP3j3zMibM/Tym_rdNu6oI/AAAAAAAAACM/YIP81ymhYIQ/s1600/External%2BTools%2BConfigurations%2B_034.png" imageanchor="1" style=""><img border="0" height="295" width="320" src="http://3.bp.blogspot.com/-DqP3j3zMibM/Tym_rdNu6oI/AAAAAAAAACM/YIP81ymhYIQ/s320/External%2BTools%2BConfigurations%2B_034.png" /></a></div>
The <i>Arguments:</i> text area contains the following code:
<pre class="brush:groovy">
-e "def input = '${string_prompt:Base64 decoding}';
println new String(input.decodeBase64())"
</pre>
The Eclipse variable <code>${string_prompt}</code> makes a popup dialog appear which promts for an input value.
</li>
<li>Save the configuration</li>
</ol>
</p>
<p>
Base64 decoding can now be executed the following way:
<ol>
<li>Select the newly created Tool<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-tFBv0Mesul8/TynCC8ioOII/AAAAAAAAACY/bAf0WSTGrXU/s1600/Auswahl_033.png" imageanchor="1" style=""><img border="0" height="127" width="320" src="http://3.bp.blogspot.com/-tFBv0Mesul8/TynCC8ioOII/AAAAAAAAACY/bAf0WSTGrXU/s320/Auswahl_033.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-z-7Ze5DSCEY/TynCC-rRX6I/AAAAAAAAACk/iy8I6Wam2VU/s1600/Men%25C3%25BC_035.png" imageanchor="1" style=""><img border="0" height="139" width="245" src="http://4.bp.blogspot.com/-z-7Ze5DSCEY/TynCC-rRX6I/AAAAAAAAACk/iy8I6Wam2VU/s320/Men%25C3%25BC_035.png" /></a></div>
</li>
<li>Insert the string to convert and start conversion
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-CeTWdMsEgBU/TynCWBfX1GI/AAAAAAAAACw/BJb0FyUnzTA/s1600/Variable%2Binput%2B_037.png" imageanchor="1" style=""><img border="0" height="126" width="320" src="http://3.bp.blogspot.com/-CeTWdMsEgBU/TynCWBfX1GI/AAAAAAAAACw/BJb0FyUnzTA/s320/Variable%2Binput%2B_037.png" /></a></div>
</li>
<li>Read the result from the <i>Console</i> view
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-TmjD6bzgtA4/TynCveCH_hI/AAAAAAAAAC8/GAZGivYla88/s1600/Auswahl_039.png" imageanchor="1" style=""><img border="0" height="119" width="320" src="http://4.bp.blogspot.com/-TmjD6bzgtA4/TynCveCH_hI/AAAAAAAAAC8/GAZGivYla88/s320/Auswahl_039.png" /></a></div>
</li>
</ol>
</p>Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0tag:blogger.com,1999:blog-8029507791638176347.post-27062828646370970972011-11-11T23:25:00.001+01:002011-11-12T08:46:22.568+01:00Gradle and Update of SNAPSHOT Dependencies<p>We use Gradle as the main build tool for our Java projects.
Builds are scheduled by our Hudson CI server and artifacts are published with Gradle
to our Artifactory server. Artifactory is also used for dependency resolution and
referenced as a Maven repository within our Gradle setup.</p>
<p>
The most annoying problem in this special constellation is that SNAPSHOT dependencies are not properly
updated for depending projects.
For example, if project B depends on project A and project A is currently in a SNAPSHOT state
(i.e. the version entry in the <code>build.gradle</code> looks something like
<code>'2.1.3-SNAPSHOT'</code>), then project B would not always get the latest
artifact version of project A.
Entries on the Gradle user mailing list, numerous blog posts, and some JIRA tickets
show that other people are facing similar problems, but there seems not to be a
common solution for this problem, but each solution depends on a very specific build environment and cannot simply be adopted in other constellations.</p>
<p>
It is not clear to me whether this (in our constellation) is a Gradle problem, an general Ivy bug, a problem with the
Ivy Maven adapter or a missconfiguration of our artifactory server.
Recently, I tried several different solution approaches and found one in a <a href="http://issues.gradle.org/browse/GRADLE-629?focusedCommentId=12974&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12974">
comment of GRADLE-629</a> which is actually working for us.
When configuring the project dependencies of a project, explicitly setting the <code>changing</code> property for all
dependencies in SNAPSHOT state resolves
our update problems.
For example, project B from above would configure the project A dependency like
<pre class="brush:groovy">
dependencies {
compile ('my.fancy:project-a:2.1.3-SNAPSHOT') { changing=true }
}
</pre>
Whenever a new version of project A is available in Artifactory, this latest version is downloaded
when Gradle tasks are executed for project B.</p>
<p>
Although this approach works, it still feels more like a workaround than a solution.
If someone knows the real problem cause and a better solution,
please feel free to comment.
</p>Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0tag:blogger.com,1999:blog-8029507791638176347.post-78829314915411509752011-11-10T22:34:00.000+01:002011-11-11T09:21:13.040+01:00MultipleOutputFormat and File Handle LimitationsRecently, we used Hadoop for a heavy batch processing job. The job itself was not very special, in fact,
the very same job is run on a daily basis to process some sort of data incrementally. The job instances
had run fine for several months. Now we wanted to process the data of some months at once and
all of a sudden, the processing job died with nasty (and somehow missleading) exceptions.
The reduce task logs were filled with lots of stacktraces like
<pre>
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
</pre>
This stacktrace alone did not provide a lot of information about the real exception cause.
However, together with the corresponding name node logs, it became clear that the corresponding reduce
processes could not open new files for output writing.
After realizing that the job is using <code>org.apache.hadoop.mapred.lib.MultipleOutputFormat</code>
for output writing, the reason for the failing jobs became clear: file handle limitations.
The only question was: which ones? The Linux OS has these limitations and Hadoop's HDFS as well.
To make a long story short, we had to increase both of them.
<h2>Linux and Open File Limits</h2>
Linux limits the number of parallel open files on a per process basis. A given user might only start a certain
number of processes (type <code>ulimit -u</code> to see your own limit) in parallel and each of these processes
is only allowed to write to a certain number of files in parallel (<code>ulimit -n</code>). The default value for
open files is 1024 (at least in debian/ubuntu-flavored distributions).
To get around our problem from above, we increased this limit for the user running our Hadoop cluster by editing
the <code>/etc/security/limits.conf</code>.
To make the machine recognize the new limits, it is necessary to log out and afterwards back in again. However,
in our case we do not login as the user <code>hadoop</code> directly,
but using the <code>su</code> command. Thus, no new login shell is started and the configuration option would not be recognized. In
<a href="http://blog.alarmschaben.de/2011/05/04/how-to-make-su-honour-settings-in-etcsecuritylimits-conf">his blog post</a>,
Armin describes how to edit <code>/etc/pamd.d/su</code> in this scenario.
<h2>Hadoop and Open File Limits</h2>
Hadoop (we use version 0.20.2) has configuration parameters for almost everything. The one to specify the number of
parallel files per datanode is named <code>dfs.datanode.max.xcievers</code>. Unfortunately, if not set otherwise,
datanodes are at startup equipped with only 256 parallel file handles
(see <code>org.apache.hadoop.hdfs.server.datanode.DataXceiverServer</code> class for details).
The <a href="http://hbase.apache.org/book.html#dfs.datanode.max.xcievers">HBase documentation on the xcievers parameter</a>
recommends a value of 4096.
As stated before, this parameter is evaluated during datanode startup time. Therefore, it is necessary to
configure this parameter in the <code>conf/hdfs-site.xml</code> file on each datanode and restart the cluster afterwards.
<h2>Summary</h2>
We had to increase both the OS specific limits and the limits in the Hadoop configuration.
None of them alone was sufficient. To accomplish the configuration changes, we had to update setting on each
datanode machine and to restart the cluster. Afterwards, these exceptions from above were only to be seen on
cluster nodes which were not properly updated.Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0tag:blogger.com,1999:blog-8029507791638176347.post-9064064725663632332011-09-20T00:13:00.000+02:002011-09-20T00:13:38.234+02:00Shutdown MiniDFSCluster and MiniMRCluster Takes ForeverWriting unit tests for Hadoop applications does not need to be more complicated than writing tests for any other Java application.<br />
Usually, I use the following procedure for testing my Hadoop code:
<ol>
<li>Testing the classes as <i>real</i> units.</li> Each Mapper and each Reducer deserves to be tested in isolation. Spock with its awesome support for mocking and stubbing is a great tool for testing units in isolation. With few 3rd party dependencies it is even possible to mock final classes, bypass the existing constructors of these classes, and inject them into private fields of the units under test => isolation at its best (more on this in another post).<br /> Furthermore, there might be other units like custom Writables, InputFormat and OutputFormat implementations and classes containing the business logic to convert data, and so on. I try to write a Specification (the Spock counterpart of a TestCase) for each non-trivial class.<br />
This should reveal most of the nasty little bugs contained in the business logic of the application. Small side-note: I know, there is a mr-unit test framework provided by Cloudera. Personally, I found Spock to be more powerful and flexible, but as always, this is merely a matter of taste.
<li>Next step is to execute complete job roundtrips using the local Hadoop mode, again with Spock. If the application features a command line interface, I set up a simple Specification which executes the main() method of the main driver class and compares some expected output files against the actual output files.</li> Being able to execute jobs in local mode is great, because it is a fast way to run blackbox test against the map reduce framework. Howerver, since the local mode is somewhat limited, it might be necessary to use a real cluster. For example, in local mode it is not possible to run more than one reducer. This limits the testing capabilities of custom partitioning and grouping code.
<li>To execute m/r code on a real cluster, I often use the hadoop-test project (include it with <code>testCompile 'org.apache.hadoop:hadoop-test:0.20.2'</code>
in the <code>build.gradle</code> file). This project features implementations of both a distributed filesystem (<code>org.apache.hadoop.hdfs.MiniDFSCluster</code>) and a m/r cluster (<code>org.apache.hadoop.mapred.MiniMRCluster</code>). The most annoying thing about these clusters is the very long startup time, but hey, they're distributed. The remainder of this post is about usage of these clusters.</li>
</ol>
In case there are several different Hadoop jobs to test or only a single job with different configurations, because of the long startup time, one should think about putting the cluster management code into the static setup methods of the test framework.
<pre class="brush:groovy">
static MiniMRCluster mrCluster
static MiniDFSCluster dfsCluster
def setupSpec() {
def conf = new JobConf()
if (System.getProperty("hadoop.log.dir") == null) {
System.setProperty("hadoop.log.dir", "/tmp");
}
dfsCluster = new MiniDFSCluster(conf, 2, true, null)
mrCluster = new MiniMRCluster(2, dfsCluster.getFileSystem().getUri().toString(), 1)
def hdfs = dfsCluster.getFileSystem()
hdfs.delete(new Path('/main-testdata'), true)
hdfs.delete(new Path('/user'), true)
FileUtil.copy(inputData, hdfs, new Path('main-testdata'), false, conf)
}
</pre>
Notes:
<ul>
<li><code>setupSpec()</code> is the Spock equivalent of JUnit 4's <code>@BeforeClass</code></li>
<li>If the system property (lines 6 and 7) is not set, the m/r cluster will not startup</li>
<li>The clusters are configured to use 2 slave nodes each</li>
<li>After successful filesystem startup, it is possible to perform the usual filesystem operations with it</li>
</ul>
Teardown and cleanup is similarly performed in the static context:
<pre class="brush:groovy">
def cleanupSpec() {
mrCluster?.shutdown()
dfsCluster?.getFileSystem().delete(new Path('/main-testdata'), true)
dfsCluster?.getFileSystem().delete(new Path('/user'), true)
dfsCluster?.shutdown()
}
</pre>
However, there is a really annoying problem with this code: it takes forever. I don't know why (presumably, I'm doing something wrong), but this shutdown procedure is blocked by several data integrity tests which take a long time themselves. Since the generated data is garbage and of zero relevance after the tests have completed, I'd like to get rid of these checks, but I really cannot figure out how.
The logs get s-l-o-w-l-y filled with lines like
<pre class="brush:bash">
11/09/19 23:47:06 INFO datanode.DataBlockScanner: Verification succeeded for blk_3329401068442722923_1001
11/09/19 23:47:12 INFO datanode.DataBlockScanner: Verification succeeded for blk_654270692326292497_1008
11/09/19 23:47:39 INFO datanode.DataBlockScanner: Verification succeeded for blk_-3673127094860948561_1006
</pre>
and the single test and thus the whole test suite as well takes several minutes to complete, fatal for each continuous integration system.
However, there is a workaround which is neither obvious nor very nice, but it works: wrapping the shutdown procedure in another thread. Simply by modifying the code from above into
<pre class="brush:groovy">
def cleanupSpec() {
Thread.start {
mrCluster?.shutdown()
dfsCluster?.getFileSystem().delete(new Path('/main-testdata'), true)
dfsCluster?.getFileSystem().delete(new Path('/user'), true)
dfsCluster?.shutdown()
}.join(5000)
}
</pre>
I could not spot any blocking behavior any longer.
If someone has a better idea to avoid blocking on shutdown, please feel free to comment.
Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0tag:blogger.com,1999:blog-8029507791638176347.post-70850356153345062212011-09-15T00:09:00.000+02:002011-09-15T00:09:08.778+02:00Hosting a Maven Repository on Github for Gradle BuildsGradle is a great tool for organizing the build logic of software projects. Gradle is able to handle all the dependency management necessary even for very small projects. Under the hood, it relies on Ivy and can deal with Maven repositories.<br />
The question is where to upload my own artifacts which should be accessible as dependencies for other projects. As always, there are several alternatives to choose from. A small excerpt:
<ol>
<li>Use one of the free Maven repository hosting services as provided by Sonatype for example</li>
<li>Set up an own server with Artifactory or Nexus or something similar (although http, ftp, ssh, ... would suffice, as well)</li>
<li>Create a new Git project and publish it to Github</li>
</ol>
One benefit of the Sonatype solution is the automated repository synchronization with the Maven Central Repository. However, since the projects have to fulfill a certain set of requirements and I do not need to find my pet projects at Maven Central, this is not my preferred solution.<br />
The company I work for uses Artifactory set up on a dedicated server which is running 24/7 and plays nicely with the continuous integration system. Again, this seems a bit overkill for my private projects.<br />
Since I'm using Git already and have a Github account featuring a very small number of projects, the third alternative seems very appealing to me. This solution requires three basic steps:
<ol>
<li>Set up and synchronize Git repositories</li>
<li>Upload project artifacts to local Git repository and synchronize it with Github</li>
<li>Configure the Github project as <code>mavenRepo</code> within Gradle build files for depending projects</li>
</ol>
<h3>Step 1: Create a Git Project and Synchronize with Github</h3>
For Github project creation I follow the <a href="http://help.github.com/create-a-repo/">Github Repository Creation Guide</a>. At first I create a repository on Github using the web interface.
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-9mAs7g88xds/TnD9BL4TKZI/AAAAAAAAABs/23HEXfF03rU/s1600/create-repo.png" imageanchor="1" style="clear:left; float:none;margin-right:1em; margin-bottom:1em"><img border="0" height="264" width="384" src="http://2.bp.blogspot.com/-9mAs7g88xds/TnD9BL4TKZI/AAAAAAAAABs/23HEXfF03rU/s320/create-repo.png" /></a></div>
<br />
Afterwards, I set up a local repository und synchronize it with Github using the following steps:
<pre class="brush:bash">
$ mkdir maven-repository
$ cd maven-repository
$ git init
Initialized empty Git repository in /home/thevis/GIT/maven-repository/.git/
$ touch README.md
$ git add README.md
$ git commit -m 'first commit'
[master (root-commit) eeee5a8] first commit
0 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README.md
$ git remote add origin git@github.com:tthevis/maven-repository.git
$ git push -u origin master
Counting objects: 3, done.
Writing objects: 100% (3/3), 209 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To git@github.com:tthevis/maven-repository.git
* [new branch] master -> master
Branch master set up to track remote branch master from origin.
$
</pre>
A quick check at Github shows that <code>README.md</code> has actually been pushed to Github and that repository snchronization works fine.
<h3>Step 2: Upload Project Artifacts to Local Git Repository</h3>
My intention is to establish the following release workflow for my projects: Make and upload a release build using <code>gradle clean build uploadArchives</code>, manually check results in the local Git directory, add it to the Git repository, and finally push it to Github.
Well, I know that software releases should not involve manual steps and should be performed on dedicated machines runing build server software and so on and so on...<br />
However, since I'm the only contributor for my projects and the release cycles are somewhat unsettled, this procedure is fairly sufficient for my needs.
<br />
Therefore, I change to a project to be released and add following lines to the <code>build.gradle</code> file:
<pre class="brush:groovy">
apply plugin: 'maven'
uploadArchives {
repositories.mavenDeployer {
repository(url: "file:///home/thevis/GIT/maven-repository/")
}
}
</pre>
Executing the release procedure described above yields the following result:
<pre class="brush:bash">
$ gradle clean build uploadArchives
[...suppressed a few output lines here...]
:build
:uploadArchives
Uploading: net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar to repository remote at file:///home/thevis/GIT/maven-repository/
Transferring 5239K from remote
Uploaded 5239K
BUILD SUCCESSFUL
Total time: 24.443 secs
</pre>
Just to be sure, I check the result by hand:
<pre class="brush:bash">
$ find /home/thevis/GIT/maven-repository/ -name "*groovy-hadoop*"
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.md5
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.sha1
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.sha1
/home/thevis/GIT/maven-repository/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.md5
</pre>
Looks good, so I add the files to Git and push them to Github:
<pre class="brush:bash">
$ cd /home/thevis/GIT/maven-repository/
$ git add .
$ git commit -m "released groovy-hadoop-0.2.0"
[master 5026f70] released groovy-hadoop-0.2.0
9 files changed, 40 insertions(+), 0 deletions(-)
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.md5
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar.sha1
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.md5
create mode 100644 net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.pom.sha1
create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml
create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml.md5
create mode 100644 net/thevis/hadoop/groovy-hadoop/maven-metadata.xml.sha1
$ git push origin master
Counting objects: 17, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (16/16), 5.12 MiB | 98 KiB/s, done.
Total 16 (delta 0), reused 0 (delta 0)
To git@github.com:tthevis/maven-repository.git
eeee5a8..5026f70 master -> master
</pre>
<h3>Step 3: Use Custom Repository for Depending Projects</h3>
There is only one potential pitfall here. The most difficult part is to determine the URL scheme for downloading files from Github.
The file <a href="https://github.com/tthevis/maven-repository/blob/master/net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar">groovy-hadoop-0.2.0.jar page</a> reveales the <i>real</i> artifact URL if one hovers the mous pointer over the <code>raw</code> link:<br/ >
<code>https://github.com/tthevis/maven-repository/<b>raw/master/</b>net/thevis/hadoop/groovy-hadoop/0.2.0/groovy-hadoop-0.2.0.jar</code>. Thus, the maven repository URL to configure is not <code>https://github.com/tthevis/maven-repository</code> but rather <code>https://github.com/tthevis/maven-repository/<b>raw/master/</b></code>. <br />
With this finding in mind, all the rest is pretty straightforward. For testing purposes I set up the follwing <code>build.gradle</code>:
<pre class="brush:groovy">
apply plugin: 'java'
apply plugin: 'eclipse'
version = '0.1.0'
group = 'net.thevis.hadoop'
sourceCompatibility = '1.6'
repositories {
mavenRepo urls: 'https://github.com/tthevis/maven-repository/raw/master/'
}
dependencies {
compile 'net.thevis.hadoop:groovy-hadoop:0.2.0'
}
</pre>
And guess what? It works like a charm.
Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com12tag:blogger.com,1999:blog-8029507791638176347.post-85251591009270850132011-09-08T00:10:00.000+02:002011-09-08T00:11:01.791+02:00Spock in a Gradle-Powered Groovy Project<a href="http://code.google.com/p/spock/">Spock</a>. Is. Great.<br />
One can do wonderful things with Spock, at leat when it comes to testing software. One of the fun things is that one can use Spock both for Groovy and for Java and obviously for mixed projects as well. I'll write about usecases and examples in another post. This one is about setting up Spock for a <a href="http://www.gradle.org/">Gradle</a>-powered Groovy project.<br />
<h1>Basic Build</h1>
Spock relies heavily on Groovy itself, so the desired Spock version has to match the Groovy dependency for the project.<br />
I tried it with Groovy-1.8.1 and Spock-0.5-groovy-1.8.<br />
Excerpt from the <code>build.gradle</code> file:
<pre class="brush:groovy">
apply plugin: 'groovy'
apply plugin: 'eclipse'
repositories {
mavenCentral()
}
dependencies {
groovy 'org.codehaus.groovy:groovy-all:1.8.1'
testCompile 'org.spockframework:spock-core:0.5-groovy-1.8'
}
</pre>
However, the <code>gradle eclipse</code> fails with an unresolved dependency:
<pre class="brush:bash">
:eclipseClasspath
:: problems summary ::
:::: WARNINGS
module not found: org.codehaus.groovy#groovy-all;1.8.0-beta-3-SNAPSHOT
[...]
FAILURE: Build failed with an exception.
* Where:
Build file '/home/thevis/spock-test/build.gradle'
* What went wrong:
Execution failed for task ':eclipseClasspath'.
Cause: Could not resolve all dependencies for configuration 'detachedConfiguration1':
- unresolved dependency: org.codehaus.groovy#groovy-all;1.8.0-beta-3-SNAPSHOT: not found
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.
BUILD FAILED
Total time: 1.431 secs
</pre>
What does this mean? <code>gradle dependencies</code> does not show any peculiarities. So I really don't know.<br />
Fortunately, although kind of annoying, this is not really a problem for the project. Since the project is a Groovy project already, it is possible to exclude all the transitive Groovy dependency stuff introduced by Spock.
One possibility is to change the Spock dependency in the build file to:
<pre class="brush:groovy">
testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
transitive = false
}
</pre>
Alternatively, one could also exclude just the single missing dependency explicitely:
<pre class="brush:groovy">
testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
exclude 'org.codehaus.groovy:groovy-all:1.8.0-beta-3-SNAPSHOT'
}
</pre>
Either way <code>gradle eclipse</code> will succeed.
<h1>Adding Optional Features</h1>
If you want to make use of Spock's mocking and stubbing support (and I'm sure you want) the basic configuration from above is kind of limited, since it allows only mocking of interfaces. Spock lets you also mock and stub classes and even bypass the standard object construction. For these purposes, Spock depends both on <code>cglib-nodep</code> and <code>objenesis</code>, but declares these dependencies as optional. Thus, we have to declare them ourselves:
<pre class="brush:groovy">
testCompile 'cglib:cglib-nodep:2.2'
testCompile 'org.objenesis:objenesis:1.2'
</pre>
<h1>Complete Build File</h1>
Finally, here is the <code>build.gradle</code> in its full glory providing full mock and stub support with Spock:
</pre>
Alternatively, one could also exclude just the single missing dependency explicitely:
<pre class="brush:groovy">
apply plugin: 'groovy'
apply plugin: 'eclipse'
version = '0.1.0-SNAPSHOT'
sourceCompatibility = '1.6'
repositories {
mavenCentral()
}
dependencies {
groovy 'org.codehaus.groovy:groovy-all:1.8.1'
testCompile ('org.spockframework:spock-core:0.5-groovy-1.8') {
transitive = false
}
testCompile 'cglib:cglib-nodep:2.2'
testCompile 'org.objenesis:objenesis:1.2'
testCompile 'junit:junit:4.7'
}
</pre>
Happy specifying!
Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0tag:blogger.com,1999:blog-8029507791638176347.post-54698791377649815842011-08-30T21:08:00.014+02:002011-08-31T00:01:55.150+02:00Java + Clean Code = GroovySome months ago, I read Martin Fowler's <i>Clean Code</i> and discussed it every once a while with several professional Java developers working for different companies. Retrospectively, these discussions were twofold astonishing to me.
<ol>
<li>A great deal of these developers seemed to know <i>Clean Code</i> or at least some cachy claims out of it. No one disagreed with Fowler.</li>
<li>Most of them heard about Groovy, few of them know about Groovy and almost none of them does actually use Groovy.</li>
</ol>
Maybe I got it wrong, but in my understanding is the bottom line of <i>Clean Code</i> more or less: <i>write code which is short, concise, self-documentary, and leaves no room for interpretation to the reader</i>.
You may guessed it by the title, but my understanding of Groovy is not too far away from that.<br/>
Recently, I had to provide a comma-separated list of IDs as a command line argument to an application. The IDs were a consecutive sequence of integer numbers. Too lazy to copy and paste the numbers I thought about the effort to generate the list.
How would you do it in Java? In fact, you wouldn't, right? The time it takes to create a Java class with a main method, iterating integers, concatenate them with a StringBuilder, dealing with an redundant comma at the beginning or the end, and finally compiling this class only for a single execution won't pay off. Alternatives: copy and paste or learning bash (awk, python, perl, ... insert your scripting solution of choice).<br/>
Enter Groovy.
<pre class="brush: bash;">$ groovy -e 'println ((12345 .. 12355).join(","))'
12345,12346,12347,12348,12349,12350,12351,12352,12353,12354,12355</pre>
My assumption is, even if somebody had no idea what Groovy is all about, just by looking at the command, she'll guess the outcome of the command correctly.
But what if the task was a bit more involved as only listing consecutive integers? What about filtering and transforming result entries? Suppose, out of curiosity you want to list all the numbers between 1 and 1000 dividable by 17 and containing a 9 as digit. The numbers should be listed line by line with proper line numbers (rather academic example, I know).<br/>
What about Java? The proper way to deal with collection filtering and decoration would be to use <i>commons-collection</i> or <i>guava</i> or something similar or providing own implementations of <code>AbstractCollection</code>s with lots of anonymous inner classes even for this simple task. The not so clean Java solution could look like the following code.
<pre class="brush:java">class Dummy {
public static void main (String[] args) {
String result = "";
int counter = 0;
for (int i = 17; i <= 1000; i += 17) {
if (("" + i).contains("9")) {
result += (++counter) + ": " + i + "\n";
}
}
System.out.print(result);
}
}</pre>
It will work, but it is rather ugly. Moreover, one has to create a file (insert your editor of choice), compile it (javac), and run it (java) just for a single execution (and afterwards delete it). Many different tools and commands for a trival task.
In contrast, the Groovy inline script solution is very short, concise, and self-documentary (and does not produce trash in the file system):
<pre class="brush: bash;">$ groovy -e 'counter = 0
> println ((1 .. 1000) /* iteration*/
> .grep{it %17 == 0 && "${it}".contains("9")} /* filter */
> .collect{"${++counter}: ${it}"} /* decoration */
> .join("\n"))' /* concatenation */
1: 119
2: 289
[...]
14: 969
15: 986</pre>
My biggest problem when working with Groovy is to become aware of lots and lots of wasted hours spent with developing clean code Java solutions which were given by Groovy as language features out of the box in a very concise manner. Don't get me wrong. I'm a big fan of the Java language and I like solutions with lovely design patterns. However, for some problems the full featured Java design patterns sledgehammer approach seems to be a little overkill, considering the opportunities Groovy might add to your default toolkit. Since integrating Groovy is merely a matter of adding jars to the classpath and tweak the IDE appropriately, I am astonished how few people do actually make use of Groovy's opportunities.
<br/>
Bad marketing, new technology anxiety, other limitations I'm not aware of? I simply don't get it.
Thomas Thevishttp://www.blogger.com/profile/09812984572021974460noreply@blogger.com0