Wow. I can’t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at the Rubicon Project. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.
When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.
I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis.
To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?
But you’re a Computer Scientist! This should be easy!
The first part is true, but there is a stark difference between a garden variety computer scientist and one that converts from another field. Homegrown developers typically have an undergraduate degree in Computer Science (mine is close, but not purely CS). They have a strict and challenging curriculum of coursework, and their lives are peppered with summer internships. In these experiences, they are seasoned with professional work experience that they were not expected to have beforehand. Once they graduate, they have the skills necessary to work as a software engineer.
“Converts” typically have a love for Computer Science, but typically hold undergraduate degrees in related fields. Mine was Mathematics of Computation and Statistics. The Statistics B.S. was totally irrelevant. The Math of Computation B.S. was relevant in discovering my true interest, however, my life was consumed with writing proofs and solving puzzles. I only programmed for fun and to solve problems that I faced in daily life. Shortly after starting a Ph.D. in Statistics, I discovered that I wanted a career change to Computer Science and software engineering. Although I was subjected to the same curriculum as the undergraduates, it was rushed and did not provide the full experience the college CS majors received. Although I am booksmart in the fields, I come to the development world either being expected to know the ins and outs of engineering, or expected to pick it up quickly without the pampering. Learning engineering this way is fun, and not so mundane, but more frustrating.
But I though you already knew Java and Hadoop?
Well, that is all relatively speaking! The Java code I have written was not for Hadoop. The code that I write for my own research and hobby is much different from the code I am expected to write in production. In my research I always used Python (except in a few instances where I used Java) and the Hadoop Streaming package. Although Streaming is a great package, I feel it lacks the customization that a standard Hadoop job written in Java has. It also lacks the “meat” that the vanilla Hadoop distribution enjoys in the professional world. There are also performance gains from using vanilla Hadoop, as both the code and the framework are written in the same language and there is no interface among languages to deal with.
- My code must execute successfully on tens and hundreds of thousands of files, not just some manageable subset that I have created.
- I did not write the code that created these files, so there are bugs and intricacies that I must defensively program against.
- These files are created in real time, on the fly, as real events are occurring. Stuff happens. Sometimes it’s not good stuff, and code must be able to work with (or toss out) that data. Data for my own research has been massaged and curated by moi before running in Hadoop. In production, this is not realistic.
- The code must be efficient. These jobs must not take forever to run, because time can equal money: either wasted CPU cycles that others could use, or time in the cloud.
- Development time must be used to integrate existing code rather than reinventing the wheel. My coworkers have already written a lot of code. I have been learning how to integrate this code into my own.
- The developer must not go rogue with code and must code carefully. Crashing the cluster or maxing out the hard drive can have dire consequences.
Tip 0: Nobody’s Perfect
When using a JAR file containing an archive of a package, the first line of each source file has a line that starts with package. This is essentially an “address” that allows you to point to that class without having to know its name, and is a good way to index the content of the JAR file.
I have found that vim and ant have proven valuable for the time being. Adding an IDE just potentially adds another layer of misery to the learning process. The only IDE I ever recommend is Visual C++, and I am not even a Microsoft guy.
svn co http://svn.mydomain.com/svn/project/trunk ryans_first_project
Tip 1: Partition your Data
Suppose we have
Tip 2: The Key in the Mapper is Useless
If you use TextInputFormat, know that lines of text come into the mapper in key/value pairs and also leave the mapper in key/value pairs. The key coming INTO the mapper will be of this weird type LongWritable. It is useless. It is just the byte offset of the line in the file. What we really want to parse is the value coming into the mapper, and we emit the key and value to the next phase.
Tip 3: Use ant (or maven) to Configure and Build your Jobs
As I said, passing CLASSPATHs around .bashrc and on the command-line is a clusterf*ck. In my case, the code stack I am working with has a build.xml file. When I write source code, I don’t need to do anything, ant knows to compile the file because the build.xml file contains instructions to compile it.
Also, all and any libraries I need I just dump into a lib directory, and by simply adding the name of the library file to build.xml (it is obvious where to put it), it is automatically added to the CLASSPATH at compile time. To build the project, all I type is
ant some_target
and it spits out a cute little JAR file in the target directory, ready for me to use with Hadoop. Of course, this process actually builds the entire project, not just my code, but it only takes 10 seconds or so to build.
Tip 4: input_dir and output_dir are just Parameters
The Hadoop command given in tutorials usually has the following form
hadoop jar somejarfile.jar a.class.name input output
input and output are not “set” parameters, they are just plain old parameters that you can interact with in your Java programs by using the argv array. If you want to pass in 10 directories, you can do that!
Tip 5: The JAR file is the Key to Success
Once you have a JAR file built by hand or by ant, all you need to do is move that baby around to wherever you want to run the Hadoop job. Of course, this assumes that Hadoop is installed on the machines you want to use. Then, with one file, running the job is as simple as:
hadoop jar myjarfile.jar com.bytemining.my.package.name input_dir output_dir
Tip 6: When it Gets to be Stressful, It’s Nothing a Little Ping Pong Can’t Fix
I’ve only been at this for 2 weeks… there will undoubtedly be a part 2.
[…] This post was mentioned on Twitter by Ryan Rosario. Ryan Rosario said: New at Byte Mining: Some Lessons in Production Development (Hadoop) – Part 1 http://dlvr.it/BkDKg […]
Re: Tip 1. job.setNumReduceTasks(n) in your task setup (usually a class that extends Configured and implements Tool) controls the number of reducers spawned. You’ll note that your Partitioner implementation takes the number of reducers as an argument (so you’re often doing something like (segmentOfKeyIWanted.hashCode() & Integer.MAX_VALUE)%numReducers. From experience, a common mistake is to forget the mask against MAX_VALUE not realizing that hashCode() can return negative values.
Re: Tip 2. Using the “new” Hadoop lib’s mapper (mapreduce.Mapper, which is a class not an interface) I usually specialize the mapper as Mapper so that the input key can change to anything and I will never care.
Other useful things are creating the mapper’s output key and value object once and then using the appropriate set() call on each. Calling new Text() a couple of billion times can actually make a difference (not much of one, but it can be noticeable)