It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.
The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.
Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific language (DSL) written in Clojure on top of Cascading for large data processing jobs. Etsy uses a DSL written in JRuby as a layer on top of Cascading. Of course, the big player is BackType. Cascalog combines Cascading with the Datalog language to provide a declarative language for working with data and map-reduce. Wensel noted that one disadvantage of Pig and Hive that Cascading addresses is that Pig and Hive lack a physical planner. Workflow managers such as Oozie and Azaband can run Cascading jobs as part of a workflow. Version 2.0 of Cascading removes Hadoop as a dependency and will allow users to run Cascading jobs on data that is in RAM rather than on disk.
James Falgout from Pervasive DataRush presented the second lightning talk. Pervasive’s products seem to use this “dataflow” paradigm that attempts to fill in features that are missing in map-reduce. The basic description compared dataflow to the Unix shell pipeline with message passing. James showed an example dataflow that a user could configure visually. Pervasive is working on integrating dataflow with Hive.
Guy Harrison from Quest Software introduced their system Toad for Cloud Databases. Toad attempts to merge data from several different data sources for analysis such as Hive, MongoDB, and Cassandra. Unfortunately, Guy’s thick Australian accent made his humorous talk unintelligible to me (hearing loss?).
Steve Wooledge from AsterData (now part of Teradata) discussed the company’s product goal of taking a standard relational database system and integrating map-reduce on top of it. Such a system is flexible and allows both SQL-like access as well as programmatic access to data. This hybrid row-oriented and column-oriented datastore can be used for path and pattern matching, text processing and graph traversal among the usual tasks. nPath is a product that enhances a system with transactional data analytics (click analysis, sessionalization).
Andrew Yu from EMC presented some of EMC’s data analytics products. I wrote about EMC in an earlier blog post so I will spare the details. EMC offers a data warehouse product as well as a hybrid, pre-configured system containing its Greenplum warehouse and map-reduce built-in.
Ben Lee from Foursquare discussed how big data is used at Foursquare and gave some statistics about its service. This was by far the most interesting talk to me. Foursqaure offers realtime suggestions of places to visit based on the user’s history, and the user’s friends’ histories based on day of week and time of day. Foursquare has 10 million users, 50 million venues, and 750 million check-ins. There are over 3 million check-ins per day. 10,000 developers use Foursquare’s API. MongoDB is the main datastore and Scala is used for the front end. Back end data processing uses Hadoop (both vanilla, and Streaming) as well as Flume, Elastic MapReduce, and S3. Ben displayed an awesome visualization of check-in data; researchers took check-ins from New York City and performed sentiment analysis on the text attached to the check-in. The visualization suggested that people were the “happiest” in Manhattan.
Paul Baclace introduced some software called Phatvis that allows developers to visualize map-reduce jobs. It is his hope that the visualization can be used to fine tune Hadoop parameters based on evidence from prior jobs. The source can be found here
Of course, the fun in every “unconference” is the circus known as scheduling the sessions. Some of the proposed sessions:
- Big Data 101 / Intro to Hadoop
- Extract MapReduce Data into Relational Database High Performance Database
- “ETL was Yesterday” What’s next?
- Operations of Hadoop Clusters
- SQL / NoSQL Why not Both? (Aster)
- Geodata
- Big Data Retention / Compression
- Business Intelligence and Hadoop
- Data Management Lifecycle
- Distributions of Hadoop
- Hadoop for Bioinformatics and Healthcare
The topics did not seem exciting this time, and seemed to have a lot of overlap with presentations at Hadoop Summit, but I found two (we could only attend two) that stood out.
Session 1: Operating a Hadoop Cluster
Thank goodness managing a Hadoop cluster is not in my job description (only small clusters I use for research). Charles Wimmer, the lead of the Operations track for Hadoop Summit, lead this discussion and much of the discussion dovetailed off of incidents that occurred at Yahoo. A popular topic of discussion was backup. There is no such thing as “backing up” a Hadoop cluster we agreed. Any data that is important should be replicated, preferrably 3 times, or transmitted in parallel over a pipe to multiple data centers. One strict limitation of replication is that if some new release of Hadoop, or some new Hadoop distribution contains a bug that corrupts the data, all replicates may also be corrupted.
Discussion then turned to hardware. Yahoo uses high-density storage nodes with 6 drives each containing 2-3TB of space. Charles mentioned that a common problem with Hadoop is that it is difficult to keep the CPUs busy especially in a server with 8 Nehalem processors (8 CPUs or 8 cores?). The major reason for this is that the main bottleneck in map-reduce jobs is the network I/O required in the shuffle phase as data comes out of the mappers. The map phase is the most CPU bound phase. Wimmer, and several others, made one thing clear: use SATA, not SAS. Apparently SATA and SAS drives have similar read performance (I believe I misheard that) for practical purposes. The original Google map-reduce was based on commodity hardware and quantity is more important than quality (within reason). For this reason, SATA provides a lot more space for your data. The same amount of space is an order of magnitude more expensive for SAS drives.
The next topic of discussion was the NameNode as the single point of failure. Apparently the MapR system does not use the HDFS, and recovering from a lost NameNode is not as severe as it is for Hadoop. Hadoop 0.20.2 also supposedly introduces sharding, called NameNode federation, where the namespace is divided over several NameNodes.
Hadoop has some issues with certain types of scalability, particularly with the JobTracker. When a large job with a large number of mappers and reducers finish quickly, the TaskTrackers send an influx of messages to the JobTracker and it gets overwhelmed. To prevent users from thrashing a cluster, use a capacity scheduler to put hard caps on queues. There was also some high level discussion of QoS-like functionality among users and sophisticated monitoring of jobs. Map-Reduce NextGen improves scalability by allocating a JobTracker to each individual job whose purpose is solely to monitor resource allocation. The biggest feature Charles would like to see is high availability NameNodes.
Yahoo boasts an impressive 22 clusters each containing between 400 and 4200 nodes. A fellow from AOL indicated that AOL has a cluster of size close to 1000. Is AOL coming back from the dead?
Thank goodness managing a Hadoop cluster is not in my job description…
Session 2: Geodata
I do not get the opportunity to work with geographical data often, so I was curious to see what these folks had to say. The discussion was lead by a fellow named Brian from OSGeo. The largest point that I took away from this talk was that not an incredible amount of thought has been dedicated to Big Geodata, particularly how to store and process it. PostgreSQL and PostGIS are a few ways to store and analyze manageable amounts of data, but not large data. MongoDB is one solution but has its issues. A fellow from Foursquare mentioned that MongoDB cannot shard across geographic data, but I could not hear precisely what he said to that effect. The biggest challenge seems to be a lack of an indexer capable of indexing a large amount of geospatial data aside from the standard RTree implementation. I believe that geodata as well as streaming data and multimedia are some of the biggest unsolved problems in Big Data.
Anyways, on to Hadoop Summit!
Great stuff, Ryan. Keep posting, please!
Regarding sharding of geo data in Mongo, see http://www.mongodb.org/display/DOCS/Geospatial+Indexing#GeospatialIndexing-ShardedEnvironments. It sort of works with caveats.
I’d love to get more into MongoDB and PostgreSQL for spatial stuff. Time (and myself 😉 ) is my biggest enemy right now.
[…] Big Data Camp 2011 #BigDataCamp […]