This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.
Yahoo’s S4: Distributed Stream Computing Platform
First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.
S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.
Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:
emit another event that will be consumed by another PE, or publish some result
Streaming data is different from non-streaming data in that the user does not know how much data will […]