First off, I want to say this is awesome! It has been great to see all the
great YARN offerings being released lately. I noticed Hadoop 2.x was
recently voted beta so very exciting!
Like many we use Storm for near real-time processing our Kafka based
streams. In addition we send this data to Hadoop for offline analysis.
Consolidating these three environments to one is a win by itself. I also
really like the fault tolerance and security features. Are you guys using
Samza in production yet at LinkedIn or still development?
The local state approach is very interesting. Are you guys using Databus
for the feed of changes from the external stores? Is something like
Voldemort integrated locally for the key/value store? Can you maintain
multiple tables locally for stream processing?
Since we are using Storm, do any latency comparisons exist? Since Samza
makes the fault tolerance/durability tradeoff to persist to disk on every
hop between StreamTasks, it would seem to take a hit here. That said we
use Trident a good bit, so many of our topologies are already slowed by
remote calls to Cassandra.
I know it is fairly new, but were any comparisons against Spark Streaming
considered? They take a similar tact of maintaining state locally as
opposed to external stores, but I believe they are limited on what can fit
Finally where did the catchy name, Samza come from?
On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps wrote:
This may be relevant to people on this list. A few of us at LinkedIn have
been working on Samza, a stream processing framework built on YARN. We just
added this as an Apache Incubator project. We would love to get people's
feedback (and help!). Here are the docs:http://samza.incubator.apache.org
If anyone has any questions I'm happy to discuss what we are up to. Our
mailing list is here:http://samza.incubator.apache.org/community/mailing-lists.html