FAQ

[Kafka-users] Samza -- A YARN stream processing framework for Kafka

Jay Kreps
Aug 23, 2013 at 3:40 pm
Hey guys,

This may be relevant to people on this list. A few of us at LinkedIn have
been working on Samza, a stream processing framework built on YARN. We just
added this as an Apache Incubator project. We would love to get people's
feedback (and help!). Here are the docs:

http://samza.incubator.apache.org

If anyone has any questions I'm happy to discuss what we are up to. Our
mailing list is here:

http://samza.incubator.apache.org/community/mailing-lists.html

-Jay
reply

Search Discussions

2 responses

  • Jonathan Hodges at Aug 27, 2013 at 1:51 pm
    First off, I want to say this is awesome! It has been great to see all the
    great YARN offerings being released lately. I noticed Hadoop 2.x was
    recently voted beta so very exciting!

    Like many we use Storm for near real-time processing our Kafka based
    streams. In addition we send this data to Hadoop for offline analysis.
      Consolidating these three environments to one is a win by itself. I also
    really like the fault tolerance and security features. Are you guys using
    Samza in production yet at LinkedIn or still development?

    The local state approach is very interesting. Are you guys using Databus
    for the feed of changes from the external stores? Is something like
    Voldemort integrated locally for the key/value store? Can you maintain
    multiple tables locally for stream processing?

    Since we are using Storm, do any latency comparisons exist? Since Samza
    makes the fault tolerance/durability tradeoff to persist to disk on every
    hop between StreamTasks, it would seem to take a hit here. That said we
    use Trident a good bit, so many of our topologies are already slowed by
    remote calls to Cassandra.

    I know it is fairly new, but were any comparisons against Spark Streaming
    considered? They take a similar tact of maintaining state locally as
    opposed to external stores, but I believe they are limited on what can fit
    in memory.

    Finally where did the catchy name, Samza come from?

    Thanks!
    Jonathan


    On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps wrote:

    Hey guys,

    This may be relevant to people on this list. A few of us at LinkedIn have
    been working on Samza, a stream processing framework built on YARN. We just
    added this as an Apache Incubator project. We would love to get people's
    feedback (and help!). Here are the docs:

    http://samza.incubator.apache.org

    If anyone has any questions I'm happy to discuss what we are up to. Our
    mailing list is here:

    http://samza.incubator.apache.org/community/mailing-lists.html

    -Jay
  • Xavier Stevens at Aug 27, 2013 at 4:05 pm
    I can't answer the rest but the catchy name is from Gregor Samza. A
    character from Kafka's novel called The Metamorphosis.

    https://en.wikipedia.org/wiki/Gregor_Samsa#Gregor_Samsa


    -Xavier

    On Tue, Aug 27, 2013 at 6:51 AM, Jonathan Hodges wrote:

    First off, I want to say this is awesome! It has been great to see all the
    great YARN offerings being released lately. I noticed Hadoop 2.x was
    recently voted beta so very exciting!

    Like many we use Storm for near real-time processing our Kafka based
    streams. In addition we send this data to Hadoop for offline analysis.
    Consolidating these three environments to one is a win by itself. I also
    really like the fault tolerance and security features. Are you guys using
    Samza in production yet at LinkedIn or still development?

    The local state approach is very interesting. Are you guys using Databus
    for the feed of changes from the external stores? Is something like
    Voldemort integrated locally for the key/value store? Can you maintain
    multiple tables locally for stream processing?

    Since we are using Storm, do any latency comparisons exist? Since Samza
    makes the fault tolerance/durability tradeoff to persist to disk on every
    hop between StreamTasks, it would seem to take a hit here. That said we
    use Trident a good bit, so many of our topologies are already slowed by
    remote calls to Cassandra.

    I know it is fairly new, but were any comparisons against Spark Streaming
    considered? They take a similar tact of maintaining state locally as
    opposed to external stores, but I believe they are limited on what can fit
    in memory.

    Finally where did the catchy name, Samza come from?

    Thanks!
    Jonathan


    On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps wrote:

    Hey guys,

    This may be relevant to people on this list. A few of us at LinkedIn have
    been working on Samza, a stream processing framework built on YARN. We just
    added this as an Apache Incubator project. We would love to get people's
    feedback (and help!). Here are the docs:

    http://samza.incubator.apache.org

    If anyone has any questions I'm happy to discuss what we are up to. Our
    mailing list is here:

    http://samza.incubator.apache.org/community/mailing-lists.html

    -Jay

Related Discussions

Discussion Navigation
viewthread | post