One key point I wanted to mention for Hadoop developers (but then check out
the announcement).

I implemented a version of sysstat (iostat, vmstat, etc) in Peregrine and
would be more than happy to move it out and put it in another dedicated
project.

http://peregrine_mapreduce.bitbucket.org/xref/peregrine/sysstat/package-summary.html

I run this before and after major MR phases which makes it very easy to
understand the system throughput/performance for that iteration.

...

I'm pleased to announce Peregrine 0.5.0 - a new map reduce framework
optimized
for iterative and pipelined map reduce jobs.

http://peregrine_mapreduce.bitbucket.org/

This originally started off with some internal work at Spinn3r to build a
fast
and efficient Pagerank implementation. We realized that what we wanted was
a MR
runtime optimized for this type of work which differs radically from the
traditional Hadoop design.

Peregrine implements a partitioned distributed filesystem where key/value
pairs
are routed to defined partitions. This enables work to be joined against
previous iterations or different units of work by the same key on the same
local
system.

Peregrine is optimized for ETL jobs where the primary data storage system
is an
external database such as Cassandra, Hbase, MySQL, etc. Jobs are then run
as a
Extract, Transform and Load stages with intermediate data being stored in
the
Peregrine FS.

We enable features such as Map/Reduce/Merge as well as some additional
functionality like ExtractMap and ReduceLoad (in ETL parlance).

A key innovation here is a partitioning layout algorithm that can support
fast
many to many recovery similar to HDFS but still support partitioned
operation
with deterministic key placement.

We've also tried to optimize for single instance performance and use modern
IO
primitives as much as possible. This includes NOT shying away from
operating
specific features such as mlock, fadvise, fallocate, etc.

There is still a bit more work I want to do before I am ready to benchmark
it
against Hadoop. Instead of implementing a synthetic benchmark we wanted to
get
a production ready version first which would allow people to port existing
applications and see what the before / after performance numbers looked
like in
the real world.

For more information please see:

http://peregrine_mapreduce.bitbucket.org/

As well as our design documentation:

http://peregrine_mapreduce.bitbucket.org/design/



--
--

Founder/CEO Spinn3r.com <http://spinn3r.com/>

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Search Discussions

  • Arun C Murthy at Dec 27, 2011 at 8:07 am

    On Dec 26, 2011, at 10:30 PM, Kevin Burton wrote:

    One key point I wanted to mention for Hadoop developers (but then check out the announcement).

    I implemented a version of sysstat (iostat, vmstat, etc) in Peregrine and would be more than happy to move it out and put it in another dedicated project.

    http://peregrine_mapreduce.bitbucket.org/xref/peregrine/sysstat/package-summary.html

    I run this before and after major MR phases which makes it very easy to understand the system throughput/performance for that iteration.
    Thanks for sharing. I'd love to play with it, do you have a README/user-guide for systat?
    ...

    I'm pleased to announce Peregrine 0.5.0 - a new map reduce framework optimized
    for iterative and pipelined map reduce jobs.

    http://peregrine_mapreduce.bitbucket.org/
    Sounds interesting. I briefly skimmed through the site.

    Couple of questions:
    # How does peregrine deal with the case that you might not have available resources to start reduces while the maps are running? Is the map-output buffered to disk before the reduces start?
    # How does peregrine deal with failure of in-flight reduces (potentially after they have recieved X% of maps' outputs).
    # How much does peregrine depend on PFS? One idea worth exploring might be to run peregrine within YARN (MR2) as an application. Would you be interested in trying that?

    Thanks again for sharing.

    Arun
  • Kevin Burton at Dec 27, 2011 at 11:13 am

    Thanks for sharing. I'd love to play with it, do you have a
    README/user-guide for systat?
    Not a ton but I could write some up...

    Basically I modeled it after vmstat/iostat on Linux.

    http://sebastien.godard.pagesperso-orange.fr/documentation.html

    The theory is that most platforms have similar facilities so drivers could
    be written per platform and then at runtime the platform is determined or
    an 'unsupported' null object is returned which doesn't do anything.

    The output is IO , CPU and network throughput per second for the entire
    run... so it would basically be a 5 minute average per second run if the
    job took 5 minutes to execute (see below for an example)


    Couple of questions:
    # How does peregrine deal with the case that you might not have available
    resources to start reduces while the maps are running?
    If maps are not completed we can't start a reduce phase until they complete.

    Right now I don't have speculative execution turned on in the builds but
    the support is there.

    One *could* do the initial segment sort of the reduce but doing the full
    merge isn't really helpful as even if ONE key changes you have to re-merge
    all the IO.

    The issue of running a ReduceMap where the output of one reduce is the
    input of another map does require some coordination.

    Right now my plan is to split the work load in half.. so that the buffers
    are just 50% of their original values since they both have to be in play.

    I'm not using the Java heap's memory but instead mmap() so I can get away
    with some more fun tasks like shrinking various buffers and so forth.

    Is the map-output buffered to disk before the reduces start?
    No... Right now I don't have combiners implemented ... We do directly
    shuffling where IO is written directly to the reducer nodes instead of
    writing to disk first.

    I believe strongly *but need more evidence* that in practical loads that
    direct shuffling will be far superior to the indirect shuffling mechanism
    that hadoop uses.

    There ARE some situations I think where indirect shuffling could solve some
    pathological situations but that in practice these won't arise (and
    certainly not in the pagerank impl and with our data).

    We're going to buffer the IO so that about 250MB or so is put through a
    combiner before sent through snappy for compression and then the result is
    directly shuffled.

    # How does peregrine deal with failure of in-flight reduces (potentially
    after they have recieved X% of maps' outputs).
    The reduces are our major checkpoint mode right now....

    There are two solutions I'm thinking about... (and perhaps both will be
    implemented in production and you can choose which strategy to use).

    1. One replica of a partition starts a reduce, none of the blocks are
    replicated, if it fails the whole reduce has to start again.

    2. All blocks are replicated, but if a reduce fails it can just resume on
    another node.

    ... I think #1 though in practice will be the best strategy. A physical
    machine hosts about 10 partitions so even if a crash DOES happen and you
    have to resume a reduce you're only doing 1/10th of the data...

    And since recovery is now happening the other 9 partitions are split across
    9 different hosts so the reduces there can be done in parallel.

    # How much does peregrine depend on PFS? One idea worth exploring might be
    to run peregrine within YARN (MR2) as an application. Would you be
    interested in trying that?

    It depends heavily upon PFS... the block allocation is all done via the
    PFS layer and these need to be deterministic or the partitioning
    functionality will not work.

    Also, all the IO is done through async IO ... because at 10k machines you
    can't do threaded IO as it would require too much memory.

    I was thinking the other day (and talking with my staff) that right now if
    you view the distributed systems space there is a LOT of activity in Hadoop
    because it's one of the widest deployed platforms out there..

    But if you look at the *history* of computer science, we have NEVER settled
    on a single OS, single programming language, single editor, etc. There is
    always a set of choices out there because some tools are better suited to
    the task than others.

    MySQL vs Postgres, Cassandra vs Hbase, etc.. and even in a lot of cases
    it's not 'versus' as some tools are better for the job than others.

    I think this might be the Peregrine/Hadoop situation.

    Peregrine would be VERY bad for some tasks right now for example... If you
    have log files to process and just want to grok them , query them, etc...
    then a Hadoop / Pig / Hive setup would be WAY easier to run and far more
    reliable.

    My thinking is that Peregrine should just focus on the area where I think
    Hadoop could use some improvement. Specifically iterative jobs and more
    efficient pipelined IO...

    I also think that there are a lot of ergonomic areas that collaboration
    should/could happen across a number of runtimes... for example the sysstat
    package.

    For our part we're going to use the Hadoop CRC32 encoder for storing blocks
    into PFS...




    Processor %util
    --------- -----
    cpu 2.00
    cpu0 6.00
    cpu1 2.00
    cpu2 2.00
    cpu3 1.00
    cpu4 3.00
    cpu5 1.00
    cpu6 1.00
    cpu7 1.00
    cpu8 5.00
    cpu9 2.00
    cpu10 1.00
    cpu11 1.00
    cpu12 4.00
    cpu13 1.00
    cpu14 1.00
    cpu15 1.00

    Disk reads writes bytes read bytes
    written Avg req size %util
    ---- ----- ------ ----------
    ------------- ------------ -----
    sda 82,013 40,933 15,377,601
    18,155,568 272.75 100.00

    Interface bits rx bits tx
    --------- ------- -------
    lo 125 125
    eth0 122,918 233,160
    eth1 26 10
    sit0 0 0



    --
    --

    Founder/CEO Spinn3r.com <http://spinn3r.com/>

    Location: *San Francisco, CA*
    Skype: *burtonator*

    Skype-in: *(415) 871-0687*

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 27, '11 at 6:31a
activeDec 27, '11 at 11:13a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Kevin Burton: 2 posts Arun C Murthy: 1 post

People

Translate

site design / logo © 2021 Grokbase