Hi all,

I have a defbufferop transforms input tuples into a string SQL insert
statement that will be used elsewhere. For one group of tuples, however,
there are 1.6 million records. So the resulting insert statement would use
about 125mb of disk space. When we try to run this, the reduce step fails
with a java heap space out of memory error. There's an issue about it here:

https://github.com/MapofLife/fossa/issues/24

This is obviously an extreme case, but I wonder if anyone has a rule of
thumb about how large an output tuple can be. I've experimented with the
simple case below and was able to push out a 9mb tuple (10 million "a"s),
but not a 45 mb tuple (50 million).

Whatever the limit, we'll ultimately probably have to separately process
groups that would be above that limit. And it would be nice if there were
as few of those records as possible (i.e. we want to get close to the
limit!).

-Robin

(let [n 10000000
src [[(str (apply str (repeat n "a")))]]]
(?<- (hfs-seqfile "/tmp/yo" :sinkmode :replace)
[?a]
(src ?a)))

Search Discussions

  • Sam Ritchie at Nov 15, 2012 at 9:18 pm
    Robin, the defbufferop can emit multiple tuples -- why not chunk up some of
    those huge statements into smaller insert statements?

    Better yet, you could use the JDBC tap in Maple and just sink directly into
    the SQL db, vs doing this intermediate business:

    https://github.com/Cascading/maple/blob/develop/src/jvm/com/twitter/maple/jdbc/JDBCTap.java

    On Thu, Nov 15, 2012 at 12:31 PM, Robin Kraft wrote:

    Hi all,

    I have a defbufferop transforms input tuples into a string SQL insert
    statement that will be used elsewhere. For one group of tuples, however,
    there are 1.6 million records. So the resulting insert statement would use
    about 125mb of disk space. When we try to run this, the reduce step fails
    with a java heap space out of memory error. There's an issue about it here:

    https://github.com/MapofLife/fossa/issues/24

    This is obviously an extreme case, but I wonder if anyone has a rule of
    thumb about how large an output tuple can be. I've experimented with the
    simple case below and was able to push out a 9mb tuple (10 million "a"s),
    but not a 45 mb tuple (50 million).

    Whatever the limit, we'll ultimately probably have to separately process
    groups that would be above that limit. And it would be nice if there were
    as few of those records as possible (i.e. we want to get close to the
    limit!).

    -Robin

    (let [n 10000000
    src [[(str (apply str (repeat n "a")))]]]
    (?<- (hfs-seqfile "/tmp/yo" :sinkmode :replace)
    [?a]
    (src ?a)))

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    (Too brief? Here's why! http://emailcharter.org)
  • Robin Kraft at Nov 15, 2012 at 9:53 pm
    Hey Sam,

    So in this case we don't have JDBC access, and the insert statements are
    just big by nature - we're creating location geometries a la "MULTIPOINT((3
    4, -180 90, 0 0))" for 375 million species records. Fortunately the most
    common species is only 1.6m records - so 1.6 million xy pairs + metadata.

    That said, your message made me realize that we could add the geometry
    independently of the metadata fields, which would bring the size down
    dramatically. We'd still be looking at some large strings though, so some
    idea of a max tuple size would be helpful.

    I'll report back here what we find, but I'm still hopeful about hints from
    others!

    -Robin

    On Thursday, November 15, 2012 1:18:08 PM UTC-8, Sam Ritchie wrote:

    Robin, the defbufferop can emit multiple tuples -- why not chunk up some
    of those huge statements into smaller insert statements?

    Better yet, you could use the JDBC tap in Maple and just sink directly
    into the SQL db, vs doing this intermediate business:


    https://github.com/Cascading/maple/blob/develop/src/jvm/com/twitter/maple/jdbc/JDBCTap.java


    On Thu, Nov 15, 2012 at 12:31 PM, Robin Kraft <robin...@gmail.com<javascript:>
    wrote:
    Hi all,

    I have a defbufferop transforms input tuples into a string SQL insert
    statement that will be used elsewhere. For one group of tuples, however,
    there are 1.6 million records. So the resulting insert statement would use
    about 125mb of disk space. When we try to run this, the reduce step fails
    with a java heap space out of memory error. There's an issue about it here:

    https://github.com/MapofLife/fossa/issues/24

    This is obviously an extreme case, but I wonder if anyone has a rule of
    thumb about how large an output tuple can be. I've experimented with the
    simple case below and was able to push out a 9mb tuple (10 million "a"s),
    but not a 45 mb tuple (50 million).

    Whatever the limit, we'll ultimately probably have to separately process
    groups that would be above that limit. And it would be nice if there were
    as few of those records as possible (i.e. we want to get close to the
    limit!).

    -Robin

    (let [n 10000000
    src [[(str (apply str (repeat n "a")))]]]
    (?<- (hfs-seqfile "/tmp/yo" :sinkmode :replace)
    [?a]
    (src ?a)))

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    (Too brief? Here's why! http://emailcharter.org)
  • Nick Dimiduk at Nov 20, 2012 at 7:29 pm
    Hi Robin,

    I assume your eventual target is PostGIS? You may be able to save yourself
    on encoding size if you represent your geometries as HEXEWKB [0] instead.

    Which process is crashing with the OOM -- is it the reduce task or the
    tasktracker? Remember, these are separate processes and their memory is
    configured independently. Something like this reference [1] might be useful.

    Good luck,
    Nick

    [0] http://postgis.refractions.net/documentation/manual-1.5/ch04.html#EWKB_EWKT
    [1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault.html
    On Thursday, November 15, 2012 1:53:56 PM UTC-8, Robin Kraft wrote:

    Hey Sam,

    So in this case we don't have JDBC access, and the insert statements are
    just big by nature - we're creating location geometries a la "MULTIPOINT((3
    4, -180 90, 0 0))" for 375 million species records. Fortunately the most
    common species is only 1.6m records - so 1.6 million xy pairs + metadata.

    That said, your message made me realize that we could add the geometry
    independently of the metadata fields, which would bring the size down
    dramatically. We'd still be looking at some large strings though, so some
    idea of a max tuple size would be helpful.

    I'll report back here what we find, but I'm still hopeful about hints from
    others!

    -Robin

    On Thursday, November 15, 2012 1:18:08 PM UTC-8, Sam Ritchie wrote:

    Robin, the defbufferop can emit multiple tuples -- why not chunk up some
    of those huge statements into smaller insert statements?

    Better yet, you could use the JDBC tap in Maple and just sink directly
    into the SQL db, vs doing this intermediate business:


    https://github.com/Cascading/maple/blob/develop/src/jvm/com/twitter/maple/jdbc/JDBCTap.java

    On Thu, Nov 15, 2012 at 12:31 PM, Robin Kraft wrote:

    Hi all,

    I have a defbufferop transforms input tuples into a string SQL insert
    statement that will be used elsewhere. For one group of tuples, however,
    there are 1.6 million records. So the resulting insert statement would use
    about 125mb of disk space. When we try to run this, the reduce step fails
    with a java heap space out of memory error. There's an issue about it here:

    https://github.com/MapofLife/fossa/issues/24

    This is obviously an extreme case, but I wonder if anyone has a rule of
    thumb about how large an output tuple can be. I've experimented with the
    simple case below and was able to push out a 9mb tuple (10 million "a"s),
    but not a 45 mb tuple (50 million).

    Whatever the limit, we'll ultimately probably have to separately process
    groups that would be above that limit. And it would be nice if there were
    as few of those records as possible (i.e. we want to get close to the
    limit!).

    -Robin

    (let [n 10000000
    src [[(str (apply str (repeat n "a")))]]]
    (?<- (hfs-seqfile "/tmp/yo" :sinkmode :replace)
    [?a]
    (src ?a)))

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    (Too brief? Here's why! http://emailcharter.org)
  • Robin Kraft at Nov 26, 2012 at 5:09 am
    Hey Nick,

    Yeah, it's PostGIS via CartoDB. We ended up splitting the insert statement
    into multiple update statements (thanks Sam!), and that's working fine with
    a very large test dataset. We'll run the full job tomorrow and if we're
    still seeing problems HEXEWKB could be the ticket. Thanks for the heads up
    about it!

    -Robin

    On Tuesday, November 20, 2012 11:29:25 AM UTC-8, Nick Dimiduk wrote:

    Hi Robin,

    I assume your eventual target is PostGIS? You may be able to save yourself
    on encoding size if you represent your geometries as HEXEWKB [0] instead.

    Which process is crashing with the OOM -- is it the reduce task or the
    tasktracker? Remember, these are separate processes and their memory is
    configured independently. Something like this reference [1] might be useful.

    Good luck,
    Nick

    [0]
    http://postgis.refractions.net/documentation/manual-1.5/ch04.html#EWKB_EWKT
    [1]
    http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault.html
    On Thursday, November 15, 2012 1:53:56 PM UTC-8, Robin Kraft wrote:

    Hey Sam,

    So in this case we don't have JDBC access, and the insert statements are
    just big by nature - we're creating location geometries a la "MULTIPOINT((3
    4, -180 90, 0 0))" for 375 million species records. Fortunately the most
    common species is only 1.6m records - so 1.6 million xy pairs + metadata.

    That said, your message made me realize that we could add the geometry
    independently of the metadata fields, which would bring the size down
    dramatically. We'd still be looking at some large strings though, so some
    idea of a max tuple size would be helpful.

    I'll report back here what we find, but I'm still hopeful about hints
    from others!

    -Robin

    On Thursday, November 15, 2012 1:18:08 PM UTC-8, Sam Ritchie wrote:

    Robin, the defbufferop can emit multiple tuples -- why not chunk up some
    of those huge statements into smaller insert statements?

    Better yet, you could use the JDBC tap in Maple and just sink directly
    into the SQL db, vs doing this intermediate business:


    https://github.com/Cascading/maple/blob/develop/src/jvm/com/twitter/maple/jdbc/JDBCTap.java

    On Thu, Nov 15, 2012 at 12:31 PM, Robin Kraft wrote:

    Hi all,

    I have a defbufferop transforms input tuples into a string SQL insert
    statement that will be used elsewhere. For one group of tuples, however,
    there are 1.6 million records. So the resulting insert statement would use
    about 125mb of disk space. When we try to run this, the reduce step fails
    with a java heap space out of memory error. There's an issue about it here:

    https://github.com/MapofLife/fossa/issues/24

    This is obviously an extreme case, but I wonder if anyone has a rule of
    thumb about how large an output tuple can be. I've experimented with the
    simple case below and was able to push out a 9mb tuple (10 million "a"s),
    but not a 45 mb tuple (50 million).

    Whatever the limit, we'll ultimately probably have to separately
    process groups that would be above that limit. And it would be nice if
    there were as few of those records as possible (i.e. we want to get close
    to the limit!).

    -Robin

    (let [n 10000000
    src [[(str (apply str (repeat n "a")))]]]
    (?<- (hfs-seqfile "/tmp/yo" :sinkmode :replace)
    [?a]
    (src ?a)))

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedNov 15, '12 at 8:38p
activeNov 26, '12 at 5:09a
posts5
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase