FAQ
We're still evaluating how to handle vertical partitioning, updates, etc.
with pail and template taps, but in the meantime I wanted to run something
by the group. Based on some testing this morning, it seems like aggressive
vertical partitioning combined with using lots of mappers could cause
problems if the result is many thousands of files.

For example, I was trying to explode 60 million, 300-period time series
records into individual periods and sink them by date to a pail - that
makes 18 billion records. I was impatient, so I used a cluster with 150
mappers and the job quickly bogged down due to HDFS replication errors<https://gist.github.com/robinkraft/5200606>
.

Could this be another form of the infamous small files problem<http://blog.cloudera.com/blog/2009/02/the-small-files-problem/>?
One hundred fifty mappers each trying to create 300 files (one per period)
adds 45k files HDFS almost simultaneously, I think. These are fairly small
files too, 4mb max at the end of the job.

Does anyone have a rule of thumb for what a large number of files actually
is? I can imagine that for a truly massive data set, even more aggressive
partitioning (e.g. by day) using a larger cluster could be desirable. But
would it be feasible given potential replication errors? Or is there
something else going on here?

-Robin

p.s. FYI it seems that the size of a single file in a pail is limited to 1gb<https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/pail/PailOutputFormat.java#L23>.
I'm not sure where I got the idea that there's only 1 record per file prior
to pail consolidation (mentioned yesterday<https://groups.google.com/forum/?fromgroups=#!topic/cascalog-user/mOpDwxIcRMM>).
My bad.

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Sam Ritchie at Mar 20, 2013 at 6:44 pm
    It's usually much, much cheaper to just store everything together with
    minimal vertical partitioning (maybe by dataset alone), read everything
    in and filter the items you don't want. This all happens in the mappers
    and shouldn't add much overhead. It'll probably save you time in the
    long run, since you'll need one mapper per tiny, tiny file if you try to
    split everything up.

    The rule of thumb for files is that you want a file to be roughly the
    size of a block on HDFS. By default this is 128MB, I think.
    Robin Kraft March 19, 2013 3:21 PM
    We're still evaluating how to handle vertical partitioning, updates,
    etc. with pail and template taps, but in the meantime I wanted to run
    something by the group. Based on some testing this morning, it seems
    like aggressive vertical partitioning combined with using lots of
    mappers could cause problems if the result is many thousands of files.

    For example, I was trying to explode 60 million, 300-period time
    series records into individual periods and sink them by date to a pail
    - that makes 18 billion records. I was impatient, so I used a cluster
    with 150 mappers and the job quickly bogged down due to HDFS
    replication errors <https://gist.github.com/robinkraft/5200606>.

    Could this be another form of the infamous small files problem
    <http://blog.cloudera.com/blog/2009/02/the-small-files-problem/>? One
    hundred fifty mappers each trying to create 300 files (one per period)
    adds 45k files HDFS almost simultaneously, I think. These are fairly
    small files too, 4mb max at the end of the job.

    Does anyone have a rule of thumb for what a large number of files
    actually is? I can imagine that for a truly massive data set, even
    more aggressive partitioning (e.g. by day) using a larger cluster
    could be desirable. But would it be feasible given potential
    replication errors? Or is there something else going on here?

    -Robin

    p.s. FYI it seems that the size of a single file in a pail is limited
    to 1gb
    <https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/pail/PailOutputFormat.java#L23>.
    I'm not sure where I got the idea that there's only 1 record per file
    prior to pail consolidation (mentioned yesterday
    <https://groups.google.com/forum/?fromgroups=#%21topic/cascalog-user/mOpDwxIcRMM>).
    My bad.
    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMar 19, '13 at 10:21p
activeMar 20, '13 at 6:44p
posts2
users2
websiteclojure.org
irc#clojure

2 users in discussion

Robin Kraft: 1 post Sam Ritchie: 1 post

People

Translate

site design / logo © 2021 Grokbase