FAQ
It's usually much, much cheaper to just store everything together with
minimal vertical partitioning (maybe by dataset alone), read everything
in and filter the items you don't want. This all happens in the mappers
and shouldn't add much overhead. It'll probably save you time in the
long run, since you'll need one mapper per tiny, tiny file if you try to
split everything up.

The rule of thumb for files is that you want a file to be roughly the
size of a block on HDFS. By default this is 128MB, I think.
Robin Kraft March 19, 2013 3:21 PM
We're still evaluating how to handle vertical partitioning, updates,
etc. with pail and template taps, but in the meantime I wanted to run
something by the group. Based on some testing this morning, it seems
like aggressive vertical partitioning combined with using lots of
mappers could cause problems if the result is many thousands of files.

For example, I was trying to explode 60 million, 300-period time
series records into individual periods and sink them by date to a pail
- that makes 18 billion records. I was impatient, so I used a cluster
with 150 mappers and the job quickly bogged down due to HDFS
replication errors <https://gist.github.com/robinkraft/5200606>.

Could this be another form of the infamous small files problem
<http://blog.cloudera.com/blog/2009/02/the-small-files-problem/>? One
hundred fifty mappers each trying to create 300 files (one per period)
adds 45k files HDFS almost simultaneously, I think. These are fairly
small files too, 4mb max at the end of the job.

Does anyone have a rule of thumb for what a large number of files
actually is? I can imagine that for a truly massive data set, even
more aggressive partitioning (e.g. by day) using a larger cluster
could be desirable. But would it be feasible given potential
replication errors? Or is there something else going on here?

-Robin

p.s. FYI it seems that the size of a single file in a pail is limited
to 1gb
<https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/pail/PailOutputFormat.java#L23>.
I'm not sure where I got the idea that there's only 1 record per file
prior to pail consolidation (mentioned yesterday
<https://groups.google.com/forum/?fromgroups=#%21topic/cascalog-user/mOpDwxIcRMM>).
My bad.
--
You received this message because you are subscribed to the Google
Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMar 19, '13 at 10:21p
activeMar 20, '13 at 6:44p
posts2
users2
websiteclojure.org
irc#clojure

2 users in discussion

Robin Kraft: 1 post Sam Ritchie: 1 post

People

Translate

site design / logo © 2021 Grokbase