with pail and template taps, but in the meantime I wanted to run something
by the group. Based on some testing this morning, it seems like aggressive
vertical partitioning combined with using lots of mappers could cause
problems if the result is many thousands of files.
For example, I was trying to explode 60 million, 300-period time series
records into individual periods and sink them by date to a pail - that
makes 18 billion records. I was impatient, so I used a cluster with 150
mappers and the job quickly bogged down due to HDFS replication errors<https://gist.github.com/robinkraft/5200606>
Could this be another form of the infamous small files problem<http://blog.cloudera.com/blog/2009/02/the-small-files-problem/>?
One hundred fifty mappers each trying to create 300 files (one per period)
adds 45k files HDFS almost simultaneously, I think. These are fairly small
files too, 4mb max at the end of the job.
Does anyone have a rule of thumb for what a large number of files actually
is? I can imagine that for a truly massive data set, even more aggressive
partitioning (e.g. by day) using a larger cluster could be desirable. But
would it be feasible given potential replication errors? Or is there
something else going on here?
p.s. FYI it seems that the size of a single file in a pail is limited to 1gb<https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/pail/PailOutputFormat.java#L23>.
I'm not sure where I got the idea that there's only 1 record per file prior
to pail consolidation (mentioned yesterday<https://groups.google.com/forum/?fromgroups=#!topic/cascalog-user/mOpDwxIcRMM>).
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
For more options, visit https://groups.google.com/groups/opt_out.