You know the story:
You have data files that are created every 5 minutes.
You have hundreds of servers.
You want to put those files in hadoop.
You get lots of files and blocks.
Your namenode and secondary name node need more memory (BTW JVM's have
issues at large Xmx values).
Your map reduce jobs start launching too many tasks.
Hadoop File Crusher
How does it work?
Hadoop filecrusher uses map reduce to combine multiple smaller files into a
single larger one.
What was the deal with v1?
V1 was great. It happily crushed files, although some datasets presented
For example, the case where one partition of a hive table was very large and
others were smaller. V1 would allocate a reducer per folder and this job
would run as long as the biggest folder.
Also V1 ALWAYS created one file per directory, which is not optimal if a
directory already had maybe 2 largish files and crushing was not necessary.
How does v2 deal with this better?
V2 is more intelligent in it's job planning. It has tunable parameters which
define which files are too small to crush such as.
Percent threshold relative to the dfs block size over which a file
becomes eligible for crushing. Must be in the (0, 1]. Default is 0.75,
which means files smaller than or equal to 75% of a dfs block will be
eligible for crushing. File greater than 75% of a dfs block will be
The maximum number of dfs blocks per output file. Must be a positive
integer. Small input files are associated with an output file under
the assumption that input and output compression codecs have similar
efficiency. Also, a directory containing a lot of data in many small
files will be converted into a directory containing a fewer number of
large files rather than one super-massive file. With the default value
8, 80 small files, each being 1/10th of a dfs block will be grouped
into to a single output file since 8 * 1/10 = 8 dfs blocks. If there
are 81 small files, each being 1/10th of a dfs block, two output files
will be created. One output file contain the combined contents of 41
files and the second will contain the combined contents of the other
40. A directory of many small files will be converted into fewer
number of larger files where each output file is roughly the same
Why is file crushing optimal?
You can not always control how many files are generated by upstream processes
Namenode file and block constraints
Jobs have less overhead with less files and run MUCH faster.
Usage documentation is found here: