Grokbase Groups Pig user July 2011
Merging: doesn't actually speed things up all that much; reduces load
on the Namenode, and speeds up job initialization some. You don't have
to do it on the namenode itself. Neither do you have to do copying on
the NN. In fact, don't do anything but run the NameNode on the

Pig jobs can transparently merge small jobs into larger chunks, so you
won't be stuck with 11K mappers.

Don't copy to local an then run SQL loader. Use Sqoop export, and load
directly from Hadoop.

You cannot append to a file that already exists in the cluster. This
will be available in one of the coming Hadoop releases. You can
certainly create a new file in a directory, and load whole

On Sat, Jul 16, 2011 at 1:17 AM, jagaran das wrote:


Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
Like this we have 20 applications that would run in parallel

So one set would have 11520 files of total size 12 GB
Like this we would have 15 sets in parallel,

We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.

What we do:

1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
2. Copy to cluster
3. Execute PIG job
4. copy to local
5 Sql loader

Can we perform merge and copy to cluster from a different host other than the Namenode?
We want an out of cluster machine running a java process that would
1. Run periodically
2. Merge Files
3. Copy to Cluster

If we can append to an existing file in cluster?

Please provide your thoughts as maintaing the SLA is becoming tough.


Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 14 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 15, '11 at 7:41p
activeJul 17, '11 at 4:25a



site design / logo © 2021 Grokbase