Grokbase Groups Pig user January 2013
Hello users,

I have an input file (1.2 MB) which contains list of words/phrases in every
new line. I am reading each phrase per line and passing it to udf to
correct/check that phrase.
The udf (simple extends eval func) refers and reads a dictionary file of 6
MB for each input phrase.

Since, the input dataset is very small, Pig launches only one mapper (out
of 150 slots) to process the input and no parallelism is gained here.

I would like to get some input/suggestions on how these kind of scenarios
are efficiently implemented in pig.

=====code snip====

register 'Dudfs.jar';
define CorrectPhrases CorrectPhrases('/user/home/big.txt');
input_term = load '/user/home/input.txt' using PigStorage('\n') as
checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
as correctedTerms;
store checked_term into '/user/home/corrected_phrases' using


Forgive me if i am getting into wrong direction, feel free to correct me
and suggest your ways.

Thanks in advance!

Dipesh Kr. Singh

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 4 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 13, '13 at 12:48p
activeJan 15, '13 at 6:23p



site design / logo © 2021 Grokbase