Grokbase Groups Pig user January 2013
Well, if you will set split size to 1, you should get per-line split.

2013/1/13 Dipesh Kumar Singh <>
Hello users,

I have an input file (1.2 MB) which contains list of words/phrases in every
new line. I am reading each phrase per line and passing it to udf to
correct/check that phrase.
The udf (simple extends eval func) refers and reads a dictionary file of 6
MB for each input phrase.

Since, the input dataset is very small, Pig launches only one mapper (out
of 150 slots) to process the input and no parallelism is gained here.

I would like to get some input/suggestions on how these kind of scenarios
are efficiently implemented in pig.

=====code snip====

register 'Dudfs.jar';
define CorrectPhrases CorrectPhrases('/user/home/big.txt');
input_term = load '/user/home/input.txt' using PigStorage('\n') as
checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
as correctedTerms;
store checked_term into '/user/home/corrected_phrases' using


Forgive me if i am getting into wrong direction, feel free to correct me
and suggest your ways.

Thanks in advance!

Dipesh Kr. Singh

Best regards,
Vitalii Tymchyshyn

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 4 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 13, '13 at 12:48p
activeJan 15, '13 at 6:23p



site design / logo © 2021 Grokbase