Grokbase Groups Pig user January 2013
FAQ
Hello users,

I have an input file (1.2 MB) which contains list of words/phrases in every
new line. I am reading each phrase per line and passing it to udf to
correct/check that phrase.
The udf (simple extends eval func) refers and reads a dictionary file of 6
MB for each input phrase.

Since, the input dataset is very small, Pig launches only one mapper (out
of 150 slots) to process the input and no parallelism is gained here.

I would like to get some input/suggestions on how these kind of scenarios
are efficiently implemented in pig.

=====code snip====

register 'Dudfs.jar';
define CorrectPhrases CorrectPhrases('/user/home/big.txt');
input_term = load '/user/home/input.txt' using PigStorage('\n') as
(phrase:chararray);
checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
as correctedTerms;
store checked_term into '/user/home/corrected_phrases' using
PigStorage(',');

===================================

Forgive me if i am getting into wrong direction, feel free to correct me
and suggest your ways.

Thanks in advance!


Regards,
Dipesh
--
Dipesh Kr. Singh

Search Discussions

  • Dmitriy Ryaboy at Jan 13, 2013 at 10:54 pm
    "The udf (simple extends eval func) refers and reads a dictionary file of 6
    MB for each input phrase."

    Any reason to keep re-reading the dictionary instead of just reading it
    once?

    D

    On Sun, Jan 13, 2013 at 4:47 AM, Dipesh Kumar Singh
    wrote:
    The udf (simple extends eval func) refers and reads a dictionary file of 6
    MB for each input phrase.
  • Vitalii Tymchyshyn at Jan 14, 2013 at 10:23 am
    Well, if you will set split size to 1, you should get per-line split.


    2013/1/13 Dipesh Kumar Singh <dipesh.tech@gmail.com>
    Hello users,

    I have an input file (1.2 MB) which contains list of words/phrases in every
    new line. I am reading each phrase per line and passing it to udf to
    correct/check that phrase.
    The udf (simple extends eval func) refers and reads a dictionary file of 6
    MB for each input phrase.

    Since, the input dataset is very small, Pig launches only one mapper (out
    of 150 slots) to process the input and no parallelism is gained here.

    I would like to get some input/suggestions on how these kind of scenarios
    are efficiently implemented in pig.

    =====code snip====

    register 'Dudfs.jar';
    define CorrectPhrases CorrectPhrases('/user/home/big.txt');
    input_term = load '/user/home/input.txt' using PigStorage('\n') as
    (phrase:chararray);
    checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
    as correctedTerms;
    store checked_term into '/user/home/corrected_phrases' using
    PigStorage(',');

    ===================================

    Forgive me if i am getting into wrong direction, feel free to correct me
    and suggest your ways.

    Thanks in advance!


    Regards,
    Dipesh
    --
    Dipesh Kr. Singh


    --
    Best regards,
    Vitalii Tymchyshyn
  • Dipesh Kumar Singh at Jan 15, 2013 at 6:23 pm
    Thanks Dmitriy and Vitalii... !!

    I am able to control number of mappers by setting the split size. And, yes
    there isn't any reason of re-reading the dictionary, except that i was
    porting an existing code. I will re-implement to read it once and check
    the performance.

    Regards,
    Dipesh
    On Mon, Jan 14, 2013 at 3:52 PM, Vitalii Tymchyshyn wrote:

    Well, if you will set split size to 1, you should get per-line split.


    2013/1/13 Dipesh Kumar Singh <dipesh.tech@gmail.com>
    Hello users,

    I have an input file (1.2 MB) which contains list of words/phrases in every
    new line. I am reading each phrase per line and passing it to udf to
    correct/check that phrase.
    The udf (simple extends eval func) refers and reads a dictionary file of 6
    MB for each input phrase.

    Since, the input dataset is very small, Pig launches only one mapper (out
    of 150 slots) to process the input and no parallelism is gained here.

    I would like to get some input/suggestions on how these kind of scenarios
    are efficiently implemented in pig.

    =====code snip====

    register 'Dudfs.jar';
    define CorrectPhrases CorrectPhrases('/user/home/big.txt');
    input_term = load '/user/home/input.txt' using PigStorage('\n') as
    (phrase:chararray);
    checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
    as correctedTerms;
    store checked_term into '/user/home/corrected_phrases' using
    PigStorage(',');

    ===================================

    Forgive me if i am getting into wrong direction, feel free to correct me
    and suggest your ways.

    Thanks in advance!


    Regards,
    Dipesh
    --
    Dipesh Kr. Singh


    --
    Best regards,
    Vitalii Tymchyshyn


    --
    Dipesh Kr. Singh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 13, '13 at 12:48p
activeJan 15, '13 at 6:23p
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase