Grokbase Groups Pig user May 2013
FAQ
Thought I understood how to output to a single file but It doesn't seem to be working. Anything I'm missing here?


-- Dedupe and store

rows = LOAD '$input';
unique = DISTINCT rows PARELLEL 1;

STORE unique INTO '$output';

Search Discussions

  • Mike Sukmanowsky at May 1, 2013 at 5:17 pm
    How many output files are you getting? You can set SET DEFAULT_PARALLEL 1;
    so you don't have to specify parallelism on each reduce phase.

    In general though, I wouldn't recommend forcing your output into one file
    (parallelism is good). Just write a shell/python/ruby/perl script that
    appends the files after the full job executes.

    On Wed, May 1, 2013 at 12:51 PM, Mark wrote:

    Thought I understood how to output to a single file but It doesn't seem to
    be working. Anything I'm missing here?


    -- Dedupe and store

    rows = LOAD '$input';
    unique = DISTINCT rows PARELLEL 1;

    STORE unique INTO '$output';


    --
    Mike Sukmanowsky

    Product Lead, http://parse.ly
    989 Avenue of the Americas, 3rd Floor
    New York, NY 10018
    p: +1 (416) 953-4248
    e: mike@parsely.com
  • Mark at May 1, 2013 at 5:21 pm
    What I'm doing is at the end of each day I deduce and store all my log files in lzo format in an archive directory. I thought that since LZO is splittable and Hadoop likes larger files that this would be best. Is this not the case?

    And to answer your question there seems to be 2 files around 800mb in size.
    On May 1, 2013, at 10:17 AM, Mike Sukmanowsky wrote:

    How many output files are you getting? You can set SET DEFAULT_PARALLEL 1;
    so you don't have to specify parallelism on each reduce phase.

    In general though, I wouldn't recommend forcing your output into one file
    (parallelism is good). Just write a shell/python/ruby/perl script that
    appends the files after the full job executes.

    On Wed, May 1, 2013 at 12:51 PM, Mark wrote:

    Thought I understood how to output to a single file but It doesn't seem to
    be working. Anything I'm missing here?


    -- Dedupe and store

    rows = LOAD '$input';
    unique = DISTINCT rows PARELLEL 1;

    STORE unique INTO '$output';


    --
    Mike Sukmanowsky

    Product Lead, http://parse.ly
    989 Avenue of the Americas, 3rd Floor
    New York, NY 10018
    p: +1 (416) 953-4248
    e: mike@parsely.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 1, '13 at 4:52p
activeMay 1, '13 at 5:21p
posts3
users2
websitepig.apache.org

2 users in discussion

Mark: 2 posts Mike Sukmanowsky: 1 post

People

Translate

site design / logo © 2021 Grokbase