Grokbase Groups Pig user August 2010
FAQ
What loader should I use on csv files with quoted strings that contain
embedded commas? (i.e. Embedded commas should not be a separator.)

And when LOADing large files in local mode, does Pig just store it all
in memory? Or does it have memory management ala buffer managers in
DBMS's?

Search Discussions

  • Jeff Zhang at Aug 19, 2010 at 8:50 am
    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas?  (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory?  Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang
  • Defenestrator at Aug 20, 2010 at 6:42 am
    Thanks, Jeff.

    A quick follow-up question relating to the loading/storing of data - what is
    the best practice when dealing with multiple relations with many tuples, do
    people typically STORE intermediate relations to minimize memory usage and
    RELOAD the intermediate data for use later on in the same script? Because I
    noticed that when tuples are written out using the TupleFormat, which
    outputs text with an additional parenthesis that would cause a subsequent
    PigStorage LOAD to get extra parenthesis characters, right?
    On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang wrote:

    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas? (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory? Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang
  • Jeff Zhang at Aug 20, 2010 at 7:06 am
    What do you mean "multiple relations with many tuples" ? Do you mean
    join multiple data set ?
    And Pig user BinStorage for storing intermediate data.


    On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
    wrote:
    Thanks, Jeff.

    A quick follow-up question relating to the loading/storing of data - what is
    the best practice when dealing with multiple relations with many tuples, do
    people typically STORE intermediate relations to minimize memory usage and
    RELOAD the intermediate data for use later on in the same script?  Because I
    noticed that when tuples are written out using the TupleFormat, which
    outputs text with an additional parenthesis that would cause a subsequent
    PigStorage LOAD to get extra parenthesis characters, right?
    On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang wrote:

    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas?  (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory?  Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang
  • Defenestrator at Aug 20, 2010 at 7:36 am
    Right, in cases where you have to load multiple large relations and then do
    some processing on each relations (filtering, aggregation) before joining
    them together. One wouldn't want to have all of the relations and
    intermediate state in memory before the join.

    So is BinStorage just storing the Tuples in an internal binary format that
    is easily converted back to a Tuple when loaded (i.e. no csv parsing
    necessary)?

    Thanks.
    On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang wrote:

    What do you mean "multiple relations with many tuples" ? Do you mean
    join multiple data set ?
    And Pig user BinStorage for storing intermediate data.


    On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
    wrote:
    Thanks, Jeff.

    A quick follow-up question relating to the loading/storing of data - what is
    the best practice when dealing with multiple relations with many tuples, do
    people typically STORE intermediate relations to minimize memory usage and
    RELOAD the intermediate data for use later on in the same script? Because I
    noticed that when tuples are written out using the TupleFormat, which
    outputs text with an additional parenthesis that would cause a subsequent
    PigStorage LOAD to get extra parenthesis characters, right?
    On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang wrote:

    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas? (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory? Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang
  • Jeff Zhang at Aug 20, 2010 at 7:40 am
    Actually, the intermediate won't been stored in memory. they will be
    stored in a tmp directory o hdfs, and pig will help you clean up the
    intermediate data when the job is finished.

    Yes, BinStorage is a binary format for storing intermediate data and
    know how to deserialize it to tuples

    On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator
    wrote:
    Right, in cases where you have to load multiple large relations and then do
    some processing on each relations (filtering, aggregation) before joining
    them together.  One wouldn't want to have all of the relations and
    intermediate state in memory before the join.

    So is BinStorage just storing the Tuples in an internal binary format that
    is easily converted back to a Tuple when loaded (i.e. no csv parsing
    necessary)?

    Thanks.
    On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang wrote:

    What do you mean "multiple relations with many tuples" ? Do you mean
    join multiple data set ?
    And Pig user BinStorage for storing intermediate data.


    On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
    wrote:
    Thanks, Jeff.

    A quick follow-up question relating to the loading/storing of data - what is
    the best practice when dealing with multiple relations with many tuples, do
    people typically STORE intermediate relations to minimize memory usage and
    RELOAD the intermediate data for use later on in the same script? Because I
    noticed that when tuples are written out using the TupleFormat, which
    outputs text with an additional parenthesis that would cause a subsequent
    PigStorage LOAD to get extra parenthesis characters, right?
    On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang wrote:

    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas?  (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory?  Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang
  • Thejas M Nair at Aug 20, 2010 at 2:26 pm
    To clarify what Jeff said, intermediate data before the join in your case
    will be stored to disk only if the operations before join require an
    separate map-reduce job.
    If the operations between the load and the join are non-blocking , such as a
    filter or foreach, then the data will be streamed through them and won't
    need to be stored on disk.
    -Thejas


    On 8/20/10 12:40 AM, "Jeff Zhang" wrote:

    Actually, the intermediate won't been stored in memory. they will be
    stored in a tmp directory o hdfs, and pig will help you clean up the
    intermediate data when the job is finished.

    Yes, BinStorage is a binary format for storing intermediate data and
    know how to deserialize it to tuples

    On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator
    wrote:
    Right, in cases where you have to load multiple large relations and then do
    some processing on each relations (filtering, aggregation) before joining
    them together.  One wouldn't want to have all of the relations and
    intermediate state in memory before the join.

    So is BinStorage just storing the Tuples in an internal binary format that
    is easily converted back to a Tuple when loaded (i.e. no csv parsing
    necessary)?

    Thanks.
    On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang wrote:

    What do you mean "multiple relations with many tuples" ? Do you mean
    join multiple data set ?
    And Pig user BinStorage for storing intermediate data.


    On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
    wrote:
    Thanks, Jeff.

    A quick follow-up question relating to the loading/storing of data - what is
    the best practice when dealing with multiple relations with many tuples, do
    people typically STORE intermediate relations to minimize memory usage and
    RELOAD the intermediate data for use later on in the same script? Because I
    noticed that when tuples are written out using the TupleFormat, which
    outputs text with an additional parenthesis that would cause a subsequent
    PigStorage LOAD to get extra parenthesis characters, right?
    On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang wrote:

    I am afraid you should write your own LoadFunc to interpret the text.
    From Pig 0.7, the local mode use the hadoop's standalone local mode,
    so it will won't store all the data in memory, the data will been read
    in stream mode, but this mode need more memory because each task is
    executed in another jvm.


    On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
    wrote:
    What loader should I use on csv files with quoted strings that contain
    embedded commas?  (i.e. Embedded commas should not be a separator.)

    And when LOADing large files in local mode, does Pig just store it all
    in memory?  Or does it have memory management ala buffer managers in
    DBMS's?


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang


    --
    Best Regards

    Jeff Zhang

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 19, '10 at 7:49a
activeAug 20, '10 at 2:26p
posts7
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase