Grokbase Groups Pig user June 2011
FAQ
Hi Guys, Can anyone please tell me how to read Explain plan in pig? When I
do explain plan for any of my pig query it gives me really good flow
diagram, but it uses some Pig functions so didn' t really understand what going
on and what
it mean. Please let me if there is any documentation for this.

- Jagaran

Search Discussions

  • Dmitriy Ryaboy at Jun 10, 2011 at 8:00 pm
    Alan has a section on the explain plan in his upcoming book:

    http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html
    On Fri, Jun 10, 2011 at 11:57 AM, jagaran das wrote:
    Hi Guys,  Can anyone please tell me how to read Explain plan in pig? When I
    do explain plan for any of my pig query it gives me really good flow
    diagram, but it uses some Pig functions so didn' t really understand what going
    on and what
    it mean.  Please let me if there is any documentation for this.

    - Jagaran
  • Pradipta Kumar Dutta at Jun 14, 2011 at 6:43 pm
    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • Bill Graham at Jun 14, 2011 at 8:20 pm
    Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned in
    their own VM so multiple concurrent jobs acting on the same input data are
    fine.
    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta wrote:

    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • 勇胡 at Jun 15, 2011 at 10:59 am
    How can I understand immutable? I mean whether the HDFS implements lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned
    in
    their own VM so multiple concurrent jobs acting on the same input data are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • Nathan Bijnens at Jun 15, 2011 at 11:11 am
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data access. A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the future.
    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model

    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned
    in
    their own VM so multiple concurrent jobs acting on the same input data are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • 勇胡 at Jun 15, 2011 at 12:26 pm
    I read the link, and I just felt that the HDFS is designed for the
    read-frequently operation, not for the write-frequently( A file
    once created, written, and closed need not be changed.) .

    For your description (Immutable means that after creation it cannot be
    modified.), if I understand correct, you mean that the HDFS can not
    implement "update" semantics as same as in the database area? The write
    operation can not directly apply to the specific tuple or record? The result
    of write operation just appends at the end of the file.

    Regards

    Yong

    2011/6/15 Nathan Bijnens <nathan@nathan.gs>
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A
    file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data access. A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the future.

    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model

    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are
    spawned
    in
    their own VM so multiple concurrent jobs acting on the same input data are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • Jonathan Coveney at Jun 15, 2011 at 1:36 pm
    Yong,

    Currently, HDFS does not support appending to a file. So once a file is
    created, it literally cannot be changed (although it can be deleted, I
    suppose). this lets you avoid issues where I do a SELECT * on the entire
    database, and the dba can't update a row, or other things like that. There
    are some append patches in the works but I am not sure how they handle the
    concurrency implications.

    Make sense?
    Jon

    2011/6/15 勇胡 <yongyong313@gmail.com>
    I read the link, and I just felt that the HDFS is designed for the
    read-frequently operation, not for the write-frequently( A file
    once created, written, and closed need not be changed.) .

    For your description (Immutable means that after creation it cannot be
    modified.), if I understand correct, you mean that the HDFS can not
    implement "update" semantics as same as in the database area? The write
    operation can not directly apply to the specific tuple or record? The
    result
    of write operation just appends at the end of the file.

    Regards

    Yong

    2011/6/15 Nathan Bijnens <nathan@nathan.gs>
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A
    file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data access. A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the
    future.
    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are
    spawned
    in
    their own VM so multiple concurrent jobs acting on the same input
    data
    are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data (in Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • 勇胡 at Jun 15, 2011 at 1:56 pm
    Jon,

    If I want to modify data(insert or delete) in the HDFS, how can I do it?
    From the description, I can not directly modify the data itself(update the
    data), I can not append the new data to the file! How the HDFS implement the
    data modification? I just feel a little bit confusion.

    Yong
    在 2011年6月15日 下午3:36,Jonathan Coveney <jcoveney@gmail.com>写道:
    Yong,

    Currently, HDFS does not support appending to a file. So once a file is
    created, it literally cannot be changed (although it can be deleted, I
    suppose). this lets you avoid issues where I do a SELECT * on the entire
    database, and the dba can't update a row, or other things like that. There
    are some append patches in the works but I am not sure how they handle the
    concurrency implications.

    Make sense?
    Jon

    2011/6/15 勇胡 <yongyong313@gmail.com>
    I read the link, and I just felt that the HDFS is designed for the
    read-frequently operation, not for the write-frequently( A file
    once created, written, and closed need not be changed.) .

    For your description (Immutable means that after creation it cannot be
    modified.), if I understand correct, you mean that the HDFS can not
    implement "update" semantics as same as in the database area? The write
    operation can not directly apply to the specific tuple or record? The
    result
    of write operation just appends at the end of the file.

    Regards

    Yong

    2011/6/15 Nathan Bijnens <nathan@nathan.gs>
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A
    file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data
    access.
    A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the
    future.
    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements
    lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are
    spawned
    in
    their own VM so multiple concurrent jobs acting on the same input
    data
    are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data
    (in
    Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • Dmitriy Ryaboy at Jun 15, 2011 at 4:57 pm
    Yong,

    You can't. Hence, immutable. It's not a database. It's a write-once file system.

    Approaches to solve updates include:
    1) rewrite everything
    2) write a separate set of "deltas" into other files and join them in
    at read time
    3) do 2, and occasionally run a "compaction" which does a complete
    rewrite based on existing deltas
    4) write to something like HBase that handles all of this under the covers

    D

    2011/6/15 勇胡 <yongyong313@gmail.com>:
    Jon,

    If I want to modify data(insert or delete) in the HDFS, how can I do it?
    From the description, I can not directly modify the data itself(update the
    data), I can not append the new data to the file! How the HDFS implement the
    data modification? I just feel a little bit confusion.

    Yong
    在 2011年6月15日 下午3:36,Jonathan Coveney <jcoveney@gmail.com>写道:
    Yong,

    Currently, HDFS does not support appending to a file. So once a file is
    created, it literally cannot be changed (although it can be deleted, I
    suppose). this lets you avoid issues where I do a SELECT * on the entire
    database, and the dba can't update a row, or other things like that. There
    are some append patches in the works but I am not sure how they handle the
    concurrency implications.

    Make sense?
    Jon

    2011/6/15 勇胡 <yongyong313@gmail.com>
    I read the link, and I just felt that the HDFS is designed for the
    read-frequently operation, not for the write-frequently( A file
    once created, written, and closed need not be changed.) .

    For your description (Immutable means that after creation it cannot be
    modified.), if I understand correct, you mean that the HDFS can not
    implement "update" semantics as same as in the database area? The write
    operation can not directly apply to the specific tuple or record? The
    result
    of write operation just appends at the end of the file.

    Regards

    Yong

    2011/6/15 Nathan Bijnens <nathan@nathan.gs>
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A
    file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data
    access.
    A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the
    future.
    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements
    lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are
    spawned
    in
    their own VM so multiple concurrent jobs acting on the same input
    data
    are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data
    (in
    Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta
  • Jagaran das at Jun 15, 2011 at 6:25 pm
    Hi,

    Can't we append in hadoop-0.20.203.0?

    Regards,
    Jagaran



    ________________________________
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    To: user@pig.apache.org
    Sent: Wed, 15 June, 2011 9:57:20 AM
    Subject: Re: Running multiple Pig jobs simultaneously on same data

    Yong,

    You can't. Hence, immutable. It's not a database. It's a write-once file system.

    Approaches to solve updates include:
    1) rewrite everything
    2) write a separate set of "deltas" into other files and join them in
    at read time
    3) do 2, and occasionally run a "compaction" which does a complete
    rewrite based on existing deltas
    4) write to something like HBase that handles all of this under the covers

    D

    2011/6/15 勇胡 <yongyong313@gmail.com>:
    Jon,

    If I want to modify data(insert or delete) in the HDFS, how can I do it?
    From the description, I can not directly modify the data itself(update the
    data), I can not append the new data to the file! How the HDFS implement the
    data modification? I just feel a little bit confusion.

    Yong
    在 2011年6月15日 下午3:36,Jonathan Coveney <jcoveney@gmail.com>写道:
    Yong,

    Currently, HDFS does not support appending to a file. So once a file is
    created, it literally cannot be changed (although it can be deleted, I
    suppose). this lets you avoid issues where I do a SELECT * on the entire
    database, and the dba can't update a row, or other things like that. There
    are some append patches in the works but I am not sure how they handle the
    concurrency implications.

    Make sense?
    Jon

    2011/6/15 勇胡 <yongyong313@gmail.com>
    I read the link, and I just felt that the HDFS is designed for the
    read-frequently operation, not for the write-frequently( A file
    once created, written, and closed need not be changed.) .

    For your description (Immutable means that after creation it cannot be
    modified.), if I understand correct, you mean that the HDFS can not
    implement "update" semantics as same as in the database area? The write
    operation can not directly apply to the specific tuple or record? The
    result
    of write operation just appends at the end of the file.

    Regards

    Yong

    2011/6/15 Nathan Bijnens <nathan@nathan.gs>
    Immutable means that after creation it cannot be modified.

    HDFS applications need a write-once-read-many access model for files. A
    file
    once created, written, and closed need not be changed. This assumption
    simplifies data coherency issues and enables high throughput data
    access.
    A
    MapReduce application or a web crawler application fits perfectly with this
    model. There is a plan to support appending-writes to files in the
    future.
    http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
    l
    Best regards,
    Nathan
    ---
    nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs

    On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 wrote:

    How can I understand immutable? I mean whether the HDFS implements
    lock
    mechanism to obtain immutable data access when the concurrent tasks process
    the same set of data or uses other strategy to implement immutable?

    Thanks

    Yong

    2011/6/14 Bill Graham <billgraham@gmail.com>
    Yes, this is possible. Data in HDFS is immutable and MR tasks are
    spawned
    in
    their own VM so multiple concurrent jobs acting on the same input
    data
    are
    fine.

    On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
    pradipta.dutta@me.com> wrote:
    Hi All,

    We have a requirement where we have to process same set of data
    (in
    Hadoop
    cluster) by running multiple Pig jobs simultaneously.

    Any idea whether this is possible in Pig?

    Thanks,
    Pradipta

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 10, '11 at 6:58p
activeJun 15, '11 at 6:25p
posts11
users7
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase