Grokbase Groups Pig user June 2011
FAQ
Hi,



In our production Cassandra systems we are observing the time taken by same
PIG script keeps increasing each and every day. The PIG scripts reads data
for a day at a time from a Cassandra Column Family. The number of rows the
PIG script is expected to return is almost same every day, however every day
the amount of rows we are storing in Cassandra is increasing. We haven't
changed the default setting for multiquery, it is by default enabled.



Could this increase in PIG script execution time be related to the
increasing number of rows in Cassandra every day?



Related to this I was trying to understand the behavior of LOAD statement.
Does LOAD statement reads all the data from Cassandra and then applies the
required filter conditions? If so the increase in execution time could be
attributed to the extra time required to read the ever increasing data in
Cassandra.



We are also working on a suitable archival mechanisms for our data so that
the total number of rows that are stored are always maintained at an optimum
count. This should also help us to maintain almost constant PIG script
execution time every day.



Please advice.



Thanks,

Badri

Search Discussions

  • Jeremy Hanna at Jun 17, 2011 at 7:09 pm
    The way cassandra currently does mapreduce is that it iterates over all the rows of the column family. So yes, performance would be related to the growing number of rows. You can use the pig FILTER function to filter them down, but you are still iterating over all of the rows in that columns family.

    There is a ticket - CASSANDRA-1600 (https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this and allows for subsets of rows to be specified. It will also enable mapreducing over secondary indexes in a column family. We had hoped 1600 would be resolved by now but there was a complication with a dependent issue. I have been told that it will definitely be in the next major release of Cassandra - 1.0, due out in the beginning of October. From what I understand, these updates will then enable both pig and hive to more easily push down selects of subsets of data.

    Until then, what we've done is set up a separate column family with data that we want to analyze that only has a subset of the data. Then when 1.0 comes out, we'll shift over to use that.

    Jeremy
    On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote:

    Hi,



    In our production Cassandra systems we are observing the time taken by same
    PIG script keeps increasing each and every day. The PIG scripts reads data
    for a day at a time from a Cassandra Column Family. The number of rows the
    PIG script is expected to return is almost same every day, however every day
    the amount of rows we are storing in Cassandra is increasing. We haven't
    changed the default setting for multiquery, it is by default enabled.



    Could this increase in PIG script execution time be related to the
    increasing number of rows in Cassandra every day?



    Related to this I was trying to understand the behavior of LOAD statement.
    Does LOAD statement reads all the data from Cassandra and then applies the
    required filter conditions? If so the increase in execution time could be
    attributed to the extra time required to read the ever increasing data in
    Cassandra.



    We are also working on a suitable archival mechanisms for our data so that
    the total number of rows that are stored are always maintained at an optimum
    count. This should also help us to maintain almost constant PIG script
    execution time every day.



    Please advice.



    Thanks,

    Badri





  • Badrinarayanan S at Jun 18, 2011 at 2:07 am
    Hi Jeremy,

    Thanks. Till we get 1.0 we will also adopt separate CF for analysis
    purposes.

    Regards,
    badri

    -----Original Message-----
    From: Jeremy Hanna
    Sent: Saturday, June 18, 2011 12:39 AM
    To: user@pig.apache.org
    Subject: Re: PIG Cassandra - Performance

    The way cassandra currently does mapreduce is that it iterates over all the
    rows of the column family. So yes, performance would be related to the
    growing number of rows. You can use the pig FILTER function to filter them
    down, but you are still iterating over all of the rows in that columns
    family.

    There is a ticket - CASSANDRA-1600
    (https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this
    and allows for subsets of rows to be specified. It will also enable
    mapreducing over secondary indexes in a column family. We had hoped 1600
    would be resolved by now but there was a complication with a dependent
    issue. I have been told that it will definitely be in the next major
    release of Cassandra - 1.0, due out in the beginning of October. From what
    I understand, these updates will then enable both pig and hive to more
    easily push down selects of subsets of data.

    Until then, what we've done is set up a separate column family with data
    that we want to analyze that only has a subset of the data. Then when 1.0
    comes out, we'll shift over to use that.

    Jeremy
    On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote:

    Hi,



    In our production Cassandra systems we are observing the time taken by same
    PIG script keeps increasing each and every day. The PIG scripts reads data
    for a day at a time from a Cassandra Column Family. The number of rows the
    PIG script is expected to return is almost same every day, however every day
    the amount of rows we are storing in Cassandra is increasing. We haven't
    changed the default setting for multiquery, it is by default enabled.



    Could this increase in PIG script execution time be related to the
    increasing number of rows in Cassandra every day?



    Related to this I was trying to understand the behavior of LOAD statement.
    Does LOAD statement reads all the data from Cassandra and then applies the
    required filter conditions? If so the increase in execution time could be
    attributed to the extra time required to read the ever increasing data in
    Cassandra.



    We are also working on a suitable archival mechanisms for our data so that
    the total number of rows that are stored are always maintained at an optimum
    count. This should also help us to maintain almost constant PIG script
    execution time every day.



    Please advice.



    Thanks,

    Badri





  • Jeremy Hanna at Jun 18, 2011 at 3:20 am
    Oh np Badri.

    Also fwiw the open-source brisk project - https://github.com/riptano/brisk/ - does a good job integrating cassandra with hadoop and today they released beta 2 of it, which includes pig support in there. Might be worth looking at too. It simplifies operations a lot from what I understand. The explanation of what it does is http://www.datastax.com/brisk

    Anyway, so just something else to consider.

    Also I started a project called pygmalion to help out with pig + cassandra specifically. You might find it useful and/or want to contribute code/examples to it :). Anyway, that's here: https://github.com/jeromatron/pygmalion/

    Jeremy
    On Jun 17, 2011, at 9:05 PM, Badrinarayanan S wrote:

    Hi Jeremy,

    Thanks. Till we get 1.0 we will also adopt separate CF for analysis
    purposes.

    Regards,
    badri

    -----Original Message-----
    From: Jeremy Hanna
    Sent: Saturday, June 18, 2011 12:39 AM
    To: user@pig.apache.org
    Subject: Re: PIG Cassandra - Performance

    The way cassandra currently does mapreduce is that it iterates over all the
    rows of the column family. So yes, performance would be related to the
    growing number of rows. You can use the pig FILTER function to filter them
    down, but you are still iterating over all of the rows in that columns
    family.

    There is a ticket - CASSANDRA-1600
    (https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this
    and allows for subsets of rows to be specified. It will also enable
    mapreducing over secondary indexes in a column family. We had hoped 1600
    would be resolved by now but there was a complication with a dependent
    issue. I have been told that it will definitely be in the next major
    release of Cassandra - 1.0, due out in the beginning of October. From what
    I understand, these updates will then enable both pig and hive to more
    easily push down selects of subsets of data.

    Until then, what we've done is set up a separate column family with data
    that we want to analyze that only has a subset of the data. Then when 1.0
    comes out, we'll shift over to use that.

    Jeremy
    On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote:

    Hi,



    In our production Cassandra systems we are observing the time taken by same
    PIG script keeps increasing each and every day. The PIG scripts reads data
    for a day at a time from a Cassandra Column Family. The number of rows the
    PIG script is expected to return is almost same every day, however every day
    the amount of rows we are storing in Cassandra is increasing. We haven't
    changed the default setting for multiquery, it is by default enabled.



    Could this increase in PIG script execution time be related to the
    increasing number of rows in Cassandra every day?



    Related to this I was trying to understand the behavior of LOAD statement.
    Does LOAD statement reads all the data from Cassandra and then applies the
    required filter conditions? If so the increase in execution time could be
    attributed to the extra time required to read the ever increasing data in
    Cassandra.



    We are also working on a suitable archival mechanisms for our data so that
    the total number of rows that are stored are always maintained at an optimum
    count. This should also help us to maintain almost constant PIG script
    execution time every day.



    Please advice.



    Thanks,

    Badri





Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 17, '11 at 6:32p
activeJun 18, '11 at 3:20a
posts4
users2
websitepig.apache.org

2 users in discussion

Jeremy Hanna: 2 posts Badrinarayanan S: 2 posts

People

Translate

site design / logo © 2021 Grokbase