Grokbase Groups Pig user April 2011
FAQ
Hi, All.

When I do a pig query on Cassandra, and the Cassandra is updated by
application at the same time, what will happen? I may get inconsistent
results, right?

--
Bing

Graduate Student
Computer Science Department, UCSB :)

Search Discussions

  • Jeremy Hanna at Apr 21, 2011 at 12:36 am
    The answer is that it depends on which consistency level you are reading and writing at. You can make sure you are always reading consistent data by using quorum for reads and quorum for writes.

    For more information on consistency level, see:
    http://www.datastax.com/docs/0.7/consistency/index

    With Pig, you can specify the consistency level that you want to read at with the following property in your hadoop configuration:
    cassandra.consistencylevel.read

    So you can read at whatever consistency level you wish for each row. The peculiarity with pig for reading and writing at the same time is that pig is by nature a batch job. It's going to go over a set of columns for every row in the column family. So when you say you're writing at the same time, which row do you mean? But for example, if you are reading a particular row with consistency level "quorum" and you're writing with consistency level "quorum" to that row, you will see consistent results.
    On Apr 20, 2011, at 5:59 PM, Bing Wei wrote:

    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?

    --
    Bing

    Graduate Student
    Computer Science Department, UCSB :)
  • Bing Wei at Apr 21, 2011 at 4:07 am
    Thanks Jeremy. The link is of great help. The pig query only cares about
    rows with certain key patterns. For example, it only cares about rows with
    key values beginning with "aaa". For each row, the query only care about one
    column in the row.
    For write, a new row with that column can be inserted. Or a row with that
    column can be deleted.
    On Wed, Apr 20, 2011 at 5:36 PM, Jeremy Hanna wrote:

    The answer is that it depends on which consistency level you are reading
    and writing at. You can make sure you are always reading consistent data by
    using quorum for reads and quorum for writes.

    For more information on consistency level, see:
    http://www.datastax.com/docs/0.7/consistency/index

    With Pig, you can specify the consistency level that you want to read at
    with the following property in your hadoop configuration:
    cassandra.consistencylevel.read

    So you can read at whatever consistency level you wish for each row. The
    peculiarity with pig for reading and writing at the same time is that pig is
    by nature a batch job. It's going to go over a set of columns for every row
    in the column family. So when you say you're writing at the same time,
    which row do you mean? But for example, if you are reading a particular row
    with consistency level "quorum" and you're writing with consistency level
    "quorum" to that row, you will see consistent results.
    On Apr 20, 2011, at 5:59 PM, Bing Wei wrote:

    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?

    --
    Bing

    Graduate Student
    Computer Science Department, UCSB :)

    --
    Bing

    Graduate Student
    Computer Science Department, UCSB :)
  • Mridul Muralidharan at Apr 21, 2011 at 8:20 am
    In general (on hadoop based systems), if the input is not immutable -
    you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where
    you modify data in-place.



    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?
  • Jeremy Hanna at Apr 21, 2011 at 1:12 pm

    On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:


    In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where you modify data in-place.
    So do you mean that between the time of the first execution and the time of the re-execution, input data can change? Yes that's possible. However, unless you are reading stale data the second time, it's not a consistency issue, is it? I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent. If I am reading updated data the second time, that's consistent and may or may not be a problem.

    Just trying to make sure I understand.

    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?
  • Mridul Muralidharan at Apr 21, 2011 at 2:25 pm

    On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
    On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:


    In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where you modify data in-place.
    So do you mean that between the time of the first execution and the time of the re-execution, input data can change? Yes that's possible. However, unless you are reading stale data the second time, it's not a consistency issue, is it? I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent. If I am reading updated data the second time, that's consistent and may or may not be a problem.

    Just trying to make sure I understand.
    To clarify, I am referring to re-execution of a task, not job.

    From a (single) hadoop job point of view (and everything else which
    consumes its output) - it is a consistency issue : the re-execution of
    a task can generate set of key/values which are different from initial
    invocation (which might have been used by some reducers).


    Regards,
    Mridul

    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?
  • Jeremy Hanna at Apr 21, 2011 at 4:04 pm

    On Apr 21, 2011, at 9:25 AM, Mridul Muralidharan wrote:
    On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
    On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:


    In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where you modify data in-place.
    So do you mean that between the time of the first execution and the time of the re-execution, input data can change? Yes that's possible. However, unless you are reading stale data the second time, it's not a consistency issue, is it? I mean, if I am guaranteed to read the most recent data on the first execution and the second execution, that's consistent. If I am reading updated data the second time, that's consistent and may or may not be a problem.

    Just trying to make sure I understand.
    To clarify, I am referring to re-execution of a task, not job.

    From a (single) hadoop job point of view (and everything else which consumes its output) - it is a consistency issue : the re-execution of a task can generate set of key/values which are different from initial invocation (which might have been used by some reducers).
    Good point about inputs that are not immutable. Currently Cassandra doesn't have a way to snapshot the data to be immutable inputs. Created a ticket to address that -https://issues.apache.org/jira/browse/CASSANDRA-2527

    I guess I was more focused on Cassandra's architecture wrt consistency since it's often misunderstood - and how to use consistency levels with mapreduce/pig.
    Regards,
    Mridul

    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?
  • Dmitriy Ryaboy at Apr 21, 2011 at 11:51 am
    We dont have that functionality in the hbase loader yet, but technically one can get around this inconsistency by specifying max timestamp on the hbase scan. As long as the number of versions hbase is configured to keep is smaller than number of updates to a single row during your scan, you'd get a consistent snapshot of the data. There is a jira open requesting we add timestamp support....

    -----Original Message-----
    From: "Mridul Muralidharan" <mridulm@yahoo-inc.com>
    To: "user@pig.apache.org" <user@pig.apache.org>
    Cc: "Bing Wei" <blackice.wei@gmail.com>
    Sent: 4/21/2011 1:19 AM
    Subject: Re: pig query on Cassandra


    In general (on hadoop based systems), if the input is not immutable -
    you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where
    you modify data in-place.



    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?
  • Mridul Muralidharan at Apr 21, 2011 at 2:29 pm
    Agree.
    It becomes a function of number of records updated per sec (per key) and
    the max number of versions kept around for a col ...

    Ofcourse, solving this in general is not easy anyway :-)



    Regards,
    Mridul

    On Thursday 21 April 2011 05:20 PM, Dmitriy Ryaboy wrote:
    We dont have that functionality in the hbase loader yet, but technically one can get around this inconsistency by specifying max timestamp on the hbase scan. As long as the number of versions hbase is configured to keep is smaller than number of updates to a single row during your scan, you'd get a consistent snapshot of the data. There is a jira open requesting we add timestamp support....

    -----Original Message-----
    From: "Mridul Muralidharan"<mridulm@yahoo-inc.com>
    To: "user@pig.apache.org"<user@pig.apache.org>
    Cc: "Bing Wei"<blackice.wei@gmail.com>
    Sent: 4/21/2011 1:19 AM
    Subject: Re: pig query on Cassandra


    In general (on hadoop based systems), if the input is not immutable -
    you can end up with issues during task re-execution, etc.
    This happens not just for cassandra but for hbase, others too - where
    you modify data in-place.



    Regards,
    Mridul
    On Thursday 21 April 2011 04:29 AM, Bing Wei wrote:
    Hi, All.

    When I do a pig query on Cassandra, and the Cassandra is updated by
    application at the same time, what will happen? I may get inconsistent
    results, right?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 20, '11 at 11:00p
activeApr 21, '11 at 4:04p
posts9
users4
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase