FAQ
Hi all,
I'm implementing a Datanucleus plugin for Cassandra. I'm finished
with the basic functionality, and everything seems to work pretty well.
Now my issue is performing secondary indexing on fields within my data.
I have outlined some of the issues I'm facing in this post.

http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

Essentially, for each operand the user specifies, I will need to make a
trip to Cassandra, load the key columns, then perform an intersection
with the result from my previous read. Eventually at the end of all the
intersections, I will have a list of keys I will then load. This
obviously requires several trips to Cassandra, where from my
understanding of secondary indexing, I would only need to make one trip
for multiple operands over a column family. I've read over this
issue.

http://issues.apache.org/jira/browse/CASSANDRA-32610

And it seems to solve a lot of my woes. Is it possible/recommended to
patch the current code base of 0.6.2 to perform this functionality?

Thanks,
Todd

Search Discussions

  • Jonathan Ellis at Jun 16, 2010 at 1:03 am
    What issue were you trying to link? :)
    On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine wrote:
    Hi all,
    I'm implementing a Datanucleus plugin for Cassandra.  I'm finished
    with the basic functionality, and everything seems to work pretty well.
    Now my issue is performing secondary indexing on fields within my data.
    I have outlined some of the issues I'm facing in this post.

    http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

    Essentially, for each operand the user specifies, I will need to make a
    trip to Cassandra, load the key columns, then perform an intersection
    with the result from my previous read.  Eventually at the end of all the
    intersections, I will have a list of keys I will then load.  This
    obviously requires several trips to Cassandra, where from my
    understanding of secondary indexing, I would only need to make one trip
    for multiple operands over a column family.    I've read over this
    issue.

    http://issues.apache.org/jira/browse/CASSANDRA-32610

    And it seems to solve a lot of my woes.  Is it possible/recommended to
    patch the current code base of 0.6.2 to perform this functionality?

    Thanks,
    Todd


    --
    Jonathan Ellis
    Project Chair, Apache Cassandra
    co-founder of Riptano, the source for professional Cassandra support
    http://riptano.com
  • Todd Nine at Jun 16, 2010 at 3:35 am
    Lets try that again.....

    This is the intended issue.

    https://issues.apache.org/jira/browse/CASSANDRA-749

    thanks,
    Todd

    On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:

    What issue were you trying to link? :)
    On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine wrote:
    Hi all,
    I'm implementing a Datanucleus plugin for Cassandra. I'm finished
    with the basic functionality, and everything seems to work pretty well.
    Now my issue is performing secondary indexing on fields within my data.
    I have outlined some of the issues I'm facing in this post.

    http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

    Essentially, for each operand the user specifies, I will need to make a
    trip to Cassandra, load the key columns, then perform an intersection
    with the result from my previous read. Eventually at the end of all the
    intersections, I will have a list of keys I will then load. This
    obviously requires several trips to Cassandra, where from my
    understanding of secondary indexing, I would only need to make one trip
    for multiple operands over a column family. I've read over this
    issue.

    http://issues.apache.org/jira/browse/CASSANDRA-32610

    And it seems to solve a lot of my woes. Is it possible/recommended to
    patch the current code base of 0.6.2 to perform this functionality?

    Thanks,
    Todd
  • Jonathan Ellis at Jun 16, 2010 at 5:06 am
    No chance that 749 can be backported to 0.6, sorry.
    On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine wrote:

    Lets try that again.....

    This is the intended issue.

    https://issues.apache.org/jira/browse/CASSANDRA-749

    thanks,
    Todd



    On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:

    What issue were you trying to link? :)
    On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine wrote:
    Hi all,
    I'm implementing a Datanucleus plugin for Cassandra. I'm finished
    with the basic functionality, and everything seems to work pretty well.
    Now my issue is performing secondary indexing on fields within my data.
    I have outlined some of the issues I'm facing in this post.

    http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

    Essentially, for each operand the user specifies, I will need to make a
    trip to Cassandra, load the key columns, then perform an intersection
    with the result from my previous read. Eventually at the end of all the
    intersections, I will have a list of keys I will then load. This
    obviously requires several trips to Cassandra, where from my
    understanding of secondary indexing, I would only need to make one trip
    for multiple operands over a column family. I've read over this
    issue.

    http://issues.apache.org/jira/browse/CASSANDRA-32610

    And it seems to solve a lot of my woes. Is it possible/recommended to
    patch the current code base of 0.6.2 to perform this functionality?

    Thanks,
    Todd



    --
    Jonathan Ellis
    Project Chair, Apache Cassandra
    co-founder of Riptano, the source for professional Cassandra support
    http://riptano.com
  • Todd Nine at Jun 16, 2010 at 4:57 am
    No problem,
    I didn't want to implement my own solution if an existing one could
    easily be applied. Since I'll be creating CF that represent secondary
    indexes, I'll need to perform range scans over the keys of those
    secondary index CFs. The column names within the CF's are the row keys
    of the primary table. Is there a way I can get the intersection of all
    of the column names from multiple ranges scans over different column
    families in one result set? Otherwise I'll need to make multiple trips
    and create the intersection myself in my plugin. Here is an example of
    what I'm trying to do.

    CF: Person

    key1: {
    firstName: John
    lastName: Smith
    email: smiths@foo.com
    }

    key2: {
    firstName: Jane
    lastName: Smith
    email: smiths@foo.com
    }

    key3: {
    firstName: Jane
    lastName: Doe
    email: smiths@foo.com
    }


    My secondary index tables would be the following

    CF: Person_LastName

    Smith:{
    key1: 0x00
    key2: 0x00
    }

    Doe: {
    key3:0x00
    }

    CF: Person_Email
    smiths@foo.com:{
    key1:0x00
    key2:0x00
    key3:0x00
    }

    If my input is something similar to lastName == 'Smith' && email ==
    "smiths@foo.com", I would return all columns from key "Smith" in CF
    Person_LastName, and all columns from key "smiths@foo.com" in CF
    Person_Email. The intersection of the two sets is key1, and key2, and
    have cassandra only return those rows.

    Thanks,
    Todd




    On Tue, 2010-06-15 at 23:38 -0500, Jonathan Ellis wrote:

    No chance that 749 can be backported to 0.6, sorry.
    On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine wrote:

    Lets try that again.....

    This is the intended issue.

    https://issues.apache.org/jira/browse/CASSANDRA-749

    thanks,
    Todd



    On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:

    What issue were you trying to link? :)
    On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine wrote:
    Hi all,
    I'm implementing a Datanucleus plugin for Cassandra. I'm finished
    with the basic functionality, and everything seems to work pretty well.
    Now my issue is performing secondary indexing on fields within my data.
    I have outlined some of the issues I'm facing in this post.

    http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

    Essentially, for each operand the user specifies, I will need to make a
    trip to Cassandra, load the key columns, then perform an intersection
    with the result from my previous read. Eventually at the end of all the
    intersections, I will have a list of keys I will then load. This
    obviously requires several trips to Cassandra, where from my
    understanding of secondary indexing, I would only need to make one trip
    for multiple operands over a column family. I've read over this
    issue.

    http://issues.apache.org/jira/browse/CASSANDRA-32610

    And it seems to solve a lot of my woes. Is it possible/recommended to
    patch the current code base of 0.6.2 to perform this functionality?

    Thanks,
    Todd


  • Aaron morton at Jun 16, 2010 at 12:38 pm
    I've not read up on the secondary indexes, but am doing some thing similar. I got some inspiration from the Lucandra project. You will probably need to make multiple calls to the cassandra for each clause of your query.

    The design I used had two CF's rough idea was; in the TermDocIndex the key term (e.g. lastName=Smith) and the column names are the keys for the object / document the term is from e.g. key1. The DocTermIndex uses the object/doc id as the key and has columns for each term the document contains, e.g. "lastname=Smith"). I also maintained some stats on how many objects/documents had the term (using redis, will move to cassandra counters in 0.7 perhaps).

    The query process then becomes.
    1. Determine the most selective term in the query using the stats
    2. Do a get_slice to get the first X (1000 perhaps) column values from the TermDocIndex using the term key.
    3. Use the keys from step 2 in a multi_get_slice against the DocTermIndex, passing the list of keys from 2 and listing the remaining terms as the column names you want to get back.
    4. From the result of 3 filter all keys that returned less columns that we asked for.
    5. Repeat from 3 if needed.

    I was hoping the limit in step 2 would bound the queries into the cluster, and the multiget in step 3 would be better at distributing the most of the work around the cluster. E.g. rather than reading 1000 columns from, say, 3 keys. It reads 3 columns from 1000 keys.

    Aaron

    On 16 Jun 2010, at 16:57, Todd Nine wrote:

    No problem,
    I didn't want to implement my own solution if an existing one could
    easily be applied. Since I'll be creating CF that represent secondary
    indexes, I'll need to perform range scans over the keys of those
    secondary index CFs. The column names within the CF's are the row keys
    of the primary table. Is there a way I can get the intersection of all
    of the column names from multiple ranges scans over different column
    families in one result set? Otherwise I'll need to make multiple trips
    and create the intersection myself in my plugin. Here is an example of
    what I'm trying to do.

    CF: Person

    key1: {
    firstName: John
    lastName: Smith
    email: smiths@foo.com
    }

    key2: {
    firstName: Jane
    lastName: Smith
    email: smiths@foo.com
    }

    key3: {
    firstName: Jane
    lastName: Doe
    email: smiths@foo.com
    }


    My secondary index tables would be the following

    CF: Person_LastName

    Smith:{
    key1: 0x00
    key2: 0x00
    }

    Doe: {
    key3:0x00
    }

    CF: Person_Email
    smiths@foo.com:{
    key1:0x00
    key2:0x00
    key3:0x00
    }

    If my input is something similar to lastName == 'Smith' && email ==
    "smiths@foo.com", I would return all columns from key "Smith" in CF
    Person_LastName, and all columns from key "smiths@foo.com" in CF
    Person_Email. The intersection of the two sets is key1, and key2, and
    have cassandra only return those rows.

    Thanks,
    Todd




    On Tue, 2010-06-15 at 23:38 -0500, Jonathan Ellis wrote:

    No chance that 749 can be backported to 0.6, sorry.
    On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine wrote:

    Lets try that again.....

    This is the intended issue.

    https://issues.apache.org/jira/browse/CASSANDRA-749

    thanks,
    Todd



    On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:

    What issue were you trying to link? :)
    On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine wrote:
    Hi all,
    I'm implementing a Datanucleus plugin for Cassandra. I'm finished
    with the basic functionality, and everything seems to work pretty well.
    Now my issue is performing secondary indexing on fields within my data.
    I have outlined some of the issues I'm facing in this post.

    http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

    Essentially, for each operand the user specifies, I will need to make a
    trip to Cassandra, load the key columns, then perform an intersection
    with the result from my previous read. Eventually at the end of all the
    intersections, I will have a list of keys I will then load. This
    obviously requires several trips to Cassandra, where from my
    understanding of secondary indexing, I would only need to make one trip
    for multiple operands over a column family. I've read over this
    issue.

    http://issues.apache.org/jira/browse/CASSANDRA-32610

    And it seems to solve a lot of my woes. Is it possible/recommended to
    patch the current code base of 0.6.2 to perform this functionality?

    Thanks,
    Todd


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriescassandra
postedJun 15, '10 at 11:56p
activeJun 16, '10 at 12:38p
posts6
users3
websitecassandra.apache.org
irc#cassandra

People

Translate

site design / logo © 2021 Grokbase