FAQ
Hi there:

I am researching Hadoop to see which of its products suits our need for
quick queries against large data sets (billions of records per set)

The queries will be performed against chip sequencing data. Each record is
one line in a file. To be clear below shows a sample record in the data set.


one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
1 4 *103570835* F .. 23G 24C

The highlighted field is called "position of match" and the query we are
interested in is the # of sequences in a certain range of this "position of
match". For instance the range can be "position of match" > 200 and
"position of match" + 36 < 200,000.

Any suggestions on the Hadoop product I should start with to accomplish the
task? HBase,Pig,Hive, or ...?

Thanks!

Xueling

Search Discussions

  • Todd Lipcon at Dec 12, 2009 at 3:35 am
    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or 10ms?

    Thanks
    -Todd
    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each record is
    one line in a file. To be clear below shows a sample record in the data set.


    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we are
    interested in is the # of sequences in a certain range of this "position of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to accomplish the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Xueling Shu at Dec 12, 2009 at 6:38 pm
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
    10ms?

    Thanks
    -Todd
    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each record is
    one line in a file. To be clear below shows a sample record in the data set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we are
    interested in is the # of sequences in a certain range of this "position of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to accomplish the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Todd Lipcon at Dec 12, 2009 at 9:02 pm
    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd
    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each record is
    one line in a file. To be clear below shows a sample record in the data set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we are
    interested in is the # of sequences in a certain range of this "position of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to accomplish the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Stack at Dec 12, 2009 at 10:51 pm
    You might also consider hbase, particularly If you find that your data is
    being updated with some regularity, particularly if the updates are randomly
    distributed over the data set. See
    http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor
    how to do a fast bulk load of your billiions of rows of data.

    Yours,
    St.Ack

    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd
    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Xueling Shu at Dec 12, 2009 at 10:56 pm
    Great information! Thank you for your help, Todd.

    Xueling
    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd
    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Fred Zappert at Dec 12, 2009 at 11:31 pm
    +1 for hbase
    On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu wrote:

    Great information! Thank you for your help, Todd.

    Xueling
    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd

    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into
    24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418
    TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Xueling Shu at Jan 6, 2010 at 1:24 am
    Hi Todd:

    After finishing some tasks I finally get back to HDFS testing.

    One question for your last reply to this thread: Are there any code examples
    close to your second and third recommendations? Or what APIs I should start
    with for my testing?

    Thanks.
    Xueling
    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd
    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Xueling Shu at Jan 6, 2010 at 1:26 am
    Rephrase the sentence "Or what APIs I should start with for my testing?": I
    mean "What HDFS APIs I should start to look into for my testing?

    Thanks,
    Xueling
    On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu wrote:

    Hi Todd:

    After finishing some tasks I finally get back to HDFS testing.

    One question for your last reply to this thread: Are there any code
    examples close to your second and third recommendations? Or what APIs I
    should start with for my testing?

    Thanks.
    Xueling

    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd

    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed up the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Todd Lipcon at Jan 6, 2010 at 7:33 pm
    Hi Xueling,

    Here's a general outline:

    My guess is that your "position of match" field is bounded (perhaps by the
    number of base pairs in the human genome?) Given this, you can probably
    write a very simple Partitioner implementation that divides this field into
    ranges, each with an approximately equal number of records.

    Next, write a simple MR job which takes in a line of data, and outputs the
    same line, but with the position-of-match as the key. This will get
    partitioned by the above function, so you end up with each reducer receiving
    all of the records in a given range.

    In the reducer, simply output every 1000th position into your "sparse"
    output file (along with the non-sparse output file offset), and every
    position into the non-sparse output file.

    In your realtime query server (not part of Hadoop), load the "sparse" file
    into RAM and perform binary search, etc - find the "bins" which the range
    endpoints land in, and then open the non-sparse output on HDFS to finish the
    count.

    Hope that helps.

    Thanks
    -Todd
    On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu wrote:

    Rephrase the sentence "Or what APIs I should start with for my testing?": I
    mean "What HDFS APIs I should start to look into for my testing?

    Thanks,
    Xueling
    On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu wrote:

    Hi Todd:

    After finishing some tasks I finally get back to HDFS testing.

    One question for your last reply to this thread: Are there any code
    examples close to your second and third recommendations? Or what APIs I
    should start with for my testing?

    Thanks.
    Xueling

    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd

    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set
    is
    frequent. The quicker the query, the better. For example we have done
    testing on a Mysql database (5 billion records randomly scattered into 24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed
    up
    the
    search then the product is what we are looking for.

    Cheers,
    Xueling
    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:

    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
    second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
    xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our need
    for
    quick queries against large data sets (billions of records per set)

    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in the
    data
    set.

    one line (record) looks like: 1-1-174-418
    TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query
    we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200 and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Xueling Shu at Jan 6, 2010 at 7:41 pm
    Thanks for the information! I will start to try.

    Xueling
    On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon wrote:

    Hi Xueling,

    Here's a general outline:

    My guess is that your "position of match" field is bounded (perhaps by the
    number of base pairs in the human genome?) Given this, you can probably
    write a very simple Partitioner implementation that divides this field into
    ranges, each with an approximately equal number of records.

    Next, write a simple MR job which takes in a line of data, and outputs the
    same line, but with the position-of-match as the key. This will get
    partitioned by the above function, so you end up with each reducer
    receiving
    all of the records in a given range.

    In the reducer, simply output every 1000th position into your "sparse"
    output file (along with the non-sparse output file offset), and every
    position into the non-sparse output file.

    In your realtime query server (not part of Hadoop), load the "sparse" file
    into RAM and perform binary search, etc - find the "bins" which the range
    endpoints land in, and then open the non-sparse output on HDFS to finish
    the
    count.

    Hope that helps.

    Thanks
    -Todd
    On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu wrote:

    Rephrase the sentence "Or what APIs I should start with for my testing?": I
    mean "What HDFS APIs I should start to look into for my testing?

    Thanks,
    Xueling

    On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    After finishing some tasks I finally get back to HDFS testing.

    One question for your last reply to this thread: Are there any code
    examples close to your second and third recommendations? Or what APIs I
    should start with for my testing?

    Thanks.
    Xueling

    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd

    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <
    xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set
    is
    frequent. The quicker the query, the better. For example we have
    done
    testing on a Mysql database (5 billion records randomly scattered
    into
    24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed
    up
    the
    search then the product is what we are looking for.

    Cheers,
    Xueling

    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
    second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
    xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our
    need
    for
    quick queries against large data sets (billions of records per
    set)
    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in
    the
    data
    set.

    one line (record) looks like: 1-1-174-418
    TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query
    we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200
    and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling
  • Gibbon, Robert, VF-Group at Jan 6, 2010 at 8:04 pm
    Isn't this what Hadoop Hbase is supposed to do? The partioning M/R implementation - "sharding" in street - is the sideways scaling that Hbase is designed to excel at! Also the indexed hbase flavour could allow very fast ad-hoc queries for Xueling using the new yet familiar HBQL sql-dialect?

    Sorry, my 10 pence worth!

    -----Original Message-----
    From: Xueling Shu
    Sent: Wed 1/6/2010 8:41 PM
    To: general@hadoop.apache.org
    Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set?

    Thanks for the information! I will start to try.

    Xueling
    On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon wrote:

    Hi Xueling,

    Here's a general outline:

    My guess is that your "position of match" field is bounded (perhaps by the
    number of base pairs in the human genome?) Given this, you can probably
    write a very simple Partitioner implementation that divides this field into
    ranges, each with an approximately equal number of records.

    Next, write a simple MR job which takes in a line of data, and outputs the
    same line, but with the position-of-match as the key. This will get
    partitioned by the above function, so you end up with each reducer
    receiving
    all of the records in a given range.

    In the reducer, simply output every 1000th position into your "sparse"
    output file (along with the non-sparse output file offset), and every
    position into the non-sparse output file.

    In your realtime query server (not part of Hadoop), load the "sparse" file
    into RAM and perform binary search, etc - find the "bins" which the range
    endpoints land in, and then open the non-sparse output on HDFS to finish
    the
    count.

    Hope that helps.

    Thanks
    -Todd
    On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu wrote:

    Rephrase the sentence "Or what APIs I should start with for my testing?": I
    mean "What HDFS APIs I should start to look into for my testing?

    Thanks,
    Xueling

    On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    After finishing some tasks I finally get back to HDFS testing.

    One question for your last reply to this thread: Are there any code
    examples close to your second and third recommendations? Or what APIs I
    should start with for my testing?

    Thanks.
    Xueling

    On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:

    Hi Xueling,

    In that case, I would recommend the following:

    1) Put all of your data on HDFS
    2) Write a MapReduce job that sorts the data by position of match
    3) As a second output of this job, you can write a "sparse index" -
    basically a set of entries like this:

    <position of match> <offset into file> <number of entries following>

    where you're basically giving offsets into every 10K records or so. If
    you index every 10K records, then 5 billion total will mean 100,000
    index entries. Each index entry shouldn't be more than 20 bytes, so
    100,000 entries will be 2MB. This is super easy to fit into memory.
    (you could probably index every 100th record instead and end up with
    200MB, still easy to fit in memory)

    Then to satisfy your count-range query, you can simply scan your
    in-memory sparse index. Some of the indexed blocks will be completely
    included in the range, in which case you just add up the "number of
    entries following" column. The start and finish block will be
    partially covered, so you can use the file offset info to load that
    file off HDFS, start reading at that offset, and finish the count.

    Total time per query should be <100ms no problem.

    -Todd

    On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <
    xshu@systemsbiology.org>
    wrote:
    Hi Todd:

    Thank you for your reply.

    The datasets wont be updated often. But the query against a data set
    is
    frequent. The quicker the query, the better. For example we have
    done
    testing on a Mysql database (5 billion records randomly scattered
    into
    24
    tables) and the slowest query against the biggest table (400,000,000
    records) is around 12 mins. So if using any Hadoop product can speed
    up
    the
    search then the product is what we are looking for.

    Cheers,
    Xueling

    On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi Xueling,

    One important question that can really change the answer:

    How often does the dataset change? Can the changes be merged in in
    bulk every once in a while, or do you need to actually update them
    randomly very often?

    Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
    second,
    or
    10ms?

    Thanks
    -Todd

    On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
    xshu@systemsbiology.org>
    wrote:
    Hi there:

    I am researching Hadoop to see which of its products suits our
    need
    for
    quick queries against large data sets (billions of records per
    set)
    The queries will be performed against chip sequencing data. Each
    record
    is
    one line in a file. To be clear below shows a sample record in
    the
    data
    set.

    one line (record) looks like: 1-1-174-418
    TGTGTCCCTTTGTAATGAATCACTATC
    U2
    0 0
    1 4 *103570835* F .. 23G 24C

    The highlighted field is called "position of match" and the query
    we
    are
    interested in is the # of sequences in a certain range of this
    "position
    of
    match". For instance the range can be "position of match" > 200
    and
    "position of match" + 36 < 200,000.

    Any suggestions on the Hadoop product I should start with to
    accomplish
    the
    task? HBase,Pig,Hive, or ...?

    Thanks!

    Xueling

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgeneral @
categorieshadoop
postedDec 12, '09 at 3:19a
activeJan 6, '10 at 8:04p
posts12
users5
websitehadoop.apache.org
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase