FAQ
Hi all,


I am using hadoop 0.20.2 and I want to use sort huge amount of data.
I've read about Terasort [from examples], but now it's using 10bytes
char keys.
Changing keys from char to integer wasn't a good solution as Terasort
builds a trie for creating total order partitions. I got stuck when I
tried to change the char trie to a one suitable for number keys.

Then, I've given a try to Sort [also from examples] and it did work for
integer keys, but without a total order partitioning. In the end of the
day, the final result can not be created only by putting together all
reducers' outputs. Each reducer sorts only a subset of data and no
merging is occured between two reducers.

Please can anyone advise me what and how to use in order to sort huge
amount of real numbers ?
Looking forward for your replies.


Thank you.
Best,
Teodor

Search Discussions

  • Alex Kozlov at Aug 1, 2010 at 10:14 pm
    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw value.

    Total order partitioning should also work with any WritableComparable key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable. Can
    you provide more background? Are the strings of fixed length?

    Alex K
    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas wrote:

    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
    read about Terasort [from examples], but now it's using 10bytes char keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the day,
    the final result can not be created only by putting together all reducers'
    outputs. Each reducer sorts only a subset of data and no merging is occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor
  • Teodor Macicas at Aug 2, 2010 at 9:26 am
    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use
    this for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor
    On 08/02/2010 12:14 AM, Alex Kozlov wrote:
    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw value.

    Total order partitioning should also work with any WritableComparable key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable. Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicaswrote:

    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
    read about Terasort [from examples], but now it's using 10bytes char keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the day,
    the final result can not be created only by putting together all reducers'
    outputs. Each reducer sorts only a subset of data and no merging is occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor
  • Alex Kozlov at Aug 2, 2010 at 5:42 pm
    Hi Teodor,

    I see the problem now: There is no simple binary comparator for
    DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
    So you can do 2 things:

    1. Convert your doubles to ints (or bytes), say if the precision is always 2
    decimal points, represent the number as 100 x double: The problem is
    reduced to sorting integers then.

    2. Use DoubleWritable as the key and payload as value. You can use generic
    TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
    does not use tries. You also can just use a generic MR with
    DoubleWritable keys: MR will sort the key for you with identity mapper and
    identity reducer.

    Option 2 is slightly less efficient since the code will need to call
    Double.longBitsToDouble each time, but I don't see an easy way to avoid this
    with the IEEE 754 encoding.

    Alex K
    On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas wrote:

    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use this
    for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor

    On 08/02/2010 12:14 AM, Alex Kozlov wrote:

    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw value.

    Total order partitioning should also work with any WritableComparable key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable.
    Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch
    wrote:
    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
    read about Terasort [from examples], but now it's using 10bytes char
    keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I
    tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the
    day,
    the final result can not be created only by putting together all
    reducers'
    outputs. Each reducer sorts only a subset of data and no merging is
    occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor

  • Teodor Macicas at Aug 2, 2010 at 9:43 pm
    Hi Alex,

    Thank you again.
    Yes, I'm also thinking of your first suggestion. But that would help me
    only for 'reducing' the problem from floating points to integers. But I
    also do not know how to use Terasort for integer keys !

    I've tried to use the generic TotalOrderPartitioner instead of the one
    nested in Terasort class, but I received a lot of errors [0]. I had
    tried to modify the TeraInputFormat, TeraOutputFormat (and all nested
    classes) and I've continued getting errors.

    Now, it's not clear for me what do I have to change in order to make
    your second solution working. Moreover, I was unable to find a generic
    MR on my hadoop 0.20.2 version.
    I'd prefer the first solution, so can you please give me some tips for
    how to use Terasort for integers ?

    p.s.: I've made a trick using fixed-length char keys and the program
    worked for this kind of workload [1]. I think using integer keys instead
    of this trick would be faster.

    [0] java.io.IOException: wrong key class:
    org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

    [1] it worked for this:
    0000123.45 payload1
    0005120.55 payload2
    0000003.77 payload3
    ...

    Best,
    Teodor
    On 08/02/2010 07:41 PM, Alex Kozlov wrote:
    Hi Teodor,

    I see the problem now: There is no simple binary comparator for
    DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
    So you can do 2 things:

    1. Convert your doubles to ints (or bytes), say if the precision is always 2
    decimal points, represent the number as 100 x double: The problem is
    reduced to sorting integers then.

    2. Use DoubleWritable as the key and payload as value. You can use generic
    TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
    does not use tries. You also can just use a generic MR with
    DoubleWritable keys: MR will sort the key for you with identity mapper and
    identity reducer.

    Option 2 is slightly less efficient since the code will need to call
    Double.longBitsToDouble each time, but I don't see an easy way to avoid this
    with the IEEE 754 encoding.

    Alex K

    On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicaswrote:

    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use this
    for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor


    On 08/02/2010 12:14 AM, Alex Kozlov wrote:

    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw value.

    Total order partitioning should also work with any WritableComparable key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable.
    Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch
    wrote:
    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
    read about Terasort [from examples], but now it's using 10bytes char
    keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I
    tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the
    day,
    the final result can not be created only by putting together all
    reducers'
    outputs. Each reducer sorts only a subset of data and no merging is
    occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor


  • Alex Kozlov at Aug 2, 2010 at 10:22 pm
    Hi Teodor,

    Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
    are different classes. For the approach (1) I suggested, you need just to
    construct byte[10] array from an integer and create a new Text(byte[]) and
    write it together with the value to a sequence file.

    Since TeraSort was specifically created for just benchmarking purposes, I
    think it might make sense for you to start with the approach (2). Just
    create a SequenceFile<DoubleWritable,Text> file with your <key,value> data
    and do a simple MR job with an identity mapper and identity reducer. I can
    send you an example of a MR code, but there are plenty out
    there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
    One of them is TeraSort.java:run() itself, but you may want to use the new
    mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
    Once you are comfortable with the MR framework, you can optimize it further.

    Another good source of information is Tom White's 'Hadoop: The Definitive
    Guide', particularly on the TotalOrderPartitioner.

    Let me know if you have any further questions.

    Alex K
    On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas wrote:

    Hi Alex,

    Thank you again.
    Yes, I'm also thinking of your first suggestion. But that would help me
    only for 'reducing' the problem from floating points to integers. But I also
    do not know how to use Terasort for integer keys !

    I've tried to use the generic TotalOrderPartitioner instead of the one
    nested in Terasort class, but I received a lot of errors [0]. I had tried to
    modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
    I've continued getting errors.

    Now, it's not clear for me what do I have to change in order to make your
    second solution working. Moreover, I was unable to find a generic MR on my
    hadoop 0.20.2 version.
    I'd prefer the first solution, so can you please give me some tips for how
    to use Terasort for integers ?

    p.s.: I've made a trick using fixed-length char keys and the program worked
    for this kind of workload [1]. I think using integer keys instead of this
    trick would be faster.

    [0] java.io.IOException: wrong key class:
    org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

    [1] it worked for this:
    0000123.45 payload1
    0005120.55 payload2
    0000003.77 payload3
    ...

    Best,
    Teodor

    On 08/02/2010 07:41 PM, Alex Kozlov wrote:

    Hi Teodor,

    I see the problem now: There is no simple binary comparator for
    DoubleWritable<
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
    .
    So you can do 2 things:

    1. Convert your doubles to ints (or bytes), say if the precision is always
    2
    decimal points, represent the number as 100 x double: The problem is
    reduced to sorting integers then.

    2. Use DoubleWritable as the key and payload as value. You can use
    generic
    TotalOrderPartitioner<
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
    which
    does not use tries. You also can just use a generic MR with
    DoubleWritable keys: MR will sort the key for you with identity mapper and
    identity reducer.

    Option 2 is slightly less efficient since the code will need to call
    Double.longBitsToDouble each time, but I don't see an easy way to avoid
    this
    with the IEEE 754 encoding.

    Alex K

    On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<teodor.macicas@epfl.ch
    wrote:
    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use
    this
    for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor


    On 08/02/2010 12:14 AM, Alex Kozlov wrote:


    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on
    bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw
    value.

    Total order partitioning should also work with any WritableComparable
    key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable.
    Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch

    wrote:


    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data.
    I've
    read about Terasort [from examples], but now it's using 10bytes char
    keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I
    tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the
    day,
    the final result can not be created only by putting together all
    reducers'
    outputs. Each reducer sorts only a subset of data and no merging is
    occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor



  • Teodor Macicas at Aug 2, 2010 at 10:41 pm
    Hi Alex,

    Why are you suggesting using SequenceFiles ? That implies changing the
    TeraInputFormat class, right ?

    Your second approach is similar with Sort example from hadoop. The
    disadvantage of using it is that I don't have a total order partitioning
    and thus more operations are neccessary for creating the final result.

    Regards,
    Teodor
    On 08/03/2010 12:21 AM, Alex Kozlov wrote:
    Hi Teodor,

    Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
    are different classes. For the approach (1) I suggested, you need just to
    construct byte[10] array from an integer and create a new Text(byte[]) and
    write it together with the value to a sequence file.

    Since TeraSort was specifically created for just benchmarking purposes, I
    think it might make sense for you to start with the approach (2). Just
    create a SequenceFile<DoubleWritable,Text> file with your<key,value> data
    and do a simple MR job with an identity mapper and identity reducer. I can
    send you an example of a MR code, but there are plenty out
    there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
    One of them is TeraSort.java:run() itself, but you may want to use the new
    mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
    Once you are comfortable with the MR framework, you can optimize it further.

    Another good source of information is Tom White's 'Hadoop: The Definitive
    Guide', particularly on the TotalOrderPartitioner.

    Let me know if you have any further questions.

    Alex K

    On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicaswrote:

    Hi Alex,

    Thank you again.
    Yes, I'm also thinking of your first suggestion. But that would help me
    only for 'reducing' the problem from floating points to integers. But I also
    do not know how to use Terasort for integer keys !

    I've tried to use the generic TotalOrderPartitioner instead of the one
    nested in Terasort class, but I received a lot of errors [0]. I had tried to
    modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
    I've continued getting errors.

    Now, it's not clear for me what do I have to change in order to make your
    second solution working. Moreover, I was unable to find a generic MR on my
    hadoop 0.20.2 version.
    I'd prefer the first solution, so can you please give me some tips for how
    to use Terasort for integers ?

    p.s.: I've made a trick using fixed-length char keys and the program worked
    for this kind of workload [1]. I think using integer keys instead of this
    trick would be faster.

    [0] java.io.IOException: wrong key class:
    org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

    [1] it worked for this:
    0000123.45 payload1
    0005120.55 payload2
    0000003.77 payload3
    ...

    Best,
    Teodor


    On 08/02/2010 07:41 PM, Alex Kozlov wrote:

    Hi Teodor,

    I see the problem now: There is no simple binary comparator for
    DoubleWritable<
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
    .
    So you can do 2 things:

    1. Convert your doubles to ints (or bytes), say if the precision is always
    2
    decimal points, represent the number as 100 x double: The problem is
    reduced to sorting integers then.

    2. Use DoubleWritable as the key and payload as value. You can use
    generic
    TotalOrderPartitioner<
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
    which
    does not use tries. You also can just use a generic MR with
    DoubleWritable keys: MR will sort the key for you with identity mapper and
    identity reducer.

    Option 2 is slightly less efficient since the code will need to call
    Double.longBitsToDouble each time, but I don't see an easy way to avoid
    this
    with the IEEE 754 encoding.

    Alex K

    On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<teodor.macicas@epfl.ch
    wrote:
    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use
    this
    for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor


    On 08/02/2010 12:14 AM, Alex Kozlov wrote:



    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on
    bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
    really does not matter as Hadoop uses binary comparators on the raw
    value.

    Total order partitioning should also work with any WritableComparable
    key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable.
    Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch


    wrote:



    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data.
    I've
    read about Terasort [from examples], but now it's using 10bytes char
    keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I
    tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work for
    integer keys, but without a total order partitioning. In the end of the
    day,
    the final result can not be created only by putting together all
    reducers'
    outputs. Each reducer sorts only a subset of data and no merging is
    occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor




  • Alex Kozlov at Aug 2, 2010 at 10:51 pm

    On Mon, Aug 2, 2010 at 3:41 PM, Teodor Macicas wrote:

    Hi Alex,

    Why are you suggesting using SequenceFiles ? That implies changing the
    TeraInputFormat class, right ?
    Because text input file will not work for arbitrary bytes that can contain
    new line bytes for example. Yes, the old TeraInputFormat will not work.

    Your second approach is similar with Sort example from hadoop. The
    disadvantage of using it is that I don't have a total order partitioning and
    thus more operations are neccessary for creating the final result.
    There is a generic total order partitioner: I provided the links. See the
    HTDG book as well.

    Regards,
    Teodor

    On 08/03/2010 12:21 AM, Alex Kozlov wrote:

    Hi Teodor,

    Certainly org.apache.hadoop.io.DoubleWritable and
    org.apache.hadoop.io.Text
    are different classes. For the approach (1) I suggested, you need just to
    construct byte[10] array from an integer and create a new Text(byte[]) and
    write it together with the value to a sequence file.

    Since TeraSort was specifically created for just benchmarking purposes, I
    think it might make sense for you to start with the approach (2). Just
    create a SequenceFile<DoubleWritable,Text> file with your<key,value>
    data
    and do a simple MR job with an identity mapper and identity reducer. I
    can
    send you an example of a MR code, but there are plenty out
    there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.

    One of them is TeraSort.java:run() itself, but you may want to use the new
    mapreduce API<
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html
    .
    Once you are comfortable with the MR framework, you can optimize it
    further.

    Another good source of information is Tom White's 'Hadoop: The Definitive
    Guide', particularly on the TotalOrderPartitioner.

    Let me know if you have any further questions.

    Alex K

    On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<teodor.macicas@epfl.ch
    wrote:
    Hi Alex,

    Thank you again.
    Yes, I'm also thinking of your first suggestion. But that would help me
    only for 'reducing' the problem from floating points to integers. But I
    also
    do not know how to use Terasort for integer keys !

    I've tried to use the generic TotalOrderPartitioner instead of the one
    nested in Terasort class, but I received a lot of errors [0]. I had tried
    to
    modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
    I've continued getting errors.

    Now, it's not clear for me what do I have to change in order to make your
    second solution working. Moreover, I was unable to find a generic MR on
    my
    hadoop 0.20.2 version.
    I'd prefer the first solution, so can you please give me some tips for
    how
    to use Terasort for integers ?

    p.s.: I've made a trick using fixed-length char keys and the program
    worked
    for this kind of workload [1]. I think using integer keys instead of this
    trick would be faster.

    [0] java.io.IOException: wrong key class:
    org.apache.hadoop.io.DoubleWritable is not class
    org.apache.hadoop.io.Text

    [1] it worked for this:
    0000123.45 payload1
    0005120.55 payload2
    0000003.77 payload3
    ...

    Best,
    Teodor


    On 08/02/2010 07:41 PM, Alex Kozlov wrote:


    Hi Teodor,

    I see the problem now: There is no simple binary comparator for
    DoubleWritable<

    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html

    .
    So you can do 2 things:

    1. Convert your doubles to ints (or bytes), say if the precision is
    always
    2
    decimal points, represent the number as 100 x double: The problem is
    reduced to sorting integers then.

    2. Use DoubleWritable as the key and payload as value. You can use
    generic
    TotalOrderPartitioner<

    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html

    which
    does not use tries. You also can just use a generic MR with
    DoubleWritable keys: MR will sort the key for you with identity mapper
    and
    identity reducer.

    Option 2 is slightly less efficient since the code will need to call
    Double.longBitsToDouble each time, but I don't see an easy way to avoid
    this
    with the IEEE 754 encoding.

    Alex K

    On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<teodor.macicas@epfl.ch

    wrote:


    Hi Alex,

    Thank you for your quick reply and sorry for not being so clear.
    The job I want to do is simple to sort data having numbers [doubles] as
    keys [0]. I noticed that Terasort is using 10b char key. How can I use
    this
    for my particular job ?
    Do I need to change the Terasort ?

    [0] example of workload:
    123.45 payload1
    -34.56 payload2
    752.10 payload3
    10.25 payload4
    ....

    Does this make sense now ?

    Regards,
    Teodor


    On 08/02/2010 12:14 AM, Alex Kozlov wrote:




    Hi Teodor,

    I am not clear what you call 'real numbers'. Terasort does work on
    bytes
    (10 bytes key and 90 bytes payload). The actual 'meaning' of the
    bytes
    really does not matter as Hadoop uses binary comparators on the raw
    value.

    Total order partitioning should also work with any WritableComparable
    key
    (if it doesn't, it's a bug).

    My guess your problem is converting a char trie to WritableComparable.
    Can
    you provide more background? Are the strings of fixed length?

    Alex K

    On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch



    wrote:





    Hi all,


    I am using hadoop 0.20.2 and I want to use sort huge amount of data.
    I've
    read about Terasort [from examples], but now it's using 10bytes char
    keys.
    Changing keys from char to integer wasn't a good solution as Terasort
    builds a trie for creating total order partitions. I got stuck when I
    tried
    to change the char trie to a one suitable for number keys.

    Then, I've given a try to Sort [also from examples] and it did work
    for
    integer keys, but without a total order partitioning. In the end of
    the
    day,
    the final result can not be created only by putting together all
    reducers'
    outputs. Each reducer sorts only a subset of data and no merging is
    occured
    between two reducers.

    Please can anyone advise me what and how to use in order to sort huge
    amount of real numbers ?
    Looking forward for your replies.


    Thank you.
    Best,
    Teodor





Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 1, '10 at 9:24p
activeAug 2, '10 at 10:51p
posts8
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Alex Kozlov: 4 posts Teodor Macicas: 4 posts

People

Translate

site design / logo © 2022 Grokbase