FAQ
Let's say I have a simple data file with <key, value> pairs and the entire
file is ascending sorted order by 'value'. What I want to be able to do is
filter the data so that the map function is only invoked with <key, value>
pairs where 'value' is greater than some input value.

Does such a feature already exist or would I need to implement my own
RecordReader to do this filter? Is this the right place to do this in
Hadoop's input pipeline?

What I essentially want is a cheap index. By sorting the values ahead of time,
you could just do a binary search on the InputSplit until you found the
starting value that satisfies the predicate. The RecordReader would then
start this point in the file, read all the lines in, and pass the records to
map().

Any thoughts?
--
Andy Pavlo
pavlo@cs.brown.edu

Search Discussions

  • Joydeep Sen Sarma at Mar 5, 2008 at 4:31 pm
    MapFiles provide a cheap index. there was a discussion on the list in the last week i think over this. u would still have to write ur own record reader i would think (to not process splits that fall below the starting location - where the starting location is determined by index lookup).

    Writing ur own splitter would be even better - because with the recordreader option - all the map tasks are still going to get scheduled (although many may terminate immediately). with the splitter - u can construct splits that only lie in the interesting interval.

    if u don't have an index - u can still do it relatively cheaply. in the recordreader - u can read the first value from the next split - and decide based on that whether this map task is interesting or not.


    -----Original Message-----
    From: Andy Pavlo
    Sent: Tue 3/4/2008 9:11 PM
    To: core-user@hadoop.apache.org
    Subject: Using Sorted Files For Filtering Input (File Index)

    Let's say I have a simple data file with <key, value> pairs and the entire
    file is ascending sorted order by 'value'. What I want to be able to do is
    filter the data so that the map function is only invoked with <key, value>
    pairs where 'value' is greater than some input value.

    Does such a feature already exist or would I need to implement my own
    RecordReader to do this filter? Is this the right place to do this in
    Hadoop's input pipeline?

    What I essentially want is a cheap index. By sorting the values ahead of time,
    you could just do a binary search on the InputSplit until you found the
    starting value that satisfies the predicate. The RecordReader would then
    start this point in the file, read all the lines in, and pass the records to
    map().

    Any thoughts?
    --
    Andy Pavlo
    pavlo@cs.brown.edu
  • Ted Dunning at Mar 5, 2008 at 6:36 pm
    You can definitely use the approach that you suggest and you should have
    good results if you are looking for only a small fraction of the file.
    Basically, you should have the record reader check to see if any interesting
    records exist in the current split and if so, read them and if not, just
    exit.

    You should be careful, however, about having to do too many random accesses
    since they will kill your disk performance. It would be ideal if the record
    reader could determine whether there are interesting records without reading
    any data from the input split possibly by reference to an external index of
    file offsets. Whenever you do find an interesting split, I would recommend
    that you read the whole split rather than worry about reading only
    interesting records. That will help avoid random read patterns.

    Another issue is that if only a small number of records are interesting,
    then you may have very limited parallelism unless you invoke a very large
    number of readers.



    An alternative approach would be to have a side file with file offsets and
    build an input format that would treat a byte range as if it were a normal
    input. Essentially what you would have would be FileByteRangeSplit in place
    of a FileSplit. That would avoid all of the problems with limited
    parallelism and checking offsets and so on.

    On 3/4/08 9:11 PM, "Andy Pavlo" wrote:

    Let's say I have a simple data file with <key, value> pairs and the entire
    file is ascending sorted order by 'value'. What I want to be able to do is
    filter the data so that the map function is only invoked with <key, value>
    pairs where 'value' is greater than some input value.

    Does such a feature already exist or would I need to implement my own
    RecordReader to do this filter? Is this the right place to do this in
    Hadoop's input pipeline?

    What I essentially want is a cheap index. By sorting the values ahead of time,
    you could just do a binary search on the InputSplit until you found the
    starting value that satisfies the predicate. The RecordReader would then
    start this point in the file, read all the lines in, and pass the records to
    map().

    Any thoughts?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 5, '08 at 5:11a
activeMar 5, '08 at 6:36p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase