FAQ
All,

The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.

James Kilbride

Search Discussions

  • Harsh J at Jul 6, 2010 at 4:11 pm
    I believe this article will help you understand the new (not anymore)
    API+HBase MR: http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html
    [Look at the second example, which uses the Put object]

    On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P.
    wrote:
    All,

    The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.

    James Kilbride


    --
    Harsh J
    www.harshj.com
  • Kilbride, James P. at Jul 6, 2010 at 4:41 pm
    This is an interesting start but I'm really interested in the opposite direction, where hbase is the input to my map reduce job, and then I'm going to push some data into reducers which ultimately I'm okay with them just writing it to a file.

    I get the impression that I need to set up a TableInputFormat type of object. But I since job only allows you to do setInputFormatClass, I'm not sure how to dynamically configure the inputFormatClass to accept some parameters to limit the input format scan on the table to only specific rows. Here's the general thrust of what I'm trying to do with MapReduce and HBase.

    I have a table called People which has rows of people(names, ids, whatever is used for identifying a person in the system). That table also has a column family called relatives where the column ids are the names of relatives for the person. I want to pass into the inputFormat object the names of the people I want it to look up. And the mapper should get the persons name as the key and the columnFamily relatives as the value(that's the result of the scan limitations I'm putting into place).

    I then will retrieve the relatives(in the map function), look at relationships between them and push onto the context the relatives name(keyOut) and a floating point value(valueOut). The reducer will combine all these floating point values for each relative and output(in a file is fine) the relatives name and cumulative score.

    But I can't seem to figure out how to set up a job that uses the TableInputFormat I want, and which also allows me to set the parameter for it so that it will only give me the people I ask for when I run the program not the entire table.

    Does this make any sense?

    James Kilbride

    -----Original Message-----
    From: Harsh J
    Sent: Tuesday, July 06, 2010 12:10 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    I believe this article will help you understand the new (not anymore)
    API+HBase MR: http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html
    [Look at the second example, which uses the Put object]

    On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P.
    wrote:
    All,

    The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a hbase table based on a row value passed into the job, and which starts the map the row values(as the value key) and the column names(or value) as the map values.

    James Kilbride


    --
    Harsh J
    www.harshj.com
  • Jean-Daniel Cryans at Jul 6, 2010 at 4:54 pm


    Does this make any sense?
    Not in a MapReduce context, what you want to do is a LIKE with a bunch of
    values right? Since a mapper will always read all the input that it's given
    (minus some filters like you can do with HBase), whatever you do will always
    end up being a full table scan. You "could" solve your problem by
    configuring your Scan object with a RowFilter that knows about the names you
    are looking for, but that still ends up being a full scan on the region
    server side so it will be slow and will generate a lot of IO.

    WRT examples, HBase ships with a couple of utility classes that can also be
    used as examples. The Export class has the Scan configuration stuff:
    http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

    J-D
  • Kilbride, James P. at Jul 6, 2010 at 5:02 pm
    So, if that's the case, and you argument makes sense understanding how scan versus get works, I'd have to write a custom InputFormat class that looks like the TableInputFormat class, but uses a get(or series of gets) rather than a scan object as the current table mapper does?

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 12:53 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    Does this make any sense?
    Not in a MapReduce context, what you want to do is a LIKE with a bunch of
    values right? Since a mapper will always read all the input that it's given
    (minus some filters like you can do with HBase), whatever you do will always
    end up being a full table scan. You "could" solve your problem by
    configuring your Scan object with a RowFilter that knows about the names you
    are looking for, but that still ends up being a full scan on the region
    server side so it will be slow and will generate a lot of IO.

    WRT examples, HBase ships with a couple of utility classes that can also be
    used as examples. The Export class has the Scan configuration stuff:
    http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

    J-D
  • Jean-Daniel Cryans at Jul 6, 2010 at 5:12 pm
    That won't be very efficient either... are you trying to do this for a real
    time user request. If so, it really isn't the way you want to go.

    If you are in a batch processing situation, I'd say it depends on how many
    rows you have VS how many you need to retrieve eg scanning 2B rows only to
    find 10 rows really doesn't make sense. How do you determine which users you
    need to process? How big is your dataset? I understand that you wish to use
    the MR-provided functionalities of grouping and such, but simply issuing a
    bunch of Gets in parallel may just be easier to write and maintain.

    J-D
    On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. wrote:

    So, if that's the case, and you argument makes sense understanding how scan
    versus get works, I'd have to write a custom InputFormat class that looks
    like the TableInputFormat class, but uses a get(or series of gets) rather
    than a scan object as the current table mapper does?

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of
    Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 12:53 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    Does this make any sense?
    Not in a MapReduce context, what you want to do is a LIKE with a bunch of
    values right? Since a mapper will always read all the input that it's given
    (minus some filters like you can do with HBase), whatever you do will
    always
    end up being a full table scan. You "could" solve your problem by
    configuring your Scan object with a RowFilter that knows about the names
    you
    are looking for, but that still ends up being a full scan on the region
    server side so it will be slow and will generate a lot of IO.

    WRT examples, HBase ships with a couple of utility classes that can also be
    used as examples. The Export class has the Scan configuration stuff:

    http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

    J-D
  • Kilbride, James P. at Jul 6, 2010 at 5:53 pm
    I'm assuming the rows being pulled back are smaller than the full row set of the entire database. So say the 10 out of 2B case. But, each row has a column family who's 'columns' are actually rowIds in the database. (basically my one to many relationship mapping). I'm not trying to use MR for the initial get of 10 columns, but rather the fact that each of those 10 initial rows generates potentially hundreds or thousands of other calls.

    I am trying to do this for a real time user request, but I expect the total processing to take some time so it's more of a user initiated call. There also may be dozens of users making the request at any given time so I want to farm this out into the MR world so that multiple instances of the job can be running(with completely different starting rows) at any given time.

    I could do this using a serialized local process but I explicitly want some of my processing, which could take some time, happening out in the map reduce world to take advantage of spare cycles elsewhere, as well as potential data locality and the fact that it is a parallelizable problem seems to imply that M/R would be a logical way to do it.

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 1:12 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    That won't be very efficient either... are you trying to do this for a real
    time user request. If so, it really isn't the way you want to go.

    If you are in a batch processing situation, I'd say it depends on how many
    rows you have VS how many you need to retrieve eg scanning 2B rows only to
    find 10 rows really doesn't make sense. How do you determine which users you
    need to process? How big is your dataset? I understand that you wish to use
    the MR-provided functionalities of grouping and such, but simply issuing a
    bunch of Gets in parallel may just be easier to write and maintain.

    J-D
    On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. wrote:

    So, if that's the case, and you argument makes sense understanding how scan
    versus get works, I'd have to write a custom InputFormat class that looks
    like the TableInputFormat class, but uses a get(or series of gets) rather
    than a scan object as the current table mapper does?

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of
    Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 12:53 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    Does this make any sense?
    Not in a MapReduce context, what you want to do is a LIKE with a bunch of
    values right? Since a mapper will always read all the input that it's given
    (minus some filters like you can do with HBase), whatever you do will
    always
    end up being a full table scan. You "could" solve your problem by
    configuring your Scan object with a RowFilter that knows about the names
    you
    are looking for, but that still ends up being a full scan on the region
    server side so it will be slow and will generate a lot of IO.

    WRT examples, HBase ships with a couple of utility classes that can also be
    used as examples. The Export class has the Scan configuration stuff:

    http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java

    J-D
  • Jean-Daniel Cryans at Jul 6, 2010 at 9:39 pm
    (moving the thread to the HBase user mailing list, on reply please remove
    the general@ since this is not a general question)

    It is indeed a parallelizable problem that could use a job management
    system, but in your case I don't think MR is the right solution. You will
    have to do all sorts weird tweaks and in the end you won't get much out of
    it since you basically want to process a tiny portion of the whole dataset.
    You also talk about possible localisation, but I don't see that being
    a particularly strong argument in what you describe. Yes, you could start
    one mapper per region that contains some of the rows you are looking for,
    but the cost of starting and managing those JVMs is high compared to just
    starting one that does the work (since it can be done easily in a single
    process that can be multi-threaded).

    To sum up, using MR on a small dataset is basically having all the
    disadvantages for almost none of the advantages.

    Instead you could look into running Gearman (or similar) on those machines
    and that would give you exactly what you need IMHO.

    J-D
    On Tue, Jul 6, 2010 at 10:50 AM, Kilbride, James P. wrote:

    I'm assuming the rows being pulled back are smaller than the full row set
    of the entire database. So say the 10 out of 2B case. But, each row has a
    column family who's 'columns' are actually rowIds in the database.
    (basically my one to many relationship mapping). I'm not trying to use MR
    for the initial get of 10 columns, but rather the fact that each of those 10
    initial rows generates potentially hundreds or thousands of other calls.

    I am trying to do this for a real time user request, but I expect the total
    processing to take some time so it's more of a user initiated call. There
    also may be dozens of users making the request at any given time so I want
    to farm this out into the MR world so that multiple instances of the job can
    be running(with completely different starting rows) at any given time.

    I could do this using a serialized local process but I explicitly want some
    of my processing, which could take some time, happening out in the map
    reduce world to take advantage of spare cycles elsewhere, as well as
    potential data locality and the fact that it is a parallelizable problem
    seems to imply that M/R would be a logical way to do it.

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of
    Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 1:12 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    That won't be very efficient either... are you trying to do this for a real
    time user request. If so, it really isn't the way you want to go.

    If you are in a batch processing situation, I'd say it depends on how many
    rows you have VS how many you need to retrieve eg scanning 2B rows only to
    find 10 rows really doesn't make sense. How do you determine which users
    you
    need to process? How big is your dataset? I understand that you wish to use
    the MR-provided functionalities of grouping and such, but simply issuing a
    bunch of Gets in parallel may just be easier to write and maintain.

    J-D

    On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
    James.Kilbride@gd-ais.com> wrote:
    So, if that's the case, and you argument makes sense understanding how scan
    versus get works, I'd have to write a custom InputFormat class that looks
    like the TableInputFormat class, but uses a get(or series of gets) rather
    than a scan object as the current table mapper does?

    James Kilbride

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of
    Jean-Daniel Cryans
    Sent: Tuesday, July 06, 2010 12:53 PM
    To: general@hadoop.apache.org
    Subject: Re: MapReduce HBASE examples

    Does this make any sense?
    Not in a MapReduce context, what you want to do is a LIKE with a bunch of
    values right? Since a mapper will always read all the input that it's given
    (minus some filters like you can do with HBase), whatever you do will
    always
    end up being a full table scan. You "could" solve your problem by
    configuring your Scan object with a RowFilter that knows about the names
    you
    are looking for, but that still ends up being a full scan on the region
    server side so it will be slow and will generate a lot of IO.

    WRT examples, HBase ships with a couple of utility classes that can also be
    used as examples. The Export class has the Scan configuration stuff:

    http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
    J-D

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgeneral @
categorieshadoop
postedJul 6, '10 at 12:40p
activeJul 6, '10 at 9:39p
posts8
users3
websitehadoop.apache.org
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase