FAQ

[HBase-user] HBase Bulk Load script

Marc Limotte
Dec 23, 2010 at 9:12 pm
Hi,

I'm using the HBase Bulk
Loader<http://archive.cloudera.com/cdh/3/hbase/bulk-loads.html>with
0.89. Very easy to use. I have a few of questions:

1) It seems importtsv will only accept one family at a time. It shows some
sort of security access error if I give it a column list with columns from
different families. Is this a limitation of the bulk loader, or is this a
consequence of some security configuration somewhere?

2) Does the bulk load process respect the hbase family's compression
setting? If not, is there a way to trigger the compression after the fact
(major compaction, for example)?

3) Am I correct in thinking that the importtsv step can run on a separate
cluster from the hbase cluster (assuming you have an hbase client config and
libraries)? And if so, for the completebulkload step, will I need to
manually copy the output of importtsv to the hbase cluster's HDFS? Or can I
provide a remote hdfs path, or even an S3 path for the completebulkload
program?

Thanks for providing this tool.

Marc
reply

Search Discussions

6 responses

  • Lars George at Dec 23, 2010 at 9:52 pm
    Hi Marc,
    1) It seems importtsv will only accept one family at a time. It shows some
    sort of security access error if I give it a column list with columns from
    different families.  Is this a limitation of the bulk loader, or is this a
    consequence of some security configuration somewhere?
    This was what was implemented up until recently, see
    https://issues.apache.org/jira/browse/HBASE-1861 for details.
    2)  Does the bulk load process respect the hbase family's compression
    setting?  If not, is there a way to trigger the compression after the fact
    (major compaction, for example)?
    You can specify the compression I believe as a configuration option
    handled by the HFOF. Otherwise yes, switch it on (of not already done)
    and do a major compaction to get all files compressed.
    3) Am I correct in thinking that the importtsv step can run on a separate
    cluster from the hbase cluster (assuming you have an hbase client config and
    libraries)?  And if so, for the completebulkload step, will I need to
    manually copy the output of importtsv to the hbase cluster's HDFS?  Or can I
    provide a remote hdfs path, or even an S3 path for the completebulkload
    program?
    Not sure if that would work, since the files are placed next to the
    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Thanks for providing this tool.

    Marc
    Lars
  • Todd Lipcon at Dec 23, 2010 at 10:35 pm
    You beat me to it, Lars! Was writing a response when some family arrived for
    the holidays, and when I came back, you had written just what I had started
    :)
    On Thu, Dec 23, 2010 at 1:51 PM, Lars George wrote:

    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Yep, there is a code path where if the HFiles are on a different filesystem,
    it will copy them to the HBase filesystem first. It's not very efficient,
    though, so it's probably better to distcp them to the local cluster first.

    -Todd
    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Marc Limotte at Dec 28, 2010 at 1:07 am
    Lars, Todd,

    Thanks for the info. If I understand correctly, the importtsv command line
    tool will not compress by default and there is no command line switch for
    it, but I can modify the source at
    hbase-0.89.20100924+28/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
    to call FileOutputFormat.setCompressOutput/setOutputCompressorClass() on the
    Job; in order to turn on compression.

    Does that sound right?

    Marc

    On Thu, Dec 23, 2010 at 2:34 PM, Todd Lipcon wrote:

    You beat me to it, Lars! Was writing a response when some family arrived
    for
    the holidays, and when I came back, you had written just what I had started
    :)
    On Thu, Dec 23, 2010 at 1:51 PM, Lars George wrote:

    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Yep, there is a code path where if the HFiles are on a different
    filesystem,
    it will copy them to the HBase filesystem first. It's not very efficient,
    though, so it's probably better to distcp them to the local cluster first.

    -Todd
    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Stack at Dec 28, 2010 at 5:09 am
    Sounds right to me if thats of any consolation Marc.
    St.Ack
    On Mon, Dec 27, 2010 at 5:07 PM, Marc Limotte wrote:
    Lars, Todd,

    Thanks for the info.  If I understand correctly, the importtsv command line
    tool will not compress by default and there is no command line switch for
    it, but I can modify the source at
    hbase-0.89.20100924+28/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
    to call FileOutputFormat.setCompressOutput/setOutputCompressorClass() on the
    Job; in order to turn on compression.

    Does that sound right?

    Marc

    On Thu, Dec 23, 2010 at 2:34 PM, Todd Lipcon wrote:

    You beat me to it, Lars! Was writing a response when some family arrived
    for
    the holidays, and when I came back, you had written just what I had started
    :)

    On Thu, Dec 23, 2010 at 1:51 PM, Lars George <lars.george@gmail.com>
    wrote:
    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Yep, there is a code path where if the HFiles are on a different
    filesystem,
    it will copy them to the HBase filesystem first. It's not very efficient,
    though, so it's probably better to distcp them to the local cluster first.

    -Todd
    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Lars George at Dec 28, 2010 at 9:30 am
    Hi Marc,

    Actually, HFileOutputFormat is what you need to target, the below is
    for other file formats and their compression. HFOF has support for
    compressing the data as it is written, so either add this to your
    configuration

    conf.set("hfile.compression", "lzo");

    or add this to the job startup command

    -Dhfile.compression=lzo

    (or with another compression codec obviously).

    Lars

    On Tue, Dec 28, 2010 at 2:07 AM, Marc Limotte wrote:
    Lars, Todd,

    Thanks for the info.  If I understand correctly, the importtsv command line
    tool will not compress by default and there is no command line switch for
    it, but I can modify the source at
    hbase-0.89.20100924+28/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
    to call FileOutputFormat.setCompressOutput/setOutputCompressorClass() on the
    Job; in order to turn on compression.

    Does that sound right?

    Marc

    On Thu, Dec 23, 2010 at 2:34 PM, Todd Lipcon wrote:

    You beat me to it, Lars! Was writing a response when some family arrived
    for
    the holidays, and when I came back, you had written just what I had started
    :)

    On Thu, Dec 23, 2010 at 1:51 PM, Lars George <lars.george@gmail.com>
    wrote:
    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Yep, there is a code path where if the HFiles are on a different
    filesystem,
    it will copy them to the HBase filesystem first. It's not very efficient,
    though, so it's probably better to distcp them to the local cluster first.

    -Todd
    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Dec 29, 2010 at 7:52 pm
    Also, docs patches welcome :)
    On Tue, Dec 28, 2010 at 1:29 AM, Lars George wrote:

    Hi Marc,

    Actually, HFileOutputFormat is what you need to target, the below is
    for other file formats and their compression. HFOF has support for
    compressing the data as it is written, so either add this to your
    configuration

    conf.set("hfile.compression", "lzo");

    or add this to the job startup command

    -Dhfile.compression=lzo

    (or with another compression codec obviously).

    Lars

    On Tue, Dec 28, 2010 at 2:07 AM, Marc Limotte wrote:
    Lars, Todd,

    Thanks for the info. If I understand correctly, the importtsv command line
    tool will not compress by default and there is no command line switch for
    it, but I can modify the source at
    hbase-0.89.20100924+28/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
    to call FileOutputFormat.setCompressOutput/setOutputCompressorClass() on the
    Job; in order to turn on compression.

    Does that sound right?

    Marc

    On Thu, Dec 23, 2010 at 2:34 PM, Todd Lipcon wrote:

    You beat me to it, Lars! Was writing a response when some family arrived
    for
    the holidays, and when I came back, you had written just what I had
    started
    :)

    On Thu, Dec 23, 2010 at 1:51 PM, Lars George <lars.george@gmail.com>
    wrote:
    live ones and then moved into place from their temp location. Not sure
    what happens if the local cluster has no /hbase etc.

    Todd, could you help here?
    Yep, there is a code path where if the HFiles are on a different
    filesystem,
    it will copy them to the HBase filesystem first. It's not very
    efficient,
    though, so it's probably better to distcp them to the local cluster
    first.
    -Todd
    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera

Related Discussions

Discussion Navigation
viewthread | post