FAQ
Hi all,

I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
and am trying to insert data. 3 of the machines are tasktrackers, with 4
map tasks each.

My data consists of about 1.3 billion rows with 4 columns each (100GB
txt file). The column structure is "rowID, word1, word2, word3". My DFS
replication in hadoop and hbase is set to 3 each. I have put only one
column family and 3 qualifiers for each field (word*).

I am using the SampleUploader present in the HBase distribution. To
complete 40% of the insertion, it has taken around 21 hrs and it's still
running. I have 12 map tasks running.* I would like to know is the
insertion time taken here on expected lines ??? Because when I used lucene,
I was able to insert the entire data in about 8 hours.*

Also, there seems to be huge explosion of data size here. With a
replication factor of 3 for HBase, I was expecting the table size inserted
to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
replicating the data 3 times and 50+ GB for additional storage
information). But even for 40% completion of data insertion, the space
occupied is around 550GB (Looks like it might take around 1.2TB for an
100GB file).* I have used the rowID to be a String, instead of Long. Will
that account for such rapid increase in data storage???
*

Regards,
Kranthi

Search Discussions

  • Yuzhihong at Dec 4, 2011 at 7:51 pm
    May I ask whether you pre-split your table before loading ?


    On Dec 4, 2011, at 6:19 AM, kranthi reddy wrote:

    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers, with 4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each (100GB
    txt file). The column structure is "rowID, word1, word2, word3". My DFS
    replication in hadoop and hbase is set to 3 each. I have put only one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution. To
    complete 40% of the insertion, it has taken around 21 hrs and it's still
    running. I have 12 map tasks running.* I would like to know is the
    insertion time taken here on expected lines ??? Because when I used lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the space
    occupied is around 550GB (Looks like it might take around 1.2TB for an
    100GB file).* I have used the rowID to be a String, instead of Long. Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi
  • Kranthi reddy at Dec 5, 2011 at 5:23 am
    No, I split the table on the fly. This I have done because converting my
    table into Hbase format (rowID, family, qualifier, value) would result in
    the input file being arnd 300GB. Hence, I had decided to do the splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?


    On Dec 4, 2011, at 6:19 AM, kranthi reddy wrote:

    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers, with 4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each (100GB
    txt file). The column structure is "rowID, word1, word2, word3". My DFS
    replication in hadoop and hbase is set to 3 each. I have put only one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution. To
    complete 40% of the insertion, it has taken around 21 hrs and it's still
    running. I have 12 map tasks running.* I would like to know is the
    insertion time taken here on expected lines ??? Because when I used lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the space
    occupied is around 550GB (Looks like it might take around 1.2TB for an
    100GB file).* I have used the rowID to be a String, instead of Long. Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Ulrich Staudinger at Dec 5, 2011 at 7:57 am
    Hi there,

    while I cannot give you any concrete advice on your particular storage
    problem, I can share some experiences with you regarding performance.

    I also bulk import data regularly, which is around 4GB every day in about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately. Which
    resulted in a load time of about an hour. My next approach was to read an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20 files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper running. To
    HBase's defense, I have to say that I did this on a standalone installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now also put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
    insert times of around 0.5ms per row - which is very decent. My entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your servers
    and how quick they are with compactions and the amount of data you put at
    once. I have definitely found single puts to result in low performance.

    Best regards,
    Ulrich




    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy wrote:

    No, I split the table on the fly. This I have done because converting my
    table into Hbase format (rowID, family, qualifier, value) would result in
    the input file being arnd 300GB. Hence, I had decided to do the splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?


    On Dec 4, 2011, at 6:19 AM, kranthi reddy wrote:

    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers, with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each
    (100GB
    txt file). The column structure is "rowID, word1, word2, word3". My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put only one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution. To
    complete 40% of the insertion, it has taken around 21 hrs and it's
    still
    running. I have 12 map tasks running.* I would like to know is the
    insertion time taken here on expected lines ??? Because when I used lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the space
    occupied is around 550GB (Looks like it might take around 1.2TB for an
    100GB file).* I have used the rowID to be a String, instead of Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Kranthi reddy at Dec 5, 2011 at 9:10 am
    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is defined in
    hbase-default.xml. Is my understanding wrong here ???


    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger wrote:

    Hi there,

    while I cannot give you any concrete advice on your particular storage
    problem, I can share some experiences with you regarding performance.

    I also bulk import data regularly, which is around 4GB every day in about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately. Which
    resulted in a load time of about an hour. My next approach was to read an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20 files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper running. To
    HBase's defense, I have to say that I did this on a standalone installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now also put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
    insert times of around 0.5ms per row - which is very decent. My entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your servers
    and how quick they are with compactions and the amount of data you put at
    once. I have definitely found single puts to result in low performance.

    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because converting my
    table into Hbase format (rowID, family, qualifier, value) would result in
    the input file being arnd 300GB. Hence, I had decided to do the splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com>
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each
    (100GB
    txt file). The column structure is "rowID, word1, word2, word3". My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put only one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and it's
    still
    running. I have 12 map tasks running.* I would like to know is the
    insertion time taken here on expected lines ??? Because when I used lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the
    space
    occupied is around 550GB (Looks like it might take around 1.2TB for
    an
    100GB file).* I have used the rowID to be a String, instead of Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Ulrich Staudinger at Dec 5, 2011 at 3:14 pm
    the point, I refer to is not so much about when hbase's server side
    flushes, but when the client side flushes.
    If you put every value immediately, it will result every time in an RPC
    call. If you collect the data on the client side and flush (on the client
    side) manually, it will result in one RPC call with hundred or thousand
    small puts inside, instead of hundred or thousands individual put RPC
    calls.

    Another issue is, I am not so sure what happens if you collect hundreds of
    thousands of small puts, which might possibly be bigger than the memstore,
    and flush then. I guess the hbase client will hang.



    On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy wrote:

    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is defined in
    hbase-default.xml. Is my understanding wrong here ???



    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    Hi there,

    while I cannot give you any concrete advice on your particular storage
    problem, I can share some experiences with you regarding performance.

    I also bulk import data regularly, which is around 4GB every day in about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately. Which
    resulted in a load time of about an hour. My next approach was to read an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20 files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper running. To
    HBase's defense, I have to say that I did this on a standalone
    installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now also put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
    insert times of around 0.5ms per row - which is very decent. My entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your servers
    and how quick they are with compactions and the amount of data you put at
    once. I have definitely found single puts to result in low performance.

    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because converting
    my
    table into Hbase format (rowID, family, qualifier, value) would result
    in
    the input file being arnd 300GB. Hence, I had decided to do the
    splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com>
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each
    (100GB
    txt file). The column structure is "rowID, word1, word2, word3".
    My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put only
    one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and it's
    still
    running. I have 12 map tasks running.* I would like to know is the
    insertion time taken here on expected lines ??? Because when I used lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the
    space
    occupied is around 550GB (Looks like it might take around 1.2TB for
    an
    100GB file).* I have used the rowID to be a String, instead of
    Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Kranthi reddy at Dec 5, 2011 at 4:33 pm
    Ok. But can some1 explain why the data size is exploding the way I have
    mentioned earlier.

    I have tried to insert sample data of arnd 12GB. The data occupied by Hbase
    table is arnd 130GB. All my columns i.e. including the ROWID are strings. I
    have even tried converting by ROWID to long, but that seems to occupy more
    space i.e. arnd 150GB.

    Sample rows

    0-<>-f-<>-c-<>-Anarchism
    0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
    0-<>-f-<>-e2-<>-anarchy
    1-<>-f-<>-c-<>-Anarchism
    1-<>-f-<>-e1-<>-anarchy
    1-<>-f-<>-e2-<>-state (polity)
    2-<>-f-<>-c-<>-Anarchism
    2-<>-f-<>-e1-<>-anarchy
    2-<>-f-<>-e2-<>-political philosophy
    3-<>-f-<>-c-<>-Anarchism
    3-<>-f-<>-e1-<>-The Globe and Mail
    3-<>-f-<>-e2-<>-anarchy
    4-<>-f-<>-c-<>-Anarchism
    4-<>-f-<>-e1-<>-anarchy
    4-<>-f-<>-e2-<>-stateless society

    Is there a way I can know the number of bytes occupied by each key:value
    for each cell ???
    On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger wrote:

    the point, I refer to is not so much about when hbase's server side
    flushes, but when the client side flushes.
    If you put every value immediately, it will result every time in an RPC
    call. If you collect the data on the client side and flush (on the client
    side) manually, it will result in one RPC call with hundred or thousand
    small puts inside, instead of hundred or thousands individual put RPC
    calls.

    Another issue is, I am not so sure what happens if you collect hundreds of
    thousands of small puts, which might possibly be bigger than the memstore,
    and flush then. I guess the hbase client will hang.




    On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is defined in
    hbase-default.xml. Is my understanding wrong here ???



    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    Hi there,

    while I cannot give you any concrete advice on your particular storage
    problem, I can share some experiences with you regarding performance.

    I also bulk import data regularly, which is around 4GB every day in
    about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately. Which
    resulted in a load time of about an hour. My next approach was to read
    an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20
    files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper
    running.
    To
    HBase's defense, I have to say that I did this on a standalone
    installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now also put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
    insert times of around 0.5ms per row - which is very decent. My entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your
    servers
    and how quick they are with compactions and the amount of data you put
    at
    once. I have definitely found single puts to result in low performance.

    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because converting
    my
    table into Hbase format (rowID, family, qualifier, value) would
    result
    in
    the input file being arnd 300GB. Hence, I had decided to do the
    splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
    and am trying to insert data. 3 of the machines are tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each
    (100GB
    txt file). The column structure is "rowID, word1, word2, word3".
    My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put only
    one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase
    distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and
    it's
    still
    running. I have 12 map tasks running.* I would like to know is
    the
    insertion time taken here on expected lines ??? Because when I
    used
    lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here.
    With a
    replication factor of 3 for HBase, I was expecting the table size inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the
    space
    occupied is around 550GB (Looks like it might take around 1.2TB
    for
    an
    100GB file).* I have used the rowID to be a String, instead of
    Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Kranthi reddy at Dec 5, 2011 at 5:26 pm
    1) Does having dfs.replication factor "3" in general result in table data
    size of 3x + y (where x is the size of the file in local file system and y
    is some additional space for meta information storage) ???

    2) Does Hbase, pre allocate space for all the cell versions when the cell
    is created for the first time?

    Unfortunately, I am just unable to wrap my head around the problem of such
    exponential increase of data size. Except for this case happening (which I
    doubt), I just don't get it how such exponential growth of table data is
    possible.

    3) Or is it case where my KEY is being larger than VALUE and hence
    resulting in such large size increase ???

    *Similar to the the sample rows below, I have around 300 million entries
    and the ROWID increases linearly*.
    On Mon, Dec 5, 2011 at 10:03 PM, kranthi reddy wrote:

    Ok. But can some1 explain why the data size is exploding the way I have
    mentioned earlier.

    I have tried to insert sample data of arnd 12GB. The data occupied by
    Hbase table is arnd 130GB. All my columns i.e. including the ROWID are
    strings. I have even tried converting by ROWID to long, but that seems to
    occupy more space i.e. arnd 150GB.

    Sample rows

    0-<>-f-<>-c-<>-Anarchism
    0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
    0-<>-f-<>-e2-<>-anarchy
    1-<>-f-<>-c-<>-Anarchism
    1-<>-f-<>-e1-<>-anarchy
    1-<>-f-<>-e2-<>-state (polity)
    2-<>-f-<>-c-<>-Anarchism
    2-<>-f-<>-e1-<>-anarchy
    2-<>-f-<>-e2-<>-political philosophy
    3-<>-f-<>-c-<>-Anarchism
    3-<>-f-<>-e1-<>-The Globe and Mail
    3-<>-f-<>-e2-<>-anarchy
    4-<>-f-<>-c-<>-Anarchism
    4-<>-f-<>-e1-<>-anarchy
    4-<>-f-<>-e2-<>-stateless society

    Is there a way I can know the number of bytes occupied by each key:value
    for each cell ???


    On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    the point, I refer to is not so much about when hbase's server side
    flushes, but when the client side flushes.
    If you put every value immediately, it will result every time in an RPC
    call. If you collect the data on the client side and flush (on the client
    side) manually, it will result in one RPC call with hundred or thousand
    small puts inside, instead of hundred or thousands individual put RPC
    calls.

    Another issue is, I am not so sure what happens if you collect hundreds of
    thousands of small puts, which might possibly be bigger than the memstore,
    and flush then. I guess the hbase client will hang.




    On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is
    defined in
    hbase-default.xml. Is my understanding wrong here ???



    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    Hi there,

    while I cannot give you any concrete advice on your particular storage
    problem, I can share some experiences with you regarding performance.

    I also bulk import data regularly, which is around 4GB every day in
    about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately. Which
    resulted in a load time of about an hour. My next approach was to
    read an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20
    files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper
    running.
    To
    HBase's defense, I have to say that I did this on a standalone
    installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now
    also
    put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
    insert times of around 0.5ms per row - which is very decent. My entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your
    servers
    and how quick they are with compactions and the amount of data you
    put at
    once. I have definitely found single puts to result in low
    performance.
    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <
    kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because
    converting
    my
    table into Hbase format (rowID, family, qualifier, value) would
    result
    in
    the input file being arnd 300GB. Hence, I had decided to do the
    splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy <
    kranthili2020@gmail.com>
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of
    4
    machines
    and am trying to insert data. 3 of the machines are
    tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns
    each
    (100GB
    txt file). The column structure is "rowID, word1, word2, word3".
    My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put
    only
    one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase
    distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and
    it's
    still
    running. I have 12 map tasks running.* I would like to know is
    the
    insertion time taken here on expected lines ??? Because when I
    used
    lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here.
    With a
    replication factor of 3 for HBase, I was expecting the table
    size
    inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion, the
    space
    occupied is around 550GB (Looks like it might take around 1.2TB
    for
    an
    100GB file).* I have used the rowID to be a String, instead of
    Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Doug Meil at Dec 5, 2011 at 5:43 pm
    Hi there-

    Have you looked at this?

    http://hbase.apache.org/book.html#keyvalue




    On 12/5/11 11:33 AM, "kranthi reddy" wrote:

    Ok. But can some1 explain why the data size is exploding the way I have
    mentioned earlier.

    I have tried to insert sample data of arnd 12GB. The data occupied by
    Hbase
    table is arnd 130GB. All my columns i.e. including the ROWID are strings.
    I
    have even tried converting by ROWID to long, but that seems to occupy more
    space i.e. arnd 150GB.

    Sample rows

    0-<>-f-<>-c-<>-Anarchism
    0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
    0-<>-f-<>-e2-<>-anarchy
    1-<>-f-<>-c-<>-Anarchism
    1-<>-f-<>-e1-<>-anarchy
    1-<>-f-<>-e2-<>-state (polity)
    2-<>-f-<>-c-<>-Anarchism
    2-<>-f-<>-e1-<>-anarchy
    2-<>-f-<>-e2-<>-political philosophy
    3-<>-f-<>-c-<>-Anarchism
    3-<>-f-<>-e1-<>-The Globe and Mail
    3-<>-f-<>-e2-<>-anarchy
    4-<>-f-<>-c-<>-Anarchism
    4-<>-f-<>-e1-<>-anarchy
    4-<>-f-<>-e2-<>-stateless society

    Is there a way I can know the number of bytes occupied by each key:value
    for each cell ???
    On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger wrote:

    the point, I refer to is not so much about when hbase's server side
    flushes, but when the client side flushes.
    If you put every value immediately, it will result every time in an RPC
    call. If you collect the data on the client side and flush (on the
    client
    side) manually, it will result in one RPC call with hundred or thousand
    small puts inside, instead of hundred or thousands individual put RPC
    calls.

    Another issue is, I am not so sure what happens if you collect hundreds
    of
    thousands of small puts, which might possibly be bigger than the
    memstore,
    and flush then. I guess the hbase client will hang.




    On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is
    defined
    in
    hbase-default.xml. Is my understanding wrong here ???



    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    Hi there,

    while I cannot give you any concrete advice on your particular
    storage
    problem, I can share some experiences with you regarding
    performance.
    I also bulk import data regularly, which is around 4GB every day in
    about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately.
    Which
    resulted in a load time of about an hour. My next approach was to
    read
    an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20
    files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper
    running.
    To
    HBase's defense, I have to say that I did this on a standalone
    installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now
    also
    put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and
    have
    insert times of around 0.5ms per row - which is very decent. My
    entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your
    servers
    and how quick they are with compactions and the amount of data you
    put
    at
    once. I have definitely found single puts to result in low
    performance.
    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
    <kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because
    converting
    my
    table into Hbase format (rowID, family, qualifier, value) would
    result
    in
    the input file being arnd 300GB. Hence, I had decided to do the
    splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy
    <kranthili2020@gmail.com
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster
    of 4
    machines
    and am trying to insert data. 3 of the machines are
    tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns
    each
    (100GB
    txt file). The column structure is "rowID, word1, word2,
    word3".
    My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put
    only
    one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase
    distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and
    it's
    still
    running. I have 12 map tasks running.* I would like to know is
    the
    insertion time taken here on expected lines ??? Because when I
    used
    lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here.
    With a
    replication factor of 3 for HBase, I was expecting the table
    size
    inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I
    have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion,
    the
    space
    occupied is around 550GB (Looks like it might take around
    1.2TB
    for
    an
    100GB file).* I have used the rowID to be a String, instead of
    Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm
  • Kranthi reddy at Dec 19, 2011 at 5:55 am
    Hi all,

    I have been able to understand clearly as to why my Storage is
    occupying such huge space.

    I have an issue with the insertion time. I have currently .1 billion
    records (In hbase format, in future it would run into few billions) and am
    inserting them using 12 map tasks running on 4 machine hadoop cluster.

    The time taken is approximately 3 hours. Which on calculation leads to
    around 750 rows insertion per map task per second. IS THIS GOOD OR CAN IT
    BE IMPROVED???

    .1 billion -> 100000000/( 180 min * 60 sec * 12 map task) = 750
    (approx).

    I have tried using batch() function, but there is no improvement in the
    insertion time.

    * I have attached the codes that I am using to insert. Can some1 please
    check If what I am trying to do is the best way to insert data is the
    fastest and best way.
    *
    Regards,
    Kranthi


    On Mon, Dec 5, 2011 at 11:12 PM, Doug Meil wrote:


    Hi there-

    Have you looked at this?

    http://hbase.apache.org/book.html#keyvalue




    On 12/5/11 11:33 AM, "kranthi reddy" wrote:

    Ok. But can some1 explain why the data size is exploding the way I have
    mentioned earlier.

    I have tried to insert sample data of arnd 12GB. The data occupied by
    Hbase
    table is arnd 130GB. All my columns i.e. including the ROWID are strings.
    I
    have even tried converting by ROWID to long, but that seems to occupy more
    space i.e. arnd 150GB.

    Sample rows

    0-<>-f-<>-c-<>-Anarchism
    0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
    0-<>-f-<>-e2-<>-anarchy
    1-<>-f-<>-c-<>-Anarchism
    1-<>-f-<>-e1-<>-anarchy
    1-<>-f-<>-e2-<>-state (polity)
    2-<>-f-<>-c-<>-Anarchism
    2-<>-f-<>-e1-<>-anarchy
    2-<>-f-<>-e2-<>-political philosophy
    3-<>-f-<>-c-<>-Anarchism
    3-<>-f-<>-e1-<>-The Globe and Mail
    3-<>-f-<>-e2-<>-anarchy
    4-<>-f-<>-c-<>-Anarchism
    4-<>-f-<>-e1-<>-anarchy
    4-<>-f-<>-e2-<>-stateless society

    Is there a way I can know the number of bytes occupied by each key:value
    for each cell ???

    On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    the point, I refer to is not so much about when hbase's server side
    flushes, but when the client side flushes.
    If you put every value immediately, it will result every time in an RPC
    call. If you collect the data on the client side and flush (on the
    client
    side) manually, it will result in one RPC call with hundred or thousand
    small puts inside, instead of hundred or thousands individual put RPC
    calls.

    Another issue is, I am not so sure what happens if you collect hundreds
    of
    thousands of small puts, which might possibly be bigger than the
    memstore,
    and flush then. I guess the hbase client will hang.




    On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
    wrote:
    Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
    the bulk insert ??? I was of the opinion that Hbase would flush all the
    puts to the disk when it's memstore is filled, whose property is
    defined
    in
    hbase-default.xml. Is my understanding wrong here ???



    On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
    ustaudinger@activequant.org> wrote:
    Hi there,

    while I cannot give you any concrete advice on your particular
    storage
    problem, I can share some experiences with you regarding
    performance.
    I also bulk import data regularly, which is around 4GB every day in
    about
    150 files with something between 10'000 to 30'000 lines in it.

    My first approach was to read every line and put it separately.
    Which
    resulted in a load time of about an hour. My next approach was to
    read
    an
    entire file, put each individual put into a list and then store the entire
    list at once. This works fast in the beginning, but after about 20
    files,
    the server ran into compactions and couldn't cope with the load and
    finally, the master crashed, leaving regionserver and zookeeper
    running.
    To
    HBase's defense, I have to say that I did this on a standalone
    installation
    without Hadoop underneath, so the test may not be entirely fair.
    Next, I switched to a proper Hadoop layer with HBase on top. I now
    also
    put
    around 100 - 1000 lines (or puts) at once, in a bulk commit, and
    have
    insert times of around 0.5ms per row - which is very decent. My
    entire
    import now takes only 7 minutes.

    I think you must find a balance regarding the performance of your
    servers
    and how quick they are with compactions and the amount of data you
    put
    at
    once. I have definitely found single puts to result in low
    performance.
    Best regards,
    Ulrich





    On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
    <kranthili2020@gmail.com
    wrote:
    No, I split the table on the fly. This I have done because
    converting
    my
    table into Hbase format (rowID, family, qualifier, value) would
    result
    in
    the input file being arnd 300GB. Hence, I had decided to do the
    splitting
    and generating this format on the fly.

    Will this effect the performance so heavily ???
    On Mon, Dec 5, 2011 at 1:21 AM, wrote:

    May I ask whether you pre-split your table before loading ?



    On Dec 4, 2011, at 6:19 AM, kranthi reddy
    <kranthili2020@gmail.com
    wrote:
    Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster
    of 4
    machines
    and am trying to insert data. 3 of the machines are
    tasktrackers,
    with
    4
    map tasks each.

    My data consists of about 1.3 billion rows with 4 columns
    each
    (100GB
    txt file). The column structure is "rowID, word1, word2,
    word3".
    My
    DFS
    replication in hadoop and hbase is set to 3 each. I have put
    only
    one
    column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase
    distribution.
    To
    complete 40% of the insertion, it has taken around 21 hrs and
    it's
    still
    running. I have 12 map tasks running.* I would like to know is
    the
    insertion time taken here on expected lines ??? Because when I
    used
    lucene,
    I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here.
    With a
    replication factor of 3 for HBase, I was expecting the table
    size
    inserted
    to be around 350-400GB. (350-400GB for an 100GB txt file I
    have,
    300GB
    for
    replicating the data 3 times and 50+ GB for additional storage
    information). But even for 40% completion of data insertion,
    the
    space
    occupied is around 550GB (Looks like it might take around
    1.2TB
    for
    an
    100GB file).* I have used the rowID to be a String, instead of
    Long.
    Will
    that account for such rapid increase in data storage???
    *

    Regards,
    Kranthi


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm


    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm

    --
    Kranthi Reddy. B

    http://www.setusoftware.com/setu/index.htm

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedDec 4, '11 at 2:20p
activeDec 19, '11 at 5:55a
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase