Hi there,
while I cannot give you any concrete advice on your particular storage
problem, I can share some experiences with you regarding performance.
I also bulk import data regularly, which is around 4GB every day in about
150 files with something between 10'000 to 30'000 lines in it.
My first approach was to read every line and put it separately. Which
resulted in a load time of about an hour. My next approach was to read an
entire file, put each individual put into a list and then store the entire
list at once. This works fast in the beginning, but after about 20 files,
the server ran into compactions and couldn't cope with the load and
finally, the master crashed, leaving regionserver and zookeeper running. To
HBase's defense, I have to say that I did this on a standalone installation
without Hadoop underneath, so the test may not be entirely fair.
Next, I switched to a proper Hadoop layer with HBase on top. I now also put
around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
insert times of around 0.5ms per row - which is very decent. My entire
import now takes only 7 minutes.
I think you must find a balance regarding the performance of your servers
and how quick they are with compactions and the amount of data you put at
once. I have definitely found single puts to result in low performance.
Best regards,
Ulrich
On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy wrote:No, I split the table on the fly. This I have done because converting my
table into Hbase format (rowID, family, qualifier, value) would result in
the input file being arnd 300GB. Hence, I had decided to do the splitting
and generating this format on the fly.
Will this effect the performance so heavily ???
On Mon, Dec 5, 2011 at 1:21 AM, wrote:May I ask whether you pre-split your table before loading ?
On Dec 4, 2011, at 6:19 AM, kranthi reddy wrote:
Hi all,
I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
and am trying to insert data. 3 of the machines are tasktrackers, with
4
map tasks each.
My data consists of about 1.3 billion rows with 4 columns each
(100GB
txt file). The column structure is "rowID, word1, word2, word3". My
DFS
replication in hadoop and hbase is set to 3 each. I have put only one
column family and 3 qualifiers for each field (word*).
I am using the SampleUploader present in the HBase distribution. To
complete 40% of the insertion, it has taken around 21 hrs and it's
still
running. I have 12 map tasks running.* I would like to know is the
insertion time taken here on expected lines ??? Because when I used lucene,
I was able to insert the entire data in about 8 hours.*
Also, there seems to be huge explosion of data size here. With a
replication factor of 3 for HBase, I was expecting the table size inserted
to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
replicating the data 3 times and 50+ GB for additional storage
information). But even for 40% completion of data insertion, the space
occupied is around 550GB (Looks like it might take around 1.2TB for an
100GB file).* I have used the rowID to be a String, instead of Long.
Will
that account for such rapid increase in data storage???
*
Regards,
Kranthi
--
Kranthi Reddy. B
http://www.setusoftware.com/setu/index.htm