|| at Mar 29, 2011 at 1:39 pm
On Tue, Mar 29, 2011 at 03:20:38PM +0200, Eric wrote:
I'm interested in hearing how you get data into and out of HDFS. Are you
using tools like Flume? Are you using fuse_dfs? Are you putting files on
HDFS with "hadoop dfs -put ..."?
And how does your method scale? Can you move terrabytes of data per day? Or
are we talking gigabytes?
I'm currently migrating our ~600TB datastore to HDFS. To transfer the data,
we iterate through the raw files stored on our legacy data servers and write
them to HDFS using `hadoop fs -put`. So far, I've limited the number of servers
participating in the migration, so we've only had on the order of 20 parallel
writers. This week, I plan to increase that by at least an order of magnitude.
I expect to be able to scale the migration horizontally without impacting our
current production system. Then, when the transfers are complete, we can cut our
protocol endpoints over without significant downtime. At least, that's the plan.