Thanks so much for your response Alex. You have provided lot of information
that helped me understand more about Hadoop. Thanks again.
Basically I cannot clearly understand where Hadoop fits and how it
supersedes available technology?
Use Case: We have a web app where user performs some actions, we have to
track these actions and various parameters related to action initiator, we
actually store this information in the database. But our manager has
suggested evaluating Hadoop for this scenario, however, am not clear that
every time I run a job in Hadoop I have to create a directory and how can I
track that later to read the data analyzed by Hadoop. Even though I drop
user action information in Hadoop, I have to put this information in our
database such that it knows the trend and responds for various of requests
accordingy.
Thank You,
Shravan Kumar. M
Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
-----------------------------
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator -
netopshelpdesk@catalytic.com
_____
From: Alex Loddengaard
Sent: Monday, July 06, 2009 10:52 PM
To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
Subject: Re: how to use hadoop in real life?
Answers inline. Hope this is helpful.
Alex
On Mon, Jul 6, 2009 at 5:25 AM, Shravan Mahankali
wrote:
Hi Group,
Finally I have written a sample Mapred program, submitted this job to Hadoop
and got the expected results. Thanks to all of you!
Now I don't have an idea of how to use Hadoop in real life (am sorry if am
asking wrong question at wrong time.! (So, am right ;-))) :
1) If I re-submit my job, Hadoop responds with an error message saying:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://localhost:9000/user/root/impressions_output already exists
If the output directory specified to a job already exists, Hadoop won't run
the job. This makes a lot of sense, because it helps you avoid overwriting
your data. Subsequent job runs will have to write to different output
directories.
2) How to automatically execute Hadoop jobs? let's say I have set a cron job
which runs various Hadoop jobs at specified times. Is this the way we do in
Hadoop world?
Cron is a very good tool for running jobs at various times :). That said,
cron does not provide a workflow management system that can tie jobs
together. For workflow management, the community has hamake
(<
http://code.google.com/p/hamake/>), Oozie
(<
http://issues.apache.org/jira/browse/HADOOP-5303>), and Cascading
(<
http://www.cascading.org/>).
3) Can I submit jobs to Hadoop from a different machine/ network/ domain?
A different machine is easy. Just install Hadoop as you did on your other
machines, and in hadoop-site.xml (assuming you're not using 0.20), just
configure hadoop.job.ugi (optional, actually), fs.default.name, and
mapred.job.tracker. As for a different network / domain, take a look at
this:
<http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a
-gateway/>
4) I would like to generate reports from the data collected in the Hadoop.
How can I do that?
What do you mean by "data collected in the Hadoop?" Log files? Output
data?
5) Am thinking of replacing data in my database with Hadoop and query Hadoop
for various information. Is this correct?
This probably isn't correct. HDFS does not have good latency, so it
shouldn't be queried in a real-time/interactive environment. Similarly,
once a file is written, it can only be appended to, which makes HDFS a bad
application for a database. Lastly, HDFS doesn't really give you any sort
of transactional behavior that a database would, which may or may not be
what you're looking for. You may want to take a look at HBase
(<
http://hadoop.apache.org/hbase/>), as it is much more like a database than
HDFS.
6) How can I access analyzed data in Hadoop from external world, external
program?
Depending on the type of program accessing HDFS, you could just read files
from HDFS. You could use DBOutputFormat
(<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/li
b/db/DBOutputFormat.html>) to write to a SQL database somewhere, and
similarly you could output SQL syntax (or even data that can be imported
with something like mysqlimport
<
http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html>) that you then run
on a database somewhere. Again, it's important to realize that HDFS is by
no means meant to serve interactive/real-time data as a local file system
might. HDFS is meant for throughput for purposes of analyzing lots of data,
and for storing lots of data. HBase, however, is being used by several
people for real-time/interactive queries.
NOTE: I would like to use Java for any of above implementations.
Thanks in advance,
Shravan Kumar. M
Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
-----------------------------
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator -
netopshelpdesk@catalytic.com