Answers inline. Hope this is helpful.

On Mon, Jul 6, 2009 at 5:25 AM, Shravan Mahankali wrote:

Hi Group,

Finally I have written a sample Mapred program, submitted this job to
and got the expected results. Thanks to all of you!

Now I don't have an idea of how to use Hadoop in real life (am sorry if am
asking wrong question at wrong time.! (So, am right ;-))) :

1) If I re-submit my job, Hadoop responds with an error message saying:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://localhost:9000/user/root/impressions_output already exists
If the output directory specified to a job already exists, Hadoop won't run
the job. This makes a lot of sense, because it helps you avoid overwriting
your data. Subsequent job runs will have to write to different output
2) How to automatically execute Hadoop jobs? let's say I have set a cron
which runs various Hadoop jobs at specified times. Is this the way we do in
Hadoop world?
Cron is a very good tool for running jobs at various times :). That said,
cron does not provide a workflow management system that can tie jobs
together. For workflow management, the community has hamake (<
http://code.google.com/p/hamake/>), Oozie (<
http://issues.apache.org/jira/browse/HADOOP-5303>), and Cascading (<

3) Can I submit jobs to Hadoop from a different machine/ network/ domain?
A different machine is easy. Just install Hadoop as you did on your other
machines, and in hadoop-site.xml (assuming you're not using 0.20), just
configure hadoop.job.ugi (optional, actually), fs.default.name, and
mapred.job.tracker. As for a different network / domain, take a look at
this: <
4) I would like to generate reports from the data collected in the Hadoop.
How can I do that?
What do you mean by "data collected in the Hadoop?" Log files? Output

5) Am thinking of replacing data in my database with Hadoop and query
for various information. Is this correct?
This probably isn't correct. HDFS does not have good latency, so it
shouldn't be queried in a real-time/interactive environment. Similarly,
once a file is written, it can only be appended to, which makes HDFS a bad
application for a database. Lastly, HDFS doesn't really give you any sort
of transactional behavior that a database would, which may or may not be
what you're looking for. You may want to take a look at HBase (<
http://hadoop.apache.org/hbase/>), as it is much more like a database than

6) How can I access analyzed data in Hadoop from external world, external
Depending on the type of program accessing HDFS, you could just read files
from HDFS. You could use DBOutputFormat (<
to write to a SQL database somewhere, and similarly you could output SQL
syntax (or even data that can be imported with something like mysqlimport <
http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html>) that you then run
on a database somewhere. Again, it's important to realize that HDFS is by
no means meant to serve interactive/real-time data as a local file system
might. HDFS is meant for throughput for purposes of analyzing lots of data,
and for storing lots of data. HBase, however, is being used by several
people for real-time/interactive queries.

NOTE: I would like to use Java for any of above implementations.

Thanks in advance,

Shravan Kumar. M

Catalytic Software Ltd. [SEI-CMMI Level 5 Company]


This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator - netopshelpdesk@catalytic.com

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 13 | next ›
Discussion Overview
groupcommon-user @
postedJul 6, '09 at 12:26p
activeJul 10, '09 at 5:46a



site design / logo © 2022 Grokbase