FAQ
Hi Group,



Finally I have written a sample Mapred program, submitted this job to Hadoop
and got the expected results. Thanks to all of you!



Now I don't have an idea of how to use Hadoop in real life (am sorry if am
asking wrong question at wrong time.! (So, am right ;-))) :



1) If I re-submit my job, Hadoop responds with an error message saying:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://localhost:9000/user/root/impressions_output already exists

2) How to automatically execute Hadoop jobs? let's say I have set a cron job
which runs various Hadoop jobs at specified times. Is this the way we do in
Hadoop world?

3) Can I submit jobs to Hadoop from a different machine/ network/ domain?

4) I would like to generate reports from the data collected in the Hadoop.
How can I do that?

5) Am thinking of replacing data in my database with Hadoop and query Hadoop
for various information. Is this correct?

6) How can I access analyzed data in Hadoop from external world, external
program?



NOTE: I would like to use Java for any of above implementations.



Thanks in advance,

Shravan Kumar. M

Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

-----------------------------

This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator -
netopshelpdesk@catalytic.com

Search Discussions

  • Alex Loddengaard at Jul 6, 2009 at 5:22 pm
    Answers inline. Hope this is helpful.

    Alex
    On Mon, Jul 6, 2009 at 5:25 AM, Shravan Mahankali wrote:

    Hi Group,



    Finally I have written a sample Mapred program, submitted this job to
    Hadoop
    and got the expected results. Thanks to all of you!



    Now I don't have an idea of how to use Hadoop in real life (am sorry if am
    asking wrong question at wrong time.! (So, am right ;-))) :



    1) If I re-submit my job, Hadoop responds with an error message saying:
    org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
    hdfs://localhost:9000/user/root/impressions_output already exists
    If the output directory specified to a job already exists, Hadoop won't run
    the job. This makes a lot of sense, because it helps you avoid overwriting
    your data. Subsequent job runs will have to write to different output
    directories.
    2) How to automatically execute Hadoop jobs? let's say I have set a cron
    job
    which runs various Hadoop jobs at specified times. Is this the way we do in
    Hadoop world?
    Cron is a very good tool for running jobs at various times :). That said,
    cron does not provide a workflow management system that can tie jobs
    together. For workflow management, the community has hamake (<
    http://code.google.com/p/hamake/>), Oozie (<
    http://issues.apache.org/jira/browse/HADOOP-5303>), and Cascading (<
    http://www.cascading.org/>).

    3) Can I submit jobs to Hadoop from a different machine/ network/ domain?
    A different machine is easy. Just install Hadoop as you did on your other
    machines, and in hadoop-site.xml (assuming you're not using 0.20), just
    configure hadoop.job.ugi (optional, actually), fs.default.name, and
    mapred.job.tracker. As for a different network / domain, take a look at
    this: <
    http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
    >
    4) I would like to generate reports from the data collected in the Hadoop.
    How can I do that?
    What do you mean by "data collected in the Hadoop?" Log files? Output
    data?

    5) Am thinking of replacing data in my database with Hadoop and query
    Hadoop
    for various information. Is this correct?
    This probably isn't correct. HDFS does not have good latency, so it
    shouldn't be queried in a real-time/interactive environment. Similarly,
    once a file is written, it can only be appended to, which makes HDFS a bad
    application for a database. Lastly, HDFS doesn't really give you any sort
    of transactional behavior that a database would, which may or may not be
    what you're looking for. You may want to take a look at HBase (<
    http://hadoop.apache.org/hbase/>), as it is much more like a database than
    HDFS.

    6) How can I access analyzed data in Hadoop from external world, external
    program?
    Depending on the type of program accessing HDFS, you could just read files
    from HDFS. You could use DBOutputFormat (<
    http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html>)
    to write to a SQL database somewhere, and similarly you could output SQL
    syntax (or even data that can be imported with something like mysqlimport <
    http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html>) that you then run
    on a database somewhere. Again, it's important to realize that HDFS is by
    no means meant to serve interactive/real-time data as a local file system
    might. HDFS is meant for throughput for purposes of analyzing lots of data,
    and for storing lots of data. HBase, however, is being used by several
    people for real-time/interactive queries.



    NOTE: I would like to use Java for any of above implementations.



    Thanks in advance,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com


  • Shravan Mahankali at Jul 8, 2009 at 6:36 am
    Thanks so much for your response Alex. You have provided lot of information
    that helped me understand more about Hadoop. Thanks again.



    Basically I cannot clearly understand where Hadoop fits and how it
    supersedes available technology?



    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.



    Thank You,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator -
    netopshelpdesk@catalytic.com

    _____

    From: Alex Loddengaard
    Sent: Monday, July 06, 2009 10:52 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Subject: Re: how to use hadoop in real life?



    Answers inline. Hope this is helpful.

    Alex

    On Mon, Jul 6, 2009 at 5:25 AM, Shravan Mahankali
    wrote:

    Hi Group,



    Finally I have written a sample Mapred program, submitted this job to Hadoop
    and got the expected results. Thanks to all of you!



    Now I don't have an idea of how to use Hadoop in real life (am sorry if am
    asking wrong question at wrong time.! (So, am right ;-))) :



    1) If I re-submit my job, Hadoop responds with an error message saying:
    org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
    hdfs://localhost:9000/user/root/impressions_output already exists

    If the output directory specified to a job already exists, Hadoop won't run
    the job. This makes a lot of sense, because it helps you avoid overwriting
    your data. Subsequent job runs will have to write to different output
    directories.


    2) How to automatically execute Hadoop jobs? let's say I have set a cron job
    which runs various Hadoop jobs at specified times. Is this the way we do in
    Hadoop world?

    Cron is a very good tool for running jobs at various times :). That said,
    cron does not provide a workflow management system that can tie jobs
    together. For workflow management, the community has hamake
    (<http://code.google.com/p/hamake/>), Oozie
    (<http://issues.apache.org/jira/browse/HADOOP-5303>), and Cascading
    (<http://www.cascading.org/>).



    3) Can I submit jobs to Hadoop from a different machine/ network/ domain?

    A different machine is easy. Just install Hadoop as you did on your other
    machines, and in hadoop-site.xml (assuming you're not using 0.20), just
    configure hadoop.job.ugi (optional, actually), fs.default.name, and
    mapred.job.tracker. As for a different network / domain, take a look at
    this:
    <http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a
    -gateway/>


    4) I would like to generate reports from the data collected in the Hadoop.
    How can I do that?

    What do you mean by "data collected in the Hadoop?" Log files? Output
    data?



    5) Am thinking of replacing data in my database with Hadoop and query Hadoop
    for various information. Is this correct?

    This probably isn't correct. HDFS does not have good latency, so it
    shouldn't be queried in a real-time/interactive environment. Similarly,
    once a file is written, it can only be appended to, which makes HDFS a bad
    application for a database. Lastly, HDFS doesn't really give you any sort
    of transactional behavior that a database would, which may or may not be
    what you're looking for. You may want to take a look at HBase
    (<http://hadoop.apache.org/hbase/>), as it is much more like a database than
    HDFS.



    6) How can I access analyzed data in Hadoop from external world, external
    program?

    Depending on the type of program accessing HDFS, you could just read files
    from HDFS. You could use DBOutputFormat
    (<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/li
    b/db/DBOutputFormat.html>) to write to a SQL database somewhere, and
    similarly you could output SQL syntax (or even data that can be imported
    with something like mysqlimport
    <http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html>) that you then run
    on a database somewhere. Again, it's important to realize that HDFS is by
    no means meant to serve interactive/real-time data as a local file system
    might. HDFS is meant for throughput for purposes of analyzing lots of data,
    and for storing lots of data. HBase, however, is being used by several
    people for real-time/interactive queries.





    NOTE: I would like to use Java for any of above implementations.



    Thanks in advance,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator -
    netopshelpdesk@catalytic.com
  • Ted Dunning at Jul 8, 2009 at 5:18 pm
    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.
    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali wrote:

    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Shravan Mahankali at Jul 9, 2009 at 5:05 am
    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.
    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali wrote:

    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Shravan Mahankali at Jul 9, 2009 at 9:27 am
    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop from
    machine.X where as Hadoop is running from machine.Y, after dumping this data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and initiate
    the job and get the results.

    Please advice.

    Thank You,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org
    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.
    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali wrote:

    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Shravan Mahankali at Jul 9, 2009 at 9:29 am
    Sorry!!! I forgot to mention that am using Hadoop 0.18.3 (stable version of
    Hadoop).

    Thank You,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 2:57 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop from
    machine.X where as Hadoop is running from machine.Y, after dumping this data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and initiate
    the job and get the results.

    Please advice.

    Thank You,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org
    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.
    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali wrote:

    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Alex Loddengaard at Jul 9, 2009 at 5:49 pm
    Writing a Java program that uses the API is basically equivalent to
    installed a Hadoop client and writing a Python script to manipulate HDFS and
    fire off a MR job. It's up to you to decide how much you like Java :).

    Alex
    On Thu, Jul 9, 2009 at 2:27 AM, Shravan Mahankali wrote:

    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop
    from
    machine.X where as Hadoop is running from machine.Y, after dumping this
    data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install
    Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and
    initiate
    the job and get the results.

    Please advice.

    Thank You,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org
    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more
    work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend
    that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.

    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali <
    shravan.mahankali@catalytic.com> wrote:
    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Shravan Mahankali at Jul 10, 2009 at 5:12 am
    Hi Alex/ Group,



    Thanks for your response. Is there something called "Hadoop client"? Google
    does not suggest me one!



    Should this Hadoop client/ Hadoop be installed, configured as we did with
    Hadoop on a server? So, will this Hadoop client occupies memory/ disk space
    for running data/ name nodes, slaves.



    Thank You,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator -
    netopshelpdesk@catalytic.com

    _____

    From: Alex Loddengaard
    Sent: Thursday, July 09, 2009 11:19 PM
    To: shravan.mahankali@catalytic.com
    Cc: common-user@hadoop.apache.org
    Subject: Re: how to use hadoop in real life?



    Writing a Java program that uses the API is basically equivalent to
    installed a Hadoop client and writing a Python script to manipulate HDFS and
    fire off a MR job. It's up to you to decide how much you like Java :).

    Alex

    On Thu, Jul 9, 2009 at 2:27 AM, Shravan Mahankali
    wrote:

    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop from
    machine.X where as Hadoop is running from machine.Y, after dumping this data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and initiate
    the job and get the results.

    Please advice.

    Thank You,

    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----

    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org

    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.
    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali wrote:

    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.
  • Harish Mallipeddi at Jul 10, 2009 at 5:32 am
    Hi Shravan,
    By Hadoop client, I think he means the "hadoop" command-line program
    available under $HADOOP_HOME/bin. You can either write a custom Java program
    which directly uses the Hadoop APIs or just write a bash/python script which
    will invoke this command-line app and delegate work to it.

    - Harish
    On Fri, Jul 10, 2009 at 10:41 AM, Shravan Mahankali wrote:

    Hi Alex/ Group,



    Thanks for your response. Is there something called "Hadoop client"? Google
    does not suggest me one!



    Should this Hadoop client/ Hadoop be installed, configured as we did with
    Hadoop on a server? So, will this Hadoop client occupies memory/ disk space
    for running data/ name nodes, slaves.



    Thank You,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    _____

    From: Alex Loddengaard
    Sent: Thursday, July 09, 2009 11:19 PM
    To: shravan.mahankali@catalytic.com
    Cc: common-user@hadoop.apache.org
    Subject: Re: how to use hadoop in real life?



    Writing a Java program that uses the API is basically equivalent to
    installed a Hadoop client and writing a Python script to manipulate HDFS
    and
    fire off a MR job. It's up to you to decide how much you like Java :).

    Alex

    On Thu, Jul 9, 2009 at 2:27 AM, Shravan Mahankali
    wrote:

    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop
    from
    machine.X where as Hadoop is running from machine.Y, after dumping this
    data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install
    Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and
    initiate
    the job and get the results.

    Please advice.

    Thank You,

    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----

    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org

    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more
    work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend
    that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.

    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali <
    shravan.mahankali@catalytic.com> wrote:
    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear that
    every time I run a job in Hadoop I have to create a directory and how can I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.


    --
    Harish Mallipeddi
    http://blog.poundbang.in
  • Shravan Mahankali at Jul 10, 2009 at 5:46 am
    Okay. Thanks Harish for your response.

    Thank You,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Harish Mallipeddi
    Sent: Friday, July 10, 2009 11:02 AM
    To: common-user@hadoop.apache.org
    Subject: Re: how to use hadoop in real life?

    Hi Shravan,
    By Hadoop client, I think he means the "hadoop" command-line program
    available under $HADOOP_HOME/bin. You can either write a custom Java program
    which directly uses the Hadoop APIs or just write a bash/python script which
    will invoke this command-line app and delegate work to it.

    - Harish
    On Fri, Jul 10, 2009 at 10:41 AM, Shravan Mahankali wrote:

    Hi Alex/ Group,



    Thanks for your response. Is there something called "Hadoop client"? Google
    does not suggest me one!



    Should this Hadoop client/ Hadoop be installed, configured as we did with
    Hadoop on a server? So, will this Hadoop client occupies memory/ disk space
    for running data/ name nodes, slaves.



    Thank You,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    _____

    From: Alex Loddengaard
    Sent: Thursday, July 09, 2009 11:19 PM
    To: shravan.mahankali@catalytic.com
    Cc: common-user@hadoop.apache.org
    Subject: Re: how to use hadoop in real life?



    Writing a Java program that uses the API is basically equivalent to
    installed a Hadoop client and writing a Python script to manipulate HDFS
    and
    fire off a MR job. It's up to you to decide how much you like Java :).

    Alex

    On Thu, Jul 9, 2009 at 2:27 AM, Shravan Mahankali
    wrote:

    Hi Group,

    I have data to be analyzed and I would like to dump this data to Hadoop
    from
    machine.X where as Hadoop is running from machine.Y, after dumping this
    data
    to data I would like to initiate a job, get this data analyzed and get the
    output information back to machine.X

    I would like to do all this programmatically. Am going through Hadoop API
    for this same purpose. I remember last day Alex was saying to install
    Hadoop
    in machine.X, but I was not sure why to do that?

    I simple write a Java program including Hadoop-core jar, I was planning to
    use "FsUrlStreamHandlerFactory" to connect to Hadoop in machine.Y and then
    use "org.apache.hadoop.fs.shell" to copy data to Hadoop machine and
    initiate
    the job and get the results.

    Please advice.

    Thank You,

    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----

    From: Shravan Mahankali
    Sent: Thursday, July 09, 2009 10:35 AM
    To: common-user@hadoop.apache.org

    Cc: 'Alex Loddengaard'
    Subject: RE: how to use hadoop in real life?

    Thanks for the information Ted.

    Regards,
    Shravan Kumar. M
    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
    -----------------------------
    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com

    -----Original Message-----
    From: Ted Dunning
    Sent: Wednesday, July 08, 2009 10:48 PM
    To: common-user@hadoop.apache.org; shravan.mahankali@catalytic.com
    Cc: Alex Loddengaard
    Subject: Re: how to use hadoop in real life?

    In general hadoop is simpler than you might imagine.

    Yes, you need to create directories to store data. This is much lighter
    weight than creating a table in SQL.

    But the key question is volume. Hadoop makes some things easier and Pig
    queries are generally easier to write than SQL (for programmers ... not for
    those raised on SQL), but, overall, map-reduce programs really are more
    work
    to write than SQL queries until you get to really large scale problems.

    If your database has less than 10 million rows or so, I would recommend
    that
    you consider doing all analysis in SQL augmented by procedural languages.
    Only as your data goes beyond 100 million to a billion rows do the clear
    advantages of map-reduce formulation become apparent.

    On Tue, Jul 7, 2009 at 11:35 PM, Shravan Mahankali <
    shravan.mahankali@catalytic.com> wrote:
    Use Case: We have a web app where user performs some actions, we have to
    track these actions and various parameters related to action initiator, we
    actually store this information in the database. But our manager has
    suggested evaluating Hadoop for this scenario, however, am not clear
    that
    every time I run a job in Hadoop I have to create a directory and how
    can
    I
    track that later to read the data analyzed by Hadoop. Even though I drop
    user action information in Hadoop, I have to put this information in our
    database such that it knows the trend and responds for various of requests
    accordingy.


    --
    Harish Mallipeddi
    http://blog.poundbang.in
  • Andy at Jul 7, 2009 at 12:21 pm
    Hi, befor you resubmit your program, please make sure your output path is "valid". The valid condition depends on your outputformat. In the case of FileOutputFomrat, this means, your output dir should not exist. So try to delete them first.

    you have various methods to initiate your hadoop program. try to refer to some "classic" hadoop programs, such as Nutch, a totally hadoop based search engine. Find how many ways that you can deploy and run a hadoop program.

    to my knowledge, you can submit your hadoop program anywhere, as long as you can access the "master" machine through network.

    what kind of report?
    not a good idea to replace your database with hadoop storage. the design of hadoop aims the high I/O utility when dealing a large scale of data. So, maybe it is not appropriate to treat hadoop as distributed database. BTW, the hbase project may be useful for you.
    this is simple, use the HDFS APIs, you can do any file operation on HDFS remotely.
    Java is a powerful language for most hadoop programmer, again, try to get familiar with some Java written hadoop projects, you will find it is convienent to do your things.
    Best wishes
    Song
    在2009-07-06?20:25:49,"Shravan?Mahankali"?<shravan.mahankali@catalytic.com>?写道:
    Hi?Group,

    ?

    Finally?I?have?written?a?sample?Mapred?program,?submitted?this?job?to?Hadoop
    and?got?the?expected?results.?Thanks?to?all?of?you!

    ?

    Now?I?don't?have?an?idea?of?how?to?use?Hadoop?in?real?life?(am?sorry?if?am
    asking?wrong?question?at?wrong?time.!?(So,?am?right?;-)))?:

    ?

    1)?If?I?re-submit?my?job,?Hadoop?responds?with?an?error?message?saying:
    org.apache.hadoop.mapred.FileAlreadyExistsException:?Output?directory
    hdfs://localhost:9000/user/root/impressions_output?already?exists

    2)?How?to?automatically?execute?Hadoop?jobs??let's?say?I?have?set?a?cron?job
    which?runs?various?Hadoop?jobs?at?specified?times.?Is?this?the?way?we?do?in
    Hadoop?world?

    3)?Can?I?submit?jobs?to?Hadoop?from?a?different?machine/?network/?domain?

    4)?I?would?like?to?generate?reports?from?the?data?collected?in?the?Hadoop.
    How?can?I?do?that?

    5)?Am?thinking?of?replacing?data?in?my?database?with?Hadoop?and?query?Hadoop
    for?various?information.?Is?this?correct?

    6)?How?can?I?access?analyzed?data?in?Hadoop?from?external?world,?external
    program?

    ?

    NOTE:?I?would?like?to?use?Java?for?any?of?above?implementations.

    ?

    Thanks?in?advance,

    Shravan?Kumar.?M?

    Catalytic?Software?Ltd.?[SEI-CMMI?Level?5?Company]

    -----------------------------

    This?email?and?any?files?transmitted?with?it?are?confidential?and?intended
    solely?for?the?use?of?the?individual?or?entity?to?whom?they?are?addressed.
    If?you?have?received?this?email?in?error?please?notify?the?system
    administrator?-?? netopshelpdesk@catalytic.com

    ?
  • Shravan Mahankali at Jul 8, 2009 at 6:37 am
    Thanks for your response Andy!



    Regards,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator -
    netopshelpdesk@catalytic.com

    _____

    From: Andy
    Sent: Tuesday, July 07, 2009 5:51 PM
    To: common-user; shravan.mahankali
    Subject: Re:how to use hadoop in real life?



    Hi, befor you resubmit your program, please make sure your output path is
    "valid". The valid condition depends on your outputformat. In the case of
    FileOutputFomrat, this means, your output dir should not exist. So try to
    delete them first.



    you have various methods to initiate your hadoop program. try to refer to
    some "classic" hadoop programs, such as Nutch, a totally hadoop based search
    engine. Find how many ways that you can deploy and run a hadoop program.

    to my knowledge, you can submit your hadoop program anywhere, as long as you
    can access the "master" machine through network.



    what kind of report?

    not a good idea to replace your database with hadoop storage. the design of
    hadoop aims the high I/O utility when dealing a large scale of data. So,
    maybe it is not appropriate to treat hadoop as distributed database. BTW,
    the hbase project may be useful for you.
    this is simple, use the HDFS APIs, you can do any file operation on HDFS
    remotely.
    Java is a powerful language for most hadoop programmer, again, try to get
    familiar with some Java written hadoop projects, you will find it is
    convienent to do your things.
    Best wishes
    Song
    在2009-07-06 20:25:49,"Shravan Mahankali" <shravan.mahankali@catalytic.com>
    写道:
    Hi Group,



    Finally I have written a sample Mapred program, submitted this job to Hadoop
    and got the expected results. Thanks to all of you!



    Now I don't have an idea of how to use Hadoop in real life (am sorry if am
    asking wrong question at wrong time.! (So, am right ;-))) :



    1) If I re-submit my job, Hadoop responds with an error message saying:
    org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
    hdfs://localhost:9000/user/root/impressions_output already exists

    2) How to automatically execute Hadoop jobs? let's say I have set a cron job
    which runs various Hadoop jobs at specified times. Is this the way we do in
    Hadoop world?

    3) Can I submit jobs to Hadoop from a different machine/ network/ domain?

    4) I would like to generate reports from the data collected in the Hadoop.
    How can I do that?

    5) Am thinking of replacing data in my database with Hadoop and query Hadoop
    for various information. Is this correct?

    6) How can I access analyzed data in Hadoop from external world, external
    program?



    NOTE: I would like to use Java for any of above implementations.



    Thanks in advance,

    Shravan Kumar. M

    Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

    -----------------------------

    This email and any files transmitted with it are confidential and intended
    solely for the use of the individual or entity to whom they are addressed.
    If you have received this email in error please notify the system
    administrator - netopshelpdesk@catalytic.com






    _____

    200
    <http://count.mail.163.com/redirect/footer.htm?f=http://gouwu.youdao.com> 万
    种商品,最低价格,疯狂诱惑你

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 6, '09 at 12:26p
activeJul 10, '09 at 5:46a
posts13
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase