FAQ
Hi,

I have been studying map reduce and hadoop for the past few weeks, and
found it a very new concept. While I have a grasp of the map reduce
process as well as being able to follow some of the example code, I
still feel at a loss when it comes to creating my own exercise
"project" and would appreciate any inputs and help on that.

The project I am having in mind is to leech several (hundred) HTML
files from a website, and use hadoop to index the words of each page
so they can be later searched. However, in all examples I have seen
so far, the data are split into HDFS prior to the execution of the
job.

Here is the set of questions I have:

1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need
for this project

2. If so, are there any detailed documentations/examples on these classes?

3. If not, could you please let me know conceptually how you would go
about doing this?

3. If data must be split beforehand, do I must manually retrieve all
the webpages and load them into HDFS? or do I list the URLs of the
webpages into a text file and split this file instead?

As you can see, I am very confused at this point and would greatly
appreciate all the help I could get. Thanks!

-- Jim

Search Discussions

  • Ted Dunning at Oct 21, 2007 at 1:00 am
    Look for the slide show on Nutch and Hadoop.

    http://wiki.apache.org/lucene-hadoop/HadoopPresentations

    open the one called "Scalable Computing with Hadoop (Doug Cutting, May
    2006)"

    On 10/20/07 1:53 PM, "Jim the Standing Bear" wrote:

    Hi,

    I have been studying map reduce and hadoop for the past few weeks, and
    found it a very new concept. While I have a grasp of the map reduce
    process as well as being able to follow some of the example code, I
    still feel at a loss when it comes to creating my own exercise
    "project" and would appreciate any inputs and help on that.

    The project I am having in mind is to leech several (hundred) HTML
    files from a website, and use hadoop to index the words of each page
    so they can be later searched. However, in all examples I have seen
    so far, the data are split into HDFS prior to the execution of the
    job.

    Here is the set of questions I have:

    1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need
    for this project

    2. If so, are there any detailed documentations/examples on these classes?

    3. If not, could you please let me know conceptually how you would go
    about doing this?

    3. If data must be split beforehand, do I must manually retrieve all
    the webpages and load them into HDFS? or do I list the URLs of the
    webpages into a text file and split this file instead?

    As you can see, I am very confused at this point and would greatly
    appreciate all the help I could get. Thanks!

    -- Jim
  • Jim the Standing Bear at Oct 21, 2007 at 8:11 pm
    Thanks Ted.

    While the slides indeed give me valuable insights on the project I
    have in mind, I would still like to see some detailed
    examples/documentations on the different mappers and reducers that
    come with hadoop. Do you happen to know where I can find such texts?
    Thanks.

    -- Jim

    On 10/20/07, Ted Dunning wrote:

    Look for the slide show on Nutch and Hadoop.

    http://wiki.apache.org/lucene-hadoop/HadoopPresentations

    open the one called "Scalable Computing with Hadoop (Doug Cutting, May
    2006)"

    On 10/20/07 1:53 PM, "Jim the Standing Bear" wrote:

    Hi,

    I have been studying map reduce and hadoop for the past few weeks, and
    found it a very new concept. While I have a grasp of the map reduce
    process as well as being able to follow some of the example code, I
    still feel at a loss when it comes to creating my own exercise
    "project" and would appreciate any inputs and help on that.

    The project I am having in mind is to leech several (hundred) HTML
    files from a website, and use hadoop to index the words of each page
    so they can be later searched. However, in all examples I have seen
    so far, the data are split into HDFS prior to the execution of the
    job.

    Here is the set of questions I have:

    1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need
    for this project

    2. If so, are there any detailed documentations/examples on these classes?

    3. If not, could you please let me know conceptually how you would go
    about doing this?

    3. If data must be split beforehand, do I must manually retrieve all
    the webpages and load them into HDFS? or do I list the URLs of the
    webpages into a text file and split this file instead?

    As you can see, I am very confused at this point and would greatly
    appreciate all the help I could get. Thanks!

    -- Jim

    --
    --------------------------------------
    Standing Bear Has Spoken
    --------------------------------------

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 20, '07 at 8:53p
activeOct 21, '07 at 8:11p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase