FAQ
I'm just starting out using Hadoop. I've looked through the java examples,
and have an idea about what's going on, but don't really get it.

I'd like to write a program that takes a directory of files. Contained in
those files are a URL to a website on the first line, and the second line is
the TEXT from that website.

The mapper should create a map for each word in the text to that URL, so
every word found on the website would map to the URL.

The reducer then, would collect all of the URLs that are mapped to via a
given word.

Each Word->URL is then written to a file.

So, it's "simple" as a program designed to run on a single system, but I
want to be able to distribute the computation and whatnot using Hadoop.

I'm extremely new to Hadoop, I'm not even sure how to ask all of the
questions I'd like answers for, I have zero experience in MapReduce, and
limited experience in functional programming at all. Any programming tips,
or if I have my "Mapper" or "Reducer" defined incorrectly, corrections, etc
would be greatly appreciated.

Questions:
How do I read (and write) files from hdfs?
Once I've read them, How do I distribute the files to be mapped?
I know I need a class to implement the mapper, and one to implement the
reducer, but how does the class have a return type to output the map?

Thanks a lot for your help.
--
View this message in context: http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Ted Dunning at Apr 1, 2011 at 4:55 am
    It would help to get a good book. There are several.

    For your program, there are several things that will trip you up:

    a) lots of little files is going to be slow. You want input that is >100MB
    per file if you want speed.

    b) That file format is a bit cheesy since it is hard to tell URL's from text
    if you concatenate lots of files. Better to use a format like protobufs or
    Avro or even sequence files to separate the key and the data unambiguously.

    c) I suspect that what you are asking for is to run a mapper so that each
    invocation of map gets the URL as key and the text as data. That map
    invocation can then tokenize the data and emit records with the URL as key
    and each word as data. That isn't much use since the reducer will get the
    URL and all the words that were emitted for that URL. If each URL appears
    exactly once, then the input already had that. Perhaps you mean to emit the
    word as key and URL as data. Then the reducer will see the word as key and
    an iterator over all the URLs that mentioned the word.
    On Thu, Mar 31, 2011 at 9:48 PM, DoomUs wrote:


    I'm just starting out using Hadoop. I've looked through the java examples,
    and have an idea about what's going on, but don't really get it.

    I'd like to write a program that takes a directory of files. Contained in
    those files are a URL to a website on the first line, and the second line
    is
    the TEXT from that website.

    The mapper should create a map for each word in the text to that URL, so
    every word found on the website would map to the URL.

    The reducer then, would collect all of the URLs that are mapped to via a
    given word.

    Each Word->URL is then written to a file.

    So, it's "simple" as a program designed to run on a single system, but I
    want to be able to distribute the computation and whatnot using Hadoop.

    I'm extremely new to Hadoop, I'm not even sure how to ask all of the
    questions I'd like answers for, I have zero experience in MapReduce, and
    limited experience in functional programming at all. Any programming tips,
    or if I have my "Mapper" or "Reducer" defined incorrectly, corrections, etc
    would be greatly appreciated.

    Questions:
    How do I read (and write) files from hdfs?
    Once I've read them, How do I distribute the files to be mapped?
    I know I need a class to implement the mapper, and one to implement the
    reducer, but how does the class have a return type to output the map?

    Thanks a lot for your help.
    --
    View this message in context:
    http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 1, '11 at 4:49a
activeApr 1, '11 at 4:55a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Ted Dunning: 1 post DoomUs: 1 post

People

Translate

site design / logo © 2022 Grokbase