FAQ
I have a corpus of 300,000 raw HTML files that I want to read in and
parse using Hadoop. What is the best input file format to use in this
case? I want to have access to each page's raw HTML in the mapper, so
I can parse from there.

I was thinking of preprocessing all the files, removing the new
lines, and putting them in a big <key, value> file:

url1, html with stripped new lines
url2, ....
url3, ....
...
urlN, ....

I'd rather not do all this preprocessing, just to wrangle the text
into Hadoop. Any other suggestions? What if I just stored the path to
the HTML file in a <key, value> type

url1, path_to_file1
url2, path_to_file2
...
urlN, path_to_fileN

Then in the mapper, I could read each file in from the DFS on the
fly. Anyone have any other good ideas? I feel like there's some key
function that I'm just stupidly overlooking...

Thanks!
David Balatero

Search Discussions

  • Ted Dunning at Oct 25, 2007 at 12:29 am
    File open time is an issue if you have lots and lots of little files.

    If you are doing this analysis once or a few times, then it isn't worth
    reformatting into a few larger files.

    If you are likely to do this analysis dozens of times, then opening larger
    files will probably give you a significant benefit in terms of runtime.

    If the runtime isn't terribly important, then the filename per line approach
    will work fine.

    Note that the filename per line approach is a great way to do the
    pre-processing into a few large files which will then be analyzed faster.
    On 10/24/07 5:09 PM, "David Balatero" wrote:

    I have a corpus of 300,000 raw HTML files that I want to read in and
    parse using Hadoop. What is the best input file format to use in this
    case? I want to have access to each page's raw HTML in the mapper, so
    I can parse from there.

    I was thinking of preprocessing all the files, removing the new
    lines, and putting them in a big <key, value> file:

    url1, html with stripped new lines
    url2, ....
    url3, ....
    ...
    urlN, ....

    I'd rather not do all this preprocessing, just to wrangle the text
    into Hadoop. Any other suggestions? What if I just stored the path to
    the HTML file in a <key, value> type

    url1, path_to_file1
    url2, path_to_file2
    ...
    urlN, path_to_fileN

    Then in the mapper, I could read each file in from the DFS on the
    fly. Anyone have any other good ideas? I feel like there's some key
    function that I'm just stupidly overlooking...

    Thanks!
    David Balatero
  • David Balatero at Oct 25, 2007 at 12:43 am
    I like your style regarding pre-processing into a few large files
    with Hadoop. I think I may go that route, unless anyone else has any
    brilliant ideas.

    - David
    On Oct 24, 2007, at 5:29 PM, Ted Dunning wrote:



    File open time is an issue if you have lots and lots of little files.

    If you are doing this analysis once or a few times, then it isn't
    worth
    reformatting into a few larger files.

    If you are likely to do this analysis dozens of times, then opening
    larger
    files will probably give you a significant benefit in terms of
    runtime.

    If the runtime isn't terribly important, then the filename per line
    approach
    will work fine.

    Note that the filename per line approach is a great way to do the
    pre-processing into a few large files which will then be analyzed
    faster.
    On 10/24/07 5:09 PM, "David Balatero" wrote:

    I have a corpus of 300,000 raw HTML files that I want to read in and
    parse using Hadoop. What is the best input file format to use in this
    case? I want to have access to each page's raw HTML in the mapper, so
    I can parse from there.

    I was thinking of preprocessing all the files, removing the new
    lines, and putting them in a big <key, value> file:

    url1, html with stripped new lines
    url2, ....
    url3, ....
    ...
    urlN, ....

    I'd rather not do all this preprocessing, just to wrangle the text
    into Hadoop. Any other suggestions? What if I just stored the path to
    the HTML file in a <key, value> type

    url1, path_to_file1
    url2, path_to_file2
    ...
    urlN, path_to_fileN

    Then in the mapper, I could read each file in from the DFS on the
    fly. Anyone have any other good ideas? I feel like there's some key
    function that I'm just stupidly overlooking...

    Thanks!
    David Balatero
  • Enis Soztutar at Oct 25, 2007 at 6:41 am
    I can think of two ways to do this :
    1. Use MultiFileInputFormat for this job, and split the input so that
    each mapper gets many files to read.
    2. First pack the files into one large SequenceFile of <url,html> pairs.
    Then use SequenceFileInputFormat.

    David Balatero wrote:
    I like your style regarding pre-processing into a few large files with
    Hadoop. I think I may go that route, unless anyone else has any
    brilliant ideas.

    - David
    On Oct 24, 2007, at 5:29 PM, Ted Dunning wrote:



    File open time is an issue if you have lots and lots of little files.

    If you are doing this analysis once or a few times, then it isn't worth
    reformatting into a few larger files.

    If you are likely to do this analysis dozens of times, then opening
    larger
    files will probably give you a significant benefit in terms of runtime.

    If the runtime isn't terribly important, then the filename per line
    approach
    will work fine.

    Note that the filename per line approach is a great way to do the
    pre-processing into a few large files which will then be analyzed
    faster.
    On 10/24/07 5:09 PM, "David Balatero" wrote:

    I have a corpus of 300,000 raw HTML files that I want to read in and
    parse using Hadoop. What is the best input file format to use in this
    case? I want to have access to each page's raw HTML in the mapper, so
    I can parse from there.

    I was thinking of preprocessing all the files, removing the new
    lines, and putting them in a big <key, value> file:

    url1, html with stripped new lines
    url2, ....
    url3, ....
    ...
    urlN, ....

    I'd rather not do all this preprocessing, just to wrangle the text
    into Hadoop. Any other suggestions? What if I just stored the path to
    the HTML file in a <key, value> type

    url1, path_to_file1
    url2, path_to_file2
    ...
    urlN, path_to_fileN

    Then in the mapper, I could read each file in from the DFS on the
    fly. Anyone have any other good ideas? I feel like there's some key
    function that I'm just stupidly overlooking...

    Thanks!
    David Balatero

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 25, '07 at 12:09a
activeOct 25, '07 at 6:41a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase