I have a corpus of 300,000 raw HTML files that I want to read in and
parse using Hadoop. What is the best input file format to use in this
case? I want to have access to each page's raw HTML in the mapper, so
I can parse from there.
I was thinking of preprocessing all the files, removing the new
lines, and putting them in a big <key, value> file:
url1, html with stripped new lines
I'd rather not do all this preprocessing, just to wrangle the text
into Hadoop. Any other suggestions? What if I just stored the path to
the HTML file in a <key, value> type
Then in the mapper, I could read each file in from the DFS on the
fly. Anyone have any other good ideas? I feel like there's some key
function that I'm just stupidly overlooking...