FAQ
Hi,

I am using Hadoop 0.20.1 and I have a problem which is similar to the
one below:

Setup
-----
I have a web server log that looks like:
serial number, domain, datetime, httpstatus
The web server outputs to a single log file for 1,000 domains. I would
like an output in the following format:
previous serial number, domain1, previous datetime, previous
httpstatus, next serial number, current datetime, current httpstatus

For example, an input of:
1728 ...
1729, hadoop.apache.org, 2009/10/18 08:23:15, 200
1730, ...
...
1735, ...
1736, hadoop.apache.org, 2009/10/18 08:23:19, 404
1737, ...
...
1741, ...
1742, hadoop.apache.org, 2009/10/18 08:23:24, 500
1743, ...
...
1750, ...
1751, hadoop.apache.org, 2009/10/18 08:23:30, 200
1752 ...
would output:
1729, hadoop.apache.org, 2009/10/18 08:23:15, 200, 1736, 2009/10/18
08:23:19, 404
1736, hadoop.apache.org, 2009/10/18 08:23:19, 404, 1742, 2009/10/18
08:23:24, 500
1742, hadoop.apache.org, 2009/10/18 08:23:24, 500, 1751, 2009/10/18
08:23:30, 200


My thoughts
-----------
I have separated the problem into 2 jobs:

1) Do a Secondary Sort on the log file to output a file sorted primarily
by <domain> followed by <serial number>

2a) Implement a custom TwoLineRecordReader<LongWritable, Text> that
takes in the previous output as the input. The custom RecordReader:
2a) i) During initialize(InputSplit, TaskAttemptContext), reads the
first line.
2a) ii) During nextKeyValue(), reads the second line output and sets
<value> to firstLine + "|" + secondLine.
Consequently, sets firstLine to secondLine.

2b) The mapper and reducer generates the output.

I have been successful at job 1.

Problems
--------
It seems as though the job is not using TwoLineRecordReader, even though
I've specified it through a custom InputFormat. Instead, it outputs the
same input file when I do a println on <value> in Mapper.
TwoLineInputFormat.addInputPath(job, new
Path("output/sorted/part-r-00000"));
TextOutputFormat.setOutputPath(job, new Path("output/transitions"));
Call to action
--------------
1) Perhaps I'm not thinking of the problem the right way. Would you
suggest another way to solve it?
2) Am I implementing the custom RecordReader in the right way?


Thank you!

Regards,
Shahfik Amasha
Undergraduate
School of Information Systems
Singapore Management University

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 18, '09 at 11:59a
activeOct 18, '09 at 11:59a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Shahfik Bin AMASHA: 1 post

People

Translate

site design / logo © 2022 Grokbase