Hi!

Let's assume an example use case where Apache's mod_usertrack
generates randomly selected user id's that are stored in a cookie and
written to the log file.

I want to keep track of the number of daily, weekly, monthly,
quarterly, yearly users, as well as the total number of users since the
application was launched.

I can do this rather fast on the Linux commandline by creating a file
for each day listing the unique UIDs, one per line, then use sort -u to
get the list of unique such users. To get a weekly count, I can create
a list of weekly users, then merge in each day's users with "sort -u
-m", which works well if both input files are already sorted.

Now, this of course only works up to a rather small amount of data
before the runtime becomes a problem. Now I want to do this using
Hadoop and Map/Reduce. I'm sure this problem have been solved before,
and I'm now looking for hints from experienced people.

Is there perhaps already freely available java code that can do the
sorting/merging for me?

Should I use some trick to take advantage of the fact that the
weekly/monthly/etc files are already sorted?

Should I store the weekly/monthly/etc files in some hadoop:ish format
for better performance instead of keeping the textoutputformat?

Thanks,
\EF
--
Erik Forsberg <forsberg@opera.com>
Developer, Opera Software, http://www.opera.com/

Search Discussions

  • Sean Owen at Sep 1, 2009 at 9:44 am
    (This isn't what you're asking for, but if the input is already
    sorted, the 'uniq' command would likely be faster than sort. I am not
    sure how much of a difference it makes.)

    I've recently joined the list so will indulge myself and put forth a
    longer opinion about this case, hope I am not spamming.

    How much data are you talking about? My hunch is that bringing
    mapreduce into the picture only makes sense when your input is so
    massive as to be difficult to get onto one machine, like a couple
    hundred gigabytes or so? I say this because the *computation* here is
    trivial -- you're not distributing in order to use lots more CPUs. The
    constraint here is surely I/O -- think about how much time it takes to
    split the data out to remote workers and all that. If you can just run
    this on a machine that already has the logs, rather than move them
    anywhere, that's huge.

    Mapreduce begins to make sense in conjunction with HDFS at bigger
    scales. If you can deal with pushing the data into HDFS, then HDFS can
    help manage getting data to workers efficiently and all that. It
    starts to add a lot of value. But you're still talking about pushing
    data into HDFS which is a lot of overhead. You may have to go there if
    your data is terabytes or more. But it's not going to be faster --
    simply necessary I believe.


    Anyway that is my hunch based on dealing with a lot of stuff like this
    at Google. I'd be most interested to hear other perspectives from this
    list.


    Sean


    On Tue, Sep 1, 2009 at 10:27 AM, Erik Forsbergwrote:
    Hi!

    Let's assume an example use case where Apache's mod_usertrack
    generates randomly selected user id's that are stored in a cookie and
    written to the log file.

    I want to keep track of the number of daily, weekly, monthly,
    quarterly, yearly users, as well as the total number of users since the
    application was launched.

    I can do this rather fast on the Linux commandline by creating a file
    for each day listing the unique UIDs, one per line, then use sort -u to
    get the list of unique such users. To get a weekly count, I can create
    a list of weekly users, then merge in each day's users with "sort -u
    -m", which works well if both input files are already sorted.

    Now, this of course only works up to a rather small amount of data
    before the runtime becomes a problem. Now I want to do this using
    Hadoop and Map/Reduce. I'm sure this problem have been solved before,
    and I'm now looking for hints from experienced people.

    Is there perhaps already freely available java code that can do the
    sorting/merging for me?

    Should I use some trick to take advantage of the fact that the
    weekly/monthly/etc files are already sorted?

    Should I store the weekly/monthly/etc files in some hadoop:ish format
    for better performance instead of keeping the textoutputformat?

    Thanks,
    \EF
    --
    Erik Forsberg <forsberg@opera.com>
    Developer, Opera Software, http://www.opera.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedSep 1, '09 at 9:28a
activeSep 1, '09 at 9:44a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Erik Forsberg: 1 post Sean Owen: 1 post

People

Translate

site design / logo © 2021 Grokbase