FAQ
Hello list,

sorry, sent this email to the wrong list, I think (MapReduce-list had
*no* activity the last whole day?).

As a newbie I got a tricky use-case in mind which I want to implement
with Hadoop to train my skillz. There is no real scenario behind that,
so I can extend or shrink the problem to the extent I like. The problems
I got are coming from a conceptual point of view and from the lack of
experience with Hadoop itself.

Here is what I want to do:

I create random lists of person-IDs and places plus a time-value.

The result of my map-reduce-operations should be something like that:
The key is a place and the value is a list of places that were visited
by persons after they visited the key-place.
Additionally the value should be sorted in a way were I use some
time/count-biased metric. This way the value-list should reflect the
place which was the most popular i.e. second-station on a tour.

I think this is a complex almost real-world-scenario.

In pseudo-code it will be something like this:
for every place p
for every person m that visited p
select list l of all the places that m visited after p
write a key-value-pair p=>l to disc and l is in order of the visits

for every key k in the list of key-value-pairs
get the value list of places v for k -
create another key-value-pair pv where the key is the place and
the value is its index in v (for a place p in v)

for every k
get all pv
for every pv aggregate the key-value-pairs by key and sum up
the index i for every place p so that it becomes the kv-pair opv
sort opv in ascending order by its value

The result would be what I wanted, no?

It looks like I need multiple MR-phases, however I do not even know how
to start.

My first guess is: Create a MR-Job where I invert my list so that I got
a place as the key and as value all persons that visited it.
The next phase needs to iterate over the value's persons and join with
the original data to get an idea of when this person visited this place
and what places came next.
And now the problems arise:
- First: What happens to places that are so popular that the number of
persons that visited it is so large, that I can not pass the whole
KV-pair to a single node to iterate over it?
- Second: I need to re-join the original data. Without a database this
would be extremely slow, wouldn't it?

I hope that you guys can give me some ideas and input to make my first
serious steps in Hadoop-land.

Regards,
Em

Search Discussions

  • Prashant at Jul 25, 2011 at 6:10 am

    On 07/19/2011 08:49 PM, Em wrote:
    As a newbie I got a tricky use-case in mind which I want to implement
    with Hadoop to train my skillz. There is no real scenario behind that,
    so I can extend or shrink the problem to the extent I like. The problems
    I got are coming from a conceptual point of view and from the lack of
    experience with Hadoop itself.
    Firstly you have to have your own implementation of key and value
    classes which implement Writable comparable and override the comparable
    for them a:

    "the value should be sorted in a way were I use some
    time/count-biased metric."

    your key should be a composite key comprising Id and timestamp. So that
    in first MR phase you can secondary sort on person Id and time stamp.

    So while dumping it in reduce phase to disk you have to take special
    care as you have to dump it in this format for next phase :
    PersonId<space/tab><values (which is your places visited).

    sample data would be like this till now


    Eg input: personId time place
    pe1 t1 P1
    pe1 t2 P2
    pe2 t3 P1
    pe1 t4 P4
    pe2 t5 P3
    and suppose the times are in order t2<t1<t3<t5<t4

    map(<Compositekey of personID and time>, <value as place>).
    reduce(<person Id> <value as tab seperated time and value>


    so after reduce your output would be :
    pe1 t2 P2
    pe1 t1 P1
    pe1 t4 P4
    pe2 t3 P1
    pe2 t5 P3


    As you see this is nothing but a secondary sort on PersonId and Times.
    Also not there is no aggregation of data being done until this point.

    Also you can use a grouping I recommend you go through an example in
    hadoop repos on Secondary sort where you can group based on PersonId.

    Now you don't have to worry about anything else about the popular places
    too.. as the map reduce framework will take care about that. obviously
    if the data is large it will take more time. Also join is done by
    Framework. in case you used more than one reduce you get more than one
    output files which can easily concatenated as the are in sorted and format.

    I have tried to understand your problem as much as I could and found it
    maps to secondary sort and most simple solution I can propose is this.

    --Prashant Sharma
    http://getfunctional.wordpress.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 19, '11 at 3:20p
activeJul 25, '11 at 6:10a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Em: 1 post Prashant: 1 post

People

Translate

site design / logo © 2023 Grokbase