Grokbase Groups Pig user October 2010
FAQ
Hey guys -

I have a script that loads a list of ~800,000 category hierarchies,
filters them a bit and streams them through a PHP script for some
quick procedural work. The file contains one column and a snippet
looks like this:

Arts
Arts/Animation
Arts/Animation/Anime
Arts/Animation/Anime/Characters
Arts/Animation/Anime/Clubs_and_Organizations
Arts/Animation/Anime/Collectibles
Arts/Animation/Anime/Collectibles/Cels
Arts/Animation/Anime/Collectibles/Models_and_Figures
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
Arts/Animation/Anime/Collectibles/Shitajiki
Arts/Animation/Anime/Creators
Arts/Animation/Anime/Creators/Anno,_Hideaki
Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
Arts/Animation/Anime/Creators/Miyazaki,_Hayao
Arts/Animation/Anime/Creators/Studios
Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
Arts/Animation/Anime/Distribution
Arts/Animation/Anime/Distribution/Companies

Now I need to take it one step further. I need to get a count of how
many items are in "Arts", how many are in "Arts/Animation", etc. I
know a grouping and count is involved, but I can't wrap my mind around
how to get there since the category path depth is entirely variable
and I need these numbers relative to the "whole" (i.e. I need to know
how many times Arts/Animation/Anime appears rather than how many times
Anime appears at any level).

Any guidance would be much appreciated.

Rob Wilkerson

The information transmitted in this
email is intended only for the
person(s) or entity to which it is
addressed and may contain
confidential and/or privileged
material. Any review,
retransmission, dissemination
or other use of, or taking of any
action in reliance upon, this
information by persons or entities
other than the intended recipient
is prohibited. If you received this
email in error, please contact the
sender and permanently delete the
email from any computer.

Search Discussions

  • David Vrensk at Oct 1, 2010 at 11:45 am
    I would just preprocess the file with Perl or Ruby:

    perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile

    That would give you

    Arts/Animation/Anime<TAB>Clubs_and_Organizations

    i.e. two columns for every line (lines without slashes will be skipped).

    Come to think of it, if your entire file is just 800k lines, I'd do the
    entire thing with Perl.

    HTH,

    /David
    On Fri, Oct 1, 2010 at 13:32, Rob Wilkerson wrote:

    Hey guys -

    I have a script that loads a list of ~800,000 category hierarchies,
    filters them a bit and streams them through a PHP script for some
    quick procedural work. The file contains one column and a snippet
    looks like this:

    Arts
    Arts/Animation
    Arts/Animation/Anime
    Arts/Animation/Anime/Characters
    Arts/Animation/Anime/Clubs_and_Organizations
    Arts/Animation/Anime/Collectibles
    Arts/Animation/Anime/Collectibles/Cels
    Arts/Animation/Anime/Collectibles/Models_and_Figures
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
    Arts/Animation/Anime/Collectibles/Shitajiki
    Arts/Animation/Anime/Creators
    Arts/Animation/Anime/Creators/Anno,_Hideaki
    Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
    Arts/Animation/Anime/Creators/Miyazaki,_Hayao
    Arts/Animation/Anime/Creators/Studios
    Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
    Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
    Arts/Animation/Anime/Distribution
    Arts/Animation/Anime/Distribution/Companies

    Now I need to take it one step further. I need to get a count of how
    many items are in "Arts", how many are in "Arts/Animation", etc. I
    know a grouping and count is involved, but I can't wrap my mind around
    how to get there since the category path depth is entirely variable
    and I need these numbers relative to the "whole" (i.e. I need to know
    how many times Arts/Animation/Anime appears rather than how many times
    Anime appears at any level).

    Any guidance would be much appreciated.

    Rob Wilkerson

    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.

    --
    David Vrensk
    Systems developer, ICE House AB
    Mobile: +46 703 74 69 00
  • Rob Wilkerson at Oct 1, 2010 at 12:01 pm

    On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk wrote:
    I would just preprocess the file with Perl or Ruby:

    perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile
    What is the "#" representing? I have a semi-educated guess, but I
    can't find that particular symbol in any examples.

    Also, as far as I can tell, this regex also misses the top level path
    because it has not children. For example, the "Arts" path. It catches
    "Arts/Anime" and below nicely, of course.
    Come to think of it, if your entire file is just 800k lines, I'd do the
    entire thing with Perl.
    I thought about that when PHP couldn't handle it, but my Perl skills
    are light and it was a chance to learn something entirely new.

    Thanks for your help.

    --
    +rw

    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.
  • Andrea Leistra at Oct 1, 2010 at 12:10 pm
    The '#' is being used as a delimiter here, to avoid the need to escape all the slashes.

    -----Original Message-----
    From: Rob Wilkerson
    Sent: Friday, October 01, 2010 8:00 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: Grouping & Counting
    On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk wrote:
    I would just preprocess the file with Perl or Ruby:

    perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile >
    outfile
    What is the "#" representing? I have a semi-educated guess, but I can't find that particular symbol in any examples.

    Also, as far as I can tell, this regex also misses the top level path because it has not children. For example, the "Arts" path. It catches "Arts/Anime" and below nicely, of course.
    Come to think of it, if your entire file is just 800k lines, I'd do
    the entire thing with Perl.
    I thought about that when PHP couldn't handle it, but my Perl skills are light and it was a chance to learn something entirely new.

    Thanks for your help.

    --
    +rw

    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.
  • David Vrensk at Oct 1, 2010 at 12:24 pm

    On Fri, Oct 1, 2010 at 14:00, Rob Wilkerson wrote:
    On Fri, Oct 1, 2010 at 7:44 AM, David Vrensk wrote:
    I would just preprocess the file with Perl or Ruby:

    perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile
    What is the "#" representing? I have a semi-educated guess, but I
    can't find that particular symbol in any examples.
    They are alternative regexp delimiters. Usually, we write regexen delimited
    by slashes, but I wanted to avoid that since the only important character in
    the regexp is a slash.

    Also, as far as I can tell, this regex also misses the top level path
    because it has not children. For example, the "Arts" path. It catches
    "Arts/Anime" and below nicely, of course.
    Yup. But the point is to count children, so if there is no child on the
    row, there is nothing to count.

    BTW, you didn't say if you wanted to count children or descendants (i.e.
    children and children's children). From your follow-up, I gather it's about
    descendants. Try this:

    --------8<----------------
    #! /usr/bin/env ruby

    counts = Hash.new(0)

    while (line = ARGF.gets)
    line.chomp!
    segments = line.split '/'
    key = []
    segments.each do |s|
    key << s
    counts[key.join('/')] += 1
    end
    end

    counts.keys.sort.each do |key|
    puts("%5d %s" % [counts[key], key])
    end
    --------8<----------------

    which with your data returns

    23 Arts
    22 Arts/Animation
    21 Arts/Animation/Anime
    1 Arts/Animation/Anime/Characters
    1 Arts/Animation/Anime/Clubs_and_Organizations
    9 Arts/Animation/Anime/Collectibles
    1 Arts/Animation/Anime/Collectibles/Cels
    6 Arts/Animation/Anime/Collectibles/Models_and_Figures
    3 Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
    1
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
    1
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
    2 Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
    1 Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
    1 Arts/Animation/Anime/Collectibles/Shitajiki
    7 Arts/Animation/Anime/Creators
    1 Arts/Animation/Anime/Creators/Anno,_Hideaki
    1 Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
    1 Arts/Animation/Anime/Creators/Miyazaki,_Hayao
    3 Arts/Animation/Anime/Creators/Studios
    2 Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
    1 Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
    2 Arts/Animation/Anime/Distribution
    1 Arts/Animation/Anime/Distribution/Companies


    HTH,

    /David
    --
    David Vrensk
    Systems developer, ICE House AB
    Mobile: +46 703 74 69 00
  • Rob Wilkerson at Oct 1, 2010 at 12:44 pm

    On Fri, Oct 1, 2010 at 8:23 AM, David Vrensk wrote:
    Yup.  But the point is to count children, so if there is no child on the
    row, there is nothing to count.
    I suppose the point is to count descendents + the base, but I should
    have been more clear.
    BTW, you didn't say if you wanted to count children or descendants (i.e.
    children and children's children).  From your follow-up, I gather it's about
    descendants
    This looks very much like what I'm after. I'll verify against the
    data, of course, but thank you very much.


    --
    +rw

    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.
  • Dmitriy Ryaboy at Oct 1, 2010 at 4:57 pm
    So, if you had to do this for a "really large" file, and wanted to do this
    "the proper way" you would write the same thing as a Pig UDF that returns a
    bag of children, and then flatten the bag to get a row per partial
    hierarchy. Then group by said partial hierarchy, and count.

    In pig 8 you will be able to write this udf in Python, if you like.

    -D
    On Fri, Oct 1, 2010 at 5:43 AM, Rob Wilkerson wrote:
    On Fri, Oct 1, 2010 at 8:23 AM, David Vrensk wrote:
    Yup. But the point is to count children, so if there is no child on the
    row, there is nothing to count.
    I suppose the point is to count descendents + the base, but I should
    have been more clear.
    BTW, you didn't say if you wanted to count children or descendants (i.e.
    children and children's children). From your follow-up, I gather it's about
    descendants
    This looks very much like what I'm after. I'll verify against the
    data, of course, but thank you very much.


    --
    +rw

    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.
  • Rob Wilkerson at Oct 1, 2010 at 12:11 pm
    A more concise way of stating this would be to say that:

    For each line item, I need to count all of the line items (including
    this on) that _start_ with the same value plus an optional "/".

    For example, using pseudo-regex syntax:

    - A count of all of the lines that start with "Arts$" or "Arts/"
    - A count of all of the lines that start with "Arts/Animation$" or
    "Arts/Animation/", etc.

    I should have led with that, but I'm still feeling out the problem a little bit.
    On Fri, Oct 1, 2010 at 7:32 AM, Rob Wilkerson wrote:
    I have a script that loads a list of ~800,000 category hierarchies,
    filters them a bit and streams them through a PHP script for some
    quick procedural work. The file contains one column and a snippet
    looks like this:

    Arts
    Arts/Animation
    Arts/Animation/Anime
    Arts/Animation/Anime/Characters
    Arts/Animation/Anime/Clubs_and_Organizations
    Arts/Animation/Anime/Collectibles
    Arts/Animation/Anime/Collectibles/Cels
    Arts/Animation/Anime/Collectibles/Models_and_Figures
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
    Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
    Arts/Animation/Anime/Collectibles/Shitajiki
    Arts/Animation/Anime/Creators
    Arts/Animation/Anime/Creators/Anno,_Hideaki
    Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
    Arts/Animation/Anime/Creators/Miyazaki,_Hayao
    Arts/Animation/Anime/Creators/Studios
    Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
    Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
    Arts/Animation/Anime/Distribution
    Arts/Animation/Anime/Distribution/Companies

    Now I need to take it one step further. I need to get a count of how
    many items are in "Arts", how many are in "Arts/Animation", etc. I
    know a grouping and count is involved, but I can't wrap my mind around
    how to get there since the category path depth is entirely variable
    and I need these numbers relative to the "whole" (i.e. I need to know
    how many times Arts/Animation/Anime appears rather than how many times
    Anime appears at any level).

    Any guidance would be much appreciated.
    The information transmitted in this
    email is intended only for the
    person(s) or entity to which it is
    addressed and may contain
    confidential and/or privileged
    material. Any review,
    retransmission, dissemination
    or other use of, or taking of any
    action in reliance upon, this
    information by persons or entities
    other than the intended recipient
    is prohibited. If you received this
    email in error, please contact the
    sender and permanently delete the
    email from any computer.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 1, '10 at 11:33a
activeOct 1, '10 at 4:57p
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase