FAQ
Is there a way to get the top N elements within each group of a query?
That is, I want to group my query and then within each group, find the
tuples that have the largest fields.

Not sure that makes sense so here's an example. Given this collection
of tuples:

     (def word-counts
       ;; person, word, total
       [["david" "apple" 3]
        ["david" "banana" 5]
        ["david" "cherry" 4]
        ["bob" "apple" 100]
        ["bob" "bulgaria" 10]
        ["bob" "cambodia" 23]
        ["bob" "dominica" 12]
        ["george" "apple" 20]
        ["george" "france" 7]])

how do we find the top 2 words with the greatest total for every person?

     ((["david" "banana" 5] ["david" "cherry" 4])
      (["bob" "apple" 100] ["bob" "cambodia" 23])
      (["george" "apple" 20] ["george" "france" 7]))

Cheers,
Chris Dean

Search Discussions

  • Paul Lam at Jun 20, 2012 at 10:54 am
    I've been doing this below. Any suggestion on optimising first-n-tuples or
    a more canonical way of doing it?

    (defbufferop [first-n-tuples [n]]
       [tuples] (take n tuples))

    (<- [?person ?word ?total]
         (src ?person ?word-all ?total-all)
         (:sort ?total-all) (:reverse true)
         (first-n-tuples [2] ?word-all ?total-all :> ?word ?total))


    Paul
    On Wednesday, 20 June 2012 10:25:55 UTC+1, ctdean wrote:

    Is there a way to get the top N elements within each group of a query?
    That is, I want to group my query and then within each group, find the
    tuples that have the largest fields.

    Not sure that makes sense so here's an example. Given this collection
    of tuples:

    (def word-counts
    ;; person, word, total
    [["david" "apple" 3]
    ["david" "banana" 5]
    ["david" "cherry" 4]
    ["bob" "apple" 100]
    ["bob" "bulgaria" 10]
    ["bob" "cambodia" 23]
    ["bob" "dominica" 12]
    ["george" "apple" 20]
    ["george" "france" 7]])

    how do we find the top 2 words with the greatest total for every person?

    ((["david" "banana" 5] ["david" "cherry" 4])
    (["bob" "apple" 100] ["bob" "cambodia" 23])
    (["george" "apple" 20] ["george" "france" 7]))

    Cheers,
    Chris Dean
  • Chris Dean at Jun 20, 2012 at 6:15 pm

    Paul Lam writes:
    I've been doing this below. Any suggestion on optimising
    first-n-tuples or a more canonical way of doing it?

    (defbufferop [first-n-tuples [n]]
    [tuples] (take n tuples))
    That's great, thanks! Do you know if defbufferop will have all the
    tuples in ram when it is called? A typical dataset would have the
    tuples arg be 5 million entries, but I only care about the first 1000.

    Cheers,
    Chris Dean
  • Sam Ritchie at Jun 20, 2012 at 6:20 pm
    Hey guys,

    The right way to do this without buffering every tuple for a group in
    memory is with the cascalog.ops/limit operator:

    (<- [?person ?word ?total]
         (src ?person ?word-all ?total-all)
         (:sort ?total-all) (:reverse true)
         (c/limit [2] ?word-all ?total-all :> ?word ?total))

    first-n uses this under the hood to do its magic.
    On Wed, Jun 20, 2012 at 11:15 AM, Chris Dean wrote:

    Paul Lam <paul.lam@forward.co.uk> writes:
    I've been doing this below. Any suggestion on optimising
    first-n-tuples or a more canonical way of doing it?

    (defbufferop [first-n-tuples [n]]
    [tuples] (take n tuples))
    That's great, thanks! Do you know if defbufferop will have all the
    tuples in ram when it is called? A typical dataset would have the
    tuples arg be 5 million entries, but I only care about the first 1000.

    Cheers,
    Chris Dean


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Chris Dean at Jun 20, 2012 at 6:31 pm

    Sam Ritchie writes:
    The right way to do this without buffering every tuple for a group in
    memory is with the cascalog.ops/limit operator:
    Perfect.

    I had tried something like that before but got some error - of course I
    can't repeat that error now.

    Cheers,
    Chris Dean
  • Mason at Nov 7, 2013 at 1:33 am
    Looks like this needs a slight tweak to work in Cascalog 2.0, due to
    limit's interface changing. Here it is for posterity:

    (def word-counts
       ;; person, word, total
       [["david" "apple" 3]
       ["david" "banana" 5]
       ["david" "cherry" 4]
       ["bob" "apple" 100]
       ["bob" "bulgaria" 10]
       ["bob" "cambodia" 23]
       ["bob" "dominica" 12]
       ["george" "apple" 20]
       ["george" "france" 7]])

    (def query (<- [?person ?word ?total]
       (word-counts ?person ?word-all ?total-all)
       (:sort ?total-all) (:reverse true)
       ((cascalog.logic.ops/limit 2) :< ?word-all ?total-all :> ?word ?total)))

    (cascalog.api/??- query)

    On Wednesday, June 20, 2012 11:31:34 AM UTC-7, ctdean wrote:

    Sam Ritchie <sritc...@gmail.com <javascript:>> writes:
    The right way to do this without buffering every tuple for a group in
    memory is with the cascalog.ops/limit operator:
    Perfect.

    I had tried something like that before but got some error - of course I
    can't repeat that error now.

    Cheers,
    Chris Dean
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sam Ritchie at Nov 7, 2013 at 1:35 am
    Nice, thanks for this! We might need some wiki updates as well.

    Mason wrote:
    Looks like this needs a slight tweak to work in Cascalog 2.0, due to
    limit's interface changing. Here it is for posterity:

    (def word-counts
    ;; person, word, total
    [["david" "apple" 3]
    ["david" "banana" 5]
    ["david" "cherry" 4]
    ["bob" "apple" 100]
    ["bob" "bulgaria" 10]
    ["bob" "cambodia" 23]
    ["bob" "dominica" 12]
    ["george" "apple" 20]
    ["george" "france" 7]])

    (def query (<- [?person ?word ?total]
    (word-counts ?person ?word-all ?total-all)
    (:sort ?total-all) (:reverse true)
    ((cascalog.logic.ops/limit 2) :< ?word-all ?total-all :> ?word ?total)))

    (cascalog.api/??- query)


    On Wednesday, June 20, 2012 11:31:34 AM UTC-7, ctdean wrote:

    Sam Ritchie <sritc...@gmail.com <javascript:>> writes:
    The right way to do this without buffering every tuple for a group in
    memory is with the cascalog.ops/limit operator:
    Perfect.

    I had tried something like that before but got some error - of
    course I
    can't repeat that error now.

    Cheers,
    Chris Dean

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Alexander Kehayias at Jan 6, 2014 at 5:19 pm
    Quick note about limiting each group that was not spelled out in the
    example solution; the field you want to group by should not be in the input
    to cascalog.logic.ops/limit

    Note the input and output fields to get the top n tuples per person:
    - final out fields of the query: [?person ?word ?total]
    - in fields to limit: ?word-all ?total-all
    - out fields to limit: ?word ?total

    using cascalog 2.0
    On Wednesday, November 6, 2013 8:35:37 PM UTC-5, Sam Ritchie wrote:

    Nice, thanks for this! We might need some wiki updates as well.

    Mason wrote:

    Looks like this needs a slight tweak to work in Cascalog 2.0, due to
    limit's interface changing. Here it is for posterity:

    (def word-counts
    ;; person, word, total
    [["david" "apple" 3]
    ["david" "banana" 5]
    ["david" "cherry" 4]
    ["bob" "apple" 100]
    ["bob" "bulgaria" 10]
    ["bob" "cambodia" 23]
    ["bob" "dominica" 12]
    ["george" "apple" 20]
    ["george" "france" 7]])

    (def query (<- [?person ?word ?total]
    (word-counts ?person ?word-all ?total-all)
    (:sort ?total-all) (:reverse true)
    ((cascalog.logic.ops/limit 2) :< ?word-all ?total-all :> ?word ?total)))

    (cascalog.api/??- query)

    On Wednesday, June 20, 2012 11:31:34 AM UTC-7, ctdean wrote:

    Sam Ritchie <sritc...@gmail.com> writes:
    The right way to do this without buffering every tuple for a group in
    memory is with the cascalog.ops/limit operator:
    Perfect.

    I had tried something like that before but got some error - of course I
    can't repeat that error now.

    Cheers,
    Chris Dean

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-use...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJun 20, '12 at 9:26a
activeJan 6, '14 at 5:19p
posts8
users5
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase