FAQ
Hey all,

While preparing for next week's Cascalog class, I've been forced to confront the fact that most of the Cascalog tutorials out there are designed more to show off the logical power of the API than its ability to crack open huge, interesting datasets. (Scalding is in the lead here with Edwin Chen's movie recommendation tutorial (https://github.com/echen/scaldingale).)

To fill the gap, I've been thinking about putting together a series of more detailed tutorials on Cascalog and Scalding that work with real datasets. The most fruitful and available source is probably Amazon's public dataset repository, located here (http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1). Which of these would you all find the most interesting?

I'll start. I'd like to see a document clustering analysis of these Enron emails (http://aws.amazon.com/datasets/917205). The legal document analysis system is currently VERY broken. (If any of you work on such systems, I'm happy to be convinced otherwise :) Some of the current email indexing services charge hundreds of dollars per search. A tutorial would be a nice first step toward demolishing this industry.

Respond here, or post ideas on this Reddit thread: http://www.reddit.com/r/Cascading/comments/qnodc/cascading_tutorial_ideas/

Looking forward to hearing your thoughts!
--
Sam Ritchie
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

Search Discussions

  • Paul Lam at Mar 9, 2012 at 1:55 pm
    I've just thought of this although it's probably not quick enough to do in
    a session. I'd love to investigate the threshold of collective action for
    major riots (e.g. Tunisia) using Twitter/social network feeds.

    1. When is the event-horizon of information flow in a riot. At what point
    is the flow of information snowballed until action is practically
    guaranteed based on network effects. This could be modelled by considering
    information flow quantities at change-point of third, fourth, or fifth
    derivatives against time.

    2. What is the average influence threshold until action on a personal level
    for the people involved? How many buddies does an average rioter needed to
    know has been called to action until they will take action themselves?
    See https://modelthinkinginruby.wordpress.com/2012/02/27/granovetter/ for a
    sample modelling methodology.

    Once we can model one, we might be able to model for many. Identify
    patterns across the spectrum, and maybe lead to predictive capabilities for
    predicting social unrests. Which has never been done before, massively
    important, and we're starting to have sufficient data. Minority Report,
    anyone?


    On Thursday, 8 March 2012 19:21:40 UTC, Sam Ritchie wrote:

    Hey all,

    While preparing for next week's Cascalog class, I've been forced to
    confront the fact that most of the Cascalog tutorials out there are
    designed more to show off the logical power of the API than its ability to
    crack open huge, interesting datasets. (Scalding is in the lead here with
    Edwin Chen's movie recommendation tutorial<https://github.com/echen/scaldingale>
    .)

    To fill the gap, I've been thinking about putting together a series of
    more detailed tutorials on Cascalog and Scalding that work with real
    datasets. The most fruitful and available source is probably Amazon's
    public dataset repository, located here<http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1>.
    Which of these would you all find the most interesting?

    I'll start. I'd like to see a document clustering analysis of these Enron
    emails <http://aws.amazon.com/datasets/917205>. The legal document
    analysis system is currently VERY broken. (If any of you work on such
    systems, I'm happy to be convinced otherwise :) Some of the current email
    indexing services charge hundreds of dollars per search. A tutorial would
    be a nice first step toward demolishing this industry.

    Respond here, or post ideas on this Reddit thread:
    http://www.reddit.com/r/Cascading/comments/qnodc/cascading_tutorial_ideas/

    Looking forward to hearing your thoughts!
    --
    Sam Ritchie
    Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
  • Ted Dunning at Mar 9, 2012 at 7:43 pm

    On Fri, Mar 9, 2012 at 5:55 AM, Paul Lam wrote:

    I've just thought of this although it's probably not quick enough to do in
    a session. I'd love to investigate the threshold of collective action for
    major riots (e.g. Tunisia) using Twitter/social network feeds.

    1. When is the event-horizon of information flow in a riot. At what point
    is the flow of information snowballed until action is practically
    guaranteed based on network effects. This could be modelled by considering
    information flow quantities at change-point of third, fourth, or fifth
    derivatives against time.
    Great problem. Bad approach.

    Events that are inherently counts are not good candidates for derivatives.

    It is quite reasonable to do change-point detection using a hierarchical
    Poisson model, however. The idea is that you have a Poisson process with a
    rate that is a sample from a random walk of some kind. The details of the
    walk aren't real important in that there are lots of good alternatives to
    pick from. Some alternatives can be a step that occurs at random intervals
    with weakly constrained long-tailed distribution or a t-distributed random
    walk that takes steps very often.

    The underlying rate of the Poisson process is the hidden variable that you
    are looking for. Large steps in that rate are the events that you seek.

    2. What is the average influence threshold until action on a personal level
    for the people involved? How many buddies does an average rioter needed to
    know has been called to action until they will take action themselves? See
    https://modelthinkinginruby.wordpress.com/2012/02/27/granovetter/ for a
    sample modelling methodology.
    I can't comment on this except that you need to model the external
    information flow as well. It isn't just buddies.


    Once we can model one, we might be able to model for many. Identify
    patterns across the spectrum, and maybe lead to predictive capabilities for
    predicting social unrests. Which has never been done before, massively
  • Sam Ritchie at Mar 9, 2012 at 9:28 pm
    This is exactly the sort of question we could crack open by having standard taps into large public datasets. I think that the logic is easy at this point; I personally find myself writing queries for data I don't yet have, or for very small, contrived datasets, rather than asking questions of the data itself.

    What (publicly available) datasets can we think of that would help us start approaching this problem? On the twitter side, I'll start poking around to see if it's possible to get some of our data out there.

    --
    Sam Ritchie
    Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

    On Friday, March 9, 2012 at 11:26 AM, Ted Dunning wrote:
    On Fri, Mar 9, 2012 at 5:55 AM, Paul Lam (mailto:paul.lam@forward.co.uk)> wrote:

    I've just thought of this although it's probably not quick enough to do in
    a session. I'd love to investigate the threshold of collective action for
    major riots (e.g. Tunisia) using Twitter/social network feeds.

    1. When is the event-horizon of information flow in a riot. At what point
    is the flow of information snowballed until action is practically
    guaranteed based on network effects. This could be modelled by considering
    information flow quantities at change-point of third, fourth, or fifth
    derivatives against time.
    Great problem. Bad approach.

    Events that are inherently counts are not good candidates for derivatives.

    It is quite reasonable to do change-point detection using a hierarchical
    Poisson model, however. The idea is that you have a Poisson process with a
    rate that is a sample from a random walk of some kind. The details of the
    walk aren't real important in that there are lots of good alternatives to
    pick from. Some alternatives can be a step that occurs at random intervals
    with weakly constrained long-tailed distribution or a t-distributed random
    walk that takes steps very often.

    The underlying rate of the Poisson process is the hidden variable that you
    are looking for. Large steps in that rate are the events that you seek.

    2. What is the average influence threshold until action on a personal level
    for the people involved? How many buddies does an average rioter needed to
    know has been called to action until they will take action themselves? See
    https://modelthinkinginruby.wordpress.com/2012/02/27/granovetter/ for a
    sample modelling methodology.
    I can't comment on this except that you need to model the external
    information flow as well. It isn't just buddies.


    Once we can model one, we might be able to model for many. Identify
    patterns across the spectrum, and maybe lead to predictive capabilities for
    predicting social unrests. Which has never been done before, massively
    --
    You received this message because you are subscribed to the Google Groups "cascading-user" group.
    To post to this group, send email to cascading-user@googlegroups.com (mailto:cascading-user@googlegroups.com).
    To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com (mailto:cascading-user+unsubscribe@googlegroups.com).
    For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
  • Paul Lam at Mar 9, 2012 at 10:54 pm
    Nice... Thanks for the correction, Ted! I've completely forgotten about my
    stochastic processes.

    Friday night, just got updated with the news this week and found out about
    Kony 2012. Analysing the Kony campaign would be interesting as it's
    happening now and looks to be big.

    Will look for datasets this weekend.

    On Friday, March 9, 2012 9:28:06 PM UTC, Sam Ritchie wrote:

    This is exactly the sort of question we could crack open by having
    standard taps into large public datasets. I think that the logic is easy at
    this point; I personally find myself writing queries for data I don't yet
    have, or for very small, contrived datasets, rather than asking questions
    of the data itself.

    What (publicly available) datasets can we think of that would help us
    start approaching this problem? On the twitter side, I'll start poking
    around to see if it's possible to get some of our data out there.

    --
    Sam Ritchie
    Sent with Sparrow <http://www.sparrowmailapp.com/?sig>

    On Friday, March 9, 2012 at 11:26 AM, Ted Dunning wrote:
    On Fri, Mar 9, 2012 at 5:55 AM, Paul Lam wrote:

    I've just thought of this although it's probably not quick enough to do in
    a session. I'd love to investigate the threshold of collective action for
    major riots (e.g. Tunisia) using Twitter/social network feeds.

    1. When is the event-horizon of information flow in a riot. At what point
    is the flow of information snowballed until action is practically
    guaranteed based on network effects. This could be modelled by
    considering
    information flow quantities at change-point of third, fourth, or fifth
    derivatives against time.
    Great problem. Bad approach.

    Events that are inherently counts are not good candidates for derivatives.

    It is quite reasonable to do change-point detection using a hierarchical
    Poisson model, however. The idea is that you have a Poisson process with a
    rate that is a sample from a random walk of some kind. The details of the
    walk aren't real important in that there are lots of good alternatives to
    pick from. Some alternatives can be a step that occurs at random intervals
    with weakly constrained long-tailed distribution or a t-distributed random
    walk that takes steps very often.

    The underlying rate of the Poisson process is the hidden variable that you
    are looking for. Large steps in that rate are the events that you seek.

    2. What is the average influence threshold until action on a personal level
    for the people involved? How many buddies does an average rioter needed to
    know has been called to action until they will take action themselves? See
    https://modelthinkinginruby.wordpress.com/2012/02/27/granovetter/ for a
    sample modelling methodology.
    I can't comment on this except that you need to model the external
    information flow as well. It isn't just buddies.


    Once we can model one, we might be able to model for many. Identify
    patterns across the spectrum, and maybe lead to predictive capabilities for
    predicting social unrests. Which has never been done before, massively
    --
    You received this message because you are subscribed to the Google Groups
    "cascading-user" group.
    To post to this group, send email to cascading-user@googlegroups.com.
    To unsubscribe from this group, send email to
    cascading-user+unsubscribe@googlegroups.com.
    For more options, visit this group at
    http://groups.google.com/group/cascading-user?hl=en.



    On Friday, March 9, 2012 9:28:06 PM UTC, Sam Ritchie wrote:

    This is exactly the sort of question we could crack open by having
    standard taps into large public datasets. I think that the logic is easy at
    this point; I personally find myself writing queries for data I don't yet
    have, or for very small, contrived datasets, rather than asking questions
    of the data itself.

    What (publicly available) datasets can we think of that would help us
    start approaching this problem? On the twitter side, I'll start poking
    around to see if it's possible to get some of our data out there.

    --
    Sam Ritchie
    Sent with Sparrow <http://www.sparrowmailapp.com/?sig>

    On Friday, March 9, 2012 at 11:26 AM, Ted Dunning wrote:
    On Fri, Mar 9, 2012 at 5:55 AM, Paul Lam wrote:

    I've just thought of this although it's probably not quick enough to do in
    a session. I'd love to investigate the threshold of collective action for
    major riots (e.g. Tunisia) using Twitter/social network feeds.

    1. When is the event-horizon of information flow in a riot. At what point
    is the flow of information snowballed until action is practically
    guaranteed based on network effects. This could be modelled by
    considering
    information flow quantities at change-point of third, fourth, or fifth
    derivatives against time.
    Great problem. Bad approach.

    Events that are inherently counts are not good candidates for derivatives.

    It is quite reasonable to do change-point detection using a hierarchical
    Poisson model, however. The idea is that you have a Poisson process with a
    rate that is a sample from a random walk of some kind. The details of the
    walk aren't real important in that there are lots of good alternatives to
    pick from. Some alternatives can be a step that occurs at random intervals
    with weakly constrained long-tailed distribution or a t-distributed random
    walk that takes steps very often.

    The underlying rate of the Poisson process is the hidden variable that you
    are looking for. Large steps in that rate are the events that you seek.

    2. What is the average influence threshold until action on a personal level
    for the people involved? How many buddies does an average rioter needed to
    know has been called to action until they will take action themselves? See
    https://modelthinkinginruby.wordpress.com/2012/02/27/granovetter/ for a
    sample modelling methodology.
    I can't comment on this except that you need to model the external
    information flow as well. It isn't just buddies.


    Once we can model one, we might be able to model for many. Identify
    patterns across the spectrum, and maybe lead to predictive capabilities for
    predicting social unrests. Which has never been done before, massively
    --
    You received this message because you are subscribed to the Google Groups
    "cascading-user" group.
    To post to this group, send email to cascading-user@googlegroups.com.
    To unsubscribe from this group, send email to
    cascading-user+unsubscribe@googlegroups.com.
    For more options, visit this group at
    http://groups.google.com/group/cascading-user?hl=en.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMar 8, '12 at 7:22p
activeMar 9, '12 at 10:54p
posts5
users4
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase