FAQ
Let's say we have a datatype with atrrib1, attrib2, and attrib3. This
represents a stable concept during a fixed period of time, say from Jan
2012 to July 2012. After July 2012, I realized the stable concept was
missing an attribute, and I subsequently added attrib4. Now I want to
operate a query on the datatype between Jan 2012 to December 2012, where
all the data before July 2012 will be missing attrib4. How would I cope
with something like that in Cascalog? Of course, this is a simple example,
and in practice, new attributes could be added quite often.

As an example, let's say I have the a set of data representing job adverts
(like the kind we search for when we need a new job on sites like indeed),
where the job advert has the attributes of PositionName,
PositionDescription, and SalaryRange. I go live with this data, and for 6
months, all of my jobs have these characteristics. Now let's say I
introduce a new attribute to the job advert for JobLocation (the location
where the job is based). Now I have some job adverts where there is a
JobLocation, and some other adverts where I don't have a JobLocation. If I
wanted to construct a query indiscriminately on all of the job adverts
using JobLocation as a criteria, would that be possible in Cascalog? Would
I have to resort to treating them as 2 distinct data sets? Would I have to
migrate the old data set to the new data set, setting the non-existent
values to null? Would I have to introduce some versioning elsewhere? Or a
more elegant way to handle this in Cascalog?

Search Discussions

  • Sam Ritchie at Aug 2, 2012 at 8:53 pm
    How are you storing your objects? This is completely fine to do with a data
    representation like Thrift -- just mark new fields as "optional" and make
    sure not to change the IDs of old fields. Then the current Cascalog job
    will load the code for the latest version of your structs while accepting
    the old ones with no issues.
    On Thu, Aug 2, 2012 at 1:30 PM, indranil wrote:

    Let's say we have a datatype with atrrib1, attrib2, and attrib3. This
    represents a stable concept during a fixed period of time, say from Jan
    2012 to July 2012. After July 2012, I realized the stable concept was
    missing an attribute, and I subsequently added attrib4. Now I want to
    operate a query on the datatype between Jan 2012 to December 2012, where
    all the data before July 2012 will be missing attrib4. How would I cope
    with something like that in Cascalog? Of course, this is a simple example,
    and in practice, new attributes could be added quite often.

    As an example, let's say I have the a set of data representing job adverts
    (like the kind we search for when we need a new job on sites like indeed),
    where the job advert has the attributes of PositionName,
    PositionDescription, and SalaryRange. I go live with this data, and for 6
    months, all of my jobs have these characteristics. Now let's say I
    introduce a new attribute to the job advert for JobLocation (the location
    where the job is based). Now I have some job adverts where there is a
    JobLocation, and some other adverts where I don't have a JobLocation. If I
    wanted to construct a query indiscriminately on all of the job adverts
    using JobLocation as a criteria, would that be possible in Cascalog? Would
    I have to resort to treating them as 2 distinct data sets? Would I have to
    migrate the old data set to the new data set, setting the non-existent
    values to null? Would I have to introduce some versioning elsewhere? Or a
    more elegant way to handle this in Cascalog?


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedAug 2, '12 at 8:31p
activeAug 2, '12 at 8:53p
posts2
users2
websiteclojure.org
irc#clojure

2 users in discussion

Indranil: 1 post Sam Ritchie: 1 post

People

Translate

site design / logo © 2021 Grokbase