Hi,

Is it possible to store unique messages in a queue? I have a message
source that may provide duplicate messages but I do not want the
message to be processed more than once. I am trying to avoid using a
database or any other form of persistence in the client using the
queues. Any help/advice will be very helpful.

Thanks!

Regads,

--
Vidit Drolia

Search Discussions

  • Darien Kindlund at Jul 31, 2009 at 3:56 pm
    Hi Vidit,

    To my knowledge, RabbitMQ doesn't provide this type of capability by
    default. However, you could easily write a "pre-filter" application,
    which takes in duplicate messages from your source on one queue and
    then outputs the non-duplicates out on another queue. You don't need
    a database or other expensive I/O to check for duplicates.

    Presumably, you have a way of identifying unique messages; I'll assume
    it's some sort of unique ID associated with the message...
    You can have your "pre-filter" application use a bloom filter
    (http://en.wikipedia.org/wiki/Bloom_filters), which allocates constant
    size memory and a configurable false-positive percentage.
    The only major issue with this approach, is that bloom filters hold a
    maximal amount of entries (user configurable), so you'll have to deal
    with the corner case of how the application should behave when it has
    seen the maximum number of unique messages -- either recreate a new
    bloom filter (and discard your previous history) or chain multiple
    bloom filters together (slowly increasing your CPU costs). Again, the
    maximal value can be set exceedingly large (e.g., 10 million), in
    order to reduce the (probable) likelihood of dealing with this
    situation.

    -- Darien

    On Fri, Jul 31, 2009 at 11:39 AM, Vidit Droliawrote:
    Hi,

    Is it possible to store unique messages in a queue? I have a message
    source that may provide duplicate messages but I do not want the
    message to be processed more than once. I am trying to avoid using a
    database or any other form of persistence in the client using the
    queues. Any help/advice will be very helpful.

    Thanks!

    Regads,

    --
    Vidit Drolia

    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at Jul 31, 2009 at 5:03 pm
    Vidit,

    Vidit Drolia wrote:
    Is it possible to store unique messages in a queue? I have a message
    source that may provide duplicate messages but I do not want the
    message to be processed more than once.
    As Darien pointed out, deduplicating messages at the client end isn't
    all that hard, and even easier than he described if, say, you can
    guarantee that message ids are monotonically increasing.

    The really, really, hard part is ensuring that a message only gets
    *processed* once.

    When can a message be considered to have been processed? Let's assume we
    have an app that pulls messages off a rabbit queue and calls a function
    process(msg) to process them. At what point then has the message been
    processed? At the exact point we call the function? At the exact point
    it returns? Somewhere inbetween? Whatever point you choose, you then
    still have to *record* the fact that the point has been reached, so that
    the message can be forgotten for good, or, alternatively, if the point
    hasn't been reached, replayed at a later point. That act of recording -
    whether it be by acknowledging the message in rabbit, or some other
    means - itself can fail, which will result in eventual resending and
    thus duplication.

    The problem here is that the processing of the message itself, let alone
    the combination of it with the recording/acknowledging action are not
    atomic.

    The only way to solve this is to either make everything - rabbit, your
    app, any apps it talks to, etc - part of a gigantic transaction - thus
    ensuring atomicity - or - and this is by far the easier and better
    option - construct your apps in such a way that messaging is idempotent.


    Regards,

    Matthias.
  • Vidit Drolia at Jul 31, 2009 at 5:24 pm
    Thanks Darien and Matthias.

    Each message received is actually triggering an email. I can use Bloom
    Filters to make sure messages are not duplicate, but if the node
    running the filter application fails, I will end up losing historical
    information. Running the filter on multiple nodes may be a solution.

    The primary problem is that since the action being triggered by the
    message is an email, I can't revert the action. So I am trying to
    ensure that the application sending emails gets a message only once.
    Is there another approach I can take to this problem?

    Thanks again!

    Vidit

    On Fri, Jul 31, 2009 at 1:03 PM, Matthias Radestockwrote:
    Vidit,

    Vidit Drolia wrote:
    Is it possible to store unique messages in a queue? I have a message
    source that may provide duplicate messages but I do not want the
    message to be processed more than once.
    As Darien pointed out, deduplicating messages at the client end isn't all
    that hard, and even easier than he described if, say, you can guarantee that
    message ids are monotonically increasing.

    The really, really, hard part is ensuring that a message only gets
    *processed* once.

    When can a message be considered to have been processed? Let's assume we
    have an app that pulls messages off a rabbit queue and calls a function
    process(msg) to process them. At what point then has the message been
    processed? At the exact point we call the function? At the exact point it
    returns? Somewhere inbetween? Whatever point you choose, you then still have
    to *record* the fact that the point has been reached, so that the message
    can be forgotten for good, or, alternatively, if the point hasn't been
    reached, replayed at a later point. That act of recording - whether it be by
    acknowledging the message in rabbit, or some other means - itself can fail,
    which will result in eventual resending and thus duplication.

    The problem here is that the processing of the message itself, let alone the
    combination of it with the recording/acknowledging action are not atomic.

    The only way to solve this is to either make everything - rabbit, your app,
    any apps it talks to, etc - part of a gigantic transaction - thus ensuring
    atomicity - or - and this is by far the easier and better option - construct
    your apps in such a way that messaging is idempotent.


    Regards,

    Matthias.


    --
    Vidit Drolia
  • Matthias Radestock at Jul 31, 2009 at 6:08 pm
    Vidit,

    Vidit Drolia wrote:
    The primary problem is that since the action being triggered by the
    message is an email, I can't revert the action. So I am trying to
    ensure that the application sending emails gets a message only once.
    Is there another approach I can take to this problem?
    If you replace "process(msg)" in my last email with "send_email(msg)",
    you will see that what you are asking for is impossible. The best one
    can do (in any system, involving rabbit or not) is to make it *very
    unlikely* that an email is sent more than once. As long as we can agree
    on that, let's proceed ...

    If your main concern is removing the duplicates the senders can produce,
    then I suggest inserting a filtering proxy, i.e. a process that consumes
    messages from one queue, de-dups them and publishes the non-dups to
    another exchange.

    This process does need to keep some state, so, as you say, if it crashes
    and the state is lost then you may get some dups. The process is very
    simple though, so the likelihood of it crashing should be low. Given
    that we have established that there can be no 100% no-dup guarantee, is
    it really worth worrying about that? If the answer is yes, then
    persisting that state, or replicating it between several redundant nodes
    are possible options.


    Regards,

    Matthias.
  • Vidit Drolia at Jul 31, 2009 at 6:41 pm
    Matthias,

    Minimizing the probability of sending out a duplicate message is the
    practical objective. So you are right in saying that the best we can
    do is to make it very unlikely that a duplicate mail is sent out.

    A filtering proxy would make most sense because I wanted to move away
    from expensive I/O for persistence in the first place, plus if needed,
    redundancy can be introduced later for fault-tolerance.

    Thanks for all the help!

    Best,

    Vidit

    On Fri, Jul 31, 2009 at 2:08 PM, Matthias Radestockwrote:
    Vidit,

    Vidit Drolia wrote:
    The primary problem is that since the action being triggered by the
    message is an email, I can't revert the action. So I am trying to
    ensure that the application sending emails gets a message only once.
    Is there another approach I can take to this problem?
    If you replace "process(msg)" in my last email with "send_email(msg)", you
    will see that what you are asking for is impossible. The best one can do (in
    any system, involving rabbit or not) is to make it *very unlikely* that an
    email is sent more than once. As long as we can agree on that, let's proceed
    ...

    If your main concern is removing the duplicates the senders can produce,
    then I suggest inserting a filtering proxy, i.e. a process that consumes
    messages from one queue, de-dups them and publishes the non-dups to another
    exchange.

    This process does need to keep some state, so, as you say, if it crashes and
    the state is lost then you may get some dups. The process is very simple
    though, so the likelihood of it crashing should be low. Given that we have
    established that there can be no 100% no-dup guarantee, is it really worth
    worrying about that? If the answer is yes, then persisting that state, or
    replicating it between several redundant nodes are possible options.


    Regards,

    Matthias.
  • Tony Garnock-Jones at Aug 4, 2009 at 4:39 pm
    Hi Vidit,

    You wrote, at the start of this thread, that you "have a message source
    that may provide duplicate messages". What kinds of duplicates are we
    talking here? One per minute for the next six years, or the occasional
    duplicate within a minute of the original followed by no more duplicates
    ever?

    If it's the former, then a long-term memory is clearly required; if the
    latter (i.e. you're coping with the normal possibility of
    duplication-because-of-connection-failure-etc), then a simple memory of
    say an hour's worth of processed message IDs ought to be enough.

    Regards,
    Tony


    Vidit Drolia wrote:
    Matthias,

    Minimizing the probability of sending out a duplicate message is the
    practical objective. So you are right in saying that the best we can
    do is to make it very unlikely that a duplicate mail is sent out.

    A filtering proxy would make most sense because I wanted to move away
    from expensive I/O for persistence in the first place, plus if needed,
    redundancy can be introduced later for fault-tolerance.

    Thanks for all the help!

    Best,

    Vidit

    On Fri, Jul 31, 2009 at 2:08 PM, Matthias Radestockwrote:
    Vidit,

    Vidit Drolia wrote:
    The primary problem is that since the action being triggered by the
    message is an email, I can't revert the action. So I am trying to
    ensure that the application sending emails gets a message only once.
    Is there another approach I can take to this problem?
    If you replace "process(msg)" in my last email with "send_email(msg)", you
    will see that what you are asking for is impossible. The best one can do (in
    any system, involving rabbit or not) is to make it *very unlikely* that an
    email is sent more than once. As long as we can agree on that, let's proceed
    ...

    If your main concern is removing the duplicates the senders can produce,
    then I suggest inserting a filtering proxy, i.e. a process that consumes
    messages from one queue, de-dups them and publishes the non-dups to another
    exchange.

    This process does need to keep some state, so, as you say, if it crashes and
    the state is lost then you may get some dups. The process is very simple
    though, so the likelihood of it crashing should be low. Given that we have
    established that there can be no 100% no-dup guarantee, is it really worth
    worrying about that? If the answer is yes, then persisting that state, or
    replicating it between several redundant nodes are possible options.


    Regards,

    Matthias.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

    --
    [][][] Tony Garnock-Jones | Mob: +44 (0)7905 974 211
    [][] LShift Ltd | Tel: +44 (0)20 7729 7060
    [] [] http://www.lshift.net/ | Email: tonyg at lshift.net
  • Vidit Drolia at Aug 4, 2009 at 4:56 pm
    Hi Tony,

    There *may* be a one or more duplicates per day. The message source is
    Amazon SQS and it does not guarantee that a message is deleted even
    after issuing a delete command. Nor do I get an acknowledgement
    confirming that the message is deleted. Thus, I am trying to make my
    system immune to the constraints imposed by SQS. I am assuming that I
    will be able to delete the message within a day but till I did so, my
    application needs to be sure that the duplicates are not introduced
    into the system.

    Best,

    Vidit

    On Tue, Aug 4, 2009 at 12:39 PM, Tony Garnock-Joneswrote:
    Hi Vidit,

    You wrote, at the start of this thread, that you "have a message source
    that may provide duplicate messages". What kinds of duplicates are we
    talking here? One per minute for the next six years, or the occasional
    duplicate within a minute of the original followed by no more duplicates
    ever?

    If it's the former, then a long-term memory is clearly required; if the
    latter (i.e. you're coping with the normal possibility of
    duplication-because-of-connection-failure-etc), then a simple memory of
    say an hour's worth of processed message IDs ought to be enough.

    Regards,
    ?Tony


    Vidit Drolia wrote:
    Matthias,

    Minimizing the probability of sending out a duplicate message is the
    practical objective. So you are right in saying that the best we can
    do is to make it very unlikely that a duplicate mail is sent out.

    A filtering proxy would make most sense because I wanted to move away
    from expensive I/O for persistence in the first place, plus if needed,
    redundancy can be introduced later for fault-tolerance.

    Thanks for all the help!

    Best,

    Vidit

    On Fri, Jul 31, 2009 at 2:08 PM, Matthias Radestockwrote:
    Vidit,

    Vidit Drolia wrote:
    The primary problem is that since the action being triggered by the
    message is an email, I can't revert the action. So I am trying to
    ensure that the application sending emails gets a message only once.
    Is there another approach I can take to this problem?
    If you replace "process(msg)" in my last email with "send_email(msg)", you
    will see that what you are asking for is impossible. The best one can do (in
    any system, involving rabbit or not) is to make it *very unlikely* that an
    email is sent more than once. As long as we can agree on that, let's proceed
    ...

    If your main concern is removing the duplicates the senders can produce,
    then I suggest inserting a filtering proxy, i.e. a process that consumes
    messages from one queue, de-dups them and publishes the non-dups to another
    exchange.

    This process does need to keep some state, so, as you say, if it crashes and
    the state is lost then you may get some dups. The process is very simple
    though, so the likelihood of it crashing should be low. Given that we have
    established that there can be no 100% no-dup guarantee, is it really worth
    worrying about that? If the answer is yes, then persisting that state, or
    replicating it between several redundant nodes are possible options.


    Regards,

    Matthias.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

    --
    ?[][][] Tony Garnock-Jones ? ? | Mob: +44 (0)7905 974 211
    ? [][] LShift Ltd ? ? ? ? ? ? | Tel: +44 (0)20 7729 7060
    ?[] ?[] http://www.lshift.net/ | Email: tonyg at lshift.net


    --
    Vidit Drolia

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedJul 31, '09 at 3:39p
activeAug 4, '09 at 4:56p
posts8
users4
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase