FAQ
Short story: the subject says it all, so if you have an answer already,
fire away. Below is the long story of what I'm using it for, and why I
think it needs to be recursive. It may even be of more general
interest in terms of filtering the results of generators.


I'm playing with an anagram-generating module which works like this:


1) Generate the integer partitions of the length of the input string.


2) For each partition, replace its elements with a list of all
dictionary words which are a) the same length as the partition
element and b) a sub-multiset of the input string.


eg: "cat in hat" -> ... , [3,5], ... -> [[act, ant, ...], [antic,
attic, ...]]


3) Pass each resulting list of wordlists to itertools.product, filtering
the output of anything which is not the same multiset of characters as
the input string.


This works but gets very slow for long strings. It spends most of its
time at the filtering stage because most of the results have too many
of some characters (and therefore not enough of others). I do some
optimising of the lists prior to running product, but this only shaves
off smallish percentages.


I got a big speedup (factor of five for medium-length inputs, much more
for longer strings) by replacing itertools.product with this recursive
generator:


def cartesian_product(args, input_str):
     if not args:
         yield (), input_str
         return
     for words, check_str in cartesian_product(args[:-1], input_str):
         for word in args[-1]:
             #this bit prevents bothering with anything useless
             new_str = check_str
             for letter in word:
                 if letter in new_str:
                     new_str = new_str.replace(letter, '', 1)
                 else:
                     break
             else:
                 yield words + (word,), new_str


Despite being inherently much slower than itertools.product, it can
"prune" branches of the recursion as soon as it accumulates too many of
any character. This means it's much faster and produces correct results
without further filtering.


But there is another problem. The initial partitions contain repeated
elements, so the corresponding wordlists are also repeated. This means
that any cartesian product will contain many redundant results - the
same words in a different order. For a medium-sized string, this is
most of them.


A solution for repeated sections of a partition of wordlists is to
use r-combinations (where r is the number of repeats). In this
scenario, though, some words may be usable more than once, and the
right number of copies of these words must be added to the list to
allow this. This means I need the combinations of a multiset (so I
can't use itertools.combinations).


I found a verbal description of such an algorithm and came up with
this:


def multicombs(it, r):
     result = it[:r]
     yield result
     while 1:
         for i in range(-1, -r - 1, -1):
             rep = result[i]
             if rep < it[i]:
                 break
         else:
             break
         for j, n in enumerate(it):
             if n > rep:
                 break
         result = result[:i] + it[j:j - i]
         yield result


I added a call to this generator in a branch to the main generator
above to deal with repeated elements. This eliminates redundant
results, but with a substantial slowdown. The program now spends most
of its time inside multicombs.


I'm hoping that if I could find a recursive multiset combination
generator, I could speed it up using the same pruning approach.


Any suggestions?


--


John

Search Discussions

  • Oscar Benjamin at Nov 21, 2013 at 11:42 am

    On 21 November 2013 06:46, John O'Hagan wrote:
    I found a verbal description of such an algorithm and came up with
    this:

    def multicombs(it, r):
    result = it[:r]
    yield result
    while 1:
    for i in range(-1, -r - 1, -1):
    rep = result[i]
    if rep < it[i]:
    break
    else:
    break
    for j, n in enumerate(it):
    if n > rep:
    break
    result = result[:i] + it[j:j - i]
    yield result

    I'm not really sure what it is you're asking for. I thought if I ran
    the code I'd understand but that just confused me more. Is the output
    below correct? If not what should it be?


    multicombs("abracadabra", 0)
    ['']
    multicombs("abracadabra", 1)
    ['a']
    multicombs("abracadabra", 2)
    ['ab', 'br', 'ra']
    multicombs("abracadabra", 3)
    ['abr', 'ara', 'bra']
    multicombs("abracadabra", 4)
    ['abra']
    multicombs("abracadabra", 5)
    ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
    'racbr', 'racra']




    Oscar
  • John O'Hagan at Nov 21, 2013 at 1:01 pm

    On Thu, 21 Nov 2013 11:42:49 +0000 Oscar Benjamin wrote:

    On 21 November 2013 06:46, John O'Hagan wrote:

    I found a verbal description of such an algorithm and came up with
    this:

    def multicombs(it, r):
    result = it[:r]
    yield result
    while 1:
    for i in range(-1, -r - 1, -1):
    rep = result[i]
    if rep < it[i]:
    break
    else:
    break
    for j, n in enumerate(it):
    if n > rep:
    break
    result = result[:i] + it[j:j - i]
    yield result
    I'm not really sure what it is you're asking for. I thought if I ran
    the code I'd understand but that just confused me more. Is the output
    below correct? If not what should it be?

    multicombs("abracadabra", 0)
    ['']
    multicombs("abracadabra", 1)
    ['a']
    multicombs("abracadabra", 2)
    ['ab', 'br', 'ra']
    multicombs("abracadabra", 3)
    ['abr', 'ara', 'bra']
    multicombs("abracadabra", 4)
    ['abra']
    multicombs("abracadabra", 5)
    ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
    'racbr', 'racra']



    I neglected to mention that multicombs takes a sorted iterable;
    it doesn't work right otherwise. I'd forgotten that because my
    wordlists are guaranteed sorted by the way they're built. Sorry about
    that.


    In my use-case the first argument to multicombs is a tuple of words
    which may contain duplicates, and it produces all unique combinations
    of a certain length of those words, eg:


    list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))


    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
    ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
    ('in', 'the', 'the')]


    Contrast this with:


    list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3))


    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'),
    ('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'),
    ('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
    ('in', 'the', 'the')]


    which produces results which are redundant for my purposes.


    What I'm looking for is a recursive algorithm which does what
    multicombs does (order unimportant) so that I can apply a pruning
    shortcut like the one I used in the recursive cartesian product
    algorithm in my original post.


    Multiset combination algorithms seem pretty thin on the ground out
    there - as I said, I could only find a description of the procedure
    above, no actual code. The ones I did find are non-recursive. I'm
    hoping some combinatorics and/or recursion experts can offer advice.


    Regards,


    --


    John
  • Oscar Benjamin at Nov 25, 2013 at 12:15 pm

    On 21 November 2013 13:01, John O'Hagan wrote:
    In my use-case the first argument to multicombs is a tuple of words
    which may contain duplicates, and it produces all unique combinations
    of a certain length of those words, eg:

    list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))

    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
    ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
    ('in', 'the', 'the')]

    I still don't understand what you're actually doing well enough to
    know whether there is a better general approach to the problem. For
    the specific thing you requested, here is a recursive multiset
    combinations generator. Does it do what you wanted?


    #!/usr/bin/env python


    def multicombs(it, r):
         words = []
         last = None
         for N, word in enumerate(it):
             if word == last:
                 words[-1][1] += 1
             else:
                 words.append([word, 1])
                 last = word
         cumulative = 0
         for n in range(len(words)-1, -1, -1):
             words[n].append(cumulative)
             cumulative += words[n][1]
         return _multicombs((), words, r)


    def _multicombs(prepend, words, r):
         if r == 0:
             yield prepend
             return
         (word, count, rem), *remaining = words
         for k in range(max(r-rem, 0), min(count, r) + 1):
             yield from _multicombs(prepend + (word,) * k, remaining, r-k)


    expected = [
            ('cat', 'hat', 'in'),
            ('cat', 'hat', 'the'),
            ('cat', 'in', 'the'),
            ('cat', 'the', 'the'),
            ('hat', 'in', 'the'),
            ('hat', 'the', 'the'),
            ('in', 'the', 'the'),
         ]


    output = list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))


    assert len(expected) == len(output)
    assert set(expected) == set(output) # The order is different




    Oscar
  • John O'Hagan at Nov 26, 2013 at 6:18 am

    On Mon, 25 Nov 2013 12:15:15 +0000 Oscar Benjamin wrote:

    On 21 November 2013 13:01, John O'Hagan wrote:
    In my use-case the first argument to multicombs is a tuple of words
    which may contain duplicates, and it produces all unique
    combinations of a certain length of those words, eg:

    list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))

    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
    ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
    ('in', 'the', 'the')]
    I still don't understand what you're actually doing well enough to
    know whether there is a better general approach to the problem. For
    the specific thing you requested, here is a recursive multiset
    combinations generator. Does it do what you wanted?

    #!/usr/bin/env python

    def multicombs(it, r):
    words = []
    last = None
    for N, word in enumerate(it):
    if word == last:
    words[-1][1] += 1
    else:
    words.append([word, 1])
    last = word
    cumulative = 0
    for n in range(len(words)-1, -1, -1):
    words[n].append(cumulative)
    cumulative += words[n][1]
    return _multicombs((), words, r)

    def _multicombs(prepend, words, r):
    if r == 0:
    yield prepend
    return
    (word, count, rem), *remaining = words
    for k in range(max(r-rem, 0), min(count, r) + 1):
    yield from _multicombs(prepend + (word,) * k, remaining, r-k)

    Thanks, that is exactly the type of thing I was after. Although it is
    actually slower as is than a hack like


    def imulticombs(it, n):
         last = ()
         for i in itertools.combinations(it,n):
             if i > last:
                 yield i
                 last = i


    it can be modified to be a lot faster for my use-case like this:


    def _multicombs(prepend, words, r, chkstr):
         """chkstr is the string of remaining availalable characters"""
         if r == 0:
             yield prepend, chkstr
             return
         (word, count, rem), *remaining = words
         newstr = chkstr
         for k in range(max(r-rem, 0), min(count, r) + 1):
             for letter in word * k:
                 if letter in newstr:
                     newstr = newstr.replace(letter, '' , 1)
                 else:
                     return
             yield from _multicombs(prepend + (word,) * k, remaining, r-k,
             newstr)


    By progressively checking against the available characters, it avoids
    fruitless recursion branches. I'm not 100% sure I'm doing this right yet
    but so far it's promising. This is the quickest non-redundant approach
    I've used so far (although strangely, the original product-only version
    which produces a lot of redundant results is still by far the quickest.)


    This is how I'm calling it:


    def cartesian_product(args, instr):
         if not args:
             yield (),instr
             return
         for items, chkstr in cartesian_product(args[:-1], instr):
             #this prevents bothering with anything useless
             words, repeats = args[-1]
             if repeats == 1:
                 for word in words:
                     newstr = chkstr
                     for letter in word:
                         if letter in newstr:
                             newstr = newstr.replace(letter, '' , 1)
                         else:
                             break
                     else:
                         yield items + (word,), newstr
             else:
                 for item, newstr in multicombs(words, repeats, chkstr):
                     yield items + item, newstr


    I'll try to explain better what I'm actually doing. It's quite long,
    but you seem to be wondering, or maybe someone else is!


    It's an old anagram program I wrote for practice a few years ago which
    produces the phrases of dictionary words which together contains the
    same letter-multiset as the input phrase. I would be interested to know
    if you think there is a better general approach. My approach is:


    - make a dictionary of candidate words composed of the same
       sub-alphabet as the input, using their lengths as keys, then


    - for each integer partition of the length of the input, replace each
       partition element with the corresponding value from the dictionary,
       then


    - Take the cartesian product of the resulting list of lists of words,
       rejecting any output which exceeds any of the letter frequencies of
       the input


    I have omitted descibing some complex optimisations of the lists, which
    gave only fractional speedups.


    However, I recently discovered that I could get an orders-of-magnitude
    speedup by using my own recursive cartesian product algorithnm which
    checks letter frequencies on the fly, rather than using
    itertools.product and testing each output phrase. (This algorithm was
    in my original post).


    For example, take the phrase "break beat". I make a wordlength
    dictionary of sub-alphabet words, using the system dictionary file,
    like this:


      {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
      'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
      'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet',
      'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak',
      'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate',
      'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate',
      'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab',
      'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake',
      'karate', 'rebate', 'retake']}




    The input has nine letters, and the partitions of
    nine are:


    9
    1, 8
    2, 7
    1, 1, 7
    3, 6
    ... etc


    and for each partition, I grab the relevant lists from the dictionary,
    eg. the first one for which dictionary entries are available is (3, 6)
    which gives:


      [(['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet',
    'bra', 'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab',
    'tar', 'tea', 'tee'], 1), (['aerate', 'beaker', 'beater', 'berate',
    'betake', 'karate', 'rebate', 'retake'], 1)]


    The '1' in each tuple above is the number of repeats of the partition
    element.


    Then I run cartesian_product on the lists, and so on for all the
    partitions. The full result is like:


    bar betake
    bat beaker
    betake bra
    [... many more ...]
    bra eke tab
    a ark be bet
    ark at be be




    The caveat, and subject of my original post, is that if a partition
    element is repeated, we run into complications. In this example, for the
    partition (3, 3, 3), if we take the product of three copies of this
    list:


    ['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet', 'bra',
    'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar',
    'tea', 'tee']


    we get a lot of repetition:


    ['ark', 'ate', 'ebb']
    ['ark', 'ate', 'ebb']
    ['ark', 'ate', 'ebb']
    ['ark', 'ate', 'ebb']
    ['ark', 'ate', 'ebb']
    ['ark', 'ate', 'ebb']
    [... many more ...]
    ['bra', 'eke', 'tab']
    ['bra', 'eke', 'tab']
    ['bra', 'eke', 'tab']
    ['bra', 'eke', 'tab']
    ['bra', 'eke', 'tab']
    ['bra', 'eke', 'tab']


    I sorted this output to demonstrate the problem. For longer phrases,
    the bulk of the output is redundant.


    To fix this we could use use itertools.combinations(list,
    number_of_repeats), but this creates another issue: if there are enough
    letters in the input for particular output words to be repeated,
    itertools.combinations will miss the repetitions. In this example, the
    word 'be' can occur twice in an output phrase. To allow for this
    repetition using combinations, we have to insert another copy of it
    into the list of two-letter words. But now we have redundancy again, as
    every combination using 'be' will also be repeated!


    The solution is to use multiset combinations, but doing so to date has
    cost much of the big speedup I got by using a self-pruning recursive
    version of cartesian product. I'm hoping that some more tweaking of an
    algorithm like yours will allow the same great speed improvement I got
    for the cartesian product version.


    For now, although to a lesser extent, I have to choose between
    redundancy and slowness!


    Regards,


    --


    John
  • Oscar Benjamin at Nov 26, 2013 at 10:33 am

    On 26 November 2013 06:18, John O'Hagan wrote:
    Thanks, that is exactly the type of thing I was after. Although it is
    actually slower as is than a hack like

    def imulticombs(it, n):
    last = ()
    for i in itertools.combinations(it,n):
    if i > last:
    yield i
    last = i

    it can be modified to be a lot faster for my use-case like this:

    def _multicombs(prepend, words, r, chkstr):
    """chkstr is the string of remaining availalable characters"""
    if r == 0:
    yield prepend, chkstr
    return
    (word, count, rem), *remaining = words
    newstr = chkstr
    for k in range(max(r-rem, 0), min(count, r) + 1):
    for letter in word * k:
    if letter in newstr:
    newstr = newstr.replace(letter, '' , 1)
    else:
    return
    yield from _multicombs(prepend + (word,) * k, remaining, r-k,
    newstr)

    Further micro-optimisations are possible. One is to inline the lower
    recursion levels. For example if the termination condition is checked
    immediately before recursion you can eliminate a redundant generator
    function call.

    By progressively checking against the available characters, it avoids
    fruitless recursion branches. I'm not 100% sure I'm doing this right yet
    but so far it's promising. This is the quickest non-redundant approach
    I've used so far (although strangely, the original product-only version
    which produces a lot of redundant results is still by far the quickest.)

    Bear in mind that the interpreter does a lot of "redundant" things so
    what you might think of as non-redundant code can actually be highly
    redundant when interpreter over-head is accounted for.


    Recently I've been trying out PyPY in my own work and it is *really
    fast*. In fact it is so much faster that I've given up on the idea of
    speed-optimising code for CPython: just use PyPy instead - although it
    is worth optimising highly intensive code for PyPy.

    I'll try to explain better what I'm actually doing. It's quite long,
    but you seem to be wondering, or maybe someone else is!

    [snip]


    Okay I understand now. I was confused because I think of anagrams as
    being words of the same length. I didn't understand how your multiword
    version works but I think I do now. In my own words: You want to
    generate all sequences of English words having the same letters as
    some input string, ignoring the spaces in both the input and output
    strings (so that the number or length of the individual words need not
    be the same).


    [snip]
    For example, take the phrase "break beat". I make a wordlength
    dictionary of sub-alphabet words, using the system dictionary file,
    like this:

    {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
    'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
    'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet',
    'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak',
    'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate',
    'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate',
    'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab',
    'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake',
    'karate', 'rebate', 'retake']}

    Okay, how about the following (pseudoish-code)?


    def word_sequences(prepend, target, subwords):
         # subwords is a list rather than a dict
         for n in range(1, len(subwords)):
             for word in subwords[n]:
                 if equal_in_a_multiset_sense(word, target):
                     yield prepend + ' ' + target
                 elif is_a_submultiset(word, target):
                     recurse_target = multiset_subtract(target, word)
                     subsubwords = list_of_subwords_by_length[:len(recurse_target)]
                     if any(subsubwords):
                         yield from word_sequences(prepend + ' ' + word,
    recurse_target, subsubwords)


    Your general approach seems reasonable to me now that I understand the
    problem. The word_sequences generator above is the solution I would
    probably write at first if faced with the problem. I suspect that your
    current approach using partition sizes is better though.




    Oscar
  • John O'Hagan at Nov 27, 2013 at 10:25 am

    On Tue, 26 Nov 2013 10:33:06 +0000 Oscar Benjamin wrote:

    On 26 November 2013 06:18, John O'Hagan wrote:
    [...]
    def _multicombs(prepend, words, r, chkstr):
    """chkstr is the string of remaining availalable characters"""
    if r == 0:
    yield prepend, chkstr
    return
    (word, count, rem), *remaining = words
    newstr = chkstr
    for k in range(max(r-rem, 0), min(count, r) + 1):
    for letter in word * k:
    if letter in newstr:
    newstr = newstr.replace(letter, '' , 1)
    else:
    return
    yield from _multicombs(prepend + (word,) * k, remaining,
    r-k, newstr)
    Further micro-optimisations are possible. One is to inline the lower
    recursion levels. For example if the termination condition is checked
    immediately before recursion you can eliminate a redundant generator
    function call.
    By progressively checking against the available characters, it
    avoids fruitless recursion branches. I'm not 100% sure I'm doing
    this right yet but so far it's promising. This is the quickest
    non-redundant approach I've used so far (although strangely, the
    original product-only version which produces a lot of redundant
    results is still by far the quickest.)
    Bear in mind that the interpreter does a lot of "redundant" things so
    what you might think of as non-redundant code can actually be highly
    redundant when interpreter over-head is accounted for.

    I don't mind redundancy as long as I don't have to see it in the output!

    Recently I've been trying out PyPY in my own work and it is *really
    fast*. In fact it is so much faster that I've given up on the idea of
    speed-optimising code for CPython: just use PyPy instead - although it
    is worth optimising highly intensive code for PyPy.

    I'll have a look at it.

    I'll try to explain better what I'm actually doing. It's quite long,
    but you seem to be wondering, or maybe someone else is!
    [snip]

    Okay I understand now. I was confused because I think of anagrams as
    being words of the same length. I didn't understand how your multiword
    version works but I think I do now. In my own words: You want to
    generate all sequences of English words having the same letters as
    some input string, ignoring the spaces in both the input and output
    strings (so that the number or length of the individual words need not
    be the same).

    [snip]
    For example, take the phrase "break beat". I make a wordlength
    dictionary of sub-alphabet words, using the system dictionary file,
    like this:

    {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
    'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb',
    'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4:
    ['abet', 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate',
    'beak', 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake',
    'rate', 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5:
    ['abate', 'baker', 'beret', 'brake', 'break', 'eater', 'karat',
    'kebab', 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate',
    'betake', 'karate', 'rebate', 'retake']}
    Okay, how about the following (pseudoish-code)?

    def word_sequences(prepend, target, subwords):
    # subwords is a list rather than a dict
    for n in range(1, len(subwords)):
    for word in subwords[n]:
    if equal_in_a_multiset_sense(word, target):
    yield prepend + ' ' + target
    elif is_a_submultiset(word, target):
    recurse_target = multiset_subtract(target, word)
    subsubwords =
    list_of_subwords_by_length[:len(recurse_target)] if any(subsubwords):
    yield from word_sequences(prepend + ' ' + word,
    recurse_target, subsubwords)

    [...]


    That is very clever. Direct, economical and does effectively the
    same pruning job as my approach without all the combinatorial
    brainstrain. I got it working by changing "for n in range..." to "for
    words in subwords" (otherwise it missed the one-letter words) and the
    first yield needed to be "prepend + ' ' + word".


    I simplified it a bit more to this:


    def word_sequences(prepend, target, subwords):
         """subwords is a list of lists of subwords grouped by length,
             in order of length"""
         for words in subwords:
             for word in words:
                 recurse_target = subset_subtract(target, word)
                 if recurse_target:
                     yield from word_sequences(prepend + ' ' + word,
                             recurse_target, subwords[:len(recurse_target)])
                 elif recurse_target == '':
                     yield prepend + ' ' + word


    with just one function to do the subset testing:


    def subset_subtract(target, word):
         for i in word:
             if i in target:
                 target = target.replace(i, '' ,1)
             else:
                 return
         return target


    Speed-wise it is somewhat faster than any of my non-duplicate-producing
    attempts, but still way slower than the current champion, the redundant
    cartesian product-only version.


    However, because it iterates through all the remaining words on each
    recursion, it seems to produce n! of each unique result, where n in the
    number of words in the result, so this is the new champion as far as
    redundancy is concerned. I'll keep working on it, the totally
    different approach is interesting.


    --


    John
  • Oscar Benjamin at Nov 27, 2013 at 10:50 pm

    On 27 November 2013 10:25, John O'Hagan wrote:
    On Tue, 26 Nov 2013 10:33:06 +0000
    Oscar Benjamin wrote:

    I simplified it a bit more to this:

    def word_sequences(prepend, target, subwords):
    """subwords is a list of lists of subwords grouped by length,
    in order of length"""
    for words in subwords:
    for word in words:
    recurse_target = subset_subtract(target, word)
    if recurse_target:
    yield from word_sequences(prepend + ' ' + word,
    recurse_target, subwords[:len(recurse_target)])
    elif recurse_target == '':
    yield prepend + ' ' + word

    with just one function to do the subset testing:

    def subset_subtract(target, word):
    for i in word:
    if i in target:
    target = target.replace(i, '' ,1)
    else:
    return
    return target

    Speed-wise it is somewhat faster than any of my non-duplicate-producing
    attempts, but still way slower than the current champion, the redundant
    cartesian product-only version.

    However, because it iterates through all the remaining words on each
    recursion, it seems to produce n! of each unique result, where n in the
    number of words in the result, so this is the new champion as far as
    redundancy is concerned. I'll keep working on it, the totally
    different approach is interesting.

    Whoops, I guess this is what happens when you send untested
    (pseudo-)code out. It needs an outer helper function that can do
    something like:


    def word_sequences_top(target, subwords):
         for word in copy(subwords):
             recurse_target = multiset_subtrace(target,word)
             yield from word_sequences(words, recurse_target, subwords)
             remove_word_from_list(word, subwords)


    This way we yield all matches involving the word once and then go on
    to all matches that don't include the word.


    Also the partition length logic from your original version can be used
    in word_sequences to prune recursion branches.




    Oscar
  • James at Nov 22, 2013 at 2:14 am

    On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote:
    On Thu, 21 Nov 2013 11:42:49 +0000

    Oscar Benjamin wrote:


    On 21 November 2013 06:46, John O'Hagan
    wrote:
    I found a verbal description of such an algorithm and came up with
    this:
    def multicombs(it, r):
    result = it[:r]
    yield result
    while 1:
    for i in range(-1, -r - 1, -1):
    rep = result[i]
    if rep < it[i]:
    break
    else:
    break
    for j, n in enumerate(it):
    if n > rep:
    break
    result = result[:i] + it[j:j - i]
    yield result

    I'm not really sure what it is you're asking for. I thought if I ran
    the code I'd understand but that just confused me more. Is the output
    below correct? If not what should it be?

    multicombs("abracadabra", 0)
    ['']
    multicombs("abracadabra", 1)
    ['a']
    multicombs("abracadabra", 2)
    ['ab', 'br', 'ra']
    multicombs("abracadabra", 3)
    ['abr', 'ara', 'bra']
    multicombs("abracadabra", 4)
    ['abra']
    multicombs("abracadabra", 5)
    ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
    'racbr', 'racra']




    I neglected to mention that multicombs takes a sorted iterable;

    it doesn't work right otherwise. I'd forgotten that because my

    wordlists are guaranteed sorted by the way they're built. Sorry about

    that.



    In my use-case the first argument to multicombs is a tuple of words

    which may contain duplicates, and it produces all unique combinations

    of a certain length of those words, eg:



    list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))



    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),

    ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),

    ('in', 'the', 'the')]



    Contrast this with:



    list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3))



    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'),

    ('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'),

    ('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),

    ('in', 'the', 'the')]



    which produces results which are redundant for my purposes.



    What I'm looking for is a recursive algorithm which does what

    multicombs does (order unimportant) so that I can apply a pruning

    shortcut like the one I used in the recursive cartesian product

    algorithm in my original post.



    Multiset combination algorithms seem pretty thin on the ground out

    there - as I said, I could only find a description of the procedure

    above, no actual code. The ones I did find are non-recursive. I'm

    hoping some combinatorics and/or recursion experts can offer advice.



    Regards,



    --



    John

    Could convert the following perl script to python?


    use Data::Dump qw(dump);
    dump combo([@ARGV], 3);


    sub combo {
    my ($t, $k) = @_;
    my @T = @$t;
    my @R = ();
    my %g = ();
    if ($k == 1) {
             for (@T) {
                     push @R, $_ unless $g{$_}++;
             }
    } else {
             while (my $x = shift @T) {
             $p = combo([@T], $k-1);
             for (@{$p}) {
                     my $q = $x.",".$_;
                     push @R, $q unless $g{$q}++;
             }
             }
    }
    [@R];
    }


    $ prog.pl cat hat in the the
    [
       "cat,hat,in",
       "cat,hat,the",
       "cat,in,the",
       "cat,the,the",
       "hat,in,the",
       "hat,the,the",
       "in,the,the",
    ]


    James
  • John O'Hagan at Nov 23, 2013 at 1:07 am

    On Thu, 21 Nov 2013 18:14:41 -0800 (PST) James wrote:


    On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote:

    [...]

    On 21 November 2013 06:46, John O'Hagan
    wrote:

    [...]

    def multicombs(it, r):
    result = it[:r]
    yield result
    while 1:
    for i in range(-1, -r - 1, -1):
    rep = result[i]
    if rep < it[i]:
    break
    else:
    break
    for j, n in enumerate(it):
    if n > rep:
    break
    result = result[:i] + it[j:j - i]
    yield result

    [...]

    I neglected to mention that multicombs takes a sorted iterable;

    it doesn't work right otherwise. I'd forgotten that because my

    wordlists are guaranteed sorted by the way they're built. Sorry
    about

    that.



    In my use-case the first argument to multicombs is a tuple of words

    which may contain duplicates, and it produces all unique
    combinations

    of a certain length of those words, eg:



    list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))



    [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),

    ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),

    ('in', 'the', 'the')]

    [...]

    What I'm looking for is a recursive algorithm which does what

    multicombs does (order unimportant) so that I can apply a pruning

    shortcut like the one I used in the recursive cartesian product

    algorithm in my original post.



    Multiset combination algorithms seem pretty thin on the ground out

    there - as I said, I could only find a description of the procedure

    above, no actual code. The ones I did find are non-recursive. I'm

    hoping some combinatorics and/or recursion experts can offer
    advice.

    [...]

    John
    Could convert the following perl script to python?

    use Data::Dump qw(dump);
    dump combo([@ARGV], 3);

    sub combo {
    my ($t, $k) = @_;
    my @T = @$t;
    my @R = ();
    my %g = ();
    if ($k == 1) {
    for (@T) {
    push @R, $_ unless $g{$_}++;
    }
    } else {
    while (my $x = shift @T) {
    $p = combo([@T], $k-1);
    for (@{$p}) {
    my $q = $x.",".$_;
    push @R, $q unless $g{$q}++;
    }
    }
    }
    [@R];
    }

    $ prog.pl cat hat in the the
    [
    "cat,hat,in",
    "cat,hat,the",
    "cat,in,the",
    "cat,the,the",
    "hat,in,the",
    "hat,the,the",
    "in,the,the",
    ]

    James



    Thanks. Now I just have to learn Perl to understand what that
    does! :)


    Regards,


    --


    John
  • Dan Stromberg at Nov 21, 2013 at 8:59 pm
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan wrote:

    Short story: the subject says it all, so if you have an answer already,
    fire away. Below is the long story of what I'm using it for, and why I
    think it needs to be recursive. It may even be of more general
    interest in terms of filtering the results of generators.

    I think you probably need permutations rather than combinations.


    Also, I think you'll need to form a word (partitioned off by spaces), and
    then check it against a set containing /usr/share/dict/words before
    recursing for the remainder of the sentence - this should speed things up a
    LOT.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131121/f83715bb/attachment.html>
  • John O'Hagan at Nov 23, 2013 at 12:58 am

    On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg wrote:


    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
    wrote:
    Short story: the subject says it all, so if you have an answer
    already, fire away. Below is the long story of what I'm using it
    for, and why I think it needs to be recursive. It may even be of
    more general interest in terms of filtering the results of
    generators.
    I think you probably need permutations rather than combinations.

    Also, I think you'll need to form a word (partitioned off by spaces),
    and then check it against a set containing /usr/share/dict/words
    before recursing for the remainder of the sentence - this should
    speed things up a LOT.

    Thanks for the reply. If I understand you correctly, you are suggesting
    permuting the input _characters_ to form words and then seeing if
    they exist, as opposed to my approach of combining known words and
    seeing if they are anagrams. (Permutations of words would not help find
    anagrams as they merely change the word order). Here is an attempt at
    that:


    def anagrams(partition, input_string):
         """Find anagrams which fit given partition of input string length"""
         if not partition:
             yield (), input_string
             return
         for words, checkstring in anagrams(partition[:-1], input_string):
             for word in itertools.permutations(checkstring, partition[-1]):
                 word = ''.join(word)
                 if word in WORDS: #WORDS is collection of dictionary words
                     newstring = checkstring
                     for l in word:
                         newstring = newstring.replace(l, '' , 1)
                     yield words + (word,), newstring


    There are two problems with this. If there are repeated characters in
    the input, redundant results are produced; a multiset-permutation
    algorithm would fix this. But the main problem is it is incredibly
    slow: on my run-of-the-mill laptop, it chokes on anything longer than
    about 10 characters, spending most of its time rejecting non-words.


    Or have I misunderstood your suggestion?


    Regards,


    --


    John
  • MRAB at Nov 23, 2013 at 4:23 am

    On 23/11/2013 00:58, John O'Hagan wrote:
    On Thu, 21 Nov 2013 12:59:26 -0800
    Dan Stromberg wrote:
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
    wrote:
    Short story: the subject says it all, so if you have an answer
    already, fire away. Below is the long story of what I'm using it
    for, and why I think it needs to be recursive. It may even be of
    more general interest in terms of filtering the results of
    generators.
    I think you probably need permutations rather than combinations.

    Also, I think you'll need to form a word (partitioned off by spaces),
    and then check it against a set containing /usr/share/dict/words
    before recursing for the remainder of the sentence - this should
    speed things up a LOT.
    Thanks for the reply. If I understand you correctly, you are suggesting
    permuting the input _characters_ to form words and then seeing if
    they exist, as opposed to my approach of combining known words and
    seeing if they are anagrams. (Permutations of words would not help find
    anagrams as they merely change the word order). Here is an attempt at
    that:

    def anagrams(partition, input_string):
    """Find anagrams which fit given partition of input string length"""
    if not partition:
    yield (), input_string
    return
    for words, checkstring in anagrams(partition[:-1], input_string):
    for word in itertools.permutations(checkstring, partition[-1]):
    word = ''.join(word)
    if word in WORDS: #WORDS is collection of dictionary words
    newstring = checkstring
    for l in word:
    newstring = newstring.replace(l, '' , 1)
    yield words + (word,), newstring

    There are two problems with this. If there are repeated characters in
    the input, redundant results are produced; a multiset-permutation
    algorithm would fix this. But the main problem is it is incredibly
    slow: on my run-of-the-mill laptop, it chokes on anything longer than
    about 10 characters, spending most of its time rejecting non-words.

    Or have I misunderstood your suggestion?
    If you want to know how to get unique permutations, have a look here:


    http://mail.python.org/pipermail/python-ideas/2013-October/023610.html
  • John O'Hagan at Nov 24, 2013 at 2:28 am

    On Sat, 23 Nov 2013 04:23:42 +0000 MRAB wrote:

    On 23/11/2013 00:58, John O'Hagan wrote:
    On Thu, 21 Nov 2013 12:59:26 -0800
    Dan Stromberg wrote:
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
    wrote:
    Short story: the subject says it all, so if you have an answer
    already, fire away. Below is the long story of what I'm using it
    for, and why I think it needs to be recursive. It may even be of
    more general interest in terms of filtering the results of
    generators.
    I think you probably need permutations rather than combinations.

    Also, I think you'll need to form a word (partitioned off by
    spaces), and then check it against a set
    containing /usr/share/dict/words before recursing for the
    remainder of the sentence - this should speed things up a LOT.
    Thanks for the reply. If I understand you correctly, you are
    suggesting permuting the input _characters_ to form words and then
    seeing if they exist, as opposed to my approach of combining known
    words and seeing if they are anagrams. (Permutations of words would
    not help find anagrams as they merely change the word order). Here
    is an attempt at that:

    def anagrams(partition, input_string):
    """Find anagrams which fit given partition of input string
    length""" if not partition:
    yield (), input_string
    return
    for words, checkstring in anagrams(partition[:-1],
    input_string): for word in itertools.permutations(checkstring,
    partition[-1]): word = ''.join(word)
    if word in WORDS: #WORDS is collection of dictionary
    words newstring = checkstring
    for l in word:
    newstring = newstring.replace(l, '' , 1)
    yield words + (word,), newstring

    There are two problems with this. If there are repeated characters
    in the input, redundant results are produced; a multiset-permutation
    algorithm would fix this. But the main problem is it is incredibly
    slow: on my run-of-the-mill laptop, it chokes on anything longer
    than about 10 characters, spending most of its time rejecting
    non-words.

    Or have I misunderstood your suggestion?
    If you want to know how to get unique permutations, have a look here:

    http://mail.python.org/pipermail/python-ideas/2013-October/023610.html

    For this particular problem I don't need multiset permutations but
    multiset combinations (see original post). But that thread was a good
    read.


    This is a little OT, but I did need multiset permutations a couple of
    years back for a project involving the generation of musical
    structures. There was zero interest here at the time (fair enough!) and
    I ended up implementing the C++ function "next_permutation" in Python.


    So it was good to read in that thread that there seems to be some
    interest in incorporating multiset combinatorics into itertools
    (including your excellent contribution). IMHO the scepticism there about
    non-exotic use-cases was unjustified. Leaving aside my probably atypical
    uses, it crops in many ordinary situations dealing with arrangements of
    multiple items of several types where each instance of a type is
    equivalent. Take stock control: when stacking a warehouse shelf it
    doesn't matter which particular box goes where, only how many of each
    size. Or timetabling: if Mrs Smith teaches the same students maths on
    Tuesdays and Thursdays, swapping the classes does nothing.


    The same goes for cyclic and cyclic-multiset ("necklaces")
    combinatorics, where the beginning and end of an arrangement is not
    significant, eg. 24-hour rostering, laying tiles, etc. And musical
    scales.


    Disclaimer: I am far from expert on combinatorics but seem to end up
    using it a lot.


    --


    John
  • Dan Stromberg at Nov 23, 2013 at 6:33 am

    On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan wrote:


    On Thu, 21 Nov 2013 12:59:26 -0800
    Dan Stromberg wrote:
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
    wrote:
    Short story: the subject says it all, so if you have an answer
    already, fire away. Below is the long story of what I'm using it
    for, and why I think it needs to be recursive. It may even be of
    more general interest in terms of filtering the results of
    generators.
    I think you probably need permutations rather than combinations.

    Also, I think you'll need to form a word (partitioned off by spaces),
    and then check it against a set containing /usr/share/dict/words
    before recursing for the remainder of the sentence - this should
    speed things up a LOT.
    Thanks for the reply. If I understand you correctly, you are suggesting
    permuting the input _characters_ to form words and then seeing if
    they exist, as opposed to my approach of combining known words and
    seeing if they are anagrams. (Permutations of words would not help find
    anagrams as they merely change the word order). Here is an attempt at
    that:



    You've interpreted me correctly.


    However, I was thinking about this in the back of my mind, and decided it
    would probably be best to inhale /usr/share/dict/words (if on Linux), and
    pull out words of the corrects lengths (as separated by the blanks) over
    the correct (possible) alphabet, and permute Those, afterward checking if
    they form good anagrams of the original sentence. This would probably be
    much faster, since English isn't that dense of a space.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131122/beb40699/attachment-0001.html>
  • John O'Hagan at Nov 24, 2013 at 2:57 am

    On Fri, 22 Nov 2013 22:33:29 -0800 Dan Stromberg wrote:


    On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan
    wrote:
    On Thu, 21 Nov 2013 12:59:26 -0800
    Dan Stromberg wrote:
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
    wrote:
    Short story: the subject says it all, so if you have an answer
    already, fire away. Below is the long story of what I'm using it
    for, and why I think it needs to be recursive. It may even be of
    more general interest in terms of filtering the results of
    generators.
    I think you probably need permutations rather than combinations.

    Also, I think you'll need to form a word (partitioned off by
    spaces), and then check it against a set
    containing /usr/share/dict/words before recursing for the
    remainder of the sentence - this should speed things up a LOT.
    Thanks for the reply. If I understand you correctly, you are
    suggesting permuting the input _characters_ to form words and then
    seeing if they exist, as opposed to my approach of combining known
    words and seeing if they are anagrams. (Permutations of words would
    not help find anagrams as they merely change the word order). Here
    is an attempt at that:

    You've interpreted me correctly.

    However, I was thinking about this in the back of my mind, and
    decided it would probably be best to inhale /usr/share/dict/words (if
    on Linux), and pull out words of the corrects lengths (as separated
    by the blanks) over the correct (possible) alphabet, and permute
    Those, afterward checking if they form good anagrams of the original
    sentence. This would probably be much faster, since English isn't
    that dense of a space.

    If you look back at my original question, you'll see that's pretty much
    what I've done. I didn't spell out all the details, but I made a
    dictionary of wordlength keys containing lists of all dictionary words
    of that length made of the correct sub-alphabet.


    But to to recap the problem: to produce non-redundant anagram phrases, I
    need the cartesian product (not permutations) of these lists if each is
    unique, but with a subroutine producing multiset combinations if a list
    is repeated, i.e., if a particular length is called for more than once.
    The version I have so far is correct but substantially slower than the
    product-only one, which just goes ahead and produces all the redundant
    results. This seems counter-intuitive, and my theory is that this is
    because I am unable to "prune" the non-recursive combination algorithm
    I currently have.


    Regards,


    --


    John
  • Dan Stromberg at Dec 6, 2013 at 4:05 am
    On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan wrote:

    Short story: the subject says it all, so if you have an answer already,
    fire away. Below is the long story of what I'm using it for, and why I
    think it needs to be recursive. It may even be of more general
    interest in terms of filtering the results of generators.


    Any suggestions?
    I've updated my code at
    http://stromberg.dnsalias.org/svn/anagrams/trunk/; It's multiword now.
      It can blast through the word "punishment" in 4
    seconds, but for "The public art galleries" it ate about 7 gigabytes of RAM
    and ran for more than a day before I killed it.


    I believe it's an exponential problem. Parallelization might help, but
    it'd probably take a lot of RAM that way. Maybe the RAM use would be
    better with CPython, but it's much faster with Pypy; I did most of my
    testing with Pypy 2.2.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131205/15873954/attachment.html>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 21, '13 at 6:46a
activeDec 6, '13 at 4:05a
posts17
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase