FAQ

From: Andrew Dalke [mailto:adalke at mindspring.com]

Python 2.3 offers at least two new ways to do this. The first is
with the new 'Set' class

# Let 'data' be a list or iterable object
import sets
subset = list(sets.Set(data))
subset.sort()
# Use 'subset' as needed
Using sets is definitely the Right Way (TM) to do it. This is one of the
primary use cases for sets (*everyone* wants to do this).
(The 'list()' is needed because that's the only way to get elements
out from a list. It provides an __iter__ but no 'tolist()' method.)
And this is the canonical way to transform any iterable to a list. Why
should every class that you want to transform to a list have to supply a
`tolist` method? Why not a `totuple` method?
The other is with the new 'fromkeys' class, which constructs
Actually, dictionary class (static?) method.
# Let 'data' be a list or iterable object
subset = dict.fromkeys(data).keys()
subset.sort()
# Use 'subset' as needed
This, whilst slightly shorter (due to no import - which in future versions
will be going away anyway), is definitely *not* the Right Way (TM) to do it.
It is likely to confuse people.
For a real-life example, suppose you want to get unique lines
from the stdin input stream, sort them, and dump the results
to stdout. Here's how to do it in Python 2.3

import sys
unique_lines = dict.fromkeys(sys.stdin).keys()
unique_lines.sort()
sys.stdout.writelines(unique_lines)
Nope - this is better done as:

import sets
import sys

unique_lines = list(sets.Set(sys.stdin))
unique_lines.sort()
sys,stdout.writelines(unique_lines)

It says explicitly what you are doing - creating a set of unique *values*
(since that is the definition of a set), the sorting the result.

Tim Delaney

Search Discussions

  • Just at Jan 6, 2003 at 8:41 am
    In article <mailman.1041832210.3104.python-list at python.org>,
    "Delaney, Timothy" wrote:
    The other is with the new 'fromkeys' class, which constructs
    Actually, dictionary class (static?) method.
    It's a class method.
    # Let 'data' be a list or iterable object
    subset = dict.fromkeys(data).keys()
    subset.sort()
    # Use 'subset' as needed
    This, whilst slightly shorter (due to no import - which in future versions
    will be going away anyway), is definitely *not* the Right Way (TM) to do it.
    It is likely to confuse people.
    I don't know. It's currently (apparently) a lot faster than using the
    sets module. With the fromkeys() addition dicts are quite comfortable as
    poor-man's sets.

    Just
  • Andrew Dalke at Jan 6, 2003 at 10:23 am

    Delaney, Timothy wrote:
    Using sets is definitely the Right Way (TM) to do it. This is one of the
    primary use cases for sets (*everyone* wants to do this).
    - the performance of Sets is slower than that of a simple dict
    (because, after all, Sets are built on top of a dict but with
    extra overhead). I just tested it -- fromdict is about 20%
    faster than using Set
    import time, sets, random
    data = [random.randrange(1000000) for i in range(2000000)]
    def do_set():
    ... return len(sets.Set(data))
    ...
    def do_dict():
    ... return len(dict.fromkeys(data).keys())
    ...
    t1=time.clock();do_set();t2=time.clock()
    865149
    t2-t1
    2.9100000000000001
    t1=time.clock();do_dict();t2=time.clock()
    865149
    t2-t1
    2.3299999999999983
    2.33/2.9
    0.80344827586206902
    >>>


    - there's the extra import, which is a bit tedious if you don't
    need the power of a Set

    - using dicts is a basic part of using Python, so the step to using
    a different way to construct a dict is easier than thinking
    about using a different class

    (The 'list()' is needed because that's the only way to get elements
    out from a list. It provides an __iter__ but no 'tolist()' method.)

    And this is the canonical way to transform any iterable to a list. Why
    should every class that you want to transform to a list have to supply a
    `tolist` method? Why not a `totuple` method?
    I put that there as a reminder for fogies like me who even now have
    spent more time on pre-2.x version of Python than post-2.x versions.
    When I started back in the 1.3 days, there were modules like 'array',
    which *did* have a 'tolist' method, and that was the proper way to
    do it.
    import array
    x=array.array("c", "AndreW")
    array('c', 'AndreW')
    x.tolist()
    ['A', 'n', 'd', 'r', 'e', 'W']
    >>>

    The implication that there should be one was not my intention, though
    my wording in that regard was unfortunate.

    This is also a case where it isn't obvious how to get data from a
    container. Every other container spells it through [] or through
    a method name which *doesn't* start with a "_". So people just
    starting with a Set might not know what to look for.

    It would be nice if the example code showed iterating data from
    a Set...

    The other is with the new 'fromkeys' class, which constructs

    Actually, dictionary class (static?) method.
    Yep. Meant to say "class method". Just didn't get through my
    fingers.
    This, whilst slightly shorter (due to no import - which in future versions
    will be going away anyway), is definitely *not* the Right Way (TM) to do it.
    It is likely to confuse people.
    It will? Given how much pre-2.3 code uses the "build a dict then
    get the keys" to get the unique values in a data set, it's an idiom
    that any intermediate Python programmer should understand and expect
    to understand.

    As for beginning Python programmers, I can't put myself into their
    shoes.

    My feeling for now is that I'll use "Set" when I want to do set
    manipulations, like

    set1 = { identifiers matching query 1}
    set2 = { identifiers matching query 2}
    total = set1 + set2

    and not use it for getting unique values.


    Andrew
    dalke at dalkescientific.com
  • Peter Abel at Jan 9, 2003 at 9:34 pm
    Andrew Dalke <adalke at mindspring.com> wrote in message news:<avbltt$kb4$1 at slb0.atl.mindspring.net>...

    Hi Andrew,
    Im using Python 2.2.2 and without the 2.3. - features the
    following seems to work fine:
    input=[5, 3, 1, 2, 5, 4, 3, 4, 1, 1, 5, 4, 5, 1, 4, 3, 2, 2, 4, 1]
    output=dict( zip(input ,range(len(input)) ) ).keys()
    output
    [1, 2, 3, 4, 5]

    What Im wondering about is the fact that in several examples
    even without the statment:
    output.sort()
    I always got a sorted output.

    Cheers Peter
  • Skip Montanaro at Jan 9, 2003 at 9:45 pm

    input=[5, 3, 1, 2, 5, 4, 3, 4, 1, 1, 5, 4, 5, 1, 4, 3, 2, 2, 4, 1]
    output=dict( zip(input ,range(len(input)) ) ).keys()
    output
    [1, 2, 3, 4, 5]

    Peter> I always got a sorted output.

    You just got lucky. Perturb your input list a tad:
    input=[5, -1, 1, 2, 5, 4, 3, 4, 1, 1, 5, 4, 5, 1, 4, 3, 2, 2, 4, 1]
    output=dict( zip(input ,range(len(input)) ) ).keys()
    output
    [1, 2, 3, 4, 5, -1]

    Skip
  • Andrew Dalke at Jan 9, 2003 at 10:07 pm

    Peter Abel wrote:
    Im using Python 2.2.2 and without the 2.3. - features the
    following seems to work fine:
    input=[5, 3, 1, 2, 5, 4, 3, 4, 1, 1, 5, 4, 5, 1, 4, 3, 2, 2, 4, 1]
    output=dict( zip(input ,range(len(input)) ) ).keys()
    output
    [1, 2, 3, 4, 5]

    What Im wondering about is the fact that in several examples
    even without the [sort ...] I always got a sorted output.
    My guess is it's because the hash function used for integers is
    the integer itself, so the hash values are 1, 2, 3, 4. These then
    get put into slot 1, 2, 3, and 4, which are then visited in that
    order to get the values.

    Put a -1 in the list. You'll get

    [1, 2, 3, 4, 5, -1]

    I think the initial hash size is about 8 elements
    for i in range(10000):
    ... assert dict(zip(range(i, i+5), range(i, i+5))).keys() ==
    range(i, i+5)
    ...
    Traceback (most recent call last):
    File "<stdin>", line 2, in ?
    AssertionError
    i
    4
    >>>
    for i in range(10000):
    ... assert dict(zip(range(i, i+2), range(i, i+2))).keys() ==
    range(i, i+2)
    ...
    Traceback (most recent call last):
    File "<stdin>", line 2, in ?
    AssertionError
    i
    7
    >>>

    that is,
    dict(zip((7, 9), (0, 0))).keys()
    [9, 7]
    dict(zip((7, 8), (0, 0))).keys()
    [8, 7]
    dict(zip((6, 7), (0, 0))).keys()
    [6, 7]
    >>>


    BTW, this would have been a bit faster and used less memory
    output=dict( zip(input , (0,)* len(input)) ).keys()
    -----------------------------------|
    same as dict.from_keys(input, 0) in Python 2.3

    Andrew


    Andrew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJan 6, '03 at 5:49a
activeJan 9, '03 at 10:07p
posts6
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase