FAQ
Ok, I reached out to Theo de Raadt to talk to him about what he was suggesting
without Guido having to play messenger and forward fragments of the email
conversation. I'm starting a new thread because this email is rather long, and
I'm hoping to divorce it a bit from the back and forth about a proposal that
wasn't exactly what Theo was suggesting that is being discussed in the other
thread.


Essentially, there are three basic types of uses of random (the concept, not
the module). Those are:


1. People/usecases who absolutely need deterministic output given a seed and
? ?for whom security properties don't matter.
2. People/usecases who absolutely need a cryptographically random output and
? ?for whom having a deterministic output is a downside.
3. People/usecases that fall somewhere in between where it may or may not be
? ?security sensitive or it may not be known if it's security sensitive.


The people in group #1 are currently, in the Python standard library, best
served using the MT random source as it provides exactly the kind of determinsm
they need. The people in group #2 are currently, in the Python standard
library, best served using os.urandom (either directly or via
random.SystemRandom).


However, the third case is the one that Theo's suggestion is attempting to
solve. In the current landscape, the security minded folks will tell these
people to use os.urandom/random.SystemRandom and the performance or otherwise
less security minded folks will likely tell them to just use random.py. Leaving
these people with a random that is not cryptographically safe.


The questin then is, does it matter if #3 are using a cryptographically safe
source of randomness? The answer is obviously that we don't know, and it's
possible that the user doesn't know. In these cases it's typically best if we
default to the more secure option and expect people to opt in to insecurity.


In the case of randomness, a lot of languages (Python included) don't do that
and instead they opt to pick the more peformant option first, often with the
argument (as seen in the other thread) that if people need a cryptographically
secure source of random, they'll know how to look for it and if they don't
know how to look for it, then it's likely they'll have some other security
problem. I think (and I believe Theo thinks) this sort of thinking is short
sighted. Let's take an example of a web application, it's going to need session
identifiers to put into a cookie, you'll want these to be random and it's not
obvious on the tin for a non-expert that you can't just use the module level
functions in the random module to do this. Another examples are generating API
keys or a password.


Looking on google, the first result for "python random password" is
StackOverflow which suggests:


? ? ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))


However, it was later edited to, after that, include:


? ? ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(N))


So it wasn't obvious to the person who answered that question that the random
module's module scoped functions were not appropiate for this use. It appears
that the original answer lasted for roughly 4 years before it was corrected,
so who knows how many people used that in those 4 years.


The second result has someone asking if there is a better way to generate a
random password in Python than:


? ? import os, random, string


? ? length = 13
? ? chars = string.ascii_letters + string.digits + '!@#$%^&*()'
? ? random.seed = (os.urandom(1024))


? ? print ''.join(random.choice(chars) for i in range(length))


This person obviously knew that os.urandom existed and that he should use it,
but failed to correctly identify that the random module's module scoped
functions were not what he wanted to use here.


The third result has this code:


? ? import string
? ? import random


? ? def randompassword():
? ? ? ? chars=string.ascii_uppercase + string.ascii_lowercase + string.digits
? ? ? ? size=8?
? ? ? ? return ''.join(random.choice(chars) for x in range(size,12))


I'm not going to keep pasting snippets, but going through the results it is
clear that in the bulk of cases, this search turns up code snippets that
suggest there is likely to be a lot of code out there that is unknownly using
the random module in a very insecure way. I think this is a failing of the
random.py module to provide an API that guides users to be safe which was
attempted to be papered over by adding a warning to the documentation, however
like has been said before, you can't solve a UX problem with documentation.


Then we come to why might we want to not provide a safe random by default for
the folks in the #3 group. As we've seen in the other thread, this basically
boils down to the fact that for a lot of users they don't care about the
security properties and they just want a fast random-esque value. This
particular case is made stronger by the fact that there is a lot of code out
there using Python's random module in a completely safe way that would regress
in a meaningful way if the random module slowed down.


The fact that speed is the primary reason not to give people in #3 a
cryptographically secure source of random by default is where we come back to
the meat of Theo's suggestion. His claim is that invoking os.urandom through
any of the interfaces imposes a performance penalty because it has to round
trip through the kernel crypto sub system for every request. His suggestion is
essentially that we provide an interface to a modern, good, userland?
cryptographically secure source of random that is running within the same
process as Python itself. One such example of this is the arc4random function
(which doesn't actually provide ARC4 on OpenBSD, it provides ChaCha, it's not
tied to one specific algorithm) which comes from libc on many platforms.
According to Theo, modern userland CSPRNGs can create random bytes faster than
memcpy which eliminates the argument of speed for why a CSPRNG shouldn't be
the "default" source of randomness.


Thus the proposal is essentially:


* Provide an API to access a modern userland CSPRNG.
* Provide an implementation of random.SomeKindOfRandom that utilizes this.
* Move the MT based implementation of the random module to
? random.DeterministicRandom.
* Deprecate the module scoped functions, instructing people to use the new
? random.SomeKindofRandom unless they need deterministic random, in which case
? use random.DeterministicRandom.


This can of course be tweaked one way or the other, but that's the general idea
translated into something actionable for Python. I'm not sure exactly how I
feel about it, but I certainly do think that the current situation is confusing
to end users and leaving them in an insecure state, and that a minimum we
should move MT to something like random.DeterministicRandom and deprecate the
module scoped functions because it seems obvious to me that the idea of a
"default" random function that isn't safe is a footgun for users.


As an additional consideration, there are security experts who believe that
userland CSPRNGs should not be used at all. One of those is Thomas Ptacek who
wrote a blog post [1] on the subject. In this, Thomas makes the case that a
userland CSPRNG pretty much always depends on the cryptographic security of
the system random, but that it itself may be broken which means you're adding
a second, single point of failure where a mistake can cause you to get
non-random data out of the system. I had asked Theo about this, and he stated
that he disagreed with Thomas about never using a userland CSPRNG and in his
opinion that blog post was mostly warning people away from using something like
MT in the userland and away from /dev/random (which is often the cause of
people reaching for MT because /dev/random blocks which makes programs even
slower).


It seems to boil down to, do we want to try to protect users by default or at
least make it more obvious in the API which one they want to use (I think yes),
and if so do we think that /dev/urandom is "fast enough" for most people in
group #3 and if not, do we agree with Theo that a modern userland CSPRNG is
safe enough to use, or do we agree with Thomas that it's not and if we think
that it is, do we use arc4random and what do we do on systems that don't have
a modern userland CSPRNG in their libc.


[1] http://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Search Discussions

  • Andrew Barnert at Sep 10, 2015 at 12:19 am
    Deprecating the module-level functions has one problem for backward compatibility: if you're using random across multiple modules, changing them all from this:


         import random


    ... to this:


         from random import DeterministicRandom
         random = DeterministicRandom()


    ... gives a separate MT for each module. You can work around that by, e.g., providing your own myrandom.py that does that and then using "from myrandom import random" everywhere, or by stashing a random_inst inside the random module or builtins or something and only creating it if it doesn't exist, etc., but all of these are things that people will rightly complain about.


    One possible solution is to make DeterministicRandom a module instead of a class, and move all the module-level functions there, so people can just change their import to "from random import DeterministicRandom as random". (Or, alternatively, give it classmethods that create a singleton just like the module global.)


    For people who decide they want to switch to SystemRandom, I don't think it's as much of a problem, as they probably won't care that they have a separate instance in each module. (And I don't think there's any security problem with using multiple instances, but I haven't thought it through...) So, the change is probably only needed in DeterministicRandom.


    There are hopefully better solutions than that. But I think some solution is needed. People who have existing code (or textbooks, etc.) that do things the "wrong" way and get a DeprecationWarning should be able to easily figure out how to make their code correct.


    Sent from my iPhone

    On Sep 9, 2015, at 17:01, Donald Stufft wrote:

    Ok, I reached out to Theo de Raadt to talk to him about what he was suggesting
    without Guido having to play messenger and forward fragments of the email
    conversation. I'm starting a new thread because this email is rather long, and
    I'm hoping to divorce it a bit from the back and forth about a proposal that
    wasn't exactly what Theo was suggesting that is being discussed in the other
    thread.

    Essentially, there are three basic types of uses of random (the concept, not
    the module). Those are:

    1. People/usecases who absolutely need deterministic output given a seed and
    for whom security properties don't matter.
    2. People/usecases who absolutely need a cryptographically random output and
    for whom having a deterministic output is a downside.
    3. People/usecases that fall somewhere in between where it may or may not be
    security sensitive or it may not be known if it's security sensitive.

    The people in group #1 are currently, in the Python standard library, best
    served using the MT random source as it provides exactly the kind of determinsm
    they need. The people in group #2 are currently, in the Python standard
    library, best served using os.urandom (either directly or via
    random.SystemRandom).

    However, the third case is the one that Theo's suggestion is attempting to
    solve. In the current landscape, the security minded folks will tell these
    people to use os.urandom/random.SystemRandom and the performance or otherwise
    less security minded folks will likely tell them to just use random.py. Leaving
    these people with a random that is not cryptographically safe.

    The questin then is, does it matter if #3 are using a cryptographically safe
    source of randomness? The answer is obviously that we don't know, and it's
    possible that the user doesn't know. In these cases it's typically best if we
    default to the more secure option and expect people to opt in to insecurity.

    In the case of randomness, a lot of languages (Python included) don't do that
    and instead they opt to pick the more peformant option first, often with the
    argument (as seen in the other thread) that if people need a cryptographically
    secure source of random, they'll know how to look for it and if they don't
    know how to look for it, then it's likely they'll have some other security
    problem. I think (and I believe Theo thinks) this sort of thinking is short
    sighted. Let's take an example of a web application, it's going to need session
    identifiers to put into a cookie, you'll want these to be random and it's not
    obvious on the tin for a non-expert that you can't just use the module level
    functions in the random module to do this. Another examples are generating API
    keys or a password.

    Looking on google, the first result for "python random password" is
    StackOverflow which suggests:

    ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

    However, it was later edited to, after that, include:

    ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(N))

    So it wasn't obvious to the person who answered that question that the random
    module's module scoped functions were not appropiate for this use. It appears
    that the original answer lasted for roughly 4 years before it was corrected,
    so who knows how many people used that in those 4 years.

    The second result has someone asking if there is a better way to generate a
    random password in Python than:

    import os, random, string

    length = 13
    chars = string.ascii_letters + string.digits + '!@#$%^&*()'
    random.seed = (os.urandom(1024))

    print ''.join(random.choice(chars) for i in range(length))

    This person obviously knew that os.urandom existed and that he should use it,
    but failed to correctly identify that the random module's module scoped
    functions were not what he wanted to use here.

    The third result has this code:

    import string
    import random

    def randompassword():
    chars=string.ascii_uppercase + string.ascii_lowercase + string.digits
    size=8
    return ''.join(random.choice(chars) for x in range(size,12))

    I'm not going to keep pasting snippets, but going through the results it is
    clear that in the bulk of cases, this search turns up code snippets that
    suggest there is likely to be a lot of code out there that is unknownly using
    the random module in a very insecure way. I think this is a failing of the
    random.py module to provide an API that guides users to be safe which was
    attempted to be papered over by adding a warning to the documentation, however
    like has been said before, you can't solve a UX problem with documentation.

    Then we come to why might we want to not provide a safe random by default for
    the folks in the #3 group. As we've seen in the other thread, this basically
    boils down to the fact that for a lot of users they don't care about the
    security properties and they just want a fast random-esque value. This
    particular case is made stronger by the fact that there is a lot of code out
    there using Python's random module in a completely safe way that would regress
    in a meaningful way if the random module slowed down.

    The fact that speed is the primary reason not to give people in #3 a
    cryptographically secure source of random by default is where we come back to
    the meat of Theo's suggestion. His claim is that invoking os.urandom through
    any of the interfaces imposes a performance penalty because it has to round
    trip through the kernel crypto sub system for every request. His suggestion is
    essentially that we provide an interface to a modern, good, userland
    cryptographically secure source of random that is running within the same
    process as Python itself. One such example of this is the arc4random function
    (which doesn't actually provide ARC4 on OpenBSD, it provides ChaCha, it's not
    tied to one specific algorithm) which comes from libc on many platforms.
    According to Theo, modern userland CSPRNGs can create random bytes faster than
    memcpy which eliminates the argument of speed for why a CSPRNG shouldn't be
    the "default" source of randomness.

    Thus the proposal is essentially:

    * Provide an API to access a modern userland CSPRNG.
    * Provide an implementation of random.SomeKindOfRandom that utilizes this.
    * Move the MT based implementation of the random module to
    random.DeterministicRandom.
    * Deprecate the module scoped functions, instructing people to use the new
    random.SomeKindofRandom unless they need deterministic random, in which case
    use random.DeterministicRandom.

    This can of course be tweaked one way or the other, but that's the general idea
    translated into something actionable for Python. I'm not sure exactly how I
    feel about it, but I certainly do think that the current situation is confusing
    to end users and leaving them in an insecure state, and that a minimum we
    should move MT to something like random.DeterministicRandom and deprecate the
    module scoped functions because it seems obvious to me that the idea of a
    "default" random function that isn't safe is a footgun for users.

    As an additional consideration, there are security experts who believe that
    userland CSPRNGs should not be used at all. One of those is Thomas Ptacek who
    wrote a blog post [1] on the subject. In this, Thomas makes the case that a
    userland CSPRNG pretty much always depends on the cryptographic security of
    the system random, but that it itself may be broken which means you're adding
    a second, single point of failure where a mistake can cause you to get
    non-random data out of the system. I had asked Theo about this, and he stated
    that he disagreed with Thomas about never using a userland CSPRNG and in his
    opinion that blog post was mostly warning people away from using something like
    MT in the userland and away from /dev/random (which is often the cause of
    people reaching for MT because /dev/random blocks which makes programs even
    slower).

    It seems to boil down to, do we want to try to protect users by default or at
    least make it more obvious in the API which one they want to use (I think yes),
    and if so do we think that /dev/urandom is "fast enough" for most people in
    group #3 and if not, do we agree with Theo that a modern userland CSPRNG is
    safe enough to use, or do we agree with Thomas that it's not and if we think
    that it is, do we use arc4random and what do we do on systems that don't have
    a modern userland CSPRNG in their libc.

    [1] http://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/

    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


    _______________________________________________
    Python-ideas mailing list
    Python-ideas at python.org
    https://mail.python.org/mailman/listinfo/python-ideas
    Code of Conduct: http://python.org/psf/codeofconduct/
  • Random832 at Sep 10, 2015 at 1:25 am
    Andrew Barnert via Python-ideas
    <python-ideas@python.org> writes:

    You can work around that by,
    e.g., providing your own myrandom.py that does that and then using
    "from myrandom import random" everywhere, or by stashing a random_inst
    inside the random module or builtins or something and only creating it
    if it doesn't exist, etc., but all of these are things that people
    will rightly complain about.

    Of course, this brings to mind the fact that there's *already* an
    instance stashed inside the random module.


    At that point, you might as well just keep the module-level functions,
    and rewrite them to be able to pick up on it if you replace _inst
    (perhaps suitably renamed as it would be a public variable) with an
    instance of a different class.


    Proof-of-concept implementation:


    class _method:
         def __init__(self, name):
             self.__name__ = name
         def __call__(self, *args, **kwargs):
             return getattr(_inst, self.__name__)(*args, **kwargs)
         def __repr__(self):
             return "<random method wrapper " + repr(self.__name__) + ">"


    _inst = Random()
    seed = _method('seed')
    random = _method('random')
    ...etc...
  • Andrew Barnert at Sep 10, 2015 at 1:50 am

    On Sep 9, 2015, at 18:25, Random832 wrote:
    Andrew Barnert via Python-ideas
    <python-ideas@python.org> writes:

    You can work around that by,
    e.g., providing your own myrandom.py that does that and then using
    "from myrandom import random" everywhere, or by stashing a random_inst
    inside the random module or builtins or something and only creating it
    if it doesn't exist, etc., but all of these are things that people
    will rightly complain about.
    Of course, this brings to mind the fact that there's *already* an
    instance stashed inside the random module.

    At that point, you might as well just keep the module-level functions,
    and rewrite them to be able to pick up on it if you replace _inst
    (perhaps suitably renamed as it would be a public variable) with an
    instance of a different class.

    The whole point is to make people using the top-level functions see a DeprecationWarning that leads them to make a choice between SystemRandom and DeterministicRandom. Just making inst public (and dynamically switchable) doesn't do that, so it doesn't solve anything.


    However, it seems like there's a way to extend it to do that:


    First, rename Random to DeterministicRandom. Then, add a subclass called Random that raises a DeprecationWarning whenever its methods are called. Then preinitialize inst to Random(), just as we already to. Existing code will work, but with a warning. And the text of that warning or the help it leads to or the obvious google result or whatever can just suggest "add random.inst = random.DeterministicRandom() or random.inst = random.SystemRandom() at the start of your program". That has most of the benefit of deprecating the top-level functions, without the cost of the solution being non-obvious (and the most obvious solution being wrong for some use cases).


    Of course it adds the cost of making the module slower, and also more complex. Maybe a better solution would be to add a random.set_default_instance function that replaced all of the top-level functions with bound methods of the instance (just like what's already done at startup in random.py)? That's simple, and doesn't slow down anything, and it seems like it makes it more clear what you're doing than setting random.inst.
  • Chris Angelico at Sep 10, 2015 at 6:08 am

    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas wrote:
    Of course it adds the cost of making the module slower, and also more complex. Maybe a better solution would be to add a random.set_default_instance function that replaced all of the top-level functions with bound methods of the instance (just like what's already done at startup in random.py)? That's simple, and doesn't slow down anything, and it seems like it makes it more clear what you're doing than setting random.inst.

    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute. (If anything, it makes
    it more clearly a supported operation - I've seen novices not realize
    that "module.xyz = foo" is valid, but nobody would misunderstand the
    validity of a function call.)


    ChrisA
  • Andrew Barnert at Sep 10, 2015 at 8:17 am

    On Sep 9, 2015, at 23:08, Chris Angelico wrote:
    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas
    wrote:
    Of course it adds the cost of making the module slower, and also more complex. Maybe a better solution would be to add a random.set_default_instance function that replaced all of the top-level functions with bound methods of the instance (just like what's already done at startup in random.py)? That's simple, and doesn't slow down anything, and it seems like it makes it more clear what you're doing than setting random.inst.
    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute. (If anything, it makes
    it more clearly a supported operation - I've seen novices not realize
    that "module.xyz = foo" is valid, but nobody would misunderstand the
    validity of a function call.)

    I was only half-serious about this, but now I think I like it: it provides exactly the fix people are hoping to fix by deprecating the top-level functions, but with less risk, less user code churn, a smaller patch, and a much easier fix for novice users. (And it's much better than my earlier suggestion, too.)


    See https://gist.github.com/abarnert/e0fced7569e7d77f7464 for the patch, and a patched copy of random.py. The source comments in the patch should be enough to understand everything that's changed.


    A couple things:


    I'm not sure the normal deprecation path makes sense here. For a couple versions, everything continues to work (because most novices, the people we're thing to help, don't see DeprecationWarnings), and then suddenly their code breaks. Maybe making it a UserWarning makes more sense here?


    I made Random a synonym for UnsafeRandom (the class that warns and then passes through to DeterministicRandom). But is that really necessary? Someone who's explicitly using an instance of class Random rather than the top-level functions probably isn't someone who needs this warning, right?


    Also, if this is the way we'd want to go, the docs change would be a lot more substantial than the code change. I think the docs should be organized around choosing a random generator and using its methods, and only then mention set_default_instance as being useful for porting old code (and for making it easy for multiple modules to share a single generator, but that shouldn't be a common need for novices).
  • Serhiy Storchaka at Sep 10, 2015 at 8:32 am

    On 10.09.15 11:17, Andrew Barnert via Python-ideas wrote:
    On Sep 9, 2015, at 23:08, Chris Angelico wrote:
    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas
    wrote:
    Of course it adds the cost of making the module slower, and also more complex. Maybe a better solution would be to add a random.set_default_instance function that replaced all of the top-level functions with bound methods of the instance (just like what's already done at startup in random.py)? That's simple, and doesn't slow down anything, and it seems like it makes it more clear what you're doing than setting random.inst.
    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute. (If anything, it makes
    it more clearly a supported operation - I've seen novices not realize
    that "module.xyz = foo" is valid, but nobody would misunderstand the
    validity of a function call.)
    I was only half-serious about this, but now I think I like it: it provides exactly the fix people are hoping to fix by deprecating the top-level functions, but with less risk, less user code churn, a smaller patch, and a much easier fix for novice users. (And it's much better than my earlier suggestion, too.)

    See https://gist.github.com/abarnert/e0fced7569e7d77f7464 for the patch, and a patched copy of random.py. The source comments in the patch should be enough to understand everything that's changed.

    This doesn't work with the idiom "from random import random".
  • Andrew Barnert at Sep 10, 2015 at 10:33 am

    On Sep 10, 2015, at 01:32, Serhiy Storchaka wrote:
    On 10.09.15 11:17, Andrew Barnert via Python-ideas wrote:
    On Sep 9, 2015, at 23:08, Chris Angelico wrote:
    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas
    wrote:
    Of course it adds the cost of making the module slower, and also more complex. Maybe a better solution would be to add a random.set_default_instance function that replaced all of the top-level functions with bound methods of the instance (just like what's already done at startup in random.py)? That's simple, and doesn't slow down anything, and it seems like it makes it more clear what you're doing than setting random.inst.
    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute. (If anything, it makes
    it more clearly a supported operation - I've seen novices not realize
    that "module.xyz = foo" is valid, but nobody would misunderstand the
    validity of a function call.)
    I was only half-serious about this, but now I think I like it: it provides exactly the fix people are hoping to fix by deprecating the top-level functions, but with less risk, less user code churn, a smaller patch, and a much easier fix for novice users. (And it's much better than my earlier suggestion, too.)

    See https://gist.github.com/abarnert/e0fced7569e7d77f7464 for the patch, and a patched copy of random.py. The source comments in the patch should be enough to understand everything that's changed.
    This doesn't work with the idiom "from random import random".

    Well, the goal of the deprecation idea was to eventually get people to explicitly use instances, so the fact that doesn't work out of the box is a good thing, not a problem.


    But for people just trying to retrofit existing code, all they have to do is call random.set_default_instance at the top of the main module, and all their other modules can just import what they need this way. Which is why it's better than straightforward deprecation.
  • Steven D'Aprano at Sep 11, 2015 at 1:49 pm

    On Thu, Sep 10, 2015 at 04:08:09PM +1000, Chris Angelico wrote:
    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas
    wrote:
    Of course it adds the cost of making the module slower, and also
    more complex. Maybe a better solution would be to add a
    random.set_default_instance function that replaced all of the
    top-level functions with bound methods of the instance (just like
    what's already done at startup in random.py)? That's simple, and
    doesn't slow down anything, and it seems like it makes it more clear
    what you're doing than setting random.inst.
    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute.

    Making monkey-patching the official, recommended way to choose a PRNG is
    a risky solution, to put it mildly. That means that at any time, some
    other module that is directly or indirectly imported might change the
    random number generators you are using without your knowledge. You want
    a crypto PRNG, but some module replaces it with MT. Or visa versa.


    Technically, it is true that (this being Python) they can do this now,
    just by assigning to the random module:


         random.random = lambda: 9


    but that is clearly abusive, and if you write code to do that, you're
    asking for whatever trouble you get. There's no official API to screw
    over other callers of the random module behind their back. You're
    suggesting that we add one.



    (If anything, it makes
    it more clearly a supported operation

    Which is exactly why this is a terrible idea. You're making monkey-
    patching not only officially supported, but encouraged. That will not
    end well.






    --
    Steve
  • Andrew Barnert at Sep 11, 2015 at 8:27 pm

    On Sep 11, 2015, at 06:49, Steven D'Aprano wrote:
    On Thu, Sep 10, 2015 at 04:08:09PM +1000, Chris Angelico wrote:
    On Thu, Sep 10, 2015 at 11:50 AM, Andrew Barnert via Python-ideas
    wrote:
    Of course it adds the cost of making the module slower, and also
    more complex. Maybe a better solution would be to add a
    random.set_default_instance function that replaced all of the
    top-level functions with bound methods of the instance (just like
    what's already done at startup in random.py)? That's simple, and
    doesn't slow down anything, and it seems like it makes it more clear
    what you're doing than setting random.inst.
    +1. A single function call that replaces all the methods adds a
    minuscule constant to code size, run time, etc, and it's no less
    readable than assignment to a module attribute.
    Making monkey-patching the official, recommended way to choose a PRNG is
    a risky solution, to put it mildly.

    But that's not the proposal. The proposal is to make explicitly passing around an instance the official, recommended way to choose a PRNG; monkey-patching is only the official, recommended way to quickly get legacy code working: once you see the warning about the potential problem and decide that the problem doesn't affect you, you write one standard line of code at the top of your main script instead of rewriting all of your modules and patching or updating every third-party module you use.


    As I said later, I think my later suggestion of just having a singleton DeterministicRandom instance (or even a submodule with the same interface) that you can explicitly import in place or random serves the same needs well enough, and is even simpler, and is more flexible (in particular, it can also be used for novices' "my first game" programs), so I'm no longer suggesting this. But that doesn't mean there's any benefit to mischaracterizing the suggestion (especially if Chris or anyone else still supports it even though I don't).
  • Donald Stufft at Sep 10, 2015 at 1:30 am

    On September 9, 2015 at 8:01:17 PM, Donald Stufft (donald at stufft.io) wrote:
    It seems to boil down to, do we want to try to protect users by default or at
    least make it more obvious in the API which one they want to use (I think yes),
    and if so do we think that /dev/urandom is "fast enough" for most people in
    group #3 and if not, do we agree with Theo that a modern userland CSPRNG is
    safe enough to use, or do we agree with Thomas that it's not and if we think
    that it is, do we use arc4random and what do we do on systems that don't have
    a modern userland CSPRNG in their libc.

    Ok, I've talked to an honest to god cryptographer as well as some other smart
    folks!


    Here's the general gist:


    Using a userland CSPRNG like arc4random is not advisable for things that you
    absolutely need cryptographic security for (this is group #2 from my original
    email). These people should use os.urandom or random.SystemRandom as they
    should be doing now. In addition os.urandom or random.SystemRandom is
    probably fast enough for most use cases of the random.py module, however it is
    true that using os.urandom/random.SystemRandom would be slower than MT. It is
    reasonable to use a userland CSPRNG as a "default" source of randomness or in
    cases where people care about speed but maybe not about security and don't
    need determinism.


    However, they've said that the primary benefit in using a userland CSPRNG for
    a faster cryptographically secure source of randomness is if we can make it the?
    default source of randomness for a "probably safe depending on your app" safety
    net for people who didn't read or understand the documentation. This would make
    most uses of random.random and friends secure but not deterministic.


    If we're unwilling to change the default, but we are willing to deprecate the
    module scoped functions and force users to make a choice between
    random.SystemRandom and random.DeterministicRandom then there is unlikely to
    be much benefit to also adding a userland CSPRNG into the mix since there's no
    class of people who are using an ambiguous "random" that we don't know if they
    need it to be secure or deterministic/fast.


    So I guess my suggestion would be, let's deprecate the module scope functions
    and rename random.Random to random.DeterministicRandom. This absolves us of
    needing to change the behavior of people's existing code (besides deprecating
    it) and we don't need to decide if a userland CSPRNG is safe or not while still
    moving us to a situation that is far more likely to have users doing the right
    thing.


    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Petr Viktorin at Sep 10, 2015 at 7:35 am
    On Thu, Sep 10, 2015 at 3:30 AM, Donald Stufft wrote:
    [...]
    So I guess my suggestion would be, let's deprecate the module scope functions
    and rename random.Random to random.DeterministicRandom. This absolves us of
    needing to change the behavior of people's existing code (besides deprecating
    it) and we don't need to decide if a userland CSPRNG is safe or not while still
    moving us to a situation that is far more likely to have users doing the right
    thing.

    There is one use case that would be hit by that: the kid writing their
    first rock-paper-scissors game.
    A beginner who just learned the `if` statement isn't ready for a
    discussion of cryptography vs. reproducible results, and
    random.SystemRandom.random() would just become a magic incantation to
    learn. It would feel like requiring sys.stdout.write() instead of
    print().


    Functions like paretovariate(), getstate(), or seed(), which require
    some understanding of (pseudo)randomness, can be moved to a specific
    class, but I don't think deprecating random(), randint(), randrange(),
    choice(), and shuffle() would not be a good idea. Switching them to a
    cryptographically safe RNG is OK from this perspective, though.
  • Andrew Barnert at Sep 10, 2015 at 8:20 am

    On Sep 10, 2015, at 00:35, Petr Viktorin wrote:
    On Thu, Sep 10, 2015 at 3:30 AM, Donald Stufft wrote:
    [...]

    So I guess my suggestion would be, let's deprecate the module scope functions
    and rename random.Random to random.DeterministicRandom. This absolves us of
    needing to change the behavior of people's existing code (besides deprecating
    it) and we don't need to decide if a userland CSPRNG is safe or not while still
    moving us to a situation that is far more likely to have users doing the right
    thing.
    There is one use case that would be hit by that: the kid writing their
    first rock-paper-scissors game.
    A beginner who just learned the `if` statement isn't ready for a
    discussion of cryptography vs. reproducible results, and
    random.SystemRandom.random() would just become a magic incantation to
    learn. It would feel like requiring sys.stdout.write() instead of
    print().

    Functions like paretovariate(), getstate(), or seed(), which require
    some understanding of (pseudo)randomness, can be moved to a specific
    class, but I don't think deprecating random(), randint(), randrange(),
    choice(), and shuffle() would not be a good idea. Switching them to a
    cryptographically safe RNG is OK from this perspective, though.

    Silently switching them could break a lot of code.


    I don't think there's any way around making them warn the user that they need to do something. I think the patch I just sent is a good way of doing that: the minimum thing they need to do is a one-liner, which is explained in the warning, and it also gives them enough information to check the docs or google the message and get some understanding of the choice if they're at all inclined to do so. (And if they aren't, well, either one works for the use case you're talking about, so let them flip a coin, or call random.choice.;))
  • Alexander Walters at Sep 10, 2015 at 9:20 am
    Can I just ask what is the actual problem we are trying to solve here?


    Python has third party cryptography modules, that bring their own
    sources of randomness (or cryptography libraries that do the same).


    Python has a good random library for everything other than cryptography.


    Why in the heck are we trying to make the random module do something
    that it is already documented as being a poor choice, where there is
    already third party modules that do just this?


    Who needs cryptographic randomness in the standard library anyways (even
    though one line of code give you access to it)? Have we identified even
    ONE person who does cryptography in python who is kicking themselves
    that they cant use the random module as implemented?


    Is this just indulging a paranoid developer?
  • Donald Stufft at Sep 10, 2015 at 11:40 am

    On September 10, 2015 at 5:21:29 AM, Alexander Walters (tritium-list at sdamon.com) wrote:
    Why in the heck are we trying to make the random module do something
    that it is already documented as being a poor choice, where there
    is
    already third party modules that do just this?

    Who needs cryptographic randomness in the standard library
    anyways (even
    though one line of code give you access to it)? Have we identified
    even
    ONE person who does cryptography in python who is kicking themselves
    that they cant use the random module as implemented?

    Because there are a situations where you need a securely generated randomness
    where you are *NOT* "doing cryptography". Blaming people for the fact that the
    random module has a bad UX that naturally leads them to use it when it isn't
    appropriate is a shitty thing to do.


    What harm is there in making people explicitly choose between deterministic
    randomness and secure randomness? Is your use case so much better than theirs
    that you thing you deserve to type a few characters less to the detriment of
    people who don't know any better?


    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Alexander Walters at Sep 10, 2015 at 10:47 pm

    On 9/10/2015 07:40, Donald Stufft wrote:


    What harm is there in making people explicitly choose between deterministic
    randomness and secure randomness? Is your use case so much better than theirs
    that you thing you deserve to type a few characters less to the detriment of
    people who don't know any better?
    API Breakage. This is not worth the break in backwards compatibility.
    My use case is using the API that has been available for... 20 years?
    And for what benefit? None, and it can be argued that it would do the
    opposite of what is intended (false sense of security and all).
  • Steven D'Aprano at Sep 10, 2015 at 3:46 am
    On Wed, Sep 09, 2015 at 08:01:16PM -0400, Donald Stufft wrote:
    [...]
    Looking on google, the first result for "python random password" is
    StackOverflow which suggests:

    ? ? ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

    However, it was later edited to, after that, include:

    ? ? ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(N))

    You're worried about attacks on the random number generator that
    produces the characters in the password? I think I'm going to have to
    see an attack before I believe that this is meaningful.


    Excluding PRNGs that are hopelessly biased ("nine, nine, nine, nine...")
    or predictable, how does knowing the PRNG help in an attack? Here's a
    password I just generated using your "corrected" version using
    SystemRandom:


         06XW0X0X


    (Honest, that's exactly what I got on my first try.)


    Here's one I generated using the "bad" code snippet:


         V6CFKCF2


    How can you tell them apart, or attack one but not the other based on
    the PRNG?



    So it wasn't obvious to the person who answered that question that the random
    module's module scoped functions were not appropiate for this use. It appears
    that the original answer lasted for roughly 4 years before it was corrected,

    Shouldn't it be using a single instance of SystemRandom rather than a
    new instance for each call?




    [...]
    According to Theo, modern userland CSPRNGs can create random bytes faster than
    memcpy

    That is an astonishing claim, and I'd want to see evidence for it before
    accepting it.






    --
    Steve
  • Random832 at Sep 10, 2015 at 3:59 am

    Steven D'Aprano <steve@pearwood.info> writes:


    On Wed, Sep 09, 2015 at 08:01:16PM -0400, Donald Stufft wrote:
    [...]

    You're worried about attacks on the random number generator that
    produces the characters in the password? I think I'm going to have to
    see an attack before I believe that this is meaningful.

    Isn't the only difference between generating a password and generating a
    key the length (and base) of the string? Where is the line?

    That is an astonishing claim, and I'd want to see evidence for it before
    accepting it.

    I assume it's comparing a CSPRNG all of whose state is in cache (or
    registers, if a large block of random bytes is requested from the CSPRNG
    in one go, with memcpy of data which must be retrieved from main
    memory.
  • Paul Moore at Sep 10, 2015 at 8:41 am

    On 10 September 2015 at 01:01, Donald Stufft wrote:
    Essentially, there are three basic types of uses of random (the concept, not
    the module). Those are:

    1. People/usecases who absolutely need deterministic output given a seed and
    for whom security properties don't matter.
    2. People/usecases who absolutely need a cryptographically random output and
    for whom having a deterministic output is a downside.
    3. People/usecases that fall somewhere in between where it may or may not be
    security sensitive or it may not be known if it's security sensitive.

    Wrong.


    There is a fourth basic type. People (like me!) whose code absolutely
    doesn't have any security issues, but want a simple, convenient, fast
    RNG. Determinism is not an absolute requirement, but is very useful
    (for writing tests, maybe, or for offering a deterministic rerun
    option to the program). Simulation-style games often provide a way to
    find the "map seed", which allows users to share interesting maps -
    this is non-essential but a big quality-of-life benefit in such games.


    IMO, the current module perfectly serves this fourth group.


    While I accept your point that far too many people are using insecure
    RNGs in "generate a random password" scripts, they are *not* the core
    target audience of the default module-level functions in the random
    module (did you find any examples of insecure use that *weren't*
    password generators?). We should educate people that this is bad
    practice, not change the module. Also, while it may be imperfect, it's
    still better than what many people *actually* do, which is to use
    "password" as a password on sensitive systems :-(


    Maybe what Python *actually* needs is a good-quality "random password
    generator" module in the stdlib? (Semi-serious suggestion...)


    Paul
  • Donald Stufft at Sep 10, 2015 at 11:26 am

    On September 10, 2015 at 4:41:56 AM, Paul Moore (p.f.moore at gmail.com) wrote:
    On 10 September 2015 at 01:01, Donald Stufft wrote:
    Essentially, there are three basic types of uses of random (the concept, not
    the module). Those are:

    1. People/usecases who absolutely need deterministic output given a seed and
    for whom security properties don't matter.
    2. People/usecases who absolutely need a cryptographically random output and
    for whom having a deterministic output is a downside.
    3. People/usecases that fall somewhere in between where it may or may not be
    security sensitive or it may not be known if it's security sensitive.
    Wrong.

    There is a fourth basic type. People (like me!) whose code absolutely
    doesn't have any security issues, but want a simple, convenient, fast
    RNG. Determinism is not an absolute requirement, but is very useful
    (for writing tests, maybe, or for offering a deterministic rerun
    option to the program). Simulation-style games often provide a way to
    find the "map seed", which allows users to share interesting maps -
    this is non-essential but a big quality-of-life benefit in such games.

    This group is the same as #3 except for the map seed thing which is
    group #1. In particular, it wouldn?t hurt you if the random you were
    using was cryptographically secure as long as it was fast and if you
    needed determinism, it would hurt you to say so. Which is the?point
    that Theo was making.

    IMO, the current module perfectly serves this fourth group.

    Making the user pick between Deterministic and Secure random would serve
    this purpose too, especially in a language where "In the face of ambiguity,
    refuse the temptation to guess" is one of the core tenets of the language. The
    largest downside would be typing a few extra characters, which Python is not
    a language that attempts to do things in the fewest number of characters.?

    While I accept your point that far too many people are using insecure
    RNGs in "generate a random password" scripts, they are *not* the core
    target audience of the default module-level functions in the random
    module (did you find any examples of insecure use that *weren't*
    password generators?). We should educate people that this is bad
    practice, not change the module. Also, while it may be imperfect, it's
    still better than what many people *actually* do, which is to use
    "password" as a password on sensitive systems :-(

    You cannot document your way out of a UX problem.


    The problem isn?t people doing this once on the command line to generate
    a password, the problem is people doing it in applications where they
    generate an API key, a session identifier, a random password which they
    then give to their users. If you give a way to get the output of the?MT
    base random enough times, it can be used to determine?what every random
    it generated was and will be.


    Here?s a game a friend of mine created where the purpose of the game is
    to essentially unrandomize some random data, which is only possible
    because it?s (purposely) using MT to make it possible
    https://github.com/reaperhulk/dsa-ctf. This is not an ivory tower paranoia
    case, it?s a real concern that will absolutely fix some insecure software
    out there instead of telling them ?welp typing a little bit extra once
    an import is too much of a burden for me and really it?s your own fault
    anyways?.

    ?
    Maybe what Python *actually* needs is a good-quality "random password
    generator" module in the stdlib? (Semi-serious suggestion...)

    Paul

    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Paul Moore at Sep 10, 2015 at 12:29 pm

    On 10 September 2015 at 12:26, Donald Stufft wrote:
    There is a fourth basic type. People (like me!) whose code absolutely
    doesn't have any security issues, but want a simple, convenient, fast
    RNG. Determinism is not an absolute requirement, but is very useful
    (for writing tests, maybe, or for offering a deterministic rerun
    option to the program). Simulation-style games often provide a way to
    find the "map seed", which allows users to share interesting maps -
    this is non-essential but a big quality-of-life benefit in such games.
    This group is the same as #3 except for the map seed thing which is
    group #1. In particular, it wouldn?t hurt you if the random you were
    using was cryptographically secure as long as it was fast and if you
    needed determinism, it would hurt you to say so. Which is the point
    that Theo was making.

    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?

    IMO, the current module perfectly serves this fourth group.
    Making the user pick between Deterministic and Secure random would serve
    this purpose too, especially in a language where "In the face of ambiguity,
    refuse the temptation to guess" is one of the core tenets of the language. The
    largest downside would be typing a few extra characters, which Python is not
    a language that attempts to do things in the fewest number of characters.

    And yet I know that I would routinely, and (this is the problem)
    without thinking, choose Deterministic, because I know that my use
    cases all get a (small) benefit from being able to capture the seed,
    but I also know I'm not doing security-related stuff.


    No amount of making me choose is going to help me spot security
    implications that I've missed.


    And also, calling the non-crypto choice "Deterministic" is unhelpful,
    because I *don't* want something deterministic, I want something
    random (I understand PRNGs aren't truly random, but "good enough for
    my purposes" is what I want, and "deterministic" reads to me as saying
    it's *not* good enough...)

    While I accept your point that far too many people are using insecure
    RNGs in "generate a random password" scripts, they are *not* the core
    target audience of the default module-level functions in the random
    module (did you find any examples of insecure use that *weren't*
    password generators?). We should educate people that this is bad
    practice, not change the module. Also, while it may be imperfect, it's
    still better than what many people *actually* do, which is to use
    "password" as a password on sensitive systems :-(
    You cannot document your way out of a UX problem.

    What I'm trying to say is that this is an education problem more than
    a UX problem.


    Personally, I think I know enough about security for my (not a
    security specialist) purposes. To that extent, if I'm working on
    something with security implications, I'm looking for things that say
    "Crypto" in the name. The rest of the time, I just use non-specialist
    stuff. It's a similar situation to that of the "statistics" module. If
    I'm doing "proper" maths, I'd go for numpy/scipy. If I just want some
    averages and I'm not bothered about numerical stability, rounding
    behaviour, etc, I'd go for the stdlib statistics package.

    The problem isn?t people doing this once on the command line to generate
    a password, the problem is people doing it in applications where they
    generate an API key, a session identifier, a random password which they
    then give to their users. If you give a way to get the output of the MT
    base random enough times, it can be used to determine what every random
    it generated was and will be.

    To me, that's crypto and I'd look to the cryptography module, or to
    something in the stdlib that explicitly said it was suitable for
    crypto.


    Saying people write bad code isn't enough - how does the current
    module *encourage* them to write bad code? How much API change must we
    allow to cater for people who won't read the statement in the docs (in
    a big red box) "Warning: The pseudo-random generators of this module
    should not be used for security purposes." (Specifically people
    writing security related code who won't read the docs).

    Here?s a game a friend of mine created where the purpose of the game is
    to essentially unrandomize some random data, which is only possible
    because it?s (purposely) using MT to make it possible
    https://github.com/reaperhulk/dsa-ctf. This is not an ivory tower paranoia
    case, it?s a real concern that will absolutely fix some insecure software
    out there instead of telling them ?welp typing a little bit extra once
    an import is too much of a burden for me and really it?s your own fault
    anyways?.

    I don't understand how that game (which is an interesting way of
    showing people how attacks on crypto work, sure, but that's just
    education, which you dismissed above) relates to the issue here.


    And I hope you don't really think that your quote is even remotely
    what I'm trying to say (I'm not that selfish) - my point is that not
    everything is security related. Not every application people write,
    and not every API in the stdlib. You're claiming that the random
    module is security related. I'm claiming it's not, it's documented as
    not being, and that's clear to the people who use it for its intended
    purpose. Telling those people that you want to make a module designed
    for their use harder to use because people for whom it's not intended
    can't read the documentation which explicitly states that it's not
    suitable for them, is doing a disservice to those people who are
    already using the module correctly for its stated purpose.


    By the same argument, we should remove the statistics module because
    it can be used by people with numerically unstable problems. (I doubt
    you'll find StackOverflow questions along these lines yet, but that's
    only because (a) the module's pretty new, and (b) it actually works
    pretty hard to handle the hard corner cases, but I bet they'll start
    turning up in due course, if only from the people who don't understand
    floating point...)


    Paul
  • Donald Stufft at Sep 10, 2015 at 1:10 pm

    On September 10, 2015 at 8:29:16 AM, Paul Moore (p.f.moore at gmail.com) wrote:
    On 10 September 2015 at 12:26, Donald Stufft wrote:
    There is a fourth basic type. People (like me!) whose code absolutely
    doesn't have any security issues, but want a simple, convenient, fast
    RNG. Determinism is not an absolute requirement, but is very useful
    (for writing tests, maybe, or for offering a deterministic rerun
    option to the program). Simulation-style games often provide a way to
    find the "map seed", which allows users to share interesting maps -
    this is non-essential but a big quality-of-life benefit in such games.
    This group is the same as #3 except for the map seed thing which is
    group #1. In particular, it wouldn?t hurt you if the random you were
    using was cryptographically secure as long as it was fast and if you
    needed determinism, it would hurt you to say so. Which is the point
    that Theo was making.
    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?

    I transposed some words, fixed:


    "If you needed determinism, would it hurt you to say so?""


    Essentially, other than typing a little bit more, why is:


    ? ? import random
    ? ? print(random.choice([?a?, ?b?, ?c?]))


    better than


    ? ? import random;
    ? ? print(random.DetereministicRandom().choice([?a?, ?b?, ?C?]))


    As far as I can tell, you've made your code and what properties it has much
    clearer to someone reading it at the cost of 22 characters. If you're going to
    reuse the DeterministicRandom class you can assign it to a variable and
    actually end up saving characters if the variable you save it to can be
    accessed at less than 6 characters.

    IMO, the current module perfectly serves this fourth group.
    Making the user pick between Deterministic and Secure random would serve
    this purpose too, especially in a language where "In the face of ambiguity,
    refuse the temptation to guess" is one of the core tenets of the language. The
    largest downside would be typing a few extra characters, which Python is not
    a language that attempts to do things in the fewest number of characters.
    And yet I know that I would routinely, and (this is the problem)
    without thinking, choose Deterministic, because I know that my use
    cases all get a (small) benefit from being able to capture the seed,
    but I also know I'm not doing security-related stuff.

    No amount of making me choose is going to help me spot security
    implications that I've missed.

    You're allowed to pick DeterministicRandom, you're even allowed to do it
    without thinking. This isn't about making it impossible to ever insecurely use
    random numbers, that's obviously a boil the ocean level of problem, this is
    about trying to make it more likely that someone won't be hit by a fairly easy
    to hit footgun if it does matter for them, even if they don't know it. It's
    also about making code that is easier to understand on the surface, for example
    without using the prior knowledge that it's using MT, tell me how you'd know
    if this was safe or not:


    ? ? import random
    ? ? import string
    ? ? password = "".join(random.choice(string.ascii_letters) for _ in range(9))
    ? ? print("Your random password is",)



    And also, calling the non-crypto choice "Deterministic" is unhelpful,
    because I *don't* want something deterministic, I want something
    random (I understand PRNGs aren't truly random, but "good enough for
    my purposes" is what I want, and "deterministic" reads to me as saying
    it's *not* good enough?)

    But you *DO* want something deterministic, the *ONLY* way you can get this
    small benefit of capturing the seed is if you can put that seed back into the
    system and get a deterministic result. If the seed didn?t exactly determine the
    output of the randomness then you wouldn't be able to do that. If you don't
    need to be able to capture the seed and essentially "replay" the PRNG in a
    deterministic way then there is exactly zero downsides to using a CSPRNG other
    than speed, which is why Theo suggested using a very fast, modern CSPRNG to
    solve the speed issues.


    Can you point out one use case where cryptographically safe random numbers,
    assuming we could generate them as quickly as you asked for them, would hurt
    you unless you needed/wanted to be able to save the seed and thus require or
    want deterministic results?

    While I accept your point that far too many people are using insecure
    RNGs in "generate a random password" scripts, they are *not* the core
    target audience of the default module-level functions in the random
    module (did you find any examples of insecure use that *weren't*
    password generators?). We should educate people that this is bad
    practice, not change the module. Also, while it may be imperfect, it's
    still better than what many people *actually* do, which is to use
    "password" as a password on sensitive systems :-(
    You cannot document your way out of a UX problem.
    What I'm trying to say is that this is an education problem more than
    a UX problem.

    Personally, I think I know enough about security for my (not a
    security specialist) purposes. To that extent, if I'm working on
    something with security implications, I'm looking for things that say
    "Crypto" in the name. The rest of the time, I just use non-specialist
    stuff. It's a similar situation to that of the "statistics" module. If
    I'm doing "proper" maths, I'd go for numpy/scipy. If I just want some
    averages and I'm not bothered about numerical stability, rounding
    behaviour, etc, I'd go for the stdlib statistics package.
    The problem isn?t people doing this once on the command line to generate
    a password, the problem is people doing it in applications where they
    generate an API key, a session identifier, a random password which they
    then give to their users. If you give a way to get the output of the MT
    base random enough times, it can be used to determine what every random
    it generated was and will be.
    To me, that's crypto and I'd look to the cryptography module, or to
    something in the stdlib that explicitly said it was suitable for
    crypto.

    Saying people write bad code isn't enough - how does the current
    module *encourage* them to write bad code? How much API change must we
    allow to cater for people who won't read the statement in the docs (in
    a big red box) "Warning: The pseudo-random generators of this module
    should not be used for security purposes." (Specifically people
    writing security related code who won't read the docs).

    Reminder that this warning does not show up (in any color, much less red)
    if you?re using ``help(random)`` or ``dir(random)`` to explore the random
    module. It also does not show up in code review when you see someone doing
    random.random.


    It encourages you to write bad code, because it has a baked in assumption that
    there is a sane default for a random number generator and expects people to
    understand a fairly dificult concept, which is that not all "random" is equal.


    For instance, you've already made the mistake of saying you wanted "random" not
    deterministic, but the two are not mutually exlusive and deterministic is a
    property that a source of random can have, and one that you need for one of the
    features you say you like.?

    Here?s a game a friend of mine created where the purpose of the game is
    to essentially unrandomize some random data, which is only possible
    because it?s (purposely) using MT to make it possible
    https://github.com/reaperhulk/dsa-ctf. This is not an ivory tower paranoia
    case, it?s a real concern that will absolutely fix some insecure software
    out there instead of telling them ?welp typing a little bit extra once
    an import is too much of a burden for me and really it?s your own fault
    anyways?.
    I don't understand how that game (which is an interesting way of
    showing people how attacks on crypto work, sure, but that's just
    education, which you dismissed above) relates to the issue here.

    And I hope you don't really think that your quote is even remotely
    what I'm trying to say (I'm not that selfish) - my point is that not
    everything is security related. Not every application people write,
    and not every API in the stdlib. You're claiming that the random
    module is security related. I'm claiming it's not, it's documented as
    not being, and that's clear to the people who use it for its intended
    purpose. Telling those people that you want to make a module designed
    for their use harder to use because people for whom it's not intended
    can't read the documentation which explicitly states that it's not
    suitable for them, is doing a disservice to those people who are
    already using the module correctly for its stated purpose.

    I'm claiming that the term random is ambiguously both security related and
    not security related and we should either get rid of the default and expect
    people to pick whether or not their use case is security related, or we should
    assume that it is unless otherwise instructed. I don't particularly care what
    the exact spelling of this looks like, random.(System|Secure)Random and
    random.DeterministicRandom is just one option. Another option is to look at
    something closer to what Go did and deprecate the "random" module and move the
    MT based thing to ``math.random`` and the CSPRNG can be moved to something like
    crypto.random.

    By the same argument, we should remove the statistics module because
    it can be used by people with numerically unstable problems. (I doubt
    you'll find StackOverflow questions along these lines yet, but that's
    only because (a) the module's pretty new, and (b) it actually works
    pretty hard to handle the hard corner cases, but I bet they'll start
    turning up in due course, if only from the people who don't understand
    floating point...)

    No, by this argument we shouldn't have a function called statistics in the
    statistics module because there is no globally "right" answer for what the
    default should be. Should it be mean? mode? median? Why is *your* use case the
    "right" use case for the default option, particularly in a situation where
    picking the wrong option can be disastrous.


    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Paul Moore at Sep 10, 2015 at 1:44 pm

    On 10 September 2015 at 14:10, Donald Stufft wrote:
    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?
    I transposed some words, fixed:

    "If you needed determinism, would it hurt you to say so?""

    Thanks.


    In one sense, no it wouldn't. Nor would it matter to me if "the
    default random number generator" was fast and cryptographically
    secure. What matters is just that I get a load of random (enough)
    numbers.


    What hurts somewhat (not enormously, I'll admit) is up front having to
    think about whether I need to be able to capture a seed and replay it.
    That's nearly always something I'd think of way down the line, as a
    "wouldn't it be nice if I could get the user to send me a reproducible
    test case" or something like that. And of course it's just a matter of
    switching the underlying RNG at that point.


    None of this is hard. But once again, I'm currently using the module
    correctly, as documented.


    I've omitted most of the rest of your response largely because we're
    probably just going to have to agree to differ. I'm probably too worn
    out being annoyed at the way that everything ends up needing to be
    security related, and the needs of people who won't read the docs
    determines API design, to respond clearly and rationally :-(


    Paul
  • Ian Cordasco at Sep 10, 2015 at 1:56 pm

    On Thu, Sep 10, 2015 at 8:44 AM, Paul Moore wrote:
    On 10 September 2015 at 14:10, Donald Stufft wrote:
    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?
    I transposed some words, fixed:

    "If you needed determinism, would it hurt you to say so?""
    Thanks.

    In one sense, no it wouldn't. Nor would it matter to me if "the
    default random number generator" was fast and cryptographically
    secure. What matters is just that I get a load of random (enough)
    numbers.

    What hurts somewhat (not enormously, I'll admit) is up front having to
    think about whether I need to be able to capture a seed and replay it.
    That's nearly always something I'd think of way down the line, as a
    "wouldn't it be nice if I could get the user to send me a reproducible
    test case" or something like that. And of course it's just a matter of
    switching the underlying RNG at that point.

    None of this is hard. But once again, I'm currently using the module
    correctly, as documented.

    No one in this thread is accusing everyone of using the module
    incorrectly. The fact that you do use it correctly is a testament to
    the fact that you read the docs carefully and have some level of
    experience with the module to know that you're using it correctly.

    I've omitted most of the rest of your response largely because we're
    probably just going to have to agree to differ. I'm probably too worn
    out being annoyed at the way that everything ends up needing to be
    security related, and the needs of people who won't read the docs
    determines API design, to respond clearly and rationally :-(

    I think the people Theo, Donald, and others (including myself) are
    worried about are the people who have used some book or online
    tutorial to write games in Python and have seen random.random() or
    random.choice() used. Later on they start working on something else
    (including but not limited to the examples of what Donald has
    otherwise pointed out). They also have enough experience with the
    random module to know it produced randomness (what kind, they don't
    know... in fact they probably don't know there are different kinds
    yet) and they use what they know because Python has batteries included
    and they're awesome and easy to use. The reality is that past
    experiences bias current decisions. If that person went and read the
    docs, they probably won't know if what they're doing warrants using a
    CSPRNG instead of the default Python one. If they're not willing to
    learn, or read enough (and I stress enough) (or just really don't have
    the time because this is a side project) about the topic before making
    a decision, they'll say "Well the module level functions seemed random
    enough to me, so I'll just use those". That could end up being rather
    awful for them.


    The reality is that your past experiences (and other people's past
    experiences, especially those who refuse to do some research and are
    demanding others prove that these are insecure with examples) are
    biasing this discussion because they fail to empathize with new users
    whose past experiences are coloring their decisions.


    People choose Python for a variety of reasons, and one of those
    reasons is that in their past experience it was "fast enough" to be an
    acceptable choice. This is how most people behave. Being angry at
    people for reading a two sentence long warning in the middle of the
    docs isn't helping anyone or arguing the validity of this discussion.
  • Donald Stufft at Sep 10, 2015 at 2:21 pm

    On September 10, 2015 at 9:44:13 AM, Paul Moore (p.f.moore at gmail.com) wrote:
    On 10 September 2015 at 14:10, Donald Stufft wrote:
    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?
    I transposed some words, fixed:

    "If you needed determinism, would it hurt you to say so?""
    Thanks.

    In one sense, no it wouldn't. Nor would it matter to me if "the
    default random number generator" was fast and cryptographically
    secure. What matters is just that I get a load of random (enough)
    numbers.

    What hurts somewhat (not enormously, I'll admit) is up front having to
    think about whether I need to be able to capture a seed and replay it.
    That's nearly always something I'd think of way down the line, as a
    "wouldn't it be nice if I could get the user to send me a reproducible
    test case" or something like that. And of course it's just a matter of
    switching the underlying RNG at that point.
    ?
    None of this is hard. But once again, I'm currently using the module
    correctly, as documented.

    This is actually exactly why Theo suggested using a modern, userland CSPRNG
    because it can generate random numbers faster than /dev/urandom can and, unless
    you need deterministic results, there's little downside to doing so.?


    There's really two possible ideas here that depends on what sort of balance
    we'd want to strike. We can make a default "I don't want to think about it"
    implementation of random that is both *generally* secure and fast, however it
    won't be deterministic and you won't be able to explicitly seed it. This would
    be a backwards compatible change [1] for people who are simply calling these
    functions [2]:


    ? ? random.getrandbits
    ? ? random.randrange
    ? ? random.randint
    ? ? random.choice
    ? ? random.shuffle
    ? ? random.sample
    ? ? random.random
    ? ? random.uniform
    ? ? random.triangular
    ? ? random.betavariate
    ? ? random.expovariate
    ? ? random.gammavariate
    ? ? random.gauss
    ? ? random.lognormvariate
    ? ? random.normalvariate
    ? ? random.vonmisesvariate
    ? ? random.paretovariate
    ? ? random.weibullvariate


    If this were all that the top level functions in random.py provided we could
    simply replace the default and people wouldn't notice, they'd just
    automatically get safer randomness whether that's actually useful for their
    use case or not.


    However, random.py also has these functions:


    ? ? random.seed
    ? ? random.getstate
    ? ? random.setstate
    ? ? random.jumpahead


    and these functions are where the problem comes. These functions only really
    make sense for deterministic sources of random which are not "safe" for use
    in security sensitive applications. So pretending for a moment that we've
    already decided to do "something" about this, the question boils down to what
    do we do about these 4 functions. Either we can change the default to a secure
    CSPRNG and break these functions (and the people using them) which is however
    easily fixed by changing ``import random`` to
    ``import random; random = random.DeterministicRandom()`` or we can deprecate
    the top level functions and try to guide people to choose up front what kind
    of random they need. Either of these solutions will end up with people being
    safer and, if we pretend we've agreed to do "something", it comes down to
    whether we'd prefer breaking compatability for some people while keeping a
    default random generator that is probably good enough for most people, or if
    we'd prefer to not break compatability and try to push people to always
    deciding what kind of random they want.


    Of course, we still haven't decided that we should do "something", I think that
    we should because I think that secure by default (or at least, not insecure by
    default) is a good situation to be in. Over the history of computing it's been
    shown that time and time again that trying to document or educate users is
    error prone and doesn't scale, but if you can design APIs to make the "right"
    thing obvious and opt-out and require opting in to specialist [3] cases which
    require some particular property.




    [1] Assuming Theo's claim of the speed of the ChaCha based arc4random function
    ? ? is accurate, which I haven't tested but I assume he's smart enough to know
    ? ? what he's talking about WRT to speed of it.


    [2] I believe anyways, I don't think that any of these rely on the properties
    ? ? of MT or a deterministic source of random, just a source of random.


    [3] In this case, their are two specialist use cases, those that require
    ? ? deterministic results and those that require specific security properties
    ? ? that are not satisified by a userland CSPRNG because a userland CSPRNG is
    ? ? not as secure as /dev/urandom but is able to be much faster.

    I've omitted most of the rest of your response largely because we're
    probably just going to have to agree to differ. I'm probably too worn
    out being annoyed at the way that everything ends up needing to be
    security related, and the needs of people who won't read the docs
    determines API design, to respond clearly and rationally :-(

    Paul

    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Paul Moore at Sep 10, 2015 at 3:02 pm

    On 10 September 2015 at 15:21, Donald Stufft wrote:
    which is however
    easily fixed by changing ``import random`` to
    ``import random; random = random.DeterministicRandom()`` or we can deprecate

    Switching (somewhat hypocritically :-)) from an "I'm a naive user"
    stance, to talking about deeper issues as if I knew what I was talking
    about, this change results in each module getting a separate instance
    of the generator. That has implications on the risks of correlated
    results. It's unlikely to cause issues in real life, conceded.


    Paul
  • Brett Cannon at Sep 10, 2015 at 4:05 pm

    On Thu, 10 Sep 2015 at 07:22 Donald Stufft wrote:


    On September 10, 2015 at 9:44:13 AM, Paul Moore (p.f.moore at gmail.com)
    wrote:
    On 10 September 2015 at 14:10, Donald Stufft wrote:
    I don't understand the phrase "if you needed determinism, it would
    hurt you to say so". Could you clarify?
    I transposed some words, fixed:

    "If you needed determinism, would it hurt you to say so?""
    Thanks.

    In one sense, no it wouldn't. Nor would it matter to me if "the
    default random number generator" was fast and cryptographically
    secure. What matters is just that I get a load of random (enough)
    numbers.

    What hurts somewhat (not enormously, I'll admit) is up front having to
    think about whether I need to be able to capture a seed and replay it.
    That's nearly always something I'd think of way down the line, as a
    "wouldn't it be nice if I could get the user to send me a reproducible
    test case" or something like that. And of course it's just a matter of
    switching the underlying RNG at that point.

    None of this is hard. But once again, I'm currently using the module
    correctly, as documented.
    This is actually exactly why Theo suggested using a modern, userland CSPRNG
    because it can generate random numbers faster than /dev/urandom can and,
    unless
    you need deterministic results, there's little downside to doing so.

    There's really two possible ideas here that depends on what sort of balance
    we'd want to strike. We can make a default "I don't want to think about it"
    implementation of random that is both *generally* secure and fast, however
    it
    won't be deterministic and you won't be able to explicitly seed it. This
    would
    be a backwards compatible change [1] for people who are simply calling
    these
    functions [2]:

    random.getrandbits
    random.randrange
    random.randint
    random.choice
    random.shuffle
    random.sample
    random.random
    random.uniform
    random.triangular
    random.betavariate
    random.expovariate
    random.gammavariate
    random.gauss
    random.lognormvariate
    random.normalvariate
    random.vonmisesvariate
    random.paretovariate
    random.weibullvariate

    If this were all that the top level functions in random.py provided we
    could
    simply replace the default and people wouldn't notice, they'd just
    automatically get safer randomness whether that's actually useful for their
    use case or not.

    However, random.py also has these functions:

    random.seed
    random.getstate
    random.setstate
    random.jumpahead

    and these functions are where the problem comes. These functions only
    really
    make sense for deterministic sources of random which are not "safe" for use
    in security sensitive applications. So pretending for a moment that we've
    already decided to do "something" about this, the question boils down to
    what
    do we do about these 4 functions. Either we can change the default to a
    secure
    CSPRNG and break these functions (and the people using them) which is
    however
    easily fixed by changing ``import random`` to
    ``import random; random = random.DeterministicRandom()`` or we can
    deprecate
    the top level functions and try to guide people to choose up front what
    kind
    of random they need. Either of these solutions will end up with people
    being
    safer and, if we pretend we've agreed to do "something", it comes down to
    whether we'd prefer breaking compatability for some people while keeping a
    default random generator that is probably good enough for most people, or
    if
    we'd prefer to not break compatability and try to push people to always
    deciding what kind of random they want.

    +1 for deprecating module-level functions and putting everything into
    classes to force a choice
    +0 for deprecating the seed-related functions and saying "the stdlib uses
    was it uses as a RNG and you have to live with it if you don't make your
    own choice" and switching to a crypto-secure RNG.
    -0 leaving it as-is


    -Brett



    Of course, we still haven't decided that we should do "something", I think
    that
    we should because I think that secure by default (or at least, not
    insecure by
    default) is a good situation to be in. Over the history of computing it's
    been
    shown that time and time again that trying to document or educate users is
    error prone and doesn't scale, but if you can design APIs to make the
    "right"
    thing obvious and opt-out and require opting in to specialist [3] cases
    which
    require some particular property.


    [1] Assuming Theo's claim of the speed of the ChaCha based arc4random
    function
    is accurate, which I haven't tested but I assume he's smart enough to
    know
    what he's talking about WRT to speed of it.

    [2] I believe anyways, I don't think that any of these rely on the
    properties
    of MT or a deterministic source of random, just a source of random.

    [3] In this case, their are two specialist use cases, those that require
    deterministic results and those that require specific security
    properties
    that are not satisified by a userland CSPRNG because a userland CSPRNG
    is
    not as secure as /dev/urandom but is able to be much faster.
    I've omitted most of the rest of your response largely because we're
    probably just going to have to agree to differ. I'm probably too worn
    out being annoyed at the way that everything ends up needing to be
    security related, and the needs of people who won't read the docs
    determines API design, to respond clearly and rationally :-(

    Paul
    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372
    DCFA


    _______________________________________________
    Python-ideas mailing list
    Python-ideas at python.org
    https://mail.python.org/mailman/listinfo/python-ideas
    Code of Conduct: http://python.org/psf/codeofconduct/
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150910/892ecf81/attachment-0001.html>
  • Nick Coghlan at Sep 10, 2015 at 5:00 pm

    On 11 September 2015 at 02:05, Brett Cannon wrote:
    +1 for deprecating module-level functions and putting everything into
    classes to force a choice

    -1000, as this would be a *huge* regression in Python's usability for
    educational use cases. (Think 7-8 year olds that are still learning to
    read, not teenagers or adults with more fully developed vocabularies)


    A reasonable "Hello world!" equivalent for introducing randomness to
    students is rolling a 6-sided die, as that relates to a real world
    object they'll often be familiar with. At the moment that reads as
    follows:

    from random import randint
    randint(1, 6)
    6
    randint(1, 6)
    3
    randint(1, 6)
    1
    randint(1, 6)
    4


    Another popular educational exercise is the "Guess a number" game,
    where the program chooses a random number from 1-100, and the person
    playing the game has to guess what it is. Again, randint() works fine
    here.


    Shuffling decks of cards, flipping coins, these are all things used to
    introduce learners to modelling random events in the real world in
    software, and we absolutely do *not* want to invalidate the extensive
    body of educational material that assumes the current module level API
    for the random module.

    +0 for deprecating the seed-related functions and saying "the stdlib uses
    was it uses as a RNG and you have to live with it if you don't make your own
    choice" and switching to a crypto-secure RNG.

    However, this I'm +1 on. People *do* use the module level APIs
    inappropriately, and we can get them to a much safer place, while
    nudging folks that genuinely need deterministic randomness towards an
    alternative API.


    The key for me is that folks that actually *need* deterministic
    randomness *will* be calling the stateful module level APIs. This
    means we can put the deprecation warnings on *those* methods, and
    leave them out for the others.


    In terms of practical suggestions, rather than DeterministicRandom and
    NonDeterministicRandom, I'd actually go with the simpler terms
    SeededRandom and SeedlessRandom (there's a case to be made that those
    are misnomers, but I'll go into that more below):


    SeededRandom: Mersenne Twister
    SeedlessRandom: new CSPRNG
    SystemRandom: os.urandom()


    Phase one of transition:


    * add SeedlessRandom
    * rename Random to SeededRandom
    * Random becomes a subclass of SeededRandom that deprecates all
    methods not shared with SeedlessRandom
    * this will also effectively deprecate the corresponding module level functions
    * any SystemRandom methods that are no-ops (like seed()) are deprecated


    Phase two of transition:


    * Random becomes an alias for SeedlessRandom
    * deprecated methods are removed from SystemRandom
    * deprecated module level functions are removed


    As far as the proposed Seeded/Seedless naming goes, that deliberately
    glosses over the fact that "seed" gets used to refer to two different
    things - seeding a PRNG with entropy, and seeding a deterministic PRNG
    with a particular seed value. The key is that "SeedlessRandom" won't
    have a "seed()" *method*, and that's the single most salient fact
    about it from a user experience perspective: you can't get the same
    output by providing the same seed value, because we wouldn't let you
    provide a seed value at all.


    Regards,
    Nick.


    --
    Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
  • Chris Angelico at Sep 10, 2015 at 5:11 pm

    On Fri, Sep 11, 2015 at 3:00 AM, Nick Coghlan wrote:
    As far as the proposed Seeded/Seedless naming goes, that deliberately
    glosses over the fact that "seed" gets used to refer to two different
    things - seeding a PRNG with entropy, and seeding a deterministic PRNG
    with a particular seed value. The key is that "SeedlessRandom" won't
    have a "seed()" *method*, and that's the single most salient fact
    about it from a user experience perspective: you can't get the same
    output by providing the same seed value, because we wouldn't let you
    provide a seed value at all.

    Aside from sounding like varieties of grapes in a grocery, those names
    seem just fine. From the POV of someone with a bit of comprehension of
    crypto (as in, "use /dev/urandom rather than a PRNG", but not enough
    knowledge to actually build or verify these things), the distinction
    is precise: with SeededRandom, I can give it a seed and get back a
    predictable sequence of numbers, but with SeedlessRandom, I can't. I'm
    not sure what the difference is between "seeding a PRNG with entropy"
    and "seeding a deterministic PRNG with a particular seed value",
    though; aside from the fact that one of them uses a known value and
    the other doesn't, of course. Back in my BASIC programming days, we
    used to use "RANDOMIZE TIMER" to seed the RNG with time-of-day, or
    "RANDOMIZE 12345" (or other value) to seed with a particular value;
    they're the same operation, but one's considered random and the
    other's considered predictable. (Of course, bytes from /dev/urandom
    will be a lot more entropic than "number of centiseconds since
    midnight", but for a single-player game that wants to provide a
    different starting layout every time you play, the latter is
    sufficient.)


    ChrisA
  • Nick Coghlan at Sep 10, 2015 at 5:27 pm

    On 11 September 2015 at 03:11, Chris Angelico wrote:
    On Fri, Sep 11, 2015 at 3:00 AM, Nick Coghlan wrote:
    As far as the proposed Seeded/Seedless naming goes, that deliberately
    glosses over the fact that "seed" gets used to refer to two different
    things - seeding a PRNG with entropy, and seeding a deterministic PRNG
    with a particular seed value. The key is that "SeedlessRandom" won't
    have a "seed()" *method*, and that's the single most salient fact
    about it from a user experience perspective: you can't get the same
    output by providing the same seed value, because we wouldn't let you
    provide a seed value at all.
    Aside from sounding like varieties of grapes in a grocery, those names
    seem just fine. From the POV of someone with a bit of comprehension of
    crypto (as in, "use /dev/urandom rather than a PRNG", but not enough
    knowledge to actually build or verify these things), the distinction
    is precise: with SeededRandom, I can give it a seed and get back a
    predictable sequence of numbers, but with SeedlessRandom, I can't. I'm
    not sure what the difference is between "seeding a PRNG with entropy"
    and "seeding a deterministic PRNG with a particular seed value",
    though; aside from the fact that one of them uses a known value and
    the other doesn't, of course.

    Actually, that was just a mistake on my part - they're really the same
    thing, and the only distinction is the one you mention: setting the
    seed to a known value. Thus the main seed-related difference between
    something like arc4random and other random APIs is the same one I'm
    proposing to make here: it's seedless at the API level because it
    takes care of collecting its own initial entropy from the operating
    system's random number API.


    Regards,
    Nick.


    --
    Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
  • Greg Ewing at Sep 11, 2015 at 6:19 am

    Chris Angelico wrote:
    I'm
    not sure what the difference is between "seeding a PRNG with entropy"
    and "seeding a deterministic PRNG with a particular seed value",
    though; aside from the fact that one of them uses a known value and
    the other doesn't, of course. Back in my BASIC programming days, we
    used to use "RANDOMIZE TIMER" to seed the RNG with time-of-day, or
    "RANDOMIZE 12345" (or other value) to seed with a particular value;

    I think the only other difference is that the Linux kernel
    is continually re-seeding its generator whenever more
    unpredictable bits become available. It's not something
    you need to explicitly do yourself, as in your BASIC
    example.


    --
    Greg
  • Donald Stufft at Sep 10, 2015 at 5:02 pm

    On September 10, 2015 at 10:21:11 AM, Donald Stufft (donald at stufft.io) wrote:
    Assuming Theo's claim of the speed of the ChaCha based arc4random
    function
    is accurate, which I haven't tested but I assume he's smart enough
    to know
    what he's talking about WRT to speed of it.

    I wanted to try and test this. These are not super scientific since I just ran
    them on a single computer once (but 10 million iterations each) but I think it
    can probably give us an indication of the differences?


    I put the code up at https://github.com/dstufft/randtest but it's a pretty
    simple module. I'm not sure if (double)arc4random() / UINT_MAX is a reasonable
    way to get a double out of arc4random (which returns a uint) that is between
    0.0 and 1.0, but I assume it's fine at least for this test.


    Here's the results from running the test on my personal computer which is
    running the OSX El Capitan public Beta:


    ? ? $ python test.py
    ? ? Number of Calls: ?10000000
    ? ? +---------------+--------------------+
    ? ? | method ? ? ? ?| usecs per call ? ? |
    ? ? +---------------+--------------------+
    ? ? | deterministic | 0.0586802460020408 |
    ? ? | system ? ? ? ?| 1.6681434757076203 |
    ? ? | userland ? ? ?| 0.1534261149005033 |
    ? ? +---------------+--------------------+




    I'll try it against OpenBSD later to see if their implementation of arc4random
    is faster than OSX.


    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Tim Peters at Sep 10, 2015 at 5:23 pm
    [Donald Stufft <donald@stufft.io>, on arc4random speed]
    I wanted to try and test this. These are not super scientific since I just ran
    them on a single computer once (but 10 million iterations each) but I think it
    can probably give us an indication of the differences?

    I put the code up at https://github.com/dstufft/randtest but it's a pretty
    simple module. I'm not sure if (double)arc4random() / UINT_MAX is a reasonable
    way to get a double out of arc4random (which returns a uint) that is between
    0.0 and 1.0, but I assume it's fine at least for this test.

    arc4random() specifically returns uint32_t, which is 21 bits shy of
    what's needed to generate a reasonable random double. Our MT wrapping
    internally generates two 32-bit uint32_t thingies, and pastes them
    together like so (Python's C code here):


    """
    /* random_random is the function named genrand_res53 in the original code;
      * generates a random number on [0,1) with 53-bit resolution; note that
      * 9007199254740992 == 2**53; I assume they're spelling "/2**53" as
      * multiply-by-reciprocal in the (likely vain) hope that the compiler will
      * optimize the division away at compile-time. 67108864 is 2**26. In
      * effect, a contains 27 random bits shifted left 26, and b fills in the
      * lower 26 bits of the 53-bit numerator.
      * The orginal code credited Isaku Wada for this algorithm, 2002/01/09.
      */
    static PyObject *
    random_random(RandomObject *self)
    {
         PY_UINT32_T a=genrand_int32(self)>>5, b=genrand_int32(self)>>6;
         return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0));
    }
    """


    So now you know how to make it more directly comparable. The
    high-order bit is that it requires 2 calls to the 32-bit uint integer
    primitive to get a double, and that can indeed be significant.



    Here's the results from running the test on my personal computer which is
    running the OSX El Capitan public Beta:

    $ python test.py
    Number of Calls: 10000000
    +---------------+--------------------+
    method | usecs per call |
    +---------------+--------------------+
    deterministic | 0.0586802460020408 |
    system | 1.6681434757076203 |
    userland | 0.1534261149005033 |
    +---------------+--------------------+


    I'll try it against OpenBSD later to see if their implementation of arc4random
    is faster than OSX.

    Just noting that most people timing the OpenBSD version seem to
    comment out the "get stuff from the kernel periodically" part first,
    in order to time the algorithm instead of the kernel ;-) In real
    life, though, they both count, so I like what you're doing better.
  • Donald Stufft at Sep 10, 2015 at 6:50 pm

    On September 10, 2015 at 1:24:05 PM, Tim Peters (tim.peters at gmail.com) wrote:
    So now you know how to make it more directly comparable. The
    high-order bit is that it requires 2 calls to the 32-bit uint integer
    primitive to get a double, and that can indeed be significant.

    It didn?t change the results really though:


    My OSX El Capitan machine:


    Number of Calls: ?10000000
    +---------------+---------------------+
    method ? ? ? ?| usecs per call ? ? ?|
    +---------------+---------------------+
    deterministic | 0.05792283279588446 |
    system ? ? ? ?| 1.7192466521984897 ?|
    userland ? ? ?| 0.17901834140066059 |
    +---------------+??????????+




    An OpenBSD 5.7 VM:


    Number of Calls: ?10000000
    +---------------+---------------------+
    method ? ? ? ?| usecs per call ? ? ?|
    +---------------+---------------------+
    deterministic | 0.06555143180000868 |
    system ? ? ? ?| 0.8929547749999983 ?|
    userland ? ? ?| 0.16291017429998647 |
    +---------------+---------------------+






    -----------------
    Donald Stufft
    PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
  • Andrew Barnert at Sep 10, 2015 at 10:46 pm

    On Sep 10, 2015, at 07:21, Donald Stufft wrote:
    Either we can change the default to a secure
    CSPRNG and break these functions (and the people using them) which is however
    easily fixed by changing ``import random`` to
    ``import random; random = random.DeterministicRandom()``

    But that isn't a fix, unless all your code is in a single module. If I call random.seed in game.py and then call random.choice in aiplayer.py, I'll get different results after your fix than I did before.


    What I'd need to do instead is create a separate myrandom.py that does this and then exports all of the bound methods of random as top-level functions, and then make game.py, aiplayer.py, etc. all import myrandom as random. Which is, while not exactly hard, certainly harder, and much less obvious, than the incorrect fix that you've suggested, and it may not be immediately obvious that it's wrong until someone files a bug three versions later claiming that when he reloads a game the AI cheats and you have to track through the problem.


    That's why I suggested the set_default_instance function, which makes this problem trivial to solve in a correct way instead of in an incorrect way.
  • Andrew Barnert at Sep 10, 2015 at 10:54 pm

    On Sep 10, 2015, at 15:46, Andrew Barnert via Python-ideas wrote:
    On Sep 10, 2015, at 07:21, Donald Stufft wrote:

    Either we can change the default to a secure
    CSPRNG and break these functions (and the people using them) which is however
    easily fixed by changing ``import random`` to
    ``import random; random = random.DeterministicRandom()``
    But that isn't a fix, unless all your code is in a single module. If I call random.seed in game.py and then call random.choice in aiplayer.py, I'll get different results after your fix than I did before.

    What I'd need to do instead is create a separate myrandom.py that does this and then exports all of the bound methods of random as top-level functions, and then make game.py, aiplayer.py, etc. all import myrandom as random. Which is, while not exactly hard, certainly harder, and much less obvious, than the incorrect fix that you've suggested, and it may not be immediately obvious that it's wrong until someone files a bug three versions later claiming that when he reloads a game the AI cheats and you have to track through the problem.

    That's why I suggested the set_default_instance function, which makes this problem trivial to solve in a correct way instead of in an incorrect way.

    Actually, I just thought of an even simpler solution:


    Add a deterministic_singleton member to random (which is just initialized to DeterministicRandom() at startup). Now, the user fix is just to change "import random" to "from random import deterministic_singleton as random".
  • Nick Coghlan at Sep 11, 2015 at 2:48 am

    On 11 September 2015 at 08:54, Andrew Barnert via Python-ideas wrote:
    Actually, I just thought of an even simpler solution:

    Add a deterministic_singleton member to random (which is just initialized to DeterministicRandom() at startup). Now, the user fix is just to change "import random" to "from random import deterministic_singleton as random".

    Change the spelling to "import random.seeded_random as random" and the
    user fix is even shorter.


    I do agree with the idea of continuing to provide a process global
    instance of the current PRNG for ease of migration - changing a single
    import is a good way to be able to address a deprecation, and looking
    for the use of seeded_random in a security sensitive context would
    still be fairly straightforward.


    Cheers,
    Nick.


    --
    Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
  • Andrew Barnert at Sep 11, 2015 at 3:18 am

    On Sep 10, 2015, at 19:48, Nick Coghlan wrote:
    On 11 September 2015 at 08:54, Andrew Barnert via Python-ideas
    wrote:
    Actually, I just thought of an even simpler solution:

    Add a deterministic_singleton member to random (which is just initialized to DeterministicRandom() at startup). Now, the user fix is just to change "import random" to "from random import deterministic_singleton as random".
    Change the spelling to "import random.seeded_random as random" and the
    user fix is even shorter.

    OK, sure; I don't care much about the spelling. I think neither name will be unduly confusing to novices, and anyone who actually wants to understand what the choice means will use help or the docs or a Google search and find out in a few seconds.

    I do agree with the idea of continuing to provide a process global
    instance of the current PRNG for ease of migration - changing a single
    import is a good way to be able to address a deprecation, and looking
    for the use of seeded_random in a security sensitive context would
    still be fairly straightforward.

    Personally, I think we're done with that change. Deprecation of the names random.Random, random.random(), etc. is sufficient to prevent people from making mistakes without realizing it. Having a good workaround to prevent code churn for the thousands of affected apps means the cost doesn't outweigh the benefits. So, the problem Theo raised is solved.[1] Which means the more radical solution he offered is unnecessary. Unless we're seriously worried that some people who aren't sure if they need Seeded or System may incorrectly choose Seeded just because of performance, there's no need to add a Chacha choice alongside them. Put it on PyPI, maybe with a link from the SystemRandom docs, and see how things go from there.


    [1] Well, it's not quite solved, because someone has to figure out how to organize things in the docs, which obviously need to change. Do we tell people how to choose between creating a SeededRandom or SystemRandom instance, then describe their interface, and then include a brief note "... but for porting old code, or when you explicitly need a globally shared Seeded instance, use seeded_random"? Or do we present all three as equally valid choices, and try to explain why you might want the singleton seeded_random vs. constructing and managing an instance or instances?
  • Nick Coghlan at Sep 11, 2015 at 3:38 am

    On 11 September 2015 at 13:18, Andrew Barnert wrote:
    Personally, I think we're done with that change. Deprecation of the names random.Random, random.random(), etc. is sufficient to prevent people from making mistakes without realizing it.

    Implementing dice rolling or number guessing for a game as "from
    random import randint" is *not* a mistake, and I'm adamantly opposed
    to any proposal that makes it one - the cost imposed on educational
    use cases would be far too high.


    Regards,
    Nick.


    --
    Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
  • Stephen J. Turnbull at Sep 11, 2015 at 4:44 am
    Nick Coghlan writes:

    Implementing dice rolling or number guessing for a game as "from
    random import randint" is *not* a mistake,

    Turning the number guessing game into a text CAPTCHA might be one,
    though. That randint may as well be crypto strong, modulo the problem
    that people who use an explicit seed get punished for knowing what
    they're doing.


    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?


    Steve
  • Chris Angelico at Sep 11, 2015 at 4:54 am

    On Fri, Sep 11, 2015 at 2:44 PM, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?

    Ooh. Actually, I rather like that idea. If you don't seed the RNG, its
    output will be unpredictable; it doesn't matter whether it's a PRNG
    seeded by an unknown number, a PRNG seeded by /dev/urandom, a CSRNG,
    or just reading from /dev/urandom every time. Until you explicitly
    request determinism, you don't have it. If Python changes its RNG
    algorithm and you haven't been seeding it, would you even know? Could
    it ever matter to you?


    It would require a bit of an internals change; is it possible that
    code depends on random.seed and random.randint are bound methods of
    the same object? To implement what you describe, they'd probably have
    to not be.


    ChrisA
  • Xavier Combelle at Sep 11, 2015 at 6:34 am

    2015-09-11 6:54 GMT+02:00 Chris Angelico <rosuav@gmail.com>:

    On Fri, Sep 11, 2015 at 2:44 PM, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?
    Ooh. Actually, I rather like that idea. If you don't seed the RNG, its
    output will be unpredictable; it doesn't matter whether it's a PRNG
    seeded by an unknown number, a PRNG seeded by /dev/urandom, a CSRNG,
    or just reading from /dev/urandom every time. Until you explicitly
    request determinism, you don't have it. If Python changes its RNG
    algorithm and you haven't been seeding it, would you even know? Could
    it ever matter to you?

    It would require a bit of an internals change; is it possible that
    code depends on random.seed and random.randint are bound methods of
    the same object? To implement what you describe, they'd probably have
    to not be.

    ChrisA
    _______________________________________________
    Python-ideas mailing list
    Python-ideas at python.org
    https://mail.python.org/mailman/listinfo/python-ideas
    Code of Conduct: http://python.org/psf/codeofconduct/

    I have thought of this idea and was quite seduced by it. However in this
    case on a non seeded generator, getstate/setstate would be meaningless. I
    also wonder what pickling generators does.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150911/a741e144/attachment-0001.html>
  • Petr Viktorin at Sep 11, 2015 at 8:08 am

    On Fri, Sep 11, 2015 at 6:54 AM, Chris Angelico wrote:
    On Fri, Sep 11, 2015 at 2:44 PM, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?
    Ooh. Actually, I rather like that idea. If you don't seed the RNG, its
    output will be unpredictable; it doesn't matter whether it's a PRNG
    seeded by an unknown number, a PRNG seeded by /dev/urandom, a CSRNG,
    or just reading from /dev/urandom every time. Until you explicitly
    request determinism, you don't have it. If Python changes its RNG
    algorithm and you haven't been seeding it, would you even know? Could
    it ever matter to you?

    It would require a bit of an internals change; is it possible that
    code depends on random.seed and random.randint are bound methods of
    the same object? To implement what you describe, they'd probably have
    to not be.

    I've also thought about this idea. The problem with it is that seed()
    and friends affect a global instance of Random.
    If, after this change, there was a library that used random.random()
    for crypto, calling seed() in the main program (or any other library)
    would make it insecure. So we'd still be in a situation where nobody
    should use random() for crypto.
  • Chris Angelico at Sep 11, 2015 at 8:57 am

    On Fri, Sep 11, 2015 at 6:08 PM, Petr Viktorin wrote:
    I've also thought about this idea. The problem with it is that seed()
    and friends affect a global instance of Random.
    If, after this change, there was a library that used random.random()
    for crypto, calling seed() in the main program (or any other library)
    would make it insecure. So we'd still be in a situation where nobody
    should use random() for crypto.

    So library functions shouldn't use random.random() for anything they
    know needs security. If you write a function generate_password(), the
    responsibility is yours to ensure that it's entropic rather than
    deterministic. That's no different from the current situation (seeding
    the RNG makes it deterministic) except that the unseeded RNG is not
    just harder to predict, it's actually entropic.


    In some cases, having the 99% by default is a barrier to people who
    need the 100%. (Conflating UCS-2 with Unicode deceives people into
    thinking their program works just fine, and then it fails on astral
    characters.) But in this case, there's no perfect-by-default solution,
    so IMO the best two solutions are: Be great, but vulnerable to an
    external seed(), until someone chooses; or have no random number
    generation until someone chooses. We know that the latter is a
    terrible option for learning, so vulnerability to someone else calling
    random.seed() is a small price to pay.


    ChrisA
  • Random832 at Sep 11, 2015 at 12:42 pm

    On Fri, Sep 11, 2015, at 00:54, Chris Angelico wrote:
    It would require a bit of an internals change; is it possible that
    code depends on random.seed and random.randint are bound methods of
    the same object?

    That's a ridiculous thing to depend on.

    To implement what you describe, they'd probably have
    to not be.

    You could implement one class that calls either a SystemRandom instance
    or an instance of another class depending on which mode it is in.
  • Paul Moore at Sep 11, 2015 at 8:02 am

    On 11 September 2015 at 05:44, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?

    One issue with that - often, programs simply use a RNG for their own
    purposes, but offer a means of getting the seed after the fact for
    reproducibility reasons (the "map seed" case, for example).


    Pseudo-code:


         if <user supplied a "seed">:
             state = <user-supplied value>
             random.setstate(state)
         else:
             state = random.getstate()
         ... do the program's main job, never calling seed/setstate
         if <user requests the "seed">:
             print state


    So getstate (and setstate) would also need to switch to a PRNG.


    There's actually very few cases I can think of where I'd need seed()
    (as opposed to setstate()). Maybe if I let the user *choose* a seed
    Some games do this.


    Paul
  • Nathaniel Smith at Sep 11, 2015 at 9:52 am

    On Fri, Sep 11, 2015 at 1:02 AM, Paul Moore wrote:
    On 11 September 2015 at 05:44, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?
    One issue with that - often, programs simply use a RNG for their own
    purposes, but offer a means of getting the seed after the fact for
    reproducibility reasons (the "map seed" case, for example).

    Pseudo-code:

    if <user supplied a "seed">:
    state = <user-supplied value>
    random.setstate(state)
    else:
    state = random.getstate()
    ... do the program's main job, never calling seed/setstate
    if <user requests the "seed">:
    print state

    So getstate (and setstate) would also need to switch to a PRNG.

    There's actually very few cases I can think of where I'd need seed()
    (as opposed to setstate()). Maybe if I let the user *choose* a seed
    Some games do this.

    You don't really want to use the full 4992 byte state for a "map seed"
    application anyway (type 'random.getstate()' in a REPL and watch your
    terminal scroll down multiple pages...). No game actually uses map
    seeds that look anything like that. I'm 99% sure that real
    applications in this category are actually using logic like:


    if <user supplied a "seed">:
         seed = user_seed()
    else:
         # use some RNG that was seeded with real entropy
         seed = random_short_printable_string()
    r = random.Random(seed)
    # now use 'r' to generate the map


    -n


    --
    Nathaniel J. Smith -- http://vorpus.org
  • Paul Moore at Sep 11, 2015 at 10:03 am

    On 11 September 2015 at 10:52, Nathaniel Smith wrote:
    You don't really want to use the full 4992 byte state for a "map seed"
    application anyway (type 'random.getstate()' in a REPL and watch your
    terminal scroll down multiple pages...). No game actually uses map
    seeds that look anything like that. I'm 99% sure that real
    applications in this category are actually using logic like:

    if <user supplied a "seed">:
    seed = user_seed()
    else:
    # use some RNG that was seeded with real entropy
    seed = random_short_printable_string()
    r = random.Random(seed)
    # now use 'r' to generate the map

    Yeah, good point. As I say, I don't actually *use* this in the example
    program I'm thinking of, I just know it's a feature I need to add in
    due course. So when I do, I'll have to look into how to best implement
    it. (And I'll probably nick the approach you show above, thanks ;-))


    Paul
  • Andrew Barnert at Sep 11, 2015 at 10:07 am

    On Sep 11, 2015, at 02:52, Nathaniel Smith wrote:
    On Fri, Sep 11, 2015 at 1:02 AM, Paul Moore wrote:
    On 11 September 2015 at 05:44, Stephen J. Turnbull wrote:
    I suppose it would be too magic to have the seed method substitute the
    traditional PRNG for the default, while an implicitly seeded RNG
    defaults to a crypto strong algorithm?
    One issue with that - often, programs simply use a RNG for their own
    purposes, but offer a means of getting the seed after the fact for
    reproducibility reasons (the "map seed" case, for example).

    Pseudo-code:

    if <user supplied a "seed">:
    state = <user-supplied value>
    random.setstate(state)
    else:
    state = random.getstate()
    ... do the program's main job, never calling seed/setstate
    if <user requests the "seed">:
    print state

    So getstate (and setstate) would also need to switch to a PRNG.

    There's actually very few cases I can think of where I'd need seed()
    (as opposed to setstate()). Maybe if I let the user *choose* a seed
    Some games do this.
    You don't really want to use the full 4992 byte state for a "map seed"
    application anyway (type 'random.getstate()' in a REPL and watch your
    terminal scroll down multiple pages...). No game actually uses map
    seeds that look anything like that.

    But games do store the entire map state with saved games if they want repeatable saves (e.g., to prevent players from defeating the RNG by save scumming).
  • Paul Moore at Sep 11, 2015 at 10:10 am

    On 11 September 2015 at 11:07, Andrew Barnert wrote:
    But games do store the entire map state with saved games if they want repeatable saves (e.g., to prevent players from defeating the RNG by save scumming).

    So far off-topic it's not true, but a number of games I know of (e.g.,
    Factorio, Minecraft) include a means to get a map seed (a simple text
    string) which you can publish, that allows other users to (in effect)
    play on the same map as you. That's different from saves.


    Paul
  • Random832 at Sep 11, 2015 at 12:58 pm

    On Fri, Sep 11, 2015, at 06:10, Paul Moore wrote:
    On 11 September 2015 at 11:07, Andrew Barnert wrote:
    But games do store the entire map state with saved games if they want repeatable saves (e.g., to prevent players from defeating the RNG by save scumming).
    So far off-topic it's not true, but a number of games I know of (e.g.,
    Factorio, Minecraft) include a means to get a map seed (a simple text
    string) which you can publish, that allows other users to (in effect)
    play on the same map as you. That's different from saves.

    Of course, Minecraft doesn't actually use the seed in such a simple way
    as seeding a single-sequence random number generator. If it did, the map
    would depend on what order you visited regions in. (This is less of an
    issue for games with finite worlds)

Related Discussions

People

Translate

site design / logo © 2019 Grokbase