Hello,

I've recently been busy improving a query that yields a fixed number of
random records matching certain conditions. I have tried all the usual
approaches, and although they do work, they're all limited in some way
and don't translate really well to what you "want". They're kludges, IMHO.

The methods I've tried are explained quite well on
http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html

All these methods involve calculating a random number for every record
in the result set at some point in time, which is really not what I'm
trying to model. I think the database should provide some means to get
those records, so...

Dear Santa,

I'd like my database to have functionality analogue to how LIMIT works,
but for other - non-sequential - algorithms.

I was thinking along the lines of:

SELECT *
FROM table
WHERE condition = true
RANDOM 5;

Which would (up to) return 5 random rows from the result set, just as
LIMIT 5 returns (up to) the first 5 records in the result set.


Or maybe even with a custom function, so that you could get non-linear
distributions:

SELECT *
FROM table
WHERE condition = true
LIMIT 5 USING my_func();

Where my_func() could be a user definable function accepting a number
that should be (an estimate of?) the number of results being returned so
that it can provide pointers to which rows in the resultset will be
returned from the query.

Examples:
* random(maxrows) would return random rows from the resultset.
* median() would return the rows in the middle of the result set (this
would require ordering to be meaningful).

What do people think, is this feasable? Desirable? Necessary?

If I'd have time I'd volunteer for at least looking into this, but I'm
working on three projects simultaneously already. Alas...

Regards,
Alban Hertroys.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //

Search Discussions

  • Josh Tolley at Aug 31, 2007 at 1:33 pm

    On 8/31/07, Alban Hertroys wrote:
    Hello,

    I've recently been busy improving a query that yields a fixed number of
    random records matching certain conditions. I have tried all the usual
    approaches, and although they do work, they're all limited in some way
    and don't translate really well to what you "want". They're kludges, IMHO.

    The methods I've tried are explained quite well on
    http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html

    All these methods involve calculating a random number for every record
    in the result set at some point in time, which is really not what I'm
    trying to model. I think the database should provide some means to get
    those records, so...

    Dear Santa,

    I'd like my database to have functionality analogue to how LIMIT works,
    but for other - non-sequential - algorithms.

    I was thinking along the lines of:

    SELECT *
    FROM table
    WHERE condition = true
    RANDOM 5;

    Which would (up to) return 5 random rows from the result set, just as
    LIMIT 5 returns (up to) the first 5 records in the result set.


    Or maybe even with a custom function, so that you could get non-linear
    distributions:

    SELECT *
    FROM table
    WHERE condition = true
    LIMIT 5 USING my_func();

    Where my_func() could be a user definable function accepting a number
    that should be (an estimate of?) the number of results being returned so
    that it can provide pointers to which rows in the resultset will be
    returned from the query.

    Examples:
    * random(maxrows) would return random rows from the resultset.
    * median() would return the rows in the middle of the result set (this
    would require ordering to be meaningful).

    What do people think, is this feasable? Desirable? Necessary?

    If I'd have time I'd volunteer for at least looking into this, but I'm
    working on three projects simultaneously already. Alas...

    Regards,
    Alban Hertroys.

    --
    Alban Hertroys
    alban@magproductions.nl

    magproductions b.v.

    T: ++31(0)534346874
    F: ++31(0)534346876
    M:
    I: www.magproductions.nl
    A: Postbus 416
    7500 AK Enschede

    // Integrate Your World //

    ---------------------------(end of broadcast)---------------------------
    TIP 6: explain analyze is your friend
    It seems to me that anything that wants to return a random set of rows
    will need to calculate a random number for all the rows it processes,
    unless you change how the database scans rows in indexes or tables,
    which if at all possible will probably make things *really* slow. If
    it's a given that the database will always sequentially scan whatever
    it is the query plan tells it to scan, you're pretty much stuck with
    the rows in your result set being in the same order unless you start
    picking random numbers.

    One possible alternative not mentioned on the site you linked to is to
    as follows:

    select [whatever] from [table] where random() < [some number between 0
    and 1] limit [limit value]

    That doesn't require assigning a random number for *every* row in the
    table, nor does it require sorting everything. It does mean that
    numbers encountered earlier in the query processing have a higher
    likelihood of being returned, and it also means that there's some
    chance you won't actually get as many as [limit value] rows returned.

    jtolley=# create table a (i integer);
    CREATE TABLE
    jtolley=# insert into a (i) select * from generate_series(1, 100);
    INSERT 0 100
    jtolley=# create table a (i integer);
    CREATE TABLE
    jtolley=# insert into a (i) select * from generate_series(1, 100);
    INSERT 0 100
    jtolley=# select * from a where random() < .1 limit 3;
    i
    ----
    22
    23
    25
    (3 rows)

    Hope this helps...

    -Josh
  • Albe Laurenz at Aug 31, 2007 at 1:39 pm

    Alban Hertroys wrote:
    I've recently been busy improving a query that yields a fixed
    number of random records matching certain conditions.
    Dear Santa,

    I'd like my database to have functionality analogue to how
    LIMIT works,
    but for other - non-sequential - algorithms.

    I was thinking along the lines of:

    SELECT *
    FROM table
    WHERE condition = true
    RANDOM 5;
    Ho, ho, ho.

    SELECT *
    FROM table
    WHERE condition = true
    ORDER BY hashfloat8(random())
    LIMIT 5;

    Yours,
    Laurenz Albe
  • Csaba Nagy at Aug 31, 2007 at 1:45 pm

    Dear Santa,

    I'd like my database to have functionality analogue to how LIMIT works,
    but for other - non-sequential - algorithms.
    There was some discussion before to possibly reuse the algorithm ANALYZE
    is using for sampling some given percentage of the table data and
    provide this for some kind of "SELECT SAMPLE x% " style of
    functionality. This would be the fastest you can get for a reasonably
    big sample so it can be statistically significant, but not repeatable.
    I'm not sure if this is the same what you were asking for though, I
    would like something like this for statistical stuff, not for randomly
    selecting rows.

    Cheers,
    Csaba.
  • Martijn van Oosterhout at Aug 31, 2007 at 1:54 pm

    On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
    Examples:
    * random(maxrows) would return random rows from the resultset.
    * median() would return the rows in the middle of the result set (this
    would require ordering to be meaningful).
    It would be possible to write an aggregate that returns a single random
    value from a set. The algorithm is something like:

    n = 1
    v = null
    for each row
    if random() < 1/n:
    v = value of row
    n = n + 1

    return v

    It does require a seqscan though. If you're asking for 5 random rows
    you probably mean 5 random but distinct rows, which is different to
    just running the above set 5 times in parallel.

    I don't know if there's a similar method for median...

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    From each according to his ability. To each according to his ability to litigate.
  • Alban Hertroys at Sep 1, 2007 at 11:08 am

    On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote:
    On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
    Examples:
    * random(maxrows) would return random rows from the resultset.
    * median() would return the rows in the middle of the result set
    (this
    would require ordering to be meaningful).
    It would be possible to write an aggregate that returns a single
    random
    value from a set. The algorithm is something like:

    n = 1
    v = null
    for each row
    if random() < 1/n:
    v = value of row
    n = n + 1

    return v
    Doesn't this always return the first record, since random() is always
    less than 1/1?
    I don't think this method has a linear distribution, but then again I
    don't understand what 'value of row' refers to...
    It does require a seqscan though.
    I doubt that a seqscan can be entirely avoided to fetch random rows
    from a set, at least not until the last random result has been
    returned, _unless_ the number of matching records would be known
    before starting taking random samples.
    If you're asking for 5 random rows
    you probably mean 5 random but distinct rows, which is different to
    just running the above set 5 times in parallel.
    Indeed, that is one of the distinctions that need some thought for my
    original preposition. I left it out, as it's an implementation detail
    (an important one, admittedly).
    I don't know if there's a similar method for median...
    I'm not entirely sure, but I think your method is the only one
    suggested that doesn't involve calculating random() a million times
    (for a million records) to return 5 (random) records.


    My suggestion involved a way to calculate random() only when
    retrieving records from the result set (only 5 times for a million
    records in this case).
    For a linearly distributed random set it does require knowing the
    number of records in the set though, an estimate would make it non-
    linear (although only a little bit if accurate enough).


    OTOH, I'm starting to think that the last sort step of an order by
    can be postponed to the result set fetching cycle under the
    conditions that:
    - the ordering expression is unrelated to the records involved, and
    - only a fraction of the total number of records will be returned.
    (Which is somewhat similar to the condition for an index being more
    efficient than a seqscan, btw)

    Comparing records with each other for something not related seems a
    waste of effort, while the result set has already been determined
    (just not ordered in any particular way), am I right?

    With that change (postponing sorting) my original ORDER BY random()
    LIMIT 5 would perform quite adequately, I think - it'd only involve
    calculating random() at least 5 times, not as often as the number of
    records in the result set.


    Or is order by random() acting as some kind of shuffling method? Is
    that a requirement to get a linearly distributed set to randomly draw
    from?
    I can see how it wouldn't be linear if you'd start randomly comparing
    records from the beginning of the result set... (Which would be the
    logical method if you don't know the size of the set before hand)


    I thought of another solution (with only a few calculations of random
    ()) that can be deployed in existing versions of PG, using a set-
    returning function with a scrolling cursor that accepts the query
    string as input like this (in pseudoish-code):

    ----
    create function random(text _query, integer _limit)
    returns set
    volatile
    as $$
    DECLARE
    _cur cursor;
    _cnt bigint;
    _idx integer;
    _rowpos bigint;

    _rec record;
    BEGIN
    open _cur for execute query;
    fetch forward all into _rec;
    -- select total nr of records into _cnt

    for _idx in 1.._limit loop
    _rowpos := random() * _cnt;

    fetch absolute _rowpos into _rec;
    return next _rec;
    end loop;

    return;
    END;
    $$
    language 'plpgsql';
    ----

    This method could return the same record twice though, I'll need to
    build in some accounting for used up rowpos'es.
    Would it be more efficient than the usual methods?

    Sorry for the brain dump, I tried to get everything into this single
    message. I hope it is at least comprehensible and useful, or
    interesting or at least mildly amusing if not.

    Regards,

    Alban Hertroys
    magproductions b.v.

    !DSPAM:737,46d93d9b289906550616460!
  • Alban Hertroys at Sep 1, 2007 at 12:24 pm

    On Sep 1, 2007, at 12:44, Alban Hertroys wrote:

    It would be possible to write an aggregate that returns a single
    random
    value from a set. The algorithm is something like:

    n = 1
    v = null
    for each row
    if random() < 1/n:
    v = value of row
    n = n + 1

    return v
    Doesn't this always return the first record, since random() is
    always less than 1/1?
    I don't think this method has a linear distribution, but then again
    I don't understand what 'value of row' refers to...
    Oh, now I see... The first time guarantees that v has a value (as
    random() < 1/1), and after that there is a decreasing chance that a
    new row gets re-assigned to v. That means the last row has a chance
    of 1/n, which would be it's normal chance if the distribution were
    linear, but doesn't the first row have a chance of 1/(n!) to be
    returned?

    --
    Alban Hertroys
    alban@magproductions.nl

    magproductions b.v.

    T: ++31(0)534346874
    F: ++31(0)534346876
    M:
    I: www.magproductions.nl
    A: Postbus 416
    7500 AK Enschede

    // Integrate Your World //




    !DSPAM:737,46d9551a289904044091126!
  • Martijn van Oosterhout at Sep 1, 2007 at 12:44 pm

    On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote:
    Oh, now I see... The first time guarantees that v has a value (as
    random() < 1/1), and after that there is a decreasing chance that a
    new row gets re-assigned to v. That means the last row has a chance
    of 1/n, which would be it's normal chance if the distribution were
    linear, but doesn't the first row have a chance of 1/(n!) to be
    returned?
    No. Consider at the first row it has chance 1 of being selected. At the
    second row it has chance 1/2 of being *kept*. At the third row it has
    chance 2/3 of being kept. At row four it's 3/4. As you see, the
    numerators and denominators cancel, leaving 1/n at the end...

    Neat huh?
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    From each according to his ability. To each according to his ability to litigate.
  • Alban Hertroys at Sep 1, 2007 at 3:44 pm

    On Sep 1, 2007, at 14:44, Martijn van Oosterhout wrote:
    On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote:
    Oh, now I see... The first time guarantees that v has a value (as
    random() < 1/1), and after that there is a decreasing chance that a
    new row gets re-assigned to v. That means the last row has a chance
    of 1/n, which would be it's normal chance if the distribution were
    linear, but doesn't the first row have a chance of 1/(n!) to be
    returned?
    No. Consider at the first row it has chance 1 of being selected. At
    the
    second row it has chance 1/2 of being *kept*. At the third row it has
    chance 2/3 of being kept. At row four it's 3/4. As you see, the
    numerators and denominators cancel, leaving 1/n at the end...
    Ah, now I see where I went wrong. If the first row got through to any
    next iteration, of course there's no chance anymore that it didn't.
    Neat huh?
    Neat from an algorithmic point of view, yes. But it also means that
    it's calculating random() for every record just like the rest of the
    suggested solutions :(

    I'm still convinced doing that isn't the right approach to the problem.

    I think I'll do some experimenting with the set returning function on
    Monday to see how that performs comparative to ordering by random().



    The problem with the approaches that use pre-calculated random values
    is I need my result to be truly random each time. If I'd return a
    number of records starting at a record that has a certain random
    value, I'd end up returning the records directly after that in the
    same order every time because their order is pre-determined.

    I realise the chances of that happening are slim provided enough
    records to choose from, and chance dictates that it could happen
    anyway, but you couldn't conscientiously sell that as an equal
    chance... My boss thinks otherwise though, maybe I'll have to settle
    for almost fair :P

    --
    Alban Hertroys
    alban@magproductions.nl

    magproductions b.v.

    T: ++31(0)534346874
    F: ++31(0)534346876
    M:
    I: www.magproductions.nl
    A: Postbus 416
    7500 AK Enschede

    // Integrate Your World //




    !DSPAM:737,46d983fa289902833059189!
  • Alban Hertroys at Sep 4, 2007 at 3:43 pm
    To follow up on my own post, I came up with a workable solution based on
    scrolling cursors. The SP approach didn't work out for me, I didn't
    manage to declare a cursor in PL/pgSQL that could be positioned
    absolutely (maybe that's due to us still using PG 8.1.something?).

    A solution to that would be appreciated.

    Anyway, I solved the problem in our application (PHP). I even got a
    workable solution to prevent returning the same record more than once.
    Here goes:

    function randomSet($query, $limit, $uniqueColumn) {

    // queries; depends on your DB connector
    DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query;
    MOVE FORWARD ALL IN _cur;

    //GET DIAGNOSTICS _count := ROW_COUNT;
    $count = pg_affected_rows();

    $uniques = array();
    $resultSet = array();
    while ($limit > 0 && count($uniques) < $count) {
    $idx = random(1, $count);

    //query
    $record = FETCH ABSOLUTE $idx FROM _cur;

    // Skip records with a column value we want to be unique
    if (in_array($record[$uniqueColumn], $uniques)
    continue;

    $uniques[] = $record[$uniqueColumn];
    $resultSet[] = $record;
    $limit--;
    }

    // query
    CLOSE _cur;

    return $resultSet;
    }

    I hope this is useful to anyone. It worked for us; it is definitely
    faster than order by random(), and more random than precalculated column
    values. Plus it translates directly to what we are requesting :)

    Alban Hertroys wrote:
    I thought of another solution (with only a few calculations of random())
    that can be deployed in existing versions of PG, using a set-returning
    function with a scrolling cursor that accepts the query string as input
    like this (in pseudoish-code):

    ----
    create function random(text _query, integer _limit)
    returns set
    volatile
    as $$
    DECLARE
    _cur cursor;
    _cnt bigint;
    _idx integer;
    _rowpos bigint;

    _rec record;
    BEGIN
    open _cur for execute query;
    fetch forward all into _rec;
    -- select total nr of records into _cnt

    for _idx in 1.._limit loop
    _rowpos := random() * _cnt;

    fetch absolute _rowpos into _rec;
    return next _rec;
    end loop;

    return;
    END;
    $$
    language 'plpgsql';
    ----
    --
    Alban Hertroys
    alban@magproductions.nl

    magproductions b.v.

    T: ++31(0)534346874
    F: ++31(0)534346876
    M:
    I: www.magproductions.nl
    A: Postbus 416
    7500 AK Enschede

    // Integrate Your World //
  • Alban Hertroys at Sep 4, 2007 at 3:45 pm

    Alban Hertroys wrote:
    To follow up on my own post, I came up with a workable solution based on
    scrolling cursors. The SP approach didn't work out for me, I didn't
    manage to declare a cursor in PL/pgSQL that could be positioned
    absolutely (maybe that's due to us still using PG 8.1.something?).
    Doh! I mean I couldn't use MOVE FORWARD ALL IN _cur for some reason, it
    kept saying "Syntax error".

    --
    Alban Hertroys
    alban@magproductions.nl

    magproductions b.v.

    T: ++31(0)534346874
    F: ++31(0)534346876
    M:
    I: www.magproductions.nl
    A: Postbus 416
    7500 AK Enschede

    // Integrate Your World //
  • Kaloyan Iliev at Aug 31, 2007 at 2:01 pm
    Hi,
    Why not generate a random number in your application and then:

    SELECT *
    FROM table_x
    WHERE condition = true
    OFFSET generated_random_number
    LIMIT xx

    Kaloyan Iliev

    Alban Hertroys wrote:
    Hello,

    I've recently been busy improving a query that yields a fixed number of
    random records matching certain conditions. I have tried all the usual
    approaches, and although they do work, they're all limited in some way
    and don't translate really well to what you "want". They're kludges, IMHO.

    The methods I've tried are explained quite well on
    http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html

    All these methods involve calculating a random number for every record
    in the result set at some point in time, which is really not what I'm
    trying to model. I think the database should provide some means to get
    those records, so...

    Dear Santa,

    I'd like my database to have functionality analogue to how LIMIT works,
    but for other - non-sequential - algorithms.

    I was thinking along the lines of:

    SELECT *
    FROM table
    WHERE condition = true
    RANDOM 5;

    Which would (up to) return 5 random rows from the result set, just as
    LIMIT 5 returns (up to) the first 5 records in the result set.


    Or maybe even with a custom function, so that you could get non-linear
    distributions:

    SELECT *
    FROM table
    WHERE condition = true
    LIMIT 5 USING my_func();

    Where my_func() could be a user definable function accepting a number
    that should be (an estimate of?) the number of results being returned so
    that it can provide pointers to which rows in the resultset will be
    returned from the query.

    Examples:
    * random(maxrows) would return random rows from the resultset.
    * median() would return the rows in the middle of the result set (this
    would require ordering to be meaningful).

    What do people think, is this feasable? Desirable? Necessary?

    If I'd have time I'd volunteer for at least looking into this, but I'm
    working on three projects simultaneously already. Alas...

    Regards,
    Alban Hertroys.

  • Erik Jones at Aug 31, 2007 at 4:06 pm

    On Aug 31, 2007, at 8:34 AM, Kaloyan Iliev wrote:

    Alban Hertroys wrote:
    Hello,

    I've recently been busy improving a query that yields a fixed
    number of
    random records matching certain conditions. I have tried all the
    usual
    approaches, and although they do work, they're all limited in some
    way
    and don't translate really well to what you "want". They're
    kludges, IMHO.

    The methods I've tried are explained quite well on
    http://people.planetpostgresql.org/greg/index.php?/archives/40-
    Getting-random-rows-from-a-database-table.html

    All these methods involve calculating a random number for every
    record
    in the result set at some point in time, which is really not what I'm
    trying to model. I think the database should provide some means to
    get
    those records, so...

    Dear Santa,

    I'd like my database to have functionality analogue to how LIMIT
    works,
    but for other - non-sequential - algorithms.

    I was thinking along the lines of:

    SELECT *
    FROM table
    WHERE condition = true
    RANDOM 5;

    Which would (up to) return 5 random rows from the result set, just as
    LIMIT 5 returns (up to) the first 5 records in the result set.


    Or maybe even with a custom function, so that you could get non-
    linear
    distributions:

    SELECT *
    FROM table
    WHERE condition = true
    LIMIT 5 USING my_func();

    Where my_func() could be a user definable function accepting a
    number
    that should be (an estimate of?) the number of results being
    returned so
    that it can provide pointers to which rows in the resultset will be
    returned from the query.

    Examples:
    * random(maxrows) would return random rows from the resultset.
    * median() would return the rows in the middle of the result set
    (this
    would require ordering to be meaningful).

    What do people think, is this feasable? Desirable? Necessary?

    If I'd have time I'd volunteer for at least looking into this, but
    I'm
    working on three projects simultaneously already. Alas...

    Regards,
    Alban Hertroys.
    Hi,
    Why not generate a random number in your application and then:

    SELECT *
    FROM table_x
    WHERE condition = true
    OFFSET generated_random_number
    LIMIT xx

    Kaloyan Iliev
    That won't work without some kind of a priori knowledge of how many
    rows the query would return without the offset and limit.

    Erik Jones

    Software Developer | Emma®
    erik@myemma.com
    800.595.4401 or 615.292.5888
    615.292.0777 (fax)

    Emma helps organizations everywhere communicate & market in style.
    Visit us online at http://www.myemma.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedAug 31, '07 at 12:42p
activeSep 4, '07 at 3:45p
posts13
users7
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase