FAQ
Hello,

If the csv file generated by the application that I am
importing from contains quotes around each field, must I
write a program to strip these "field-level" quotes before
sending the file to COPY?

As we know, COPY is a single transaction. Therefore, it
would be "unpleasant" if, say, the process that is doing the
importing dies 90% of the way through a 10,000,000 row table.
Is there a checkpoint mechanism, that, would do a COMMIT, for
example, every 10,000 rows. Then, if the process that is doing
the importing does 90% of the way through that 10,000,000 row
table, when you restart the COPY, it skips over the inserted
rows.
Here is an example from the RDBMS that I currently use:
$ bulkload -load -log -commit=10000 -tran=exclusive -db=test \
-table=foo ~/foo.csv
Then, if something happens after inserting 9,000,000 rows,
it can be restarted by:
$ bulkload -load -log -commit=10000 -skip=9000000 -db=test \
-tran=exclusive -table=foo ~/foo.csv

From what I've seen in the documentation, and the mailing
list archives, the solution to both of these questions is
to roll my bulk loader.

Ron
--
+---------------------------------------------------------+
Ron Johnson, Jr. Home: ron.l.johnson@cox.net |
Jefferson, LA USA http://ronandheather.dhs.org:81 |
"I have created a government of whirled peas..." |
Maharishi Mahesh Yogi, 12-May-2002, |
! CNN, Larry King Live |
+---------------------------------------------------------+

Search Discussions

  • Andrew McMillan at May 19, 2002 at 11:16 am

    On Sun, 2002-05-19 at 23:01, Ron Johnson wrote:

    If the csv file generated by the application that I am
    importing from contains quotes around each field, must I
    write a program to strip these "field-level" quotes before
    sending the file to COPY?

    As we know, COPY is a single transaction. Therefore, it
    would be "unpleasant" if, say, the process that is doing the
    importing dies 90% of the way through a 10,000,000 row table.
    Is there a checkpoint mechanism, that, would do a COMMIT, for
    example, every 10,000 rows. Then, if the process that is doing
    the importing does 90% of the way through that 10,000,000 row
    table, when you restart the COPY, it skips over the inserted
    rows.
    Here is an example from the RDBMS that I currently use:
    $ bulkload -load -log -commit=10000 -tran=exclusive -db=test \
    -table=foo ~/foo.csv
    Then, if something happens after inserting 9,000,000 rows,
    it can be restarted by:
    $ bulkload -load -log -commit=10000 -skip=9000000 -db=test \
    -tran=exclusive -table=foo ~/foo.csv
    From what I've seen in the documentation, and the mailing
    list archives, the solution to both of these questions is
    to roll my bulk loader.
    Yes, or to borrow one someone else has already done.

    I have a perl script I use for this sort of thing, and although it
    handles the full possibilities of quoting fields, it only loads the
    whole file as a single transaction, or as one transaction per line.

    You are welcome to it if you wish. It shouldn't be hard to extend it to
    allow groups of transactions to be checkpointed - I will probably even
    do it myself before the end of the year.

    Regards,
    Andrew.
    --
    --------------------------------------------------------------------
    Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
    WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
    DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
    Are you enrolled at http://schoolreunions.co.nz/ yet?
  • April L at May 19, 2002 at 4:18 pm
    After reading about data types, create table, and other search results for
    "default" and "random" in the postgres idocs, I haven't found a way to do
    this:

    Instead of a sequentially auto-incrementing primary key, I would like a
    random 16 byte character value (assumedly hex) to be automatically created
    as the default value for each new record.

    I guess I could just use a large random number, created with the postgres
    random() function - however I don't see any way of seeding it, or know if
    that is necessary.

    The purpose of using a random rather than sequential number is to prevent
    people being able to access other's records by guessing.

    Has anyone else encountered this challenge, and do you know of a way to
    generate a random default value for the primary key?

    Thank you,

    - April
  • Ron Johnson at May 19, 2002 at 8:10 pm

    On Sun, 2002-05-19 at 11:10, April L wrote:
    After reading about data types, create table, and other search results for
    "default" and "random" in the postgres idocs, I haven't found a way to do
    this:

    Instead of a sequentially auto-incrementing primary key, I would like a
    random 16 byte character value (assumedly hex) to be automatically created
    as the default value for each new record.

    I guess I could just use a large random number, created with the postgres
    random() function - however I don't see any way of seeding it, or know if
    that is necessary.

    The purpose of using a random rather than sequential number is to prevent
    people being able to access other's records by guessing.

    Has anyone else encountered this challenge, and do you know of a way to
    generate a random default value for the primary key?
    Have you considered appending (or prepending) a check digit
    to the sequence number before inserting into the database?

    That way, it is unlikely someone will get an account just by
    typing in a random number. Also, and more importantly, it will
    ensure that any fat-fingers by the clerical staff doesn't
    accidently bring up the wrong account.

    --
    +---------------------------------------------------------+
    Ron Johnson, Jr. Home: ron.l.johnson@cox.net |
    Jefferson, LA USA http://ronandheather.dhs.org:81 |
    "I have created a government of whirled peas..." |
    Maharishi Mahesh Yogi, 12-May-2002, |
    ! CNN, Larry King Live |
    +---------------------------------------------------------+
  • Josh Berkus at May 19, 2002 at 8:47 pm
    April,
    Instead of a sequentially auto-incrementing primary key, I would like a
    random 16 byte character value (assumedly hex) to be automatically created
    as the default value for each new record.

    I guess I could just use a large random number, created with the postgres
    random() function - however I don't see any way of seeding it, or know if
    that is necessary.
    First, let me point out that there is no reason for the "user key" you are
    trying to implement to be the same value as the primary key of the table.
    There are, in fact, a number of good arguments against it, the least of which
    is that a 16-byte string will take up 4x the sort memory of a 4-byte integer.
    I would suggest that you give the table an actual, hidden primary key based
    on a simple sequence, and a seperate unique "user key" for lookups.

    This is actually easy to implement through custom functions, triggers, or
    rules. However, there are some performance implications if your table gets
    very large, as you would have to prescan for accidental duplicates (in a
    truly random distribution, this is nearly certain given enough records, even
    with a 16-byte value).
    The purpose of using a random rather than sequential number is to prevent
    people being able to access other's records by guessing.

    Has anyone else encountered this challenge, and do you know of a way to
    generate a random default value for the primary key?
    Genrally, a pseudo-random number is more than adequate. For example, one of
    my applications generates a pseudo-random session key based on a calculation
    involving the user_id of the modifying user and the epoch timestamp on which
    the record was locked. This appears random to the casual eye, and is
    near-impossible to guess.

    --
    -Josh Berkus
  • April L at May 19, 2002 at 10:15 pm
    I made the primary key

    "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL

    Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
    receiving your reply.

    Since I do have to use the authkey to find records, it seems I would still
    benefit by having an index for it even if I had a separate 4 byte primary
    key - so I don't understand how it would save resources or increase
    performance to avoid making this column the primary key? Admittedly, I
    don't understand indexes in depth yet, I just assumed that every additional
    index means additional housekeeping activities each time a record is
    changed or added.

    Thank you,

    - April
    At 01:49 PM 5/19/2002 -0700, Josh Berkus wrote:
    April,
    Instead of a sequentially auto-incrementing primary key, I would like a
    random 16 byte character value (assumedly hex) to be automatically created
    as the default value for each new record.

    I guess I could just use a large random number, created with the postgres
    random() function - however I don't see any way of seeding it, or know if
    that is necessary.
    First, let me point out that there is no reason for the "user key" you are
    trying to implement to be the same value as the primary key of the table.
    There are, in fact, a number of good arguments against it, the least of which
    is that a 16-byte string will take up 4x the sort memory of a 4-byte integer.
    I would suggest that you give the table an actual, hidden primary key based
    on a simple sequence, and a seperate unique "user key" for lookups.

    This is actually easy to implement through custom functions, triggers, or
    rules. However, there are some performance implications if your table gets
    very large, as you would have to prescan for accidental duplicates (in a
    truly random distribution, this is nearly certain given enough records, even
    with a 16-byte value).
    The purpose of using a random rather than sequential number is to prevent
    people being able to access other's records by guessing.

    Has anyone else encountered this challenge, and do you know of a way to
    generate a random default value for the primary key?
    Genrally, a pseudo-random number is more than adequate. For example, one of
    my applications generates a pseudo-random session key based on a calculation
    involving the user_id of the modifying user and the epoch timestamp on which
    the record was locked. This appears random to the casual eye, and is
    near-impossible to guess.

    --
    -Josh Berkus


    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    http://archives.postgresql.org
  • Joel Burton at May 19, 2002 at 10:29 pm

    I made the primary key

    "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL

    Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
    receiving your reply.

    Since I do have to use the authkey to find records, it seems I would still
    benefit by having an index for it even if I had a separate 4 byte primary
    key - so I don't understand how it would save resources or increase
    performance to avoid making this column the primary key? Admittedly, I
    don't understand indexes in depth yet, I just assumed that every
    additional
    index means additional housekeeping activities each time a record is
    changed or added.

    Thank you,

    - April
    Be sure to build into your app some way of handling the error that will
    happen if a random primary key is already in the table -- resubmitting it
    should work fine (the odds of this happening twice become _extremely_
    unlikely, unless your table is very large)

    Anyone know off the top of their heads what the period for PG's random()
    function is?

    - J.
  • Ghaverla at May 19, 2002 at 10:41 pm

    On Sun, 19 May 2002, Joel Burton wrote:

    I made the primary key

    "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL
    Be sure to build into your app some way of handling the error that will
    happen if a random primary key is already in the table -- resubmitting it
    should work fine (the odds of this happening twice become _extremely_
    unlikely, unless your table is very large)
    If you know about how many rows you will be inserting, why not
    build a table of random ints drawn from a finite pool without
    replacement? Then, you just have to pick the next unused random
    int from the random pool table to use in your other table. If
    you get to the point where the table is exhausted (surely the
    number of rows in the pool is less than the period), you just
    add some more random ints.
    Anyone know off the top of their heads what the period for PG's random()
    function is?
    Sorry, not I.

    Gord

    Matter Realisations http://www.materialisations.com/
    Gordon Haverland, B.Sc. M.Eng. President
    101 9504 182 St. NW Edmonton, AB, CA T5T 3A7
    780/481-8019 ghaverla @ freenet.edmonton.ab.ca
    780/993-1274 (cell)
  • Josh Berkus at May 19, 2002 at 10:56 pm
    April,
    Since I do have to use the authkey to find records, it seems I would still
    benefit by having an index for it even if I had a separate 4 byte primary
    key - so I don't understand how it would save resources or increase
    performance to avoid making this column the primary key? Admittedly, I
    don't understand indexes in depth yet, I just assumed that every additional
    index means additional housekeeping activities each time a record is
    changed or added.
    Oh, yeah, you're right. I should have added the caveat that having the two
    keys makes sense if you are doing a lot of JOINing on the primary key of this
    table. If this table does not have a lot of key-related data in other
    tables, the dual keying does not make sense. Since you seemed to be talking
    about a central data table, I just assumed that it would have child tables.

    And Joel has a good point, which I made somewhat obliquely: you need to be
    prepared for key violation messages, with your method. If you expect many
    thousands of records, you should do the key-checking up front:

    1. Create a PL/pgSQL trigger function called random_authkey(),
    2. Have function generate a random number, scan the table for duplicates, and
    if found generate a replacement number and return it as NEW.authkey;
    3. Create a BEFORE trigger which calls this function for each INSERT.

    This should automate the process for you, although it will slow down your
    inserts significantly due to the extra index scan required for each insert.
    Depending on your level of activity, though, you will probably not even
    notice.


    --
    -Josh Berkus
  • Ron Johnson at May 20, 2002 at 12:18 am

    On Sun, 2002-05-19 at 17:19, April L wrote:
    I made the primary key

    "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL

    Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
    receiving your reply.

    Since I do have to use the authkey to find records, it seems I would still
    benefit by having an index for it even if I had a separate 4 byte primary
    key - so I don't understand how it would save resources or increase
    performance to avoid making this column the primary key? Admittedly, I
    don't understand indexes in depth yet, I just assumed that every additional
    index means additional housekeeping activities each time a record is
    changed or added.
    You are right that more indexes means more work. Another
    benefit of pseudo-random numbers is that keys will be inserted
    into the tree in, well, pseudo-random order... So what? Keys
    that are inserted into the tree in ascending order all get
    inserted into the right side of the tree. Therefore, postgres
    must do extra work to keep the tree balanced. (That's the B in
    b-tree). Random numbers get inserted all over the tree, thus
    minimizing the work needed to keep the tree balanced.

    If your transactions are SERIALIZABLE, then, since ascending
    order keys all get inserted into the right side of the tree,
    all users are trying to insert into the same nodes, thus causing
    rollbacks. Keys that go all over the tree will minimize this
    problem.

    --
    +---------------------------------------------------------+
    Ron Johnson, Jr. Home: ron.l.johnson@cox.net |
    Jefferson, LA USA http://ronandheather.dhs.org:81 |
    "I have created a government of whirled peas..." |
    Maharishi Mahesh Yogi, 12-May-2002, |
    ! CNN, Larry King Live |
    +---------------------------------------------------------+

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-novice @
categoriespostgresql
postedMay 19, '02 at 11:01a
activeMay 20, '02 at 12:18a
posts10
users6
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase