FAQ

Wow, I'm really confused. I'm trying to remove duplicate
lines from a marc21 text file. I have spent countless hours
searching for scripts etc.
I'm also very new to Perl and wrote a long and newbyish script that
does exactly what the Unix command "sort FILENAME | uniq" does just to
see how it can be done.

What I did was read the file's lines into an array and use the sort()
function to sort the lines. Then it's easy, do what Joe recommended
and just check if the current line is equal to the last line.

HTH
What I find frustrating while trying to learn Perl, is that
most solutions assume you know what to do. For example,
someone gives the code to find and replace, and that's it. In
other words, if the complete script was there, I think I
could learn much faster. I have no idea of how to put the
code into a script.

I did manage to find a few perl one liners but it removed the
blank lines between the records, which must be retained in
order to convert the file back to actual marc format before
downloading into the database.

It also removed non sequential lines if they were the same in
another record. They must also be kept as they are an
important part of the file.

Any help would be more than appreciated. Below is part of a
very large file.Approx 100,000 records need to be processed.
For now, I just want to remove adjacent duplicate fields.

=LDR 01548cam 2200397La 45{92}0
=001 ocm42328427\
=003 OCoLC
=005 20010526091201.0
=006 m\\\\\\\\u\\\\\\\\
=007 cr\cn-
=008 831108s1984\\\\inua\\\\sb\\\\001\0\eng\d
=010 \\$z 83048636
=035 \\1234 (sirsi)
=035 \\1234 (sirsi)
=040 \\$aN{dollar}T$cN{dollar}T$dOCL
=020 \\$a0585000905 (electronic bk.)
=020 \\$z0253366062
=020 \\$z0253203252
=050 14$aNX180.F4$bL38 1984eb
=082 04$a700/.88042$219
=049 \\$aM7@A
=100 1\$aLauter, Estella,$d1940-
=245 10$aWomen as mythmakers$h[computer file] :$bpoetry and
visual art by twentieth-century women /$cEstella Lauter.
=260 \\$aBloomington :$bIndiana University Press,$cc1984.
=300 \\$axvii, 267 p. :$bill. ;$c24 cm.
=504 \\$aBibliography: p. 247-260.
=500 \\$aIncludes index.
=533 \\$aElectronic reproduction.$bBoulder, Colo.
:$cNetLibrary,$d1999.$nAvailable via the World Wide
Web.$nAvailable in multiple electronic file formats.$nAccess
may be limited to NetLibrary affiliated libraries.
=SUBJ \0$aFeminism and the arts.
=SUBJ \0$aWomen artists.
=SUBJ \0$aWomen poets.
=SUBJ \0$aArt and mythology.
=SUBJ \0$aArts, Modern$y20th century.
=655 \7$aElectronic books.$2local
=710 2\$aNetLibrary, Inc.
=776 1\$cOriginal$w(DLC) 83048636$w(OCoLC)10162146
=856 4\$3Bibliographic record
display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v
=1&bookid=652$zAn electronic book accessible through the
World Wide Web; click for information
=994 \\$a92$bM7@

=LDR 01470cam 2200349La 45{92}0
=001 ocm42328450\
=003 OCoLC
=005 20010526091202.0
=006 m\\\\\\\\u\\\\\\\\
=007 cr\cn-
=008 980609s1998\\\\couab\\\sbf\\\001\0\eng\d
=010 \\$z 98026266
=035 \\1234 (sirsi)
=035 \\1234 (sirsi)
=040 \\$aN{dollar}T$cN{dollar}T$dOCL
=020 \\$a0585001413 (electronic bk.)
=020 \\$z1555662307
=050 14$aQB581$b.L66 1998eb
=082 04$a523.3$221
=049 \\$aM7@A
=100 1\$aLong, Kim.
=245 14$aThe moon book$h[computer file] :$bfascinating facts
about the magnificent, mysterious moon /$cKim Long ; science
advisor, Larry Sessions.
=250 \\$aRev. and expanded.
=260 \\$aBoulder, Colo. :$bJohnson Books,$cc1998.
=300 \\$a149 p. :$bill., maps ;$c22 cm.
=500 \\$aIncludes 1 errata sheet.
=504 \\$aIncludes bibliographical references (p. 132-133) and index.
=533 \\$aElectronic reproduction.$bBoulder, Colo.
:$cNetLibrary,$d1999.$nAvailable via the World Wide
Web.$nAvailable in multiple electronic file formats.$nAccess
may be limited to NetLibrary affiliated libraries.
=651 \0$aMoon$vHandbooks, manuals, etc.
=655 \7$aElectronic books.$2local
=710 2\$aNetLibrary, Inc.
=776 1\$cOriginal$w(DLC) 98026266$w(OCoLC)39299241
=856 4\$3Bibliographic record
display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v
=1&bookid=140$zAn electronic book accessible through the
World Wide Web; click for information
=994 \\$a92$bM7@
=994 \\$a92$bM7@

Search Discussions

  • John W. Krahn at May 30, 2005 at 2:41 pm

    Tielman Koekemoer (TNE) wrote:
    Wow, I'm really confused. I'm trying to remove duplicate
    lines from a marc21 text file. I have spent countless hours
    searching for scripts etc.
    I'm also very new to Perl and wrote a long and newbyish script that
    does exactly what the Unix command "sort FILENAME | uniq" does just to
    see how it can be done.
    How long? Because you can do that on one line in perl. :-)

    perl -e'print sort grep !$seen{$_}++, <>' FILENAME



    John
    --
    use Perl;
    program
    fulfillment
  • Tony Marquis at May 30, 2005 at 5:09 pm
    Very simple question.

    I'm reading a file and i want to remove all the <CR><LF> in each lines.

    while(FIC) {

    $test = $_; #remove crlf.
    ........... some code

    }

    How can i do that.
  • Elvis Cehajic at May 30, 2005 at 5:43 pm

    On Mon, 30 May 2005 13:09:23 -0400 Tony Marquis wrote:

    Very simple question.

    I'm reading a file and i want to remove all the <CR><LF> in each lines.

    while(FIC) {

    $test = $_; #remove crlf.
    ........... some code

    }

    How can i do that.
    First of all: stop hijacking threads! In future, please send new mails to the list and don't click the reply button and delete the content of the mail.. thanks

    while(FIC) {
    s/\r\n//; #remove crlf
    print;
    # or:
    chop;chop; # remove crlf
    print;
    }

    and RTFM!

    Elvis
  • Binish A R at May 30, 2005 at 6:06 pm

    Tony Marquis wrote:

    Very simple question.

    I'm reading a file and i want to remove all the <CR><LF> in each lines.

    while(FIC) {

    $test = $_; #remove crlf.
    ........... some code

    }

    How can i do that.
    Try

    $test =~ s/[\r\f]//g;


    to remove newlines, use

    $test =~ s/[\n]//g;
  • Jay Savage at May 30, 2005 at 9:23 pm

    On 5/30/05, John W. Krahn wrote:
    Tielman Koekemoer (TNE) wrote:
    Wow, I'm really confused. I'm trying to remove duplicate
    lines from a marc21 text file. I have spent countless hours
    searching for scripts etc.
    I'm also very new to Perl and wrote a long and newbyish script that
    does exactly what the Unix command "sort FILENAME | uniq" does just to
    see how it can be done.
    How long? Because you can do that on one line in perl. :-)

    perl -e'print sort grep !$seen{$_}++, <>' FILENAME



    John
    You can also use just do:

    $seen{$_}++ while <>;
    print sort keys %seen;

    Which will also let you know which items were repeated, and how many
    times. It al depends on what you ultimately want to do with the
    information. This is why we ask to see code you've tried, and where
    you're headed. There are probably close to 1,000 ways to handle this
    in Perl, each of them appropriate for a specific circumstance.

    In this case, doing a search for MARC and/or Z3950 on search.cpan.org
    will turn up some interesting results, too

    HTH,

    -- jay
    --------------------
    daggerquill [at] gmail [dot] com
    http://www.engatiki.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedMay 30, '05 at 12:03p
activeMay 30, '05 at 9:23p
posts6
users6
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase