FAQ
Thanks to much help from the list, and hours of reading up on Unicode,
the Encode module, and many posts to perlmonks, I've come up with a
hideous solution for processing text files with different character
encodings.

Can someone please explain why this first block of code works when
decoding .txt files of different character encoding types:

#!/usr/bin/perl
use strict;
use warnings;
use Encode::Guess;

print "\nPlease specify the file path: ";
my $datapath = <STDIN>;
$datapath =~ s/^\s+//;
$datapath =~ s/\s+$//;
open (my $filehndl , "<", "$datapath") ||
die ("Can't open .txt file $datapath. Exiting program.\n\n");
binmode($filehndl);
if (read($filehndl, my $filestrt, 500))
{
my $enc = guess_encoding($filestrt);
if (ref($enc))
{
my $enc_name = $enc->name;
#my $encoding = find_encoding("$enc_name");
open (my $filehdl2 , "<:encoding($enc_name)" , "$datapath");
while (my $line = <$filehdl2>)
{
#my $line = $encoding->decode($string);
#my $line = decode("$enc_name", $string);
chomp $line;
my @words = split / /, $line;
my $nr_words = @words;
print "\n$line\n";
print "The line above has " . scalar @words . " occurrences of something.\n";
}
close ($filehdl2);
}
}
close ($filehndl);

But this second generates the error:
UTF16: Unrecognised BOM 6100 at /usr/lib/perl/5.10//Encode.pm line
162, <$filehndl> line 1.

#!/usr/bin/perl
use strict;
use warnings;

use Encode;
use Encode::Guess;

print "\nPlease specify the file path: ";
my $datapath = <STDIN>;
$datapath =~ s/^\s+//;
$datapath =~ s/\s+$//;
open (my $filehndl , "<", "$datapath") ||
die ("Can't open .txt file $datapath. Exiting program.\n\n");
binmode($filehndl);
if (read($filehndl, my $filestrt, 500))
{
my $enc = guess_encoding($filestrt);
if (ref($enc))
{
my $enc_name = $enc->name;
while (my $line = decode("$enc_name", <$filehndl>))
{
chomp $line;
my @words = split / /, $line;
my $nr_words = @words;
print "\n$line\n";
print "The line above has " . scalar @words . " occurrences of something.\n";
}
}
}
close ($filehndl);

Otherwise, can someone suggest a more elegant way of accomplishing
this? It doesn't seem like I should have to open the file twice, as
I'm doing in the first block. I can't figure out any way around that,
though.

Thanks for any help!

-Doug.

===
Douglas Cacialli, M.A. - Doctoral candidate
Clinical Psychology Training Program
University of Nebraska-Lincoln
Lincoln, Nebraska 68588-0308
===

Search Discussions

  • Dr.Ruud at Apr 4, 2010 at 3:10 pm

    Doug Cacialli wrote:

    $datapath =~ s/^\s+//;
    $datapath =~ s/\s+$//;
    Alternative notation:

    s/\s+$//, s/^\s+// for $datapath;

    I have a "sub trim()" in my Toolbox, so I just call "trim($datapath);".

    open (my $filehndl , "<", "$datapath") ||
    die ("Can't open .txt file $datapath. Exiting program.\n\n");
    Why quote $datapath?
    (It isn't in a class with overloaded stringification.)

    open my $filehndl , "<", $datapath or die "'$datapath': $!\n";

    while (my $line = <$filehdl2>)
    {
    chomp $line;
    my @words = split / /, $line;
    my $nr_words = @words;
    print "\n$line\n";
    print "The line above has " . scalar @words ...
    If you wouldn't chomp, you wouldn't have to add a "\n".
    But then you need to split on " " (to get rid of empty trailings).


    while (my $line = <$fh>) {
    my @words = split " ", $line;
    print $line, "\thas ", scalar(@words), " words.\n";
    }


    --
    Ruud
  • Doug Cacialli at Apr 5, 2010 at 9:57 pm
    I sincerely appreciate the tips on improving my code; I implement (or
    at least take strong note) of all the suggestions I receive. In the
    code I posted, however, I'm primarily interested in learning if
    there's a way to avoid opening the file to determine the character
    encoding, and then opening it again with the character encoding
    specified. In the second block of code that I originally posted, I
    only open the file once but I'm consistently encountering the "UTF16:
    Unrecognised BOM" error that I mentioned.

    Does anyone have any ideas how I can make the second block of code
    work? Or otherwise accomplish the task without opening the .txt file
    twice?

    ===
    Douglas Cacialli, M.A. - Doctoral candidate
    Clinical Psychology Training Program
    University of Nebraska-Lincoln
    Lincoln, Nebraska 68588-0308
    ===


    On Sun, Apr 4, 2010 at 11:10 AM, Dr.Ruud wrote:
    Doug Cacialli wrote:
    $datapath =~ s/^\s+//;
    $datapath =~ s/\s+$//;
    Alternative notation:

    s/\s+$//, s/^\s+// for $datapath;

    I have a "sub trim()" in my Toolbox, so I just call "trim($datapath);".

    open (my $filehndl , "<", "$datapath") ||
    die ("Can't open .txt file $datapath. Exiting program.\n\n");
    Why quote $datapath?
    (It isn't in a class with overloaded stringification.)

    open my $filehndl , "<", $datapath or die "'$datapath': $!\n";

    while (my $line = <$filehdl2>)
    {
    chomp $line;
    my @words = split / /, $line;
    my $nr_words = @words;
    print "\n$line\n";
    print "The line above has " . scalar @words ...
    If you wouldn't chomp, you wouldn't have to add a "\n".
    But then you need to split on " " (to get rid of empty trailings).


    while (my $line = <$fh>) {
    my @words = split " ", $line;
    print $line, "\thas ", scalar(@words), " words.\n";
    }


    --
    Ruud

    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org
    For additional commands, e-mail: beginners-help@perl.org
    http://learn.perl.org/

  • Jim Gibson at Apr 5, 2010 at 11:05 pm
    On 4/5/10 Mon Apr 5, 2010 2:56 PM, "Doug Cacialli"
    <doug.cacialli@gmail.com> scribbled:
    I sincerely appreciate the tips on improving my code; I implement (or
    at least take strong note) of all the suggestions I receive. In the
    code I posted, however, I'm primarily interested in learning if
    there's a way to avoid opening the file to determine the character
    encoding, and then opening it again with the character encoding
    specified. In the second block of code that I originally posted, I
    only open the file once but I'm consistently encountering the "UTF16:
    Unrecognised BOM" error that I mentioned.
    I don't work with UTF-encoded files in Perl, so I am not going to be able to
    answer your question definitively.

    In general, the operating system does not maintain the encoding of text
    files (there may be exceptions I don't know about). The information you seek
    is in the file itself. There is nothing wrong with opening a file twice, if
    that is what it takes. Data read from disk, including directory information,
    is usually cached, so the second open should occur very quickly, and may not
    require access to the drive itself.

    If you really want to know how a file is encoded without opening it, you can
    maintain the information outside of the file, itself. One suggestion would
    be to always use the same encoding. Another suggestion would be to use a
    file-naming convention: e.g., include '-utf16' in the file name if that is
    the encoding. You could also create an index file that specifies the
    encoding for each file.

    All of these involve extra work on your part, so avoiding opening the file
    twice may not be worth it.
    Does anyone have any ideas how I can make the second block of code
    work? Or otherwise accomplish the task without opening the .txt file
    twice?
    Even if I could solve your problem, I don't have your "second block of code"
    on this system. It is always a good idea to include a short program that
    demonstrates the problem you are having with each post.

    Maybe somebody smarter than me has better suggestions.

    Good luck.
  • Thomas Bätzler at Apr 6, 2010 at 8:43 am

    Doug Cacialli asked:
    Does anyone have any ideas how I can make the second block of code
    work? Or otherwise accomplish the task without opening the .txt file
    twice?
    How large are your data files? If your available memory is much larger than your maximum file size, you might get away with slurping the file into a scalar and then convert its encoding if needed, possibly like this:

    #!/usr/bin/perl -w

    use strict;
    use Encode;

    my $file = 'test.txt';

    open( my $fh, '<', $file ) or die "Can't open '$file': $!";

    my $data = do {
    local $/ = undef;
    <$fh>;
    };

    close( $fh );

    if( $data =~ m/^\xff\xfe/ || $data =~ m/^\xfe\xff/ ){
    print "input is UTF-16 w/ BOM\n";
    $data = decode('utf-16',$data);
    } elsif( $data =~ m/^[^\x00]\x00/ ){
    print "input is probably little-endian utf-16 w/o BOM\n";
    $data = "\xff\xfe" . $data;
    $data = decode('utf-16',$data);
    } elsif( $data =~ m/^\x00[^\x00]/ ){
    print "input is probably big-endian utf-16 w/o BOM\n";
    $data = "\xfe\xff" . $data;
    $data = decode('utf-16',$data);
    }

    chomp( $data);

    my @words = split /\s+/, $data;

    print "input file has" . scalar( @words ) . " words\n";

    __END__

    HTH,
    Thomas
  • Dr.Ruud at Apr 6, 2010 at 7:13 pm

    Thomas Bätzler wrote:

    my $data = do {
    local $/ = undef;
    <$fh>;
    };
    Especially for big files, that is better written as:

    my $data;
    { local $/;
    $data = <$fh>;
    }

    --
    Ruud
  • Uri Guttman at Apr 7, 2010 at 8:41 am
    "R" == Ruud <rvtol+usenet@isolution.nl> writes:
    R> Thomas Bätzler wrote:
    my $data = do {
    local $/ = undef;
    <$fh>;
    };
    R> Especially for big files, that is better written as:

    define big. most files are still text or similar and not big by today's
    ram sizes. slurping in a megabyte is nothing today. back in the day it
    would have caused major disk thrashing.

    R> my $data;
    R> { local $/;
    R> $data = <$fh>;
    R> }

    even better as:

    use File::Slurp ;

    my $data = read_file( $file ) ;

    faster, cleaner, no need for $/ and local.

    uri

    --
    Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedApr 3, '10 at 7:49p
activeApr 7, '10 at 8:41a
posts7
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase