FAQ
Hello All,

Major part of my Perl scripting goes in processing text files. And most of
the times I need huge sized text files ( 3 MB +) to perform benchmarking
tests.

So I am planing to write a Perl script which will create huge sized text
file of the sample file which it will receive as first Input parameter. I
have following algorithm in mind:

1. Provide 2 input parameters to the Perl script - (i) Sample file, (ii)
Size of the new file
EG: - To create a new file of size 3 MB -
perl Create_Huge_File.pl Sample.txt 3

2. Read the input file and store the contents into an array.

3. Create a new file.

4. Dump the contents of the above array into the new file.

5. Check the length of the new file. If it is less than second input
parameter, repeat step 4 or else goto step 6.

6. Close the new file.

I have following questions:

a.) What do I need to do to make sure that length of new file will increase
every time the step 4 is executed.

b.) Since lot of I/O is involved is it the most optimised solution? If not,
does any one has any better design to suffice my requirement.

c.) What are the likely bugs that may creep in with this algorithm.

Cheers,
Parag

Search Discussions

  • Shlomi Fish at Jan 2, 2010 at 6:58 pm
    Hi Parag!
    On Saturday 02 Jan 2010 19:56:02 Parag Kalra wrote:
    Hello All,

    Major part of my Perl scripting goes in processing text files. And most of
    the times I need huge sized text files ( 3 MB +) to perform benchmarking
    tests.

    So I am planing to write a Perl script which will create huge sized text
    file of the sample file which it will receive as first Input parameter. I
    have following algorithm in mind:

    1. Provide 2 input parameters to the Perl script - (i) Sample file, (ii)
    Size of the new file
    EG: - To create a new file of size 3 MB -
    perl Create_Huge_File.pl Sample.txt 3

    2. Read the input file and store the contents into an array.
    Why an array? Storing it into a single string would be more faster, conserve
    more memory and be more efficient. See:

    http://www.perl.com/pub/a/2003/11/21/slurp.html
    3. Create a new file.

    4. Dump the contents of the above array into the new file.
    Again string.
    5. Check the length of the new file. If it is less than second input
    parameter, repeat step 4 or else goto step 6.
    You can calculate the existing length in a variable or use
    http://perldoc.perl.org/5.8.8/functions/tell.html .
    6. Close the new file. OK.
    I have following questions:

    a.) What do I need to do to make sure that length of new file will increase
    every time the step 4 is executed.
    Nothing. Just print to the output file-handle and it will append to the file's
    contents and will increase its size.
    b.) Since lot of I/O is involved is it the most optimised solution? If not,
    does any one has any better design to suffice my requirement.
    It should be good enough. Perl does I/O quickly.
    c.) What are the likely bugs that may creep in with this algorithm.
    Encoding problems, etc. Logistical problems.

    I should note that, in general, your algorithm will produce repetitive text
    with very little Entropy:

    http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

    One option you may wish to take instead is to chain several different texts
    from sources of free online texts such as http://www.gutenberg.org/ or
    http://wikisource.org/ (and see also
    http://www.google.com/search?q=free%20online%20books ).

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    What Makes Software Apps High Quality - http://shlom.in/sw-quality

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Parag Kalra at Jan 3, 2010 at 5:13 am
    Thanks Shlomi for your expert comments and I must admit you have got a very
    strong vision. :)

    Anyways coming to my first question:
    a.) What do I need to do to make sure that length of new file will increase
    every time the step 4 is executed.
    Although it may have nothing to do with this algorithm but I still thought
    of discussing it.

    Say I have scripts that dumps some contents on to an output file handler
    inside a long loop. With such scripts I have noticed 2 types of behaviour.

    1. The size of this output file is zero while the script is getting executed
    inside the loop and it increases only when the script execution gets over.

    2. And sometimes I have noticed that the size of output file increases every
    time some data is dumped on to it. Thus increase in size happens in real/run
    time.

    I want to control this behaviour. What I can guess is that it has something
    to do with output buffer.

    Cheers,
    Parag



    On Sun, Jan 3, 2010 at 12:28 AM, Shlomi Fish wrote:

    Hi Parag!
    On Saturday 02 Jan 2010 19:56:02 Parag Kalra wrote:
    Hello All,

    Major part of my Perl scripting goes in processing text files. And most of
    the times I need huge sized text files ( 3 MB +) to perform benchmarking
    tests.

    So I am planing to write a Perl script which will create huge sized text
    file of the sample file which it will receive as first Input parameter. I
    have following algorithm in mind:

    1. Provide 2 input parameters to the Perl script - (i) Sample file, (ii)
    Size of the new file
    EG: - To create a new file of size 3 MB -
    perl Create_Huge_File.pl Sample.txt 3

    2. Read the input file and store the contents into an array.
    Why an array? Storing it into a single string would be more faster,
    conserve
    more memory and be more efficient. See:

    http://www.perl.com/pub/a/2003/11/21/slurp.html
    3. Create a new file.

    4. Dump the contents of the above array into the new file.
    Again string.
    5. Check the length of the new file. If it is less than second input
    parameter, repeat step 4 or else goto step 6.
    You can calculate the existing length in a variable or use
    http://perldoc.perl.org/5.8.8/functions/tell.html .
    6. Close the new file. OK.
    I have following questions:

    a.) What do I need to do to make sure that length of new file will increase
    every time the step 4 is executed.
    Nothing. Just print to the output file-handle and it will append to the
    file's
    contents and will increase its size.
    b.) Since lot of I/O is involved is it the most optimised solution? If not,
    does any one has any better design to suffice my requirement.
    It should be good enough. Perl does I/O quickly.
    c.) What are the likely bugs that may creep in with this algorithm.
    Encoding problems, etc. Logistical problems.

    I should note that, in general, your algorithm will produce repetitive text
    with very little Entropy:

    http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

    One option you may wish to take instead is to chain several different texts
    from sources of free online texts such as http://www.gutenberg.org/ or
    http://wikisource.org/ (and see also
    http://www.google.com/search?q=free%20online%20books ).

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    What Makes Software Apps High Quality - http://shlom.in/sw-quality

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Shlomi Fish at Jan 3, 2010 at 8:38 am

    On Sunday 03 Jan 2010 07:12:32 Parag Kalra wrote:
    Thanks Shlomi for your expert comments and I must admit you have got a very
    strong vision. :)
    You're welcome.
    Anyways coming to my first question:
    a.) What do I need to do to make sure that length of new file will increase
    every time the step 4 is executed.
    Although it may have nothing to do with this algorithm but I still thought
    of discussing it.

    Say I have scripts that dumps some contents on to an output file handler
    inside a long loop. With such scripts I have noticed 2 types of behaviour.

    1. The size of this output file is zero while the script is getting
    executed inside the loop and it increases only when the script execution
    gets over.

    2. And sometimes I have noticed that the size of output file increases
    every time some data is dumped on to it. Thus increase in size happens in
    real/run time.

    I want to control this behaviour. What I can guess is that it has something
    to do with output buffer.
    Please read "Suffering from Buffering":

    http://perl.plover.com/FAQs/Buffering.html

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Optimising Code for Speed - http://shlom.in/optimise

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Parag Kalra at Jan 3, 2010 at 2:25 pm
    I am curious to know more on UTF and understand related issues that may
    creep in my algorithm. Could someone please shed some light on it.

    Can I use following:

    use Encode;

    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_data = encode("utf8", $_);
    $data_string = $data_string.$utf_data;
    }


    # Checking the current length of the string
    while(length($data_string)<$total_size){
    $data_string = $data_string.$data_string;
    }

    And finally printing $data_string to the O/P file handle

    I am just learning to handle UTF data so feel free to correct me. :)

    Cheers,
    Parag



    On Sun, Jan 3, 2010 at 2:07 PM, Shlomi Fish wrote:
    On Sunday 03 Jan 2010 07:12:32 Parag Kalra wrote:
    Thanks Shlomi for your expert comments and I must admit you have got a very
    strong vision. :)
    You're welcome.
    Anyways coming to my first question:
    a.) What do I need to do to make sure that length of new file will increase
    every time the step 4 is executed.
    Although it may have nothing to do with this algorithm but I still thought
    of discussing it.

    Say I have scripts that dumps some contents on to an output file handler
    inside a long loop. With such scripts I have noticed 2 types of
    behaviour.
    1. The size of this output file is zero while the script is getting
    executed inside the loop and it increases only when the script execution
    gets over.

    2. And sometimes I have noticed that the size of output file increases
    every time some data is dumped on to it. Thus increase in size happens in
    real/run time.

    I want to control this behaviour. What I can guess is that it has something
    to do with output buffer.
    Please read "Suffering from Buffering":

    http://perl.plover.com/FAQs/Buffering.html

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Optimising Code for Speed - http://shlom.in/optimise

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Jeff Peng at Jan 3, 2010 at 2:50 pm

    Parag Kalra:
    I am curious to know more on UTF and understand related issues that may
    creep in my algorithm. Could someone please shed some light on it.

    Can I use following:

    use Encode;

    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_data = encode("utf8", $_);

    For the line above, I may think it's not right.
    What you got from <$sample_file_fh> is maybe different encoding chunk,
    for example,iso-8859-1,gb2312 or UTF-8 etc.
    You want to translate them to Perl's internal utf8 format firstly,which
    includes a utf8 flag and the data part.After translation,utf8 flag
    should be on and the data part is the chunk with utf8 encoding.You do it
    with the decode() function from Encode module:

    my $internal_utf8 = decode("gb2312",$_); # given the data was gb2312
    encoding originally

    After that,you could translate the $internal_utf8 to any encoding string
    you want, use the encode() function from Encode module as well:

    my $output = encode("utf8",$internal_utf8); # output with UTF-8 encoding


    HTH.

    $data_string = $data_string.$utf_data;
    }
  • Parag Kalra at Jan 3, 2010 at 3:03 pm

    What you got from <$sample_file_fh> is maybe different encoding chunk, for
    example,iso-8859-1,gb2312 or UTF-8 etc.
    You want to translate them to Perl's internal utf8 format firstly,which
    includes a utf8 flag and the data part.After translation,utf8 flag should be
    on and the data part is the chunk with utf8 encoding.You do it with the
    decode() function from Encode module:

    my $internal_utf8 = decode("gb2312",$_); # given the data was gb2312
    encoding originally

    And what if I don't know the encoding type of the input chunk. Can't we use
    'decode' function in generalized way. Something like this:

    my $internal_utf8 = decode($_);

    i.e Decoding it to internal UTF8 irrespective of the input encoding type.

    Cheers,
    Parag
  • Shlomi Fish at Jan 3, 2010 at 3:43 pm

    On Sunday 03 Jan 2010 16:25:09 Parag Kalra wrote:
    I am curious to know more on UTF and understand related issues that may
    creep in my algorithm. Could someone please shed some light on it.

    Can I use following:

    use Encode;
    Make sure you add "use strict;" and "use warnings;".
    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_data = encode("utf8", $_);
    $data_string = $data_string.$utf_data;
    }


    # Checking the current length of the string
    while(length($data_string)<$total_size){
    $data_string = $data_string.$data_string;
    }
    This snippet:

    1. Will grow the size of $data_string twice each time (exponentially).

    2. Will create a very large buffer in memory.

    3. Can be better written as "$data_string .= $data_string;"

    A better snippet would be (untested):

    <<<<<<<<<<<<
    {
    open my $out_fh, ">", $out_filename
    or die "Could not open $out_filename - $!";

    my $length_so_far = 0;

    while ($length_so_far < $total_size)
    {
    print {$out_fh} $data_string;

    $length_so_far += length($data_string);
    }

    close($out_fh);
    }
    >>>>>>>>>>>>

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Funny Anti-Terrorism Story - http://shlom.in/enemy

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Parag Kalra at Jan 3, 2010 at 4:54 pm
    Thanks a bunch Shlomi.

    Using your snippet now I am to create even 1 Giga file. Previously it was
    throwing 'Out of Memory' message. :)

    Ok coming to UTF discussion, will the following work:

    use Encode;
    my @all_encodings = Encode->encodings(":all");
    use Encode::Guess @all_encodings;

    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_internal = decode("Guess",$_);
    $utf_data = encode("utf8", $utf_internal);
    $data_string = $data_string.$utf_data;
    }


    And then the snippet suggested by Shlomi.

    Cheers,
    Parag



    On Sun, Jan 3, 2010 at 9:12 PM, Shlomi Fish wrote:
    On Sunday 03 Jan 2010 16:25:09 Parag Kalra wrote:
    I am curious to know more on UTF and understand related issues that may
    creep in my algorithm. Could someone please shed some light on it.

    Can I use following:

    use Encode;
    Make sure you add "use strict;" and "use warnings;".
    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_data = encode("utf8", $_);
    $data_string = $data_string.$utf_data;
    }


    # Checking the current length of the string
    while(length($data_string)<$total_size){
    $data_string = $data_string.$data_string;
    }
    This snippet:

    1. Will grow the size of $data_string twice each time (exponentially).

    2. Will create a very large buffer in memory.

    3. Can be better written as "$data_string .= $data_string;"

    A better snippet would be (untested):

    <<<<<<<<<<<<
    {
    open my $out_fh, ">", $out_filename
    or die "Could not open $out_filename - $!";

    my $length_so_far = 0;

    while ($length_so_far < $total_size)
    {
    print {$out_fh} $data_string;

    $length_so_far += length($data_string);
    }

    close($out_fh);
    }
    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Funny Anti-Terrorism Story - http://shlom.in/enemy

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Parag Kalra at Jan 3, 2010 at 5:29 pm
    Hmmm - http://search.cpan.org/~dankogai/Encode-2.39/lib/Encode/Guess.pm

    It says right at the bottom that below method won't work to guess the
    encoding. :(

    Cheers,
    Parag



    On Sun, Jan 3, 2010 at 10:23 PM, Parag Kalra wrote:

    Thanks a bunch Shlomi.

    Using your snippet now I am to create even 1 Giga file. Previously it was
    throwing 'Out of Memory' message. :)

    Ok coming to UTF discussion, will the following work:

    use Encode;
    my @all_encodings = Encode->encodings(":all");
    use Encode::Guess @all_encodings;


    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_internal = decode("Guess",$_);
    $utf_data = encode("utf8", $utf_internal);

    $data_string = $data_string.$utf_data;
    }


    And then the snippet suggested by Shlomi.

    Cheers,
    Parag




    On Sun, Jan 3, 2010 at 9:12 PM, Shlomi Fish wrote:
    On Sunday 03 Jan 2010 16:25:09 Parag Kalra wrote:
    I am curious to know more on UTF and understand related issues that may
    creep in my algorithm. Could someone please shed some light on it.

    Can I use following:

    use Encode;
    Make sure you add "use strict;" and "use warnings;".
    while(<$sample_file_fh>){

    # Encoding into utf data
    $utf_data = encode("utf8", $_);
    $data_string = $data_string.$utf_data;
    }


    # Checking the current length of the string
    while(length($data_string)<$total_size){
    $data_string = $data_string.$data_string;
    }
    This snippet:

    1. Will grow the size of $data_string twice each time (exponentially).

    2. Will create a very large buffer in memory.

    3. Can be better written as "$data_string .= $data_string;"

    A better snippet would be (untested):

    <<<<<<<<<<<<
    {
    open my $out_fh, ">", $out_filename
    or die "Could not open $out_filename - $!";

    my $length_so_far = 0;

    while ($length_so_far < $total_size)
    {
    print {$out_fh} $data_string;

    $length_so_far += length($data_string);
    }

    close($out_fh);
    }
    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Funny Anti-Terrorism Story - http://shlom.in/enemy

    Bzr is slower than Subversion in combination with Sourceforge.
    ( By: http://dazjorz.com/ )
  • Jeff Peng at Jan 4, 2010 at 2:08 am

    Parag Kalra:
    Hmmm - http://search.cpan.org/~dankogai/Encode-2.39/lib/Encode/Guess.pm

    It says right at the bottom that below method won't work to guess the
    encoding. :(
    Encode::Guess maybe work, but not so exactly.
    Because some Code Bits of an encoding are overlapped (for example,gb2312
    and gbk),so you can't get the encoding style of a small string just by
    guess. But for large text,it maybe work rightly.

    Here is another guess way (not by me) you may reference to:

    use Encode;
    use LWP::Simple qw(get);
    use strict;

    my $str = get "http://www.sina.com.cn";

    eval {my $str2 = $str; Encode::decode("gbk", $str2, 1)};
    print "not gbk: $@\n" if $@;

    eval {my $str2 = $str; Encode::decode("utf8", $str2, 1)};
    print "not utf8: $@\n" if $@;

    eval {my $str2 = $str; Encode::decode("big5", $str2, 1)};
    print "not big5: $@\n" if $@;


    HTH.
  • Parag Kalra at Jan 4, 2010 at 6:58 pm
    Thanks Jeff.

    Cheers,
    Parag



    On Mon, Jan 4, 2010 at 7:41 AM, Jeff Peng wrote:

    Parag Kalra:

    Hmmm - http://search.cpan.org/~dankogai/Encode-2.39/lib/Encode/Guess.pm<http://search.cpan.org/%7Edankogai/Encode-2.39/lib/Encode/Guess.pm>
    It says right at the bottom that below method won't work to guess the
    encoding. :(
    Encode::Guess maybe work, but not so exactly.
    Because some Code Bits of an encoding are overlapped (for example,gb2312
    and gbk),so you can't get the encoding style of a small string just by
    guess. But for large text,it maybe work rightly.

    Here is another guess way (not by me) you may reference to:

    use Encode;
    use LWP::Simple qw(get);
    use strict;

    my $str = get "http://www.sina.com.cn";

    eval {my $str2 = $str; Encode::decode("gbk", $str2, 1)};
    print "not gbk: $@\n" if $@;

    eval {my $str2 = $str; Encode::decode("utf8", $str2, 1)};
    print "not utf8: $@\n" if $@;

    eval {my $str2 = $str; Encode::decode("big5", $str2, 1)};
    print "not big5: $@\n" if $@;


    HTH.
  • Dr.Ruud at Jan 3, 2010 at 4:46 pm

    Parag Kalra wrote:

    I am curious to know more on UTF
    First read perlunitut.

    --
    Ruud

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedJan 2, '10 at 5:56p
activeJan 4, '10 at 6:58p
posts13
users4
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase