FAQ

[P5P] Is Perl's newline mangling on Win32 necessary?

Christian Walde
Jun 28, 2011 at 6:26 pm
On Win32 Perl automatically adds the :crlf layer to a number of filehandles.

This causes a great many problems for a lot of Perl code, since constructs like $fh->binmode; and $s =~ s/\r//g; have to be added everywhere just in case that piece of perl would ever be executed on Windows. Usually this kind of thing only breaks tests, since they compare multiline text quite often; but every once in a while this breaks something important, like the CGI handler in Plack.

I've spent many a month now chasing these things and prodding CPAN authors into fixing them with great success, increasing the amount of CPAN distributions that install and work safely on Win32.

But i am dissatified and quite frankly sick of it. The reason is that as far as i can tell the perl modules are not wrong, not broken. Neither is Win32. In my perception i am chasing tails here to fix a bug in perl core by having everyone BUT perl core implement workarounds for the bug.

I have been working with Perl and Win32 for a long time. However i have never seen a single instance where this :crlf auto-adding was useful. The only and single binary code i know about that actually cares about this thing is notepad.exe.

I am entirely confident that if i were to smoke the entirety of CPAN with the latest perl; and then resmoke it with a fork of that perl, with :crlf auto-adding removed, the overall PASS count would increase a lot and many dists would automagically start testing/working right. Only a very miniscule amount of dists would fail, specifically those that are win32-only and do multiline text comparisons with win32-generated data, which is easily fixed and much smaller in scope than the current situation.

However i might be overlooking something.

As such i have to ask: Are there even any reasons in this day and age to retain this malicious functionality?

--
With regards,
Christian Walde
reply

Search Discussions

58 responses

  • Eric Brine at Jun 28, 2011 at 6:39 pm

    On Tue, Jun 28, 2011 at 2:26 PM, Christian Walde wrote:

    On Win32 Perl automatically adds the :crlf layer to a number of
    filehandles.

    This causes a great many problems for a lot of Perl code, since constructs
    like $fh->binmode; and $s =~ s/\r//g; have to be added everywhere just in
    case that piece of perl would ever be executed on Windows.

    First,

    binmode also remove :encoding layers, so you want to use binmode on all
    systems, not just Windows.

    Second,

    If you have to do both $fh->binmode; and $s =~ s/\r//g;, it's not a Windows
    issue. If you have a text file, you wouldn't do either. If you don't have a
    text file, the presence of \r has nothing to do with Windows.
  • Christian Walde at Jun 28, 2011 at 6:47 pm

    On Tue, 28 Jun 2011 20:39:19 +0200, Eric Brine wrote:

    On Tue, Jun 28, 2011 at 2:26 PM, Christian Walde <
    walde.christian@googlemail.com> wrote:
    On Win32 Perl automatically adds the :crlf layer to a number of
    filehandles.

    This causes a great many problems for a lot of Perl code, since constructs
    like $fh->binmode; and $s =~ s/\r//g; have to be added everywhere just in
    case that piece of perl would ever be executed on Windows.

    First,

    binmode also remove :encoding layers, so you want to use binmode on all
    systems, not just Windows.
    The point was that in order to work right on windows you HAVE to add it it in many situations where it would otherwise be completely unnecessary. Yes, it has other uses and i'm not saying that binmode is bad. The cargo-culting of binmode that :crlf-autoadding FORCES is bad.
    Second,

    If you have to do both $fh->binmode; and $s =~ s/\r//g;, it's not a Windows
    issue. If you have a text file, you wouldn't do either. If you don't have a
    text file, the presence of \r has nothing to do with Windows.
    Sometimes you have access to the file handle, sometimes you're dealing with another module written by someone else that doesn't have binmode in there. In that case you need to strip them manually while you wait (days, weeks, months) for the CPAN author to accept your patch and upload a new release.

    --
    With regards,
    Christian Walde
  • Eric Brine at Jun 28, 2011 at 7:09 pm

    On Tue, Jun 28, 2011 at 2:47 PM, Christian Walde wrote:

    On Tue, 28 Jun 2011 20:39:19 +0200, Eric Brine wrote:

    On Tue, Jun 28, 2011 at 2:26 PM, Christian Walde <
    walde.christian@googlemail.com**> wrote:

    On Win32 Perl automatically adds the :crlf layer to a number of
    filehandles.

    This causes a great many problems for a lot of Perl code, since
    constructs
    like $fh->binmode; and $s =~ s/\r//g; have to be added everywhere just in
    case that piece of perl would ever be executed on Windows.

    First,

    binmode also remove :encoding layers, so you want to use binmode on all
    systems, not just Windows.
    The point was that in order to work right on windows you HAVE to add it it
    in many situations where it would otherwise be completely unnecessary. Yes,
    it has other uses and i'm not saying that binmode is bad. The cargo-culting
    of binmode that :crlf-autoadding FORCES is bad.

    No, that's not true. Whenever you need to do it on Windows, you also need to
    do it outside of Windows. You seem to forget -C, open pragma and the PERLIO
    env var.

    Second,
    If you have to do both $fh->binmode; and $s =~ s/\r//g;, it's not a
    Windows
    issue. If you have a text file, you wouldn't do either. If you don't have
    a
    text file, the presence of \r has nothing to do with Windows.
    Sometimes you have access to the file handle, sometimes you're dealing with
    another module written by someone else that doesn't have binmode in there.

    Then it will break on Windows and non-Windows, not just Windows.
  • Christian Walde at Jun 28, 2011 at 8:13 pm

    On Tue, 28 Jun 2011 21:09:00 +0200, Eric Brine wrote:

    On Tue, Jun 28, 2011 at 2:47 PM, Christian Walde <
    walde.christian@googlemail.com> wrote:
    On Tue, 28 Jun 2011 20:39:19 +0200, Eric Brine <ikegami@adaelis.com>
    wrote:

    On Tue, Jun 28, 2011 at 2:26 PM, Christian Walde <
    walde.christian@googlemail.com**> wrote:

    On Win32 Perl automatically adds the :crlf layer to a number of
    filehandles.

    This causes a great many problems for a lot of Perl code, since
    constructs like $fh->binmode; and $s =~ s/\r//g; have to be added
    everywhere just in case that piece of perl would ever be executed
    on Windows.
    binmode also remove :encoding layers, so you want to use binmode on all
    systems, not just Windows.
    The point was that in order to work right on windows you HAVE to add it it
    in many situations where it would otherwise be completely unnecessary. Yes,
    it has other uses and i'm not saying that binmode is bad. The cargo-culting
    of binmode that :crlf-autoadding FORCES is bad.
    No, that's not true. Whenever you need to do it on Windows, you also need to
    do it outside of Windows.
    There are endless permutations of this kind of thing, combining reading, writing, pre-testing, during testing, without binmode on various operations or with it on various operations. But I'll just show you one very simple and often seen example of this kind of behavior:

    http://dl.dropbox.com/u/10190786/newlines.zip

    On Linux all tests pass because it uses :raw by default. On Windows however only the binmode loader works correctly.

    So, yes, it does only need to be added for windows' benefit in order to fix misbehavior of perl core, with no appreciable newline effect on *nix.
    You seem to forget -C, open pragma and the PERLIO env var.
    Honestly, i wasn't aware of them and looked them up:

    -C seems to have nothing at all to do with this, as it deals with unicode. Please correct me if i overlooked anything. If i am right: This newline issue has nothing to do with unicode and unicode has no business being in this discussion at all.

    open.pm and PERLIO look neat. However, at the end of the day they are bandaids that are just slightly more convenient to use than binmode() and i'd still have to pester CPAN authors to cargocult them into their code to fix perl modules globally.
    If you don't have a text file, the presence of \r has nothing to do with Windows.
    I missed this earlier and i would just like to mention that this is not completely true, since STDOUT/ERR/IN are also affected by this mangling.
    If you have to do both $fh->binmode; and $s =~ s/\r//g;, it's not a
    Windows issue. If you have a text file, you wouldn't do either. If you
    don't have a text file, the presence of \r has nothing to do with Windows.
    Sometimes you have access to the file handle, sometimes you're dealing with
    another module written by someone else that doesn't have binmode in there.
    Then it will break on Windows and non-Windows, not just Windows.
    I disagree, but quite honestly, i don't have the patience to dig up or make examples of situations where this breaks. More importantly though, it distracts from the actual point of the discussion:

    Are there any reasons why :crlf auto-adding is still necessary nowadays?

    --
    With regards,
    Christian Walde
  • Leon Timmermans at Jun 28, 2011 at 8:40 pm

    On Tue, Jun 28, 2011 at 10:13 PM, Christian Walde wrote:
    On Linux all tests pass because it uses :raw by default. On Windows however
    only the binmode loader works correctly.
    No, Perl on Unix does not use «:raw» by default, it does
    :«unix:perlio». The fact that :raw is an identity operator op top of
    that stack doesn't mean :raw is involved in any way.
    -C seems to have nothing at all to do with this, as it deals with unicode.
    Please correct me if i overlooked anything. If i am right: This newline
    issue has nothing to do with unicode and unicode has no business being in
    this discussion at all.
    It's all about PerlIO stacks. -C can add :utf8 to the stack, and that
    needs to be popped off too when opening binary files. :crlf isn't the
    only non-binary layer ;-).
    Then it will break on Windows and non-Windows, not just Windows.
    I disagree, but quite honestly, i don't have the patience to dig up or make
    examples of situations where this breaks. More importantly though, it
    distracts from the actual point of the discussion:
    The issue may be bigger on Windows than on Unix, but that doesn't mean
    it isn't present on Unix.
    Are there any reasons why :crlf auto-adding is still necessary nowadays?
    Because it's the correct thing to do on DOSish platform?

    Leon
  • Christian Walde at Jun 28, 2011 at 9:09 pm

    On Tue, 28 Jun 2011 22:39:55 +0200, Leon Timmermans wrote:

    On Tue, Jun 28, 2011 at 10:13 PM, Christian Walde
    wrote:
    On Linux all tests pass because it uses :raw by default. On Windows however
    only the binmode loader works correctly.
    No, Perl on Unix does not use «:raw» by default, it does
    :«unix:perlio». The fact that :raw is an identity operator op top of
    that stack doesn't mean :raw is involved in any way.
    Ok, i wasn't aware of what they do. Thanks for the correction.

    I'd like to still emphasize though that i meant: Neither on Linux nor on Unix does Perl do any newline mangling. Only on Windows. And i struggle to find any reason for why it still needs to be there.
    -C seems to have nothing at all to do with this, as it deals with unicode.
    Please correct me if i overlooked anything. If i am right: This newline
    issue has nothing to do with unicode and unicode has no business being in
    this discussion at all.
    It's all about PerlIO stacks. -C can add :utf8 to the stack, and that
    needs to be popped off too when opening binary files. :crlf isn't the
    only non-binary layer ;-).
    Yes, i understand that. But i am talking only and exclusively about newline transformation, since :utf8 isn't forced on anyone just because they have a specific os.
    Then it will break on Windows and non-Windows, not just Windows.
    I disagree, but quite honestly, i don't have the patience to dig up or make
    examples of situations where this breaks. More importantly though, it
    distracts from the actual point of the discussion:
    The issue may be bigger on Windows than on Unix, but that doesn't mean
    it isn't present on Unix.
    *nix has automatic newline transformation?
    Are there any reasons why :crlf auto-adding is still necessary nowadays?
    Because it's the correct thing to do on DOSish platform?
    As we discussed on IRC, this means: Because it's tradition. And you agreed there, tradition is not a valid reason to keep bad behavior. :)

    I have to say though: If anyone can explain why it became a tradition, that would be useful information in this discussion.

    --
    With regards,
    Christian Walde
  • Leon Timmermans at Jun 28, 2011 at 9:32 pm

    On Tue, Jun 28, 2011 at 11:08 PM, Christian Walde wrote:
    As we discussed on IRC, this means: Because it's tradition. And you agreed
    there, tradition is not a valid reason to keep bad behavior. :)

    I have to say though: If anyone can explain why it became a tradition, that
    would be useful information in this discussion.
    Terminals, the teleprinter kind. Carriage return would reset the head
    of the printer to the first column, line feed would scroll it down one
    notch. Together they had the newline effect. CP/M was apparently
    designed for such devices. DOS took over this convention from CP/M for
    reasons of backwards compatibility. Windows did the same with DOS.

    OK, I think I'm convinced in your favor now ;-).

    Leon
  • Christian Walde at Jun 28, 2011 at 9:35 pm

    On Tue, 28 Jun 2011 23:32:32 +0200, Leon Timmermans wrote:

    On Tue, Jun 28, 2011 at 11:08 PM, Christian Walde
    wrote:
    As we discussed on IRC, this means: Because it's tradition. And you agreed
    there, tradition is not a valid reason to keep bad behavior. :)

    I have to say though: If anyone can explain why it became a tradition, that
    would be useful information in this discussion.
    Terminals, the teleprinter kind. Carriage return would reset the head
    of the printer to the first column, line feed would scroll it down one
    notch. Together they had the newline effect. CP/M was apparently
    designed for such devices. DOS took over this convention from CP/M for
    reasons of backwards compatibility. Windows did the same with DOS.

    OK, I think I'm convinced in your favor now ;-).
    Haha, thanks a lot. I think that actually makes a very strong point towards how obsolete this behavior is. :)

    --
    With regards,
    Christian Walde
  • Craig A. Berry at Jun 28, 2011 at 10:53 pm

    On Tue, Jun 28, 2011 at 4:08 PM, Christian Walde wrote:

    I have to say though: If anyone can explain why it became a tradition, that
    would be useful information in this discussion.
    The beginnings of the code is here:

    <http://perl5.git.perl.org/perl.git/commit/66ecd56be076649bc9da523c12d89e06e353e801?f=perlio.c>

    Having a crlf layer was part of the original design of PerlIO's "line
    disciplines" as they were called then, discussed at some length here:

    <http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-11/msg00405.html>

    There was some question then of whether the crlf layer should be
    default for Windows. I don't know exactly when or where the decision
    to do that was made, but I suspect it was solving some actual problem.
  • Christian Walde at Jun 28, 2011 at 11:04 pm

    On Wed, 29 Jun 2011 00:53:25 +0200, Craig A. Berry wrote:

    On Tue, Jun 28, 2011 at 4:08 PM, Christian Walde
    wrote:
    I have to say though: If anyone can explain why it became a tradition, that
    would be useful information in this discussion.
    The beginnings of the code is here:

    <http://perl5.git.perl.org/perl.git/commit/66ecd56be076649bc9da523c12d89e06e353e801?f=perlio.c>

    Having a crlf layer was part of the original design of PerlIO's "line
    disciplines" as they were called then, discussed at some length here:

    <http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-11/msg00405.html>

    There was some question then of whether the crlf layer should be
    default for Windows. I don't know exactly when or where the decision
    to do that was made, but I suspect it was solving some actual problem.
    Thanks for digging!

    Looking at the mails it was meant to be a standin for CRLF/LF auto-detection, which would be meant to prevent getting mixed newlines into files. I'd argue that that is not a valid reason to keep it around, since it's terrible at that due to the programmer having to detect or decide it manually anyhow AND the distinction being entirely useless due to only notepad caring about CR. :)

    --
    With regards,
    Christian Walde
  • Sisyphus at Jun 28, 2011 at 11:24 pm
    ----- Original Message -----
    From: "Christian Walde"
    I'd like to still emphasize though that i meant: Neither on Linux nor on
    Unix does Perl do any newline mangling. Only on Windows. And i struggle to
    find any reason for why it still needs to be there.
    Me, too.

    Cheers,
    Rob
  • Jan Dubois at Jun 29, 2011 at 6:24 am

    On Tue, 28 Jun 2011, Sisyphus wrote: "Christian Walde" wrote:
    I'd like to still emphasize though that i meant: Neither on Linux
    nor on Unix does Perl do any newline mangling. Only on Windows. And
    i struggle to find any reason for why it still needs to be there.
    Me, too.
    I already gave the tl;dr explanation in my previous reply. Minimal sample:

    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'"
    number

    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'"
    not

    Cheers,
    -Jan
  • David Golden at Jun 29, 2011 at 10:00 am

    On Wed, Jun 29, 2011 at 2:24 AM, Jan Dubois wrote:
    I already gave the tl;dr explanation in my previous reply.  Minimal sample:

    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'"
    number

    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'"
    not
    To me, that's an argument for STD* to have :crlf by default, since on
    Windows the console uses CRLF, but not an argument for "open" to apply
    :crlf by default to any arbitrary file.

    I sort of like the idea of removing :crlf by default *if* we provide a
    :text option that people can use to be explicit that they want a text
    mode transformation.

    E.g.
    use v5.16;
    open my $bin, "<", "binary.dat"; # :raw even on Win32
    open my $bin, "<:text", "textual.txt"; # :text

    What that really implies is that under v5.16, Perl no longer assumes
    that text files are the primary type of file being manipulated or that
    text files for use on the local machine are the primary type of file
    being manipulated.

    I might even go so far as to wonder whether under v5.16, open should
    warn unless a layer is specified (or the open pragma is in effect):

    open my $bin, "<:raw", "binary.dat"; # explicit

    I could see people finding that annoying (oh, no! more characters to
    type!) but it might be the way to encourage good practices about being
    explicit about expected encodings on input/output handles (e.g. utf8).
    I don't know. Maybe that's too nanny-ish for Perl.

    -- David
  • Jan Dubois at Jun 29, 2011 at 1:07 pm

    On Wed, 29 Jun 2011, David Golden wrote:
    On Wed, Jun 29, 2011 at 2:24 AM, Jan Dubois wrote:
    I already gave the tl;dr explanation in my previous reply. Minimal sample:

    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'"
    number

    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'"
    not
    To me, that's an argument for STD* to have :crlf by default, since on
    Windows the console uses CRLF, but not an argument for "open" to apply
    :crlf by default to any arbitrary file.
    It has *nothing* to do with the console, but with the fact that *text* files
    on Windows have CRLF line endings. Download a text/plain file with your
    browser, or redirect the output of most any program to a file, and
    you end up with CRLF line endings.

    Most programs work correctly with just LF terminated lines because that
    is what you would end up after removing the CR anyways. This wouldn't
    work on Mac OS (which we don't support anymore), so it is an accidental
    feature, but we could claim that it is an application of "be strict in
    what you emit, and liberal in what you accept".
    I sort of like the idea of removing :crlf by default *if* we provide a
    :text option that people can use to be explicit that they want a text
    mode transformation.
    I would like to have a :text layer; I did need it at least once before,
    but can't remember the details anymore. But:
    What that really implies is that under v5.16, Perl no longer assumes
    that text files are the primary type of file being manipulated or that
    text files for use on the local machine are the primary type of file
    being manipulated.
    Assuming that Perl is primarily used to manipulate *binary* files
    seems just wrong to me.

    I'm objecting to this proposal because it's intention is to *hide*
    the portability problems of Perl scripts as long as they only operate
    on POSIX style text files. It does not encourage writing portable
    scripts that work with *both* native and foreign line-endings.

    If you want Cygwin functionality, you can find it over -----> there.
    The native port is supposed to work as best as possible with the
    native OS.
    I might even go so far as to wonder whether under v5.16, open should
    warn unless a layer is specified (or the open pragma is in effect):

    open my $bin, "<:raw", "binary.dat"; # explicit

    I could see people finding that annoying (oh, no! more characters to
    type!) but it might be the way to encourage good practices about being
    explicit about expected encodings on input/output handles (e.g. utf8).
    I don't know. Maybe that's too nanny-ish for Perl.
    I think this is better done by Perl::Critic.

    As for encodings, they are completely orthogonal to the question of
    line endings, and you could even argue that any character encoding
    should also imply :text. It would certainly help with this kind of
    abomination, which is currently necessary to open a platform-native
    text file in Unicode encoding on Windows:

    open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die;

    Cheers,
    -Jan
  • Tom Christiansen at Jun 29, 2011 at 1:34 pm

    It has *nothing* to do with the console, but with the fact that *text* files
    on Windows have CRLF line endings. Download a text/plain file with your
    browser, or redirect the output of most any program to a file, and
    you end up with CRLF line endings.
    Most programs work correctly with just LF terminated lines because that
    is what you would end up after removing the CR anyways. This wouldn't
    work on Mac OS (which we don't support anymore), so it is an accidental
    feature, but we could claim that it is an application of "be strict in
    what you emit, and liberal in what you accept".
    Would you believe that some programs' "Save as text" under Darwin still
    produce "\r" for line separators? It's true. So you can get a
    webpage that you Save As text and it has no newlines in it at all!!
    Sometimes it's even in MacRoman. Yes, this is under Darwin.

    -tom
  • David Golden at Jun 29, 2011 at 1:39 pm

    On Wed, Jun 29, 2011 at 9:06 AM, Jan Dubois wrote:
    What that really implies is that under v5.16, Perl no longer assumes
    that text files are the primary type of file being manipulated or that
    text files for use on the local machine are the primary type of file
    being manipulated.
    Assuming that Perl is primarily used to manipulate *binary* files
    seems just wrong to me.
    I think the original sin was assuming that text mode should be the
    default. Assuming binary is just as arbitrary, but it's arbitrary in
    a way that does not change what is read or written, which I think is a
    less intrusive assumption. Obviously, I would only want that change
    under
    a "use v5.X" semantic rule.
    I'm objecting to this proposal because it's intention is to *hide*
    the portability problems of Perl scripts as long as they only operate
    on POSIX style text files.  It does not encourage writing portable
    scripts that work with *both* native and foreign line-endings.
    I don't understand your argument. I think people are making
    assumptions either way. My thought is that if code is protected by a
    lexical version declaration, then we can expect that (with some
    learning curve) they understand what assumptions are in force.
    I might even go so far as to wonder whether under v5.16, open should
    warn unless a layer is specified (or the open pragma is in effect):

    open my $bin, "<:raw", "binary.dat";  # explicit

    I could see people finding that annoying (oh, no! more characters to
    type!) but it might be the way to encourage good practices about being
    explicit about expected encodings on input/output handles (e.g. utf8).
    I don't know.  Maybe that's too nanny-ish for Perl.
    I think this is better done by Perl::Critic.
    Very possibly. That's another way to address the issue.
    As for encodings, they are completely orthogonal to the question of
    line endings, and you could even argue that any character encoding
    should also imply :text.  It would certainly help with this kind of
    abomination, which is currently necessary to open a platform-native
    text file in Unicode encoding on Windows:

    open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die;
    Ick.

    My broader point was that now that (Unicode) encoded files are
    increasingly common, it's not unreasonable to start nudging
    programmers to be explicit about what transformations they want when
    reading from a filehandle. In that sense, the encoding isn't
    orthogonal -- both result in transformations of the octet-stream being
    read/written.

    -- David
  • Jan Dubois at Jun 29, 2011 at 1:59 pm

    On Wed, 29 Jun 2011, David Golden wrote:
    On Wed, Jun 29, 2011 at 9:06 AM, Jan Dubois wrote:
    I think the original sin was assuming that text mode should be the
    default. Assuming binary is just as arbitrary, [...]
    Assuming text mode is not arbitrary; it is optimizing for the
    common use case. Perl does a lot of that. :)
    I'm objecting to this proposal because it's intention is to *hide*
    the portability problems of Perl scripts as long as they only operate
    on POSIX style text files. It does not encourage writing portable
    scripts that work with *both* native and foreign line-endings.
    I don't understand your argument. I think people are making
    assumptions either way. My thought is that if code is protected by a
    lexical version declaration, then we can expect that (with some
    learning curve) they understand what assumptions are in force.
    With text mode being the default, most code dealing with text is
    cross-platform by default too. With binary mode the default, programs
    dealing with text files are platform-dependent unless you make an
    extra effort to make them portable.

    I'm just arguing that most *users* of Perl are processing text files.
    The number of people writing network protocol layer code is comparatively
    small, so let's have defaults that work for the majority and not just
    for "us".
    My broader point was that now that (Unicode) encoded files are
    increasingly common, it's not unreasonable to start nudging
    programmers to be explicit about what transformations they want when
    reading from a filehandle. In that sense, the encoding isn't
    orthogonal -- both result in transformations of the octet-stream being
    read/written.
    Indeed. That is still no reason to break default text mode handling
    on Windows when no layers are specified at all though. Well, IMO of
    course.

    Cheers,
    -Jan
  • Tom Christiansen at Jun 29, 2011 at 2:10 pm

    "Jan Dubois" <jand@activestate.com> wrote on Wed, 29 Jun 2011 06:59:03 PDT:
    On Wed, 29 Jun 2011, David Golden wrote:
    On Wed, Jun 29, 2011 at 9:06 AM, Jan Dubois wrote:
    I think the original sin was assuming that text mode should be the
    default. Assuming binary is just as arbitrary, [...]
    Assuming text mode is not arbitrary; it is optimizing for the
    common use case. Perl does a lot of that. :)
    ++
  • Ed Avis at Jun 29, 2011 at 2:13 pm
    Jan Dubois <jand <at> activestate.com> writes:
    With text mode being the default, most code dealing with text is
    cross-platform by default too. With binary mode the default, programs
    dealing with text files are platform-dependent unless you make an
    extra effort to make them portable.
    Well, this depends on what you consider 'cross-platform'. From my point of
    view, without the magic CRLF translation, most code dealing with text is
    cross-platform and will produce the same results on Windows and Unix. But
    with the CRLF translation, the code starts behaving differently on Windows
    and becomes platform-specific.

    I can see where you're coming from: there is a way of seeing the world which
    considers a 'text file' to consist of lines of text and the exact binary
    encoding of that text (in this case the line separators) to be a platform-
    specific implementation detail. In that case programs should work with the
    high-level text and use the platform's native conventions for serializing it
    to disk. This line of thinking brought us 'text' mode in FTP clients, for
    example.

    But I don't believe the modern world works like that. Perhaps you use both
    Unix-like and Windows systems at the same time and share files between them
    using a network filesystem such as CIFS. Do you mark each file as text or
    binary and have the filesystem magically translate the line endings for text?
    That's just not how things are done today. You get the same file contents
    no matter which system is reading the file.

    Given such a mixed environment, the ideal of 'cross-platform' text handling by
    translating line endings falls flat. If running on Windows, you will need to
    handle Unix-style line endings anyway, if reading a file mounted from a Unix
    system. And it's a fairly safe bet that if you are writing output to a network
    drive shared with a Unix system you'll want to write \n endings too. If,
    due to misfortune, you work with software that still expects \r\n endings in
    text files (of which there is not much), then you will still need to handle it
    in your Perl code if you want that code to be portable to Unix systems.
    There is no magic fairy that strips out the \r line endings when copying the
    file to your Linux box, nor one that adds them when copying it to Windows.

    --
    Ed Avis <eda@waniasset.com>
  • Ed Avis at Jun 29, 2011 at 1:58 pm
    Jan Dubois <jand <at> activestate.com> writes:
    It has *nothing* to do with the console, but with the fact that *text* files
    on Windows have CRLF line endings.
    Twenty years ago that was obviously true, but now it's a shaky claim. Most
    Windows software doesn't work with text files - things are saved in a binary
    format or else as HTML and XML (where line ending whitespace is not significant).
    notepad.exe is the only major holdout, and that has been relegated to use by a
    relatively small number of greybeard power users. (Its place in the Accessories
    menu has been taken by Wordpad for ten years at least. Wordpad handles Unix
    line endings fine.)

    Manipulating text files or piping one command into another just aren't part of
    everyday computer usage any more. The people who do these things are using
    Windows ports of Unixish tools from GNU or elsewhere. Adding the extra CR
    characters might have helped interoperate with DOS text-manipulation programs
    in the past. Nowadays, it causes more problems than it solves.

    You are right, though, that text/plain saved from a web browser gets the CRLF
    line endings.

    --
    Ed Avis <eda@waniasset.com>
  • Tom Christiansen at Jun 29, 2011 at 2:09 pm

    Ed Avis wrote on Wed, 29 Jun 2011 13:58:30 -0000:

    You are right, though, that text/plain saved from a
    web browser gets the CRLF line endings.
    And on Darwin, it gets CR alone, which is incredibly annoying.

    --tom
  • Ed Avis at Jun 29, 2011 at 2:15 pm

    Tom Christiansen <tchrist <at> perl.com> writes:

    You are right, though, that text/plain saved from a
    web browser gets the CRLF line endings.
    And on Darwin, it gets CR alone, which is incredibly annoying.
    So, obviously, to be 'cross-platform' Perl on Darwin writes text output using
    CR alone for line endings? Right?

    --
    Ed Avis <eda@waniasset.com>
  • Tom Christiansen at Jun 29, 2011 at 2:26 pm
    Ed Avis wrote
    on Wed, 29 Jun 2011 14:14:14 -0000:
    You are right, though, that text/plain saved from a
    web browser gets the CRLF line endings.
    And on Darwin, it gets CR alone, which is incredibly annoying.
    So, obviously, to be 'cross-platform' Perl on Darwin writes text
    output using CR alone for line endings? Right?
    Don't be silly.

    --tom
  • Konovalov, Vadim (Vadim)** CTR ** at Jun 30, 2011 at 1:52 pm

    From: Jan Dubois

    It has *nothing* to do with the console, but with the fact
    that *text* files
    on Windows have CRLF line endings. Download a text/plain
    most of my text files on windows do not have CRLF - they only have
    \x0a as line end
    file with your
    browser, or redirect the output of most any program to a file, and
    you end up with CRLF line endings.
    not convincing.

    there are couple of programs that do not recognize \x0a properly
    but most programs are not that bogus.
    Most programs work correctly with just LF terminated lines
    because that
    is what you would end up after removing the CR anyways.
    indeed.

    Assuming that Perl is primarily used to manipulate *binary* files
    seems just wrong to me.
    my opinion is opposite - if I get something from a file into a scalar and
    this scalar do not match its content - then something is very wrong.

    If I need to provide my program with a number of operators to override
    badly design defaults - this isn't good.

    Regards,
    Vadim.
  • Aristotle Pagaltzis at Jul 1, 2011 at 7:09 pm

    * Konovalov, Vadim (Vadim)** CTR ** [2011-06-30 15:55]:
    From: Jan Dubois

    It has *nothing* to do with the console, but with the fact
    that *text* files on Windows have CRLF line endings.
    Download a text/plain
    most of my text files on windows do not have CRLF - they only
    have \x0a as line end
    file with your browser, or redirect the output of most any
    program to a file, and you end up with CRLF line endings.
    not convincing.

    there are couple of programs that do not recognize \x0a
    properly but most programs are not that bogus.
    At $work we have had huge headaches trying to keep CRLFs out of
    our Git repositories. Sources that come out of Git use LF and the
    editors used by Windows-based developers DTRT with those, but
    files newly created by these developers often get checked in with
    CRLF line endings.

    To me that means most Windows software must still be presuming
    CRLF even if it can handle LF-only properly.

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • Dagfinn Ilmari Mannsåker at Jul 1, 2011 at 10:19 pm

    Aristotle Pagaltzis writes:

    At $work we have had huge headaches trying to keep CRLFs out of
    our Git repositories. Sources that come out of Git use LF and the
    editors used by Windows-based developers DTRT with those, but
    files newly created by these developers often get checked in with
    CRLF line endings.
    Have a look at the core.autocrlf config setting (I think the value you
    want is "input").

    --
    ilmari
    "A disappointingly low fraction of the human race is,
    at any given time, on fire." - Stig Sandbeck Mathisen
  • Tom Christiansen at Jun 29, 2011 at 1:02 pm
    [Brian, skip to the bottom for one of Perl's Unicode deficiencies.]

    "Jan Dubois" <jand@activestate.com> wrote
    on Tue, 28 Jun 2011 23:24:01 PDT:
    On Tue, 28 Jun 2011, Sisyphus wrote: "Christian Walde" wrote:

    I'd like to still emphasize though that i meant: Neither on Linux
    nor on Unix does Perl do any newline mangling. Only on Windows. And
    (Not that Linux is different from Unix in any meaningful way here. Perhaps
    they meant MacOS 9?)
    i struggle to find any reason for why it still needs to be there.
    Me, too.
    I already gave the tl;dr explanation in my previous reply. Minimal sample:
    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'"
    number
    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'"
    not
    I actually consider that a violation of tr18's RL1.6 on linebreak handling:

    http://unicode.org/reports/tr18/#Line_Boundaries

    RL1.6 Line Boundaries

    To meet this requirement, if an implementation provides for
    line-boundary testing, it shall recognize not only CRLF, LF,
    CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).

    Formfeed (U+000C) also normally indicates an end-of-line.


    For more information, see Chapter 3 of [Unicode].

    These characters should be uniformly handled in determining logical
    line numbers, start-of-line, end-of-line, and arbitrary-character
    implementations. Logical line number is useful for compiler error
    messages and the like. Regular expressions often allow for SOL and
    EOL patterns, which match certain boundaries. Often there is also a
    "non-line-separator" arbitrary character pattern that excludes line
    separator characters.

    The behavior of these characters may also differ depending on
    whether one is in a "multiline" mode or not. For more
    information, see Anchors and Other "Zero-Width Assertions" in
    Chapter 3 of [Friedl].

    A newline sequence is defined to be any of the following:

    \u000A | \u000B | \u000C | \u000D | \u0085 | \u2028 | \u2029 | \u000D\u000A

    [...]

    It is strongly recommended that there be a regular expression
    meta-character, such as "\R", for matching all line ending characters
    and sequences listed above (e.g. in #1). It would thus be shorthand
    for:

    ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

    And indeed, Perl's \R works that way -- in a regex. So you can do this
    sort of thing: (all perl examples I give below are tested under v5.14)

    @lines = split /\R/, $buffer;
    @paras = split /\R+/, $buffer;

    And you *can* rewrite your

    say <> =~ /^\d+$/

    to read:

    say <> =~ /^\d+\R?\z/

    But the main point of RL1.6 is that you shouldn't have to do that.
    So I think this is a bug that needs fixing:

    % perl -E 'say "\r\n" =~ /^$/ ? "Yes" : "No"'
    No
    % perl -E 'say "\r\n" =~ /^\R\z/ ? "Yes" : "No"'
    Yes

    Also, and perhaps more germane to this thread, I don't know how to set $/
    to anything that will let me get a \R-terminated line back from readline(),
    nor which allows chomp() to then work appropriately on that return. I
    believe neither is currently possible.

    How to fix that? I/O layer? Extension of :crlf?

    RL1.6 also says:

    Note: For some implementations, there may be a performance impact in
    recognizing CRLF as a single entity, such as with an arbitrary
    pattern character ("."). To account for that, an implementation
    may also satisfy R1.6 if there is a mechanism available for
    converting the sequence CRLF to a single line boundary character
    before regex processing.

    I honestly don't know whether :crlf's so-called newline-munging,
    the topic of this thread, satisfies that condition.

    --tom

    PS: CRLF is also good place to point out the difference between legacy
    and extended grapheme clusters, since only the extended version
    counts that two-character sequence as a single grapheme cluster:

    Legacy Grapheme Cluster:
    % perl -E 'say "\r\n" =~ /^\p{Grapheme_Base}\p{Grapheme_Extend}*\z/ ? "Yes" : "No"'
    No
    Extended Grapheme Cluster:
    % perl -E 'say "\r\n" =~ /^\X\z/ ? "Yes" : "No"'
    Yes

    Which are both of them further distinguished from:

    Erroneous Grapheme Cluster:
    % perl -E 'say "\r\n" =~ /^\PM\pM*\z/ ? "Yes" : "No"'
    No

    But that's a someone longer story, which doesn't need recounting here.
  • Leon Timmermans at Jun 30, 2011 at 1:16 pm

    On Wed, Jun 29, 2011 at 2:58 PM, Tom Christiansen wrote:
    I actually consider that a violation of tr18's RL1.6 on linebreak handling:

    http://unicode.org/reports/tr18/#Line_Boundaries

    RL1.6 Line Boundaries

    To meet this requirement, if an implementation provides for
    line-boundary testing, it shall recognize not only CRLF, LF,
    CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).

    Formfeed (U+000C) also normally indicates an end-of-line.


    For more information, see Chapter 3 of [Unicode].

    These characters should be uniformly handled in determining logical
    line numbers, start-of-line, end-of-line, and arbitrary-character
    implementations. Logical line number is useful for compiler error
    messages and the like. Regular expressions often allow for SOL and
    EOL patterns, which match certain boundaries. Often there is also a
    "non-line-separator" arbitrary character pattern that excludes line
    separator characters.

    The behavior of these characters may also differ depending on
    whether one is in a "multiline" mode or not. For more
    information, see Anchors and Other "Zero-Width Assertions" in
    Chapter 3 of [Friedl].

    A newline sequence is defined to be any of the following:

    \u000A | \u000B | \u000C | \u000D | \u0085 | \u2028 | \u2029 | \u000D\u000A

    [...]

    It is strongly recommended that there be a regular expression
    meta-character, such as "\R", for matching all line ending characters
    and sequences listed above (e.g. in #1). It would thus be shorthand
    for:

    ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

    And indeed, Perl's \R works that way -- in a regex.  So you can do this
    sort of thing: (all perl examples I give below are tested under v5.14)

    @lines = split /\R/,  $buffer;
    @paras = split /\R+/, $buffer;

    And you *can* rewrite your

    say <> =~ /^\d+$/

    to read:

    say <> =~ /^\d+\R?\z/

    But the main point of RL1.6 is that you shouldn't have to do that.
    So I think this is a bug that needs fixing:

    % perl -E 'say "\r\n" =~ /^$/ ? "Yes" : "No"'
    No
    % perl -E 'say "\r\n" =~ /^\R\z/ ? "Yes" : "No"'
    Yes

    Also, and perhaps more germane to this thread, I don't know how to set $/
    to anything that will let me get a \R-terminated line back from readline(),
    nor which allows chomp() to then work appropriately on that return.  I
    believe neither is currently possible.

    How to fix that?  I/O layer?  Extension of :crlf?
    A PerlIO layer doing normalization would be the most obvious approach,
    though it's not entirely clear how such a layer should behave on
    output.

    Leon
  • Tom Christiansen at Jun 30, 2011 at 2:17 pm

    Leon Timmermans wrote on Thu, 30 Jun 2011 15:16:39 +0200:

    A PerlIO layer doing normalization would be the most obvious
    approach, though it's not entirely clear how such a layer
    should behave on output.
    And you *can* rewrite your

    say <> =~ /^\d+$/

    to read:

    say <> =~ /^\d+\R?\z/
    But the main point of RL1.6 is that you shouldn't have to do that.
    So I think this is a bug that needs fixing:

    % perl -E 'say "\r\n" =~ /^$/ ? "Yes" : "No"'
    No
    % perl -E 'say "\r\n" =~ /^\R\z/ ? "Yes" : "No"'
    Yes
    Also, and perhaps more germane to this thread, I don't know how to set $/
    to anything that will let me get a \R-terminated line back from readline(),
    nor which allows chomp() to then work appropriately on that return.  I
    believe neither is currently possible.

    How to fix that?  I/O layer?  Extension of :crlf?
    A PerlIO layer doing normalization would be the most obvious
    approach, though it's not entirely clear how such a layer
    should behave on output.
    One of the purported blessings (I find it a curse) of the
    standard Java runtime library is that you when you do what is
    essentially

    while read a line
    print a line

    You *do* get a line back no matter what sort of line-ending it had.

    The gotcha is that that you cannot really distinguish what you
    started with. It autochomps but won't tell you what it did. So
    you can't tell whether you chomped 0, 1, or 2 characters, nor
    what if any those characters really were.

    http://download.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()

    That is, their readLine() method accepts any valid line-ending,
    but its corresponding println() method only emits lines with
    the "native" line terminator added to them. That must have
    sounded good to somebody once upon a time (and can even be argued
    to be following Postel's Law), but I find it a royal pain.
    It's for line-terminator normalalization, not for clean copying.

    May we please not go there? :(

    I still want us to work with \R though. I'm sorry I expressed it has a
    pattern, because that has led to what sounds like a need for support of
    arbitrary regex terminator. I certainly didn't mean that. You can
    certainly implement the *equivalent* to a \R terminator in a non-regex way
    without losing a great deal of efficiency.

    --tom
  • Ed Avis at Jun 30, 2011 at 4:25 pm
    Tom Christiansen <tchrist <at> perl.com> writes:
    One of the purported blessings (I find it a curse) of the
    standard Java runtime library
    It autochomps but won't tell you what it did.
    That is, their readLine() method accepts any valid line-ending,
    but its corresponding println() method only emits lines with
    the "native" line terminator added to them. That must have
    sounded good to somebody once upon a time (and can even be argued
    to be following Postel's Law), but I find it a royal pain.
    That's a vote against the possibility of having Win32 recognize both \r\n
    and \n on input but produce only \n on output, then. So the choice would come
    down to doing CRLF munging, as at present, or not.

    However, if you chomp() every line of your input file and then say() it, you
    are already 'normalizing' away the distinction between a file that ends in \n
    and one that doesn't.

    % echo -n a | wc -c
    1
    % echo -n a | perl -nE 'chomp; say $_' | wc -c
    2

    So perhaps it would be acceptable for Perl not to munge CRLF on input but for
    chomp to be made capable of chomping both \r\n and \n. If you chomp each line
    then you have already moved away from doing a byte-for-byte copy.

    --
    Ed Avis <eda@waniasset.com>
  • Ed Avis at Jun 29, 2011 at 1:50 pm
    Jan Dubois <jand <at> activestate.com> writes:
    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'"
    number

    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'"
    not
    It's a pity that $ in regular expressions doesn't match \r\n\z as well as \n\z
    - on any ASCII-based platform, not just Windows. It might be a bit late to
    change it though!

    This is one reason why I never use $ but use \s*\z instead.

    --
    Ed Avis <eda@waniasset.com>
  • Tom Christiansen at Jun 29, 2011 at 2:07 pm
    [Brian: here's another Perl Unicode glitch or two for your bag. Whether
    it is a bug I don't know, but RL1.2a does disagree with our \s.]

    Ed Avis <eda@waniasset.com> wrote on Wed, 29 Jun 2011 13:46:57 -0000:
    Jan Dubois <jand <at> activestate.com> writes:
    C:\>echo 123|perl -E "say <> =~ /^\d+$/ ? 'number' : 'not'" number

    C:\>echo 123|perl -E "binmode(STDIN); say <> =~ /^\d+$/ ? 'number' : 'not'" not
    It's a pity that $ in regular expressions doesn't match \r\n\z as well
    as \n\z — on any ASCII-based platform, not just Windows. It might be
    a bit late to change it though!
    This is the RL1.6 bug. We have to fix it, somehow, although I don't
    know how best to do that given all this newline-mangling discussion.
    Different approaches are readily conceivable here, plus the spec is
    somewhat fuzzy.
    This is one reason why I never use $ but use \s*\z instead.
    But that's not the same. At all. This is what \R was *made* for:
    just use /\R?\z/ for /$/ and you will be fine (excepting the 5-for-1
    Huffman failure, I suppose).

    Also, I'm a bit leery of \s these days. Remember that \s is not the same
    as [\h\v] — which if you didn't know about already, you almost certainly
    don't want to. ;-{

    --tom
  • Ed Avis at Jun 29, 2011 at 2:20 pm

    Tom Christiansen <tchrist <at> perl.com> writes:

    It's a pity that $ in regular expressions doesn't match \r\n\z as well
    as \n\z — on any ASCII-based platform, not just Windows. It might be
    a bit late to change it though!
    This is the RL1.6 bug. We have to fix it, somehow, although I don't
    know how best to do that given all this newline-mangling discussion.
    If the definition of $ can still be fixed, that's good news.
    I had assumed too much code was dependent on its current semantics.
    This is one reason why I never use $ but use \s*\z instead.
    But that's not the same. At all.
    It isn't the same. I use it instead so I don't have to care about
    trailing whitespace. Alternatively, I will s/\s*\z// on each line of
    input first, and then just use \z to match end of line.
    Also, I'm a bit leery of \s these days. Remember that \s is not the same
    as [\h\v] — which if you didn't know about already, you almost certainly
    don't want to. ;-{
    I'm in the fortunate position of working with ASCII only, for the most part,
    but when I next need to write internationalized or general-purpose code I'll
    remember your warning.

    --
    Ed Avis <eda@waniasset.com>
  • Tom Christiansen at Jun 29, 2011 at 2:28 pm
    Ed Avis wrote
    on Wed, 29 Jun 2011 14:19:47 -0000:
    I'm in the fortunate position of working with ASCII only,
    That's a rather mixed fortune.
    for the most part, but when I next need to write internationalized or
    general-purpose code I'll remember your warning.
    I don't want to go down the locale rathole here. I just want RL1.6
    to work.

    --tom
  • Jan Dubois at Jun 29, 2011 at 6:45 pm

    On Wed, 29 Jun 2011, Ed Avis wrote:
    Tom Christiansen <tchrist <at> perl.com> writes:
    It's a pity that $ in regular expressions doesn't match \r\n\z as well
    as \n\z — on any ASCII-based platform, not just Windows. It might be
    a bit late to change it though!
    This is the RL1.6 bug. We have to fix it, somehow, although I don't
    know how best to do that given all this newline-mangling discussion.
    If the definition of $ can still be fixed, that's good news.
    I had assumed too much code was dependent on its current semantics.
    Yes, fixing $ at the end of regexen to match any Unicode line ending
    would be a good thing. It still doesn't solve the text mode issue for
    file handles though; there are plenty of other ways the trailing "\r"
    can break your code.

    For another example, what do you want to do about chomp? Change $/ to
    "\r\n" by default on Windows? Then it won't strip a "naked" "\n" from
    the string if the file happens to be in POSIX text format. Make $/ a
    regular expression? I believe that has been rejected before due to
    performance concerns. And anyways, you'll end up in a much messier place
    than the one we are already in right now.

    Normalizing line-endings at the I/O boundary seems like the least insane
    way to deal with it.

    Cheers,
    -Jan
  • Nicholas Clark at Jun 29, 2011 at 6:52 pm

    On Wed, Jun 29, 2011 at 11:45:29AM -0700, Jan Dubois wrote:

    the string if the file happens to be in POSIX text format. Make $/ a
    regular expression? I believe that has been rejected before due to
    performance concerns. And anyways, you'll end up in a much messier place
    I don't think it's *just* performance concerns. To work properly, $/ as a
    regular expression would require a regular expression engine that knows that
    the end of the string [1] isn't the end of the string [2]

    1: that it was presented with right now
    2: that exists in total

    ie, you'd have to completely re-write the regular expression engine (and the
    optimiser) to remove assumptions about "can't match, because we've exhausted
    the input", and replace those with "OK, grab some more input and try again".

    Note, that's not "can't be done". I'm not saying it's impossible. It's
    perfectly possible. Just rewrite the regular expression engine.

    Nicholas Clark
  • Ed Avis at Jun 30, 2011 at 10:30 am
    Nicholas Clark <nick <at> ccl4.org> writes:
    To work properly, $/ as a
    regular expression would require a regular expression engine that knows that
    the end of the string [1] isn't the end of the string [2]

    1: that it was presented with right now
    2: that exists in total
    I suppose an alternative is to allow only very restricted regular expressions in
    $/, containing literal character matches and ? only, say. Then they can be
    handled by a simple-minded bit of code and not by the full regexp engine. It
    would be forward-compatible with allowing unrestricted regular expressions in
    some future version.

    --
    Ed Avis <eda@waniasset.com>BA
  • Aaron Crane at Jun 30, 2011 at 12:12 pm

    Nicholas Clark wrote:
    To work properly, $/ as a
    regular expression would require a regular expression engine that knows that
    the end of the string [1] isn't the end of the string [2]

    1: that it was presented with right now
    2: that exists in total

    ie, you'd have to completely re-write the regular expression engine (and the
    optimiser) to remove assumptions about "can't match, because we've exhausted
    the input", and replace those with "OK, grab some more input and try again".
    In the general case, there are regular expressions which can't be
    correctly executed without having the whole of the target string
    available. For example, consider `$/ = qr/\n(?!.*cowbell)/s`. That
    would treat every newline character as a line break, unless the input
    contains the string "cowbell", in which case only those newlines after
    the last occurrence of that string are treated as breaks. (Not a
    particularly useful setting for $/, I suspect, but it serves as an
    illustration.) You then can't know whether a newline constitutes a
    line break without scanning to the end of the entire input for more
    cowbell.

    So maybe regex-$/, if it's desired, could instead be implemented by
    internally slurping the input and using split() (or its moral
    equivalent) to break it into lines. That sounds rather easier than
    rewriting the regex engine to know when to grab more input and try
    again.

    The biggest downside I can see is that currently people expect the <>
    operator not to need to read the entire input before yielding a line
    (except in slurp mode), so they assume they can use it on
    multi-gigabyte files. Perhaps breaking that assumption would be
    unreasonable.

    --
    Aaron Crane ** http://aaroncrane.co.uk/
  • Ed Avis at Jun 30, 2011 at 12:20 pm

    Aaron Crane <perl <at> aaroncrane.co.uk> writes:

    $/ as a regular expression
    In the general case, there are regular expressions which can't be
    correctly executed without having the whole of the target string
    available.
    Am I right in thinking that these are not strictly speaking 'regular'?
    If you disallow the extended features such as lookahead and lookbehind,
    restricting to true regular expressions, then you can do matching without
    reading the whole string every time. However, reading the whole string
    may still be necessary, for example to match /a*b/ against 'aaaaaaaab'.
    A stricter policy might guarantee a maximum length of match (by banning the
    + and * qualifiers) and so guarantee only a limited amount of extra slurping.

    --
    Ed Avis <eda@waniasset.com>
  • Aaron Crane at Jun 30, 2011 at 1:44 pm

    Ed Avis wrote:
    Aaron Crane <perl <at> aaroncrane.co.uk> writes:
    In the general case, there are regular expressions which can't be
    correctly executed without having the whole of the target string
    available.
    Am I right in thinking that these are not strictly speaking 'regular'?
    I don't think so. For example, the one I suggested,
    `qr/\n(?!.*cowbell)/s`, can be rewritten to a form which could be
    matched by a DFA. I can't quite face doing it manually for such a
    long string, but shorter examples look like this, where the right-hand
    pattern is a trivially-regular rewriting of the left-hand pattern:

    qr/a(?!.*bc)/s qr/a [^b]* (?:b+ [^bc]* )*\z/x
    qr/a(?!.*bcd)/s qr/a [^b]* (?:b+ [^bcd]* (?:c+ [^cd]* )* )*\z/x

    More generally, regular languages are closed under intersection,
    union, and complement; and since you can treat "X when not followed by
    Y" as the intersection of "X" with the complement of "X followed by
    Y", you can always play that sort of trick. (As long as you don't
    care about the values of @- and @+ afterwards, that is; but regular
    expressions in the mathematical sense only give you a single-bit
    yes/no answer, not information about the location and size of the
    match. This is one of the many reasons that we use non-regular
    regexes for programming, of course.)
    If you disallow the extended features such as lookahead and lookbehind,
    restricting to true regular expressions, then you can do matching without
    reading the whole string every time.  However, reading the whole string
    may still be necessary, for example to match /a*b/ against 'aaaaaaaab'.
    A stricter policy might guarantee a maximum length of match (by banning the
    + and * qualifiers) and so guarantee only a limited amount of extra slurping.
    Yeah, that might be possible without changing the regex engine
    significantly. You'd still need additional smarts: calculate the
    regex's maximum match length M, and don't consider a buffer position a
    potential line-break point unless you have at least M characters after
    it.

    The question is whether that would be good enough to handle the use
    cases for regex-$/. If not, the slurp/split approach is probably a
    better option.

    --
    Aaron Crane ** http://aaroncrane.co.uk/
  • The Sidhekin at Jun 30, 2011 at 3:46 pm

    On Thu, Jun 30, 2011 at 3:43 PM, Aaron Crane wrote:
    qr/a(?!.*bc)/s qr/a [^b]* (?:b+ [^bc]* )*\z/x
    qr/a(?!.*bcd)/s qr/a [^b]* (?:b+ [^bcd]* (?:c+ [^cd]* )* )*\z/x

    That's not quite right, is it? The left-hand-side expression only ever
    match a single character of the string (and could match any number of
    times), whereas the right-hand-side expressions match up to and including
    the end of the string (and so could never match more than once).

    Then again, for all I know of regularity, the position and length of the
    match is not well defined ...

    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/. If your $/ marker includes \z, slurping the entire file could
    be no bug.


    Eirik
  • Tom Christiansen at Jun 30, 2011 at 4:14 pm

    The Sidhekin wrote on Thu, 30 Jun 2011 17:46:12 +0200:

    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/. If your $/ marker includes \z, slurping the entire file could
    be no bug.
    I'm sorry to have sent everyone haring off trying to solve the wrong problem.
    I'm not looking for $/ to be a regex. I'm just looking for a way to get
    readline/chomp into Unicode-newline mode, is all. That means stop at the
    first \R you see, but *that* doesn't require a regex.

    --tom
  • Abigail at Jul 29, 2011 at 3:52 pm

    On Thu, Jun 30, 2011 at 10:14:17AM -0600, Tom Christiansen wrote:
    The Sidhekin <sidhekin@gmail.com> wrote on Thu, 30 Jun 2011 17:46:12 +0200:
    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/. If your $/ marker includes \z, slurping the entire file could
    be no bug.
    I'm sorry to have sent everyone haring off trying to solve the wrong problem.
    I'm not looking for $/ to be a regex. I'm just looking for a way to get
    readline/chomp into Unicode-newline mode, is all. That means stop at the
    first \R you see, but *that* doesn't require a regex.

    Yet, it can still have practical problems.

    Say, your program is program is reading from a pipe, using a simple loop:

    while (<$pipe>) {
    ... do something as soon as a line comes in ...
    }

    Now it receives "important message\r".

    What should it do? Run the body of the loop, or block and wait for the
    next character just to see it is (or isn't) a newline?



    Abigail
  • Peter Martini at Jul 30, 2011 at 4:01 pm

    On Fri, Jul 29, 2011 at 11:52 AM, Abigail wrote:
    On Thu, Jun 30, 2011 at 10:14:17AM -0600, Tom Christiansen wrote:
    The Sidhekin <sidhekin@gmail.com> wrote on Thu, 30 Jun 2011 17:46:12 +0200:
    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/.   If your $/ marker includes \z, slurping the entire file could
    be no bug.
    I'm sorry to have sent everyone haring off trying to solve the wrong problem.
    I'm not looking for $/ to be a regex.  I'm just looking for a way to get
    readline/chomp into Unicode-newline mode, is all.  That means stop at the
    first \R you see, but *that* doesn't require a regex.

    Yet, it can still have practical problems.

    Say, your program is program is reading from a pipe, using a simple loop:

    while (<$pipe>) {
    ... do something as soon as a line comes in ...
    }

    Now it receives "important message\r".

    What should it do? Run the body of the loop, or block and wait for the
    next character just to see it is (or isn't) a newline?



    Abigail
    A naive implementation would be to run the body of the loop as soon as
    the \r is detected, including the \n in the match if its detected;
    after all, it already is a complete match. If chomp is \R aware and
    its being used inside the body, it won't matter whether the capture
    found half of a \r\n or a whole \r. The only change over that naive
    implementation necessary to make it DWIM (speaking for myself) is to
    make sure that if \R matches just \r, and the next line starts with
    \n, the match silently skips over it. If blocking behavior was
    desired, then $/ should have been set to \r\n in the first place.

    Or am I overlooking something obvious?

    Peter
  • Abigail at Jul 30, 2011 at 4:12 pm

    On Sat, Jul 30, 2011 at 12:01:09PM -0400, Peter Martini wrote:
    On Fri, Jul 29, 2011 at 11:52 AM, Abigail wrote:
    On Thu, Jun 30, 2011 at 10:14:17AM -0600, Tom Christiansen wrote:
    The Sidhekin <sidhekin@gmail.com> wrote on Thu, 30 Jun 2011 17:46:12 +0200:
    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/.   If your $/ marker includes \z, slurping the entire file could
    be no bug.
    I'm sorry to have sent everyone haring off trying to solve the wrong problem.
    I'm not looking for $/ to be a regex.  I'm just looking for a way to get
    readline/chomp into Unicode-newline mode, is all.  That means stop at the
    first \R you see, but *that* doesn't require a regex.

    Yet, it can still have practical problems.

    Say, your program is program is reading from a pipe, using a simple loop:

    while (<$pipe>) {
    ... do something as soon as a line comes in ...
    }

    Now it receives "important message\r".

    What should it do? Run the body of the loop, or block and wait for the
    next character just to see it is (or isn't) a newline?



    Abigail
    A naive implementation would be to run the body of the loop as soon as
    the \r is detected, including the \n in the match if its detected;
    after all, it already is a complete match. If chomp is \R aware and
    its being used inside the body, it won't matter whether the capture
    found half of a \r\n or a whole \r. The only change over that naive
    implementation necessary to make it DWIM (speaking for myself) is to
    make sure that if \R matches just \r, and the next line starts with
    \n, the match silently skips over it. If blocking behavior was
    desired, then $/ should have been set to \r\n in the first place.

    Or am I overlooking something obvious?

    Well, yes. You've read some text, and read a \r. *Now* you have to make
    a decision. Run the body, or wait for the next character to arrive.
    You do not know when the next character arrives. That is, the "including
    the \n in the match if its detected" may take a long time.

    You may not assume that every handle can be seeked forward whenever
    you want.


    Abigail
  • Peter Martini at Jul 30, 2011 at 4:26 pm

    On Sat, Jul 30, 2011 at 12:12 PM, Abigail wrote:
    On Sat, Jul 30, 2011 at 12:01:09PM -0400, Peter Martini wrote:
    On Fri, Jul 29, 2011 at 11:52 AM, Abigail wrote:
    On Thu, Jun 30, 2011 at 10:14:17AM -0600, Tom Christiansen wrote:
    The Sidhekin <sidhekin@gmail.com> wrote on Thu, 30 Jun 2011 17:46:12 +0200:
    For the purposes at hand though, we would want the record to include
    (terminate with) the (first, if any, remaining) part of the string (stream?)
    matching $/.   If your $/ marker includes \z, slurping the entire file could
    be no bug.
    I'm sorry to have sent everyone haring off trying to solve the wrong problem.
    I'm not looking for $/ to be a regex.  I'm just looking for a way to get
    readline/chomp into Unicode-newline mode, is all.  That means stop at the
    first \R you see, but *that* doesn't require a regex.

    Yet, it can still have practical problems.

    Say, your program is program is reading from a pipe, using a simple loop:

    while (<$pipe>) {
    ... do something as soon as a line comes in ...
    }

    Now it receives "important message\r".

    What should it do? Run the body of the loop, or block and wait for the
    next character just to see it is (or isn't) a newline?



    Abigail
    A naive implementation would be to run the body of the loop as soon as
    the \r is detected, including the \n in the match if its detected;
    after all, it already is a complete match.  If chomp is \R aware and
    its being used inside the body, it won't matter whether the capture
    found half of a \r\n or a whole \r.  The only change over that naive
    implementation necessary to make it DWIM (speaking for myself) is to
    make sure that if \R matches just \r, and the next line starts with
    \n, the match silently skips over it.  If blocking behavior was
    desired, then $/ should have been set to \r\n in the first place.

    Or am I overlooking something obvious?

    Well, yes. You've read some text, and read a \r. *Now* you have to make
    a decision. Run the body, or wait for the next character to arrive.
    You do not know when the next character arrives. That is, the "including
    the \n in the match if its detected" may take a long time.

    You may not assume that every handle can be seeked forward whenever
    you want.


    Abigail
    Ah, sorry, what I meant is if it's detected *in the current buffer*.
    If it's a \r at the end of the current buffer, run the loop anyway,
    and flag that if the first character of the next buffer is \n, ignore
    it.

    Peter
  • David Nicol at Aug 2, 2011 at 4:28 pm

    On Sat, Jul 30, 2011 at 11:26 AM, Peter Martini wrote:
    On Sat, Jul 30, 2011 at 12:12 PM, Abigail wrote:
    You may not assume that every handle can be seeked forward whenever
    you want.


    Abigail
    Ah, sorry, what I meant is if it's detected *in the current buffer*.
    If it's a \r at the end of the current buffer, run the loop anyway,
    and flag that if the first character of the next buffer is \n, ignore
    it.

    Peter
    I understood it the first time; the decision-maker needs some
    visibility to the buffer, and a way to gracefully handle the situation
    where a mutichar newline actually gets split. I think introducing a
    "partway through a multichar EOL" state makes sense, for the common
    CRLF situation.

    It shouldn't default to on though, as often logical end-of-record gets
    set to a token, such as a repeated XML opener.

    I've done things like

    $/ = '</record>';
    while(<>){
    # we now have a record element in $_
    ...
    }

    regularly; having that match because a packet split '</rhinocerous>'
    in just the right place would be bad.
  • Eric Brine at Aug 2, 2011 at 6:04 pm

    On Sat, Jul 30, 2011 at 12:12 PM, Abigail wrote:

    Well, yes. You've read some text, and read a \r. *Now* you have to make
    a decision. Run the body, or wait for the next character to arrive.
    You do not know when the next character arrives. That is, the "including
    the \n in the match if its detected" may take a long time.
    There's another option: On receipt of \r, return \n. Drop any \n that
    immediately follow, even if it's only encountered on the next read.
  • Ed Avis at Jun 30, 2011 at 10:22 am
    Jan Dubois <jand <at> activestate.com> writes:
    what do you want to do about chomp? Change $/ to
    "\r\n" by default on Windows?
    If it were up to me, chomp would remove \r\n or \n, whichever is
    there, on both Unix and Windows. In my own code, running mostly on
    Linux, I do not use chomp because it doesn't handle \r\n-format text
    files, which although less common than plain \n-format, still turn up
    regularly.
    Make $/ a regular expression? I believe that has been rejected before
    due to performance concerns.
    I see. Perhaps there would be room to make $/ a regular expression
    but have optimized handling for simple cases such as /\n/ or /\r?\n/
    so they don't need to drag in the whole regexp engine.
    Normalizing line-endings at the I/O boundary seems like the least insane
    way to deal with it.
    The trouble is that the current Windows behaviour is not really
    normalizing them but crufting them up - adding extra \r characters on
    output in order to comply with an ancient CP/M convention that really
    doesn't matter any more. If it normalized newlines by turning \r\n =>
    \n on input, and left \n alone on output, I could live with that.
    (For the small number of cases where you really do want \r\n line
    endings because you use DOS-style tools that can't cope, you can add
    the \r explicitly. You have to do that anyway, otherwise your program
    will produce the wrong result if run on Unix.)

    --
    Ed Avis <eda@waniasset.com>
  • Jan Dubois at Jun 29, 2011 at 5:44 am

    On Tue, 28 Jun 2011, Christian Walde wrote:
    On Tue, 28 Jun 2011 22:39:55 +0200, Leon Timmermans wrote:
    Because it's the correct thing to do on DOSish platform?
    As we discussed on IRC, this means: Because it's tradition. And you
    agreed there, tradition is not a valid reason to keep bad behavior. :)
    It is not just "a tradition", "\x0D\x0A" is the official line
    termination sequence for text files on Windows. Supporting both text and
    binary data streams is actually mandated by the C standard; it is just a
    property of POSIX systems that they behave identically. I can't find a
    free spec on the net right now, but you can read it paraphrased here:

    https://secure.wikimedia.org/wikipedia/en/wiki/Newline#In_programming_languages

    Just wishing that every computer is a VAX, or every OS is a version of
    Linux, or at least somewhat based on POSIX doesn't make it true.

    Maybe switching all file handles to binary mode by default will make a
    couple more CPAN modules pass their test suites on Windows (by hiding
    that they don't implement proper text mode semantics). But it really
    would be somewhat similar to enforcing that Perl can only be compiled
    when sizeof(int) == sizeof(char*) because we want to be able to store
    pointers in int variables. All computers are PDP 11s after all, or
    should at least look like one. :)

    But how does all of this help the poor Windows user who will want to
    use Perl to process, you know, regular text files on Windows that have
    *not* be written on a POSIX system, or by Perl 5.16? Most text files on
    Windows will continue to have CR+LF line endings. So if we read them in
    binary mode, they will still match /\r\n^z/. And neither chop, nor
    chomp, nor s/\n// will get rid of that pesky '\r'. How is this
    intuitive for a text processing tool? And how many existing scripts is
    that going to break?

    So running automated CPAN testing is going to mislead you about the
    impact this change is going to have.

    As you can probably guess by now, I don't think pretending that text
    mode is the same as binary mode is the right thing to do.

    Cheers,
    -Jan

Related Discussions