FAQ
Hello all,
I am beating my head against the wall, any help would be appreciated.

I have a file:
/ / / / m / cvfbcbf/ A123/ / / /// ////
/ / / / m / cvfbcbf/ A234/ / / /// ////
/ / / / m / cvfbcbf/ B123/ / / /// ////

There is spaces in the beginning and the end of each line and each line is very similar. I'm trying to count how many unique A#'s and B#'s as well as total A#'s and B#'s.

The problem for me is the line endings I think. When I open the file and read in one line, I get the whole file. I think the line endings are ^p (MS paragraph markers), but I can't open the file to view them. The files are huge, 150M or bigger. MS Word chokes on them.

Each line does end with 30 spaces.

Is there a way for me to search the entire 150M single line and get the metrics I'm looking for, or is it possible to open the file, search for the 30 spaces and replace with \n?

Thanks again,
Eric

Search Discussions

  • Jim Gibson at Aug 17, 2011 at 10:25 pm
    On 8/17/11 Wed Aug 17, 2011 2:59 PM, "ERIC KRAUSE" <erickrause@bft1.org>
    scribbled:
    Hello all,
    I am beating my head against the wall, any help would be appreciated.

    I have a file:
    / / / / m / cvfbcbf/ A123/ / / /// ////
    / / / / m / cvfbcbf/ A234/ / / /// ////
    / / / / m / cvfbcbf/ B123/ / / /// ////

    There is spaces in the beginning and the end of each line and each line is
    very similar. I'm trying to count how many unique A#'s and B#'s as well as
    total A#'s and B#'s.
    A hash would be suitable for that task.
    The problem for me is the line endings I think. When I open the file and read
    in one line, I get the whole file. I think the line endings are ^p (MS
    paragraph markers), but I can't open the file to view them. The files are
    huge, 150M or bigger. MS Word chokes on them.
    Try Wordpad or Notepad to open the file. It sounds like the file is not a
    regular text file with normal Windows (or Unix) line endings such as "\r\n",
    "\n", "\r", etc. Where did the file come from?
    Each line does end with 30 spaces.

    Is there a way for me to search the entire 150M single line and get the
    metrics I'm looking for, or is it possible to open the file, search for the 30
    spaces and replace with \n?
    Yes:

    $file_contents =~ s/\s{30,}/\n/g;

    which will substitute any consecutive substring of 30 or more whitespace
    characters with a newline character.

    You can also split the file on the 30 spaces:

    my @lines = split(/\s{30,}/,$file_contents);

    If you can figure out how the paragraph markers are stored in the file, you
    can split on those, instead. The above statement will likely leave those
    markers at the beginning of each line, except possibly the first.

    You can use substr to print parts of the file:

    print substr($file_contents,0,80), "\n";

    to see what you really have.
  • Brandon McCaig at Aug 17, 2011 at 11:22 pm

    On Wed, Aug 17, 2011 at 5:59 PM, ERIC KRAUSE wrote:
    The problem for me is the line endings I think. When I open the
    file and read in one line, I get the whole file. I think the
    line endings are ^p (MS paragraph markers), but I can't open
    the file to view them. The files are huge, 150M or bigger. MS
    Word chokes on them. *snip*
    Is there a way for me to search the entire 150M single line and
    get the metrics I'm looking for, or is it possible to open the
    file, search for the 30 spaces and replace with \n?
    150M single line? Do you mean a single line is 150 megabytes or
    did you mean something else?

    Assuming sensible line lengths you could start by opening the
    file as a binary file and reading a specific amount of data (a
    reasonable length, like a few kilobytes or megabytes). Write
    that to a new file and examine it, either with a text editor or
    hex editor (or what ever application of your choosing). Once you
    know the line/record separator character(s) you should be able to
    easily process the file line by line or record by record.


    --
    Brandon McCaig <http://www.bamccaig.com/> <bamccaig@gmail.com>
    V zrna gur orfg jvgu jung V fnl. Vg qbrfa'g nyjnlf fbhaq gung jnl.
    Castopulence Software <http://www.castopulence.org/> <bamccaig@castopulence.org>
  • ERIC KRAUSE at Aug 18, 2011 at 9:59 pm
    Brandon and Jim,
    Thank you for the replies. They were very helpful. I have gotten past my blockage.

    Eric
    On Aug 17, 2011, at 5:22 PM, Brandon McCaig wrote:
    On Wed, Aug 17, 2011 at 5:59 PM, ERIC KRAUSE wrote:
    The problem for me is the line endings I think. When I open the
    file and read in one line, I get the whole file. I think the
    line endings are ^p (MS paragraph markers), but I can't open
    the file to view them. The files are huge, 150M or bigger. MS
    Word chokes on them. *snip*
    Is there a way for me to search the entire 150M single line and
    get the metrics I'm looking for, or is it possible to open the
    file, search for the 30 spaces and replace with \n?
    150M single line? Do you mean a single line is 150 megabytes or
    did you mean something else?

    Assuming sensible line lengths you could start by opening the
    file as a binary file and reading a specific amount of data (a
    reasonable length, like a few kilobytes or megabytes). Write
    that to a new file and examine it, either with a text editor or
    hex editor (or what ever application of your choosing). Once you
    know the line/record separator character(s) you should be able to
    easily process the file line by line or record by record.


    --
    Brandon McCaig <http://www.bamccaig.com/> <bamccaig@gmail.com>
    V zrna gur orfg jvgu jung V fnl. Vg qbrfa'g nyjnlf fbhaq gung jnl.
    Castopulence Software <http://www.castopulence.org/> <bamccaig@castopulence.org>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedAug 17, '11 at 9:59p
activeAug 18, '11 at 9:59p
posts4
users3
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase