FAQ
Hi all,

I am having trouble with combining data from several files, and I can't
even figure out how to get started. So, I am NOT asking for any code
(though pseudo-code is ok) as I would like to try figuring this problem
out myself. So, if anyone can give me any references or hints that
would be great.

So, here is what I am trying to do:

I have say 2 files (I'd like to do this to as many files as the user
needs):

***FILE 1***
cat
atacta--gat--acgt-
ac-ac-ggttta-ca--
dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cac
mouse
acagctagc-atgca--
----acgtatgctacg--atg-
***end file 1***


***FILE 2***
mouse
aatctgatcgc-atgca--
----acgtaaggctagg-
cat
atacta--gat--acgt-
ac-acacagcta--ca--
dog
atgcgtatgc-atcgat
-ac--ac-a-ac-a-cac
***end file 2***

Basically, I would like to concatenate the sequence of each
corresponding animal so that the various input files would be out put
to a file like so:

***output***
cat
atacta--gat--acgt-ac-ac-ggttta-ca--atacta--gat--acgt-ac-acacagcta--ca--
dog
atgcgtatgc-atcgat-ac--ac-a-ac-a-cacatgcgtatgc-atcgat-ac--ac-a-ac-a-cac
mouse
acagctagc-atgca------acgtatgctacg--atg-aatctgatcgc-atgca------
acgtaaggctagg-
***output end***

Notice that in the two files the data are not in the same order. So, I
am trying to figure out how to have the script figure out what the
first organism is in FILE 1( say "cat" in this case) and find the
corresponding "cat" in the other input files. Then take the sequence
data (all the cat data) from FILE 2 and concatenate it to the cat
sequence data in FILE 1 to an output file. Then it should go on to the
next organism in FILE 1 and search for that next organism in the other
files (in this case FILE 2). I do not care about the order of the data,
only that the "like" data is concatenated together.

Again, I do NOT want this solved for me (unless I am totally lost).
Otherwise, I'll never learn. I would just like either hints /
suggestions / pseudo code / even links to books or sites that discuss
this particular topic. Meanwhile, I am eagerly awaiting my "PERL
Cookbook" and I'll keep searching the web.

-Thanks!
-Mike

Search Discussions

  • Michael S. Robeson II at May 17, 2004 at 11:48 am
    Hi all,

    I am having trouble with combining data from several files, and I can't
    even figure out how to get started. So, I am NOT asking for any code
    (though pseudo-code is ok) as I would like to try figuring this problem
    out myself. So, if anyone can give me any references or hints that
    would be great.

    So, here is what I am trying to do:

    I have say 2 files (I'd like to do this to as many files as the user
    needs):

    ***FILE 1***
    cat
    atacta--gat--acgt-
    ac-ac-ggttta-ca--
    dog
    atgcgtatgc-atcgat-ac--ac-a-ac-a-cac
    mouse
    acagctagc-atgca--
    ----acgtatgctacg--atg-
    ***end file 1***


    ***FILE 2***
    mouse
    aatctgatcgc-atgca--
    ----acgtaaggctagg-
    cat
    atacta--gat--acgt-
    ac-acacagcta--ca--
    dog
    atgcgtatgc-atcgat
    -ac--ac-a-ac-a-cac
    ***end file 2***

    Basically, I would like to concatenate the sequence of each
    corresponding animal so that the various input files would be out put
    to a file like so:

    ***output***
    cat
    atacta--gat--acgt-ac-ac-ggttta-ca--atacta--gat--acgt-ac-acacagcta--ca--
    dog
    atgcgtatgc-atcgat-ac--ac-a-ac-a-cacatgcgtatgc-atcgat-ac--ac-a-ac-a-cac
    mouse
    acagctagc-atgca------acgtatgctacg--atg-aatctgatcgc-atgca------
    acgtaaggctagg-
    ***output end***

    Notice that in the two files the data are not in the same order. So, I
    am trying to figure out how to have the script figure out what the
    first organism is in FILE 1( say "cat" in this case) and find the
    corresponding "cat" in the other input files. Then take the sequence
    data (all the cat data) from FILE 2 and concatenate it to the cat
    sequence data in FILE 1 to an output file. Then it should go on to the
    next organism in FILE 1 and search for that next organism in the other
    files (in this case FILE 2). I do not care about the order of the data,
    only that the "like" data is concatenated together.

    Again, I do NOT want this solved for me (unless I am totally lost).
    Otherwise, I'll never learn. I would just like either hints /
    suggestions / pseudo code / even links to books or sites that discuss
    this particular topic. Meanwhile, I am eagerly awaiting my "PERL
    Cookbook" and I'll keep searching the web.

    -Thanks!
    -Mike
  • Ricardo SIGNES at May 17, 2004 at 11:54 am
    * "Michael S. Robeson II" [2004-05-17T07:47:57]
    I am having trouble with combining data from several files, and I can't
    even figure out how to get started. So, I am NOT asking for any code
    (though pseudo-code is ok) as I would like to try figuring this problem
    out myself. So, if anyone can give me any references or hints that
    would be great.
    One way to solve this problem is to create a hash, in which the keys are
    the animal names and the values are the sequences, possibly in an
    arrayref, possibly just cat'd together.

    So, something like:

    for each file
    open the file
    for every new animal found
    add all the non-blank lines to $sequences{animal}

    Is that clear-ish?

    --
    rjbs
  • Michael S. Robeson II at May 17, 2004 at 10:21 pm
    Well this is the best I could do thinking through what you said. This
    is actually my first time working with hashes. Also, I am still a PERL
    newbie. So, I guess a little helpful code would go a long way. I just
    can't figure out how to link the regular expressions to the hash when
    searching through the multiple files. to do as you say:

    ***Philipp wrote:***

    - open the first file
    - search for the beginning of an "organism" (say: ">cat"), read
    everything
    after this point
    - search in your hash if you already stored data of this organism
    - if yes, append your new sequence to the already existing data
    - if no, create a new key in the hash
    - repeat this until you run out of "organisms"
    - repeat the whole procedure until you run out of files

    ***end***

    #!/usr/bin/perl
    # This script will take separate FASTA files and combine the "like"
    # data into one FASTA file.
    #

    use warnings;
    use strict;

    my %organisms (
    "$orgID" => "$orgSeq",
    );

    print "Enter in a list of files to be processed:\n";

    # For example:
    # CytB.fasta
    # NADH1.fasta
    # ....

    chomp (my @infiles = <STDIN>);

    foreach $infile (@infiles) {
    open (FASTA, $infile)
    or die "Can't open INFILE: $!";

    $/='>'; #Set input operator

    while (FASTA) {
    chomp;

    # Some regular expression match here?
    # something that will set, say... ">cat"
    # as the key "$orgID", something similar
    # to below?
    # and then set the sequence as the value
    # "$orgSeq" like below?

    # Do not know if or where to put the following,
    # but something like:

    if (exists $organisms{$orgID}) {
    # somehow concatenate "like" data
    # from the different files
    }

    # print the final Hash to an outfile?

    }

    .... yeah, I'm lost. :-)

    -Mike
  • Traeder, Philipp at May 17, 2004 at 11:58 am

    Hi all,
    Hi Michael,
    I am having trouble with combining data from several files,
    and I can't
    even figure out how to get started. So, I am NOT asking for any code
    (though pseudo-code is ok) as I would like to try figuring
    this problem
    out myself. So, if anyone can give me any references or hints that
    would be great.
    That´s a good approach :-)

    [..]
    Basically, I would like to concatenate the sequence of each
    corresponding animal so that the various input files would
    be out put
    to a file like so: [..]
    Notice that in the two files the data are not in the same
    order. So, I
    am trying to figure out how to have the script figure out what the
    first organism is in FILE 1( say "cat" in this case) and find the
    corresponding "cat" in the other input files. Then take the sequence
    data (all the cat data) from FILE 2 and concatenate it to the cat
    sequence data in FILE 1 to an output file. Then it should go
    on to the
    next organism in FILE 1 and search for that next organism in
    the other
    files (in this case FILE 2). I do not care about the order of
    the data,
    only that the "like" data is concatenated together.
    If memory is not a problem (i.e. the amount of data you're processing is
    rather small), I would read all files into a hash and concatenate them
    there. Something like:
    - open the first file
    - search for the beginning of an "organism" (say: ">cat"), read everything
    after this point
    - search in your hash if you already stored data of this organism
    - if yes, append your new sequence to the already existing data
    - if no, create a new key in the hash
    - repeat this until you run out of "organisms"
    - repeat the whole procedure until you run out of files

    I'd happily elaborate, but I don't want to spoil your approach of wanting to
    solve this by yourself.
    If you have any questions or need additional information, just post again.
    :-)
    Again, I do NOT want this solved for me (unless I am totally lost).
    Otherwise, I'll never learn. I would just like either hints /
    suggestions / pseudo code / even links to books or sites that
    discuss
    this particular topic. Meanwhile, I am eagerly awaiting my "PERL
    Cookbook" and I'll keep searching the web.
    Another good starting point might be the "camel book"
    (O`Reilly: Programming Perl)...

    HTH,

    Philipp
  • Johan Viklund at May 17, 2004 at 12:08 pm

    On Sun, 16 May 2004 19:50:57 -0400, Michael S. Robeson II wrote:

    Hi all,
    Hello and Welcome to the world of bioinformatics with perl!


    ...

    I think you should take a look at bioperl since this is genome data, for
    this exercise it's not what you want, but if you want to do more biology
    whith perl (blast, interfacing with databases, easy format conversion, and
    so on, and so forth...). Bioperl can be found at http://www.bioperl.org/
    ***FILE 1***
    cat
    atacta--gat--acgt-
    ac-ac-ggttta-ca-- ...
    Again, I do NOT want this solved for me (unless I am totally lost).
    Otherwise, I'll never learn. I would just like either hints /
    suggestions / pseudo code / even links to books or sites that discuss
    this particular topic. Meanwhile, I am eagerly awaiting my "PERL
    Cookbook" and I'll keep searching the web.
    So this was more like a link ;)


    -Thanks!
    -Mike
    /Johan Viklund

    Ps.
    <off-topic>
    Next exercise (or really the one before) would be to calculate the GC-skew.
    </off-topic>
  • Traeder, Philipp at May 18, 2004 at 9:27 am

    Well this is the best I could do thinking through what you said. This
    is actually my first time working with hashes. Also, I am
    still a PERL
    newbie. So, I guess a little helpful code would go a long way. I just
    can't figure out how to link the regular expressions to the hash when
    searching through the multiple files. to do as you say:
    That's quite good - you've got more or less all the relevant parts, we
    just need to put it together. I'll try to give you some hints without
    revealing the whole black magic ;-)
    #!/usr/bin/perl
    # This script will take separate FASTA files and combine the "like"
    # data into one FASTA file.
    #

    use warnings;
    use strict;

    my %organisms (
    "$orgID" => "$orgSeq",
    );
    I'd say there´s no need to fill the hash with values at this point;
    I'd just declare it like this:

    my %organisms;

    We'll put values into it later.
    print "Enter in a list of files to be processed:\n";

    # For example:
    # CytB.fasta
    # NADH1.fasta
    # ....

    chomp (my @infiles = <STDIN>);
    This doesn't work for me - maybe I just don't know how to use it,
    but for the moment, I'd hardcode this and concentrate on the
    concatenation part...

    my @infiles = ('genetics.txt');
    foreach $infile (@infiles) {
    open (FASTA, $infile)
    or die "Can't open INFILE: $!";
    Some small things - having set "warnings" and "strict", perl is asking
    me to declare $infile and FASTA - so for me this looks like:

    my $FASTA = new FileHandle;
    open ($FASTA, $infile)
    or die "Can't open INFILE: $!";

    BTW: For this to work, you need to

    use FileHandle;

    on top of your code.
    $/='>'; #Set input operator
    This is an interesting approach - I've never worked with this input
    operator, but I think it might make like quite easy...
    I'll do it manually (which is at least good practice ;-) ), but we
    should keep this in mind.
    while (FASTA) {
    small stuff again:

    while (defined($_ = <$FASTA>)) {
    chomp;

    # Some regular expression match here?
    # something that will set, say... ">cat"
    # as the key "$orgID", something similar
    # to below?
    # and then set the sequence as the value
    # "$orgSeq" like below?
    Yes, a regexp is a very good idea here.
    Generally, you just need to distinguish between the "start"-lines and
    the "regular" lines here, i.e. the ones that mark the beginning of an
    organism and the ones that carry the data.

    # We're searching for "start"-lines that look like this:
    # >dog
    # so try to match something like
    # \s* zero-to-many characters of
    # optional whitespace
    # > the bigger-than sign
    # \w+ one-to-many (word) characters
    # the parenthesis around the \w+ means that
    # we want to access this value later using $1
    if (/\s*>(\w+)/) {
    print "found a new organism called '$1'\n";
    }
    # or just some data belonging to the last
    # organism we found
    else {
    print "this is just some data : $_\n";
    }

    Don't worry if you don't understand this on the first look - regexes can
    be quite messy, but once you get used to them, they quickly become your
    best friend (for getting used to them, I can recommend very much chapter
    2 of the camel book...).
    Anyway - now you've got the name of the new organism in a special variable
    called $1 (if this is a new organism) or the data in $_ (if it´s not).
    # Do not know if or where to put the following,
    # but something like:

    if (exists $organisms{$orgID}) {
    # somehow concatenate "like" data
    # from the different files
    }
    This is completely right as well - this line of code lets you check if you
    already got data of this organism...let´s think about what we want to do.
    We've got an hash which should look like:

    cat => funny-sequence-of-a-c-g-whatever,
    dog => even-funnier-sequence-of-characters

    With the code from above, we iterate over all files specified on the command
    line (or hardcoded into the script), and there are two kinds of line we can
    meet:
    - start lines
    - regular data lines
    We can separate start lines from regular lines with the regexp above.

    When we come across a start line, we don't have to process any data: a start
    line, after all, does not contain "real" data, but only the name identifying
    the organism to which the following data belongs.
    But in order to store the data for the "right" organism, we should keep
    track
    of the last start line - I would do this by storing the last ID in a
    variable
    (that needs to be outside the while-loop).

    So now we can do something like this:
    If the new line is a start line
    - store the ID
    If the new line is a regular line
    - append it to the current entry

    Since you want to append the individual strings, you don't need to check
    explicitly if the hash entry exists already (as you sketched above).
    It would be different if you would use a data structure that needs to be
    initialized, like array-refs for example. But that´s another subject ;-)
    # print the final Hash to an outfile?

    }
    Afterwards, you can print the final hash, and everything´s fine.

    Here´s a sketch of what the code could look like though I left you to fill
    some interesting parts ;-)
    At the moment, it´s creating new hash entries, but not appending to them.

    <perl code>
    #! /usr/bin/perl -w

    use strict;
    use FileHandle;

    my %organisms;

    print "Enter in a list of files to be processed:\n";

    # For example:
    # CytB.fasta
    # NADH1.fasta
    # ....

    #chomp (my @infiles = <STDIN>);
    # TODO we should make this nice later
    my @infiles = ('genetics.txt');

    foreach my $infile (@infiles) {
    my $FASTA = new FileHandle;
    open ($FASTA, $infile)
    or die "Can't open INFILE: $!";

    #$/='>'; #Set input operator

    # I moved this variable outside the while-loop
    # in order to be able to assign the "data" in
    # the nextline to the organism it belongs to
    # (we're keeping track of the last start line
    # that we came across here)
    my $orgID;

    while (defined($_ = <$FASTA>)) {
    chomp;
    print "\nworking on >>$_<<\n";

    # see if this line is the start of an
    # organism; the thing we´re searching for
    # looks like this:
    # >dog
    # so try to match something like
    # \s* zero-to-many characters of
    # optional whitespace
    # > the bigger-than sign
    # \w+ one-to-many (word) characters
    # the parenthesis around the \w+ means that
    # we want to access this value later using $1
    if (/\s*>(\w+)/) {
    $orgID = $1;
    print "found a new organism start line ('$orgID')\n";

    }
    # or just some data belonging to the last
    # organism we found
    else {
    print "this is just some data : $_\n";
    print "this data needs to be appended to the hash
    entry for $orgID\n";

    # let´s check if we´ve got data for this entry
    if (exists ($organisms{$orgID})) {
    # TODO append the data to the hash here

    }
    else {
    # create a new hash entry for this data
    $organisms{$orgID} = $_;
    }
    }
    }
    # do not forget to close the input file
    close ($FASTA)
    or die "could not close INFILE : $!";
    }

    # we've processed all input files...print the resulting hash
    print "\n****************************************\n";
    while (my ($orgID, $sequence) = each(%organisms)) {
    print "$orgID : $sequence\n";
    }
    </perl>

    HTH,

    Philipp
  • Michael Robeson at May 18, 2004 at 5:16 pm
    Ok great. Most of what you show does make sense. However, there are
    some bits of code that I need further clarification with. Some bits I
    am able to tell what they are doing but I do not quite know how or why
    they work they way they do. I'll state these areas in the code we've
    got together at this point.

    Hopefully, I have copied over the bits you wrote correctly. I find this
    is like learning Spanish. I can read and (roughly) get the gist of the
    code. But when it comes to writing the original code on my own is when
    I have trouble. I am sure this will go away when I practice more. :-)

    I didn't finish everything because I just need some code explained /
    clarified.
    Start PERL code<<<<<
    #!usr/bin/perl -w

    use strict;
    use FileHandle;

    # I am unsure of what this module is. I've tried looking it up
    # in the Camel and Llama book to no avail, not enough description.
    # I guess I have to figure out the whole object thing?

    my %organisms;

    print "Enter in a list of files to be processed:\n";

    # For example:
    # Cytb.fasta
    # NADH1.fasta
    # ...

    # chomp (my @infiles = <STDIN>);
    # TODO we should make this nicer later
    my @infiles = ('genetics.txt');

    foreach my $infile(@infiles) {
    my $FASTA = new FileHandle;

    # Does the above statement tell PERL to create a new
    # filehandle for each file it finds? I guess I need to understand
    # what "new" and the module "FileHandle" are doing.

    open ($FASTA, $infile)
    or die "Can't open INFILE:$!";

    #$/='>' #Set input operator

    my $orgID;

    while (defined($_ = <$FASTA>)) {

    # Above I am unsure of why the "defined function
    # helps us here? I know it has something to do with an
    # expression containing a valid string, but I am unsure
    # of it's function here. This is something I would have
    # never thought to do. :-)

    chomp;
    print "\nworking on >>$_<<\n";

    if (\s*>(\w+)/) {
    $orgID=$1;
    print "Found a new organism start line ('$orgID')\n";

    # The above regex makes complete sense. Actually, I was going to put
    # something similar to that in my original post but wasn't sure
    # if this was appropriate at the time. I guess it was!

    } else {
    print "This is just some data: $_\n";
    print "This data needs to be appended to the hash entry for $orgID/n";

    # okay, in the above you are taking the left over
    # sequence ($_) and linking it as a "value" to "$orgID" ?

    if (exists ($organsims{$orgID})) {
    #TODO append the data to the hash here

    # I guess I would put the following to append to
    # the already existing hash:
    # $organism{$orgID} .= $_;

    } else {
    #create new hash entry for this data
    $organsims{$orgID} = $_;
    }
    }
    }

    # Do not forget to close the input file
    close ($FASTA)
    or die "Could not close INFILE: $!";

    # We've processed all input files... print the resulting hash

    print "\n*****************************************************\n";

    while (my($orgID, $sequence) = each(%organisms)) {
    # since I want the output as:
    # >cat
    # actgac---cgatc-ag-cttag---acg
    # >dog
    # actatc---actat-at-accta---atc
    # I would change the print statement to:
    print "> . $orgID\n $sequence\n";
    }

    end;
    end PERL code<<<
    Thanks for all your help so far! Most of this is starting help my
    thinking. I will be doing a lot more of this multi-file parsing as most
    of my work entails manipulating data in several files or folders at
    once.

    -Mike
  • Johan Viklund at May 18, 2004 at 6:31 pm
    Hi,

    See code
    On Tue, 18 May 2004 13:16:37 -0400, Michael Robeson wrote:

    Ok great. Most of what you show does make sense. However, there are some
    bits of code that I need further clarification with. Some bits I am able
    to tell what they are doing but I do not quite know how or why they work
    they way they do. I'll state these areas in the code we've got together
    at this point.

    Hopefully, I have copied over the bits you wrote correctly. I find this
    is like learning Spanish. I can read and (roughly) get the gist of the
    code. But when it comes to writing the original code on my own is when I
    have trouble. I am sure this will go away when I practice more. :-)

    I didn't finish everything because I just need some code explained /
    clarified.
    Start PERL code<<<<<
    #!usr/bin/perl -w

    use strict;
    use FileHandle;

    # I am unsure of what this module is. I've tried looking it up
    # in the Camel and Llama book to no avail, not enough description.
    # I guess I have to figure out the whole object thing?
    # write 'perldoc FileHandle' on the commandline to see
    # (you can do this with (hopefully) all new modules you come across).
    my %organisms;

    print "Enter in a list of files to be processed:\n";

    # For example:
    # Cytb.fasta
    # NADH1.fasta
    # ...

    # chomp (my @infiles = <STDIN>);
    # TODO we should make this nicer later
    my @infiles = ('genetics.txt');

    foreach my $infile(@infiles) {
    my $FASTA = new FileHandle;

    # Does the above statement tell PERL to create a new
    # filehandle for each file it finds? I guess I need to understand
    # what "new" and the module "FileHandle" are doing. Right on.
    open ($FASTA, $infile)
    or die "Can't open INFILE:$!";

    #$/='>' #Set input operator

    my $orgID;

    while (defined($_ = <$FASTA>)) {

    # Above I am unsure of why the "defined function
    # helps us here? I know it has something to do with an
    # expression containing a valid string, but I am unsure
    # of it's function here. This is something I would have
    # never thought to do. :-)
    It's what
    while (<$FASTA>)
    actually do.

    the defined function checks wheter $_ gets set or not.
    chomp;
    print "\nworking on >>$_<<\n";

    if (\s*>(\w+)/) {
    $orgID=$1;
    print "Found a new organism start line ('$orgID')\n";

    # The above regex makes complete sense. Actually, I was going to put
    # something similar to that in my original post but wasn't sure
    # if this was appropriate at the time. I guess it was!

    } else {
    print "This is just some data: $_\n";
    print "This data needs to be appended to the hash entry for $orgID/n";

    # okay, in the above you are taking the left over
    # sequence ($_) and linking it as a "value" to "$orgID" ?
    This if- then else statement should do what you want. I would do it like
    this instead:
    $organism{$orgID} .= $_;

    no if and no else just that single line. Perl will just make it work the
    wat it's supposed to work; if the hashkey don't exists it gets created and
    the contents of $_ is inserted in it (as a string).
    if (exists ($organsims{$orgID})) {
    #TODO append the data to the hash here

    # I guess I would put the following to append to
    # the already existing hash:
    # $organism{$orgID} .= $_;

    } else {
    #create new hash entry for this data
    $organsims{$orgID} = $_;
    }
    }
    }

    # Do not forget to close the input file
    close ($FASTA)
    or die "Could not close INFILE: $!";

    # We've processed all input files... print the resulting hash

    print "\n*****************************************************\n";

    while (my($orgID, $sequence) = each(%organisms)) {
    # since I want the output as:
    # >cat
    # actgac---cgatc-ag-cttag---acg
    # >dog
    # actatc---actat-at-accta---atc
    # I would change the print statement to:
    print "> . $orgID\n $sequence\n";
    Hmm, you're trying to do string concatenation here but in that case it
    should be:
    print ">" . $orgID . "\n" . $sequence . "\n";
    but it's much easier to just do it like:
    print ">$orgID\n$sequence\n";
    }

    end;
    end PERL code<<<
    Thanks for all your help so far! Most of this is starting help my
    thinking. I will be doing a lot more of this multi-file parsing as most
    of my work entails manipulating data in several files or folders at once.

    -Mike

    /Johan
  • Ramprasad A Padmanabhan at May 18, 2004 at 6:59 pm
    Quite a unique case.
    If your data is no very huge I would suggest, you just first keep on
    reading all data into a huge has ( key as the animal value as the data)
    and then just print out the hash into files
    like ( writing pseudo code is easier if written in perl :-) )

    my @files = qw(file1 file2 file3);
    $/="\n" . '>'; #this way you could read one record at a time
    my %alldata=();
    foreach $f(@files) {
    open(IN,$f) || die " Couldnt open file";
    while(<IN>){
    my ($animal) = /^(.*?)\n/;
    $alldata{$animal} .="$_\n\n";
    }
    close IN;
    }

    ###### %alldata has all the data





    On Mon, 2004-05-17 at 05:20, Michael S. Robeson II wrote:
    Hi all,

    I am having trouble with combining data from several files, and I can't
    even figure out how to get started. So, I am NOT asking for any code
    (though pseudo-code is ok) as I would like to try figuring this problem
    out myself. So, if anyone can give me any references or hints that
    would be great.

    So, here is what I am trying to do:

    I have say 2 files (I'd like to do this to as many files as the user
    needs):

    ***FILE 1***
    cat
    atacta--gat--acgt-
    ac-ac-ggttta-ca--
    dog
    atgcgtatgc-atcgat-ac--ac-a-ac-a-cac
    mouse
    acagctagc-atgca--
    ----acgtatgctacg--atg-
    ***end file 1***


    ***FILE 2***
    mouse
    aatctgatcgc-atgca--
    ----acgtaaggctagg-
    cat
    atacta--gat--acgt-
    ac-acacagcta--ca--
    dog
    atgcgtatgc-atcgat
    -ac--ac-a-ac-a-cac
    ***end file 2***

    Basically, I would like to concatenate the sequence of each
    corresponding animal so that the various input files would be out put
    to a file like so:

    ***output***
    cat
    atacta--gat--acgt-ac-ac-ggttta-ca--atacta--gat--acgt-ac-acacagcta--ca--
    dog
    atgcgtatgc-atcgat-ac--ac-a-ac-a-cacatgcgtatgc-atcgat-ac--ac-a-ac-a-cac
    mouse
    acagctagc-atgca------acgtatgctacg--atg-aatctgatcgc-atgca------
    acgtaaggctagg-
    ***output end***

    Notice that in the two files the data are not in the same order. So, I
    am trying to figure out how to have the script figure out what the
    first organism is in FILE 1( say "cat" in this case) and find the
    corresponding "cat" in the other input files. Then take the sequence
    data (all the cat data) from FILE 2 and concatenate it to the cat
    sequence data in FILE 1 to an output file. Then it should go on to the
    next organism in FILE 1 and search for that next organism in the other
    files (in this case FILE 2). I do not care about the order of the data,
    only that the "like" data is concatenated together.

    Again, I do NOT want this solved for me (unless I am totally lost).
    Otherwise, I'll never learn. I would just like either hints /
    suggestions / pseudo code / even links to books or sites that discuss
    this particular topic. Meanwhile, I am eagerly awaiting my "PERL
    Cookbook" and I'll keep searching the web.

    -Thanks!
    -Mike

  • Michael Robeson at May 19, 2004 at 4:24 pm
    Thanks to those that helped. The code works great. Now I will practice
    one honing it down to the bare essentials. Below is the final code you
    all helped with.

    -Thanks a million!
    -Mike
    Begin PERL Code<<<
    #! /usr/bin/perl -w

    use strict;
    use FileHandle;

    my %organisms;

    print "Enter in a list of files to be processed:\n";

    # For example:
    # CytB.fasta
    # NADH1.fasta
    # ....

    chomp (my @infiles = <STDIN>);
    # TODO we should make this nice later
    #my @infiles = ('genetics.txt');

    print "Enter in the name of the OUTFILE:\n";

    chomp (my $outfile = <STDIN>);

    open(OUTFILE, ">$outfile")
    or die "Can't open OUTFILE: $!";

    foreach my $infile (@infiles) {
    my $FASTA = new FileHandle;
    open ($FASTA, $infile)
    or die "Can't open INFILE: $!";

    # I moved this variable outside the while-loop
    # in order to be able to assign the "data" in
    # the nextline to the organism it belongs to
    # (we're keeping track of the last start line
    # that we came across here)
    my $orgID;

    while (defined($_ = <$FASTA>)) {
    chomp;
    print "\nWorking on >>$_<<\n";

    # see if this line is the start of an
    # organism; the thing we´re searching for
    # looks like this:
    # >dog
    # so try to match something like
    # \s* zero-to-many characters of
    # optional whitespace
    # > the bigger-than sign
    # \w+ one-to-many (word) characters
    # the parenthesis around the \w+ means that
    # we want to access this value later using $1
    if (/\s*>(\w+)/) {
    $orgID = $1;
    print "Found a new organism start line ('$orgID')\n";

    }
    # or just some data belonging to the last
    # organism we found
    else {
    print "Sequence data found: $_\n";
    print "Appending data to $orgID\n";

    # let´s check if we´ve got data for this entry
    if (exists ($organisms{$orgID})) {
    # TODO append the data to the hash here
    $organisms{$orgID} .= $_;
    }
    else {
    # create a new hash entry for this data
    $organisms{$orgID} = $_;
    }
    }
    }
    # do not forget to close the input file
    close ($FASTA)
    or die "could not close INFILE : $!";
    }

    # we've processed all input files...print the resulting hash
    print "\n****************************************\n";
    while (my ($orgID, $sequence) = each(%organisms)) {
    print OUTFILE ">$orgID\n$sequence\n\n";
    }
    END PERL CODE<<<
  • Michael Robeson at May 19, 2004 at 9:25 pm
    Sorry, I meant to upload this script (see below). However, I have one
    last question. Why can't I use

    s/\n//g; # instead of

    tr/A-Za-z-//cd;



    in the script below? I thought it would be simpler to remove the
    newline characters from $_ which is all I really want to do. However,
    most of the time all I will see are "-" and letters which is why I set
    the tr function the way I did.

    I just couldn't figure out why the substitution function wouldn't work
    in this case. How am I setting it up wrong?

    -Thanks
    -Mike
    BEGIN PERL SCRIPT<<
    #! /usr/bin/perl -w

    use strict;
    use FileHandle;

    my %organisms;

    print "Enter in a list of files to be processed:\n";

    # For example:
    # CytB.fasta
    # NADH1.fasta
    # ....

    chomp (my @infiles = <STDIN>);

    print "Enter in the name of the OUTFILE:\n";

    chomp (my $outfile = <STDIN>);

    open(OUTFILE, ">$outfile")
    or die "Can't open OUTFILE: $!";

    foreach my $infile (@infiles) {
    my $FASTA = new FileHandle;
    open ($FASTA, $infile)
    or die "Can't open INFILE: $!";

    my $orgID;

    while (defined($_ = <$FASTA>)) {
    chomp;
    print "\n<< Processing >>$_<<\n";

    if (/\s*>(\w+)/) {

    $orgID = $1;
    print "Found a new organism start line ('$orgID')\n";

    }

    else {

    tr/A-Za-z-//cd; # originally tried s/\n//g;

    print "Sequence data found: $_\n";
    print "Appending data to $orgID\n";


    $organisms{$orgID} .= $_;

    }
    }
    # do not forget to close the input file
    close ($FASTA)
    or die "could not close INFILE : $!";
    }

    # we've processed all input files...print the resulting hash
    print "\n****************************************\n";
    while (my ($orgID, $sequence) = each(%organisms)) {
    print OUTFILE ">$orgID\n$sequence\n\n";
    }
    END PERL SCRIPT <<
  • Philipp traeder at May 20, 2004 at 8:52 pm

    On Wednesday 19 May 2004 05:25 pm, Michael Robeson wrote:
    Sorry, I meant to upload this script (see below). However, I have one
    last question. Why can't I use

    s/\n//g; # instead of

    tr/A-Za-z-//cd;



    in the script below? I thought it would be simpler to remove the
    newline characters from $_ which is all I really want to do. However,
    most of the time all I will see are "-" and letters which is why I set
    the tr function the way I did.
    Hi Mike,

    I'm not sure if I understand exactly what you want to do here, but if you want
    to remove trailing newlines only, I'd use
    chomp;
    I just couldn't figure out why the substitution function wouldn't work
    in this case. How am I setting it up wrong?
    Just guessing - could it be that you need to assign the return value of s///?
    Something like
    my $var_without_newlines = s/\n//g;
    ?

    HTH,

    Philipp

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedMay 16, '04 at 11:51p
activeMay 20, '04 at 8:52p
posts13
users7
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase