Well this is the best I could do thinking through what you said. This
is actually my first time working with hashes. Also, I am
still a PERL
newbie. So, I guess a little helpful code would go a long way. I just
can't figure out how to link the regular expressions to the hash when
searching through the multiple files. to do as you say:
That's quite good - you've got more or less all the relevant parts, we
just need to put it together. I'll try to give you some hints without
revealing the whole black magic ;-)
#!/usr/bin/perl
# This script will take separate FASTA files and combine the "like"
# data into one FASTA file.
#
use warnings;
use strict;
my %organisms (
"$orgID" => "$orgSeq",
);
I'd say there´s no need to fill the hash with values at this point;
I'd just declare it like this:
my %organisms;
We'll put values into it later.
print "Enter in a list of files to be processed:\n";
# For example:
# CytB.fasta
# NADH1.fasta
# ....
chomp (my @infiles = <STDIN>);
This doesn't work for me - maybe I just don't know how to use it,
but for the moment, I'd hardcode this and concentrate on the
concatenation part...
my @infiles = ('genetics.txt');
foreach $infile (@infiles) {
open (FASTA, $infile)
or die "Can't open INFILE: $!";
Some small things - having set "warnings" and "strict", perl is asking
me to declare $infile and FASTA - so for me this looks like:
my $FASTA = new FileHandle;
open ($FASTA, $infile)
or die "Can't open INFILE: $!";
BTW: For this to work, you need to
use FileHandle;
on top of your code.
$/='>'; #Set input operator
This is an interesting approach - I've never worked with this input
operator, but I think it might make like quite easy...
I'll do it manually (which is at least good practice ;-) ), but we
should keep this in mind.
while (FASTA) {
small stuff again:
while (defined($_ = <$FASTA>)) {
chomp;
# Some regular expression match here?
# something that will set, say... ">cat"
# as the key "$orgID", something similar
# to below?
# and then set the sequence as the value
# "$orgSeq" like below?
Yes, a regexp is a very good idea here.
Generally, you just need to distinguish between the "start"-lines and
the "regular" lines here, i.e. the ones that mark the beginning of an
organism and the ones that carry the data.
# We're searching for "start"-lines that look like this:
# >dog
# so try to match something like
# \s* zero-to-many characters of
# optional whitespace
# > the bigger-than sign
# \w+ one-to-many (word) characters
# the parenthesis around the \w+ means that
# we want to access this value later using $1
if (/\s*>(\w+)/) {
print "found a new organism called '$1'\n";
}
# or just some data belonging to the last
# organism we found
else {
print "this is just some data : $_\n";
}
Don't worry if you don't understand this on the first look - regexes can
be quite messy, but once you get used to them, they quickly become your
best friend (for getting used to them, I can recommend very much chapter
2 of the camel book...).
Anyway - now you've got the name of the new organism in a special variable
called $1 (if this is a new organism) or the data in $_ (if it´s not).
# Do not know if or where to put the following,
# but something like:
if (exists $organisms{$orgID}) {
# somehow concatenate "like" data
# from the different files
}
This is completely right as well - this line of code lets you check if you
already got data of this organism...let´s think about what we want to do.
We've got an hash which should look like:
cat => funny-sequence-of-a-c-g-whatever,
dog => even-funnier-sequence-of-characters
With the code from above, we iterate over all files specified on the command
line (or hardcoded into the script), and there are two kinds of line we can
meet:
- start lines
- regular data lines
We can separate start lines from regular lines with the regexp above.
When we come across a start line, we don't have to process any data: a start
line, after all, does not contain "real" data, but only the name identifying
the organism to which the following data belongs.
But in order to store the data for the "right" organism, we should keep
track
of the last start line - I would do this by storing the last ID in a
variable
(that needs to be outside the while-loop).
So now we can do something like this:
If the new line is a start line
- store the ID
If the new line is a regular line
- append it to the current entry
Since you want to append the individual strings, you don't need to check
explicitly if the hash entry exists already (as you sketched above).
It would be different if you would use a data structure that needs to be
initialized, like array-refs for example. But that´s another subject ;-)
# print the final Hash to an outfile?
}
Afterwards, you can print the final hash, and everything´s fine.
Here´s a sketch of what the code could look like though I left you to fill
some interesting parts ;-)
At the moment, it´s creating new hash entries, but not appending to them.
<perl code>
#! /usr/bin/perl -w
use strict;
use FileHandle;
my %organisms;
print "Enter in a list of files to be processed:\n";
# For example:
# CytB.fasta
# NADH1.fasta
# ....
#chomp (my @infiles = <STDIN>);
# TODO we should make this nice later
my @infiles = ('genetics.txt');
foreach my $infile (@infiles) {
my $FASTA = new FileHandle;
open ($FASTA, $infile)
or die "Can't open INFILE: $!";
#$/='>'; #Set input operator
# I moved this variable outside the while-loop
# in order to be able to assign the "data" in
# the nextline to the organism it belongs to
# (we're keeping track of the last start line
# that we came across here)
my $orgID;
while (defined($_ = <$FASTA>)) {
chomp;
print "\nworking on >>$_<<\n";
# see if this line is the start of an
# organism; the thing we´re searching for
# looks like this:
# >dog
# so try to match something like
# \s* zero-to-many characters of
# optional whitespace
# > the bigger-than sign
# \w+ one-to-many (word) characters
# the parenthesis around the \w+ means that
# we want to access this value later using $1
if (/\s*>(\w+)/) {
$orgID = $1;
print "found a new organism start line ('$orgID')\n";
}
# or just some data belonging to the last
# organism we found
else {
print "this is just some data : $_\n";
print "this data needs to be appended to the hash
entry for $orgID\n";
# let´s check if we´ve got data for this entry
if (exists ($organisms{$orgID})) {
# TODO append the data to the hash here
}
else {
# create a new hash entry for this data
$organisms{$orgID} = $_;
}
}
}
# do not forget to close the input file
close ($FASTA)
or die "could not close INFILE : $!";
}
# we've processed all input files...print the resulting hash
print "\n****************************************\n";
while (my ($orgID, $sequence) = each(%organisms)) {
print "$orgID : $sequence\n";
}
</perl>
HTH,
Philipp