Hello,
I have completed a draft revision of a set of types intended to be used
to represent protein and nucleic acid sequence types to replace the
initial implementation currently used in biogo.
I would be grateful if some people could have a look at what I have and
give me some feedback. I'm not completely happy with the design at this
stage, though it is significantly improved from the initial write of it.
The motivation of the set of packages it to provide support for a
variety of sequence types that can be used in a type safe manner
(preferably at compile time but aggressively at runtime if that is not
possible - i.e. panicking is types mismatch) or can be used generically
via interfaces.
Types are: single sequences and column-based multiple sequences with or
without associated letter quality data (4 types for each of nucleic acid
(NA) and protein), row-based aligned and unaligned multiple sequences (2
types for each). Column-based multiple sequence types are for analysis
of alignments, while row-based are for alignment construction.
There is a fair amount of code [1] to look at, so I'll ask specific
questions/raise specific issues:
1. alphabet.Letter is just a byte. Making it a type adds typing and
conversion costs but was intended to make letters and bytes
distinct and allow letters to have methods (it seemed like a
good idea at the time).
2. NA sequences can be reverse complemented so there needs to be a
distinction between these and protein sequences. At the moment
this is done by types at runtime. This results in massive code
duplication, so I'd like to see if there is another way to
achieve this that does not have this cost, but preferably still
allows compile time type safety (the alternative seems to be to
panic if RevComp is called on a protein). NB RevComp is a method
unlike other manipulations for performance reasons - avoids
calling At/Set on each position.
3. Because sequences can be single or multiple, indexing into
sequences is via the seq.Position type. This is wasteful for the
majority of use cases which are expected to be single sequences
but required for a uniform interface across all sequences. Is
this uniformity achievable without this?
4. At/Set methods take, return a quality qualified letter
(alphabet.QLetter) for all sequence types, even if per-position
qualities are not available (a default quality is returned in
these cases). Again this is for interface uniformity.
5. I'm aware that some of my interface definitions are fairly big.
What is the view on this?
6. The types I'm least happy with are are the two row based
multiple aligned sequence types (.../*multi.Multi). Ideally
these should be able to be used in the place of other sequence
types, but if that is the case multiple sequences can be stored
in multiple sequences, which while not necessarily incorrect
leads to unwanted complexity.
Generally: Documentation needs improvement and tests are barely/almost
adequate. Any other input welcomed.
Sorry for the length this post.
Dan
[1]http://code.google.com/p/biogo/source/browse/exp?name=development
--