Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement partitioned FASTA #135

Open
recursion-ninja opened this issue May 18, 2019 · 0 comments
Open

Implement partitioned FASTA #135

recursion-ninja opened this issue May 18, 2019 · 0 comments

Comments

@recursion-ninja
Copy link
Collaborator

Details:

A partitioned FASTA file includes one or more # characters in each sequence. This means that the sequences of each taxaon in he file will have exactly the same number of # characters in their corresponding sequence.

Each # breaks the dynamic character into two separate characters which will be aligned and optimized independently. Because the dynamic characters are all in the same file, they will default to being in the same block for network optimizations.

For example, the following FASTA file:

> Alpha
ACCT#GATT#CATTAG
> Bravo
CCT#GAT#CATAG
> Charlie
ACC#ATTT#CATTAG

Would return the following Map String [String]:

Map.FromList
  [ (Alpha  , [ "ACT", "GATT", "CATTAG" ])
  , (Bravo  , [ "CCT", "GAT" , "CATAG"  ])
  , (Charlie, [ "ACC", "ATTT", "CATTAG" ])
  ]

Which represents each taxon having 3 dynamic characters in a single block.

We should also decide if we will allow empty partitions in a FASTA file.

For example, would the following file be allowed:

> Alpha
ACT#ATT#CAT
> Bravo
ACT##CAT
> Charlie
ACT#ATT#
> Delta
#ATT#CAT

In the above we can see that Bravo, Charlie, and Delta all have empty partitions.

Implementation:

The FASTA parser already exists in a very usable state and accepts # characters, though it does not currently interpret them in the special way described above.

We should perform a post-parsing pass over the FASTA data. If any sequence as one or more # chars present, we will enforce that all sequences have the same number of # present or raise a parse error.

We should make the parse error as human readable as possible. For example if all but one sequence had four # chars and the other sequence had a different number, the parse error should focus the user's attention to only the outlier sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants