Remove duplicated sequences from an alignment¶
The omit_duplicated
app removes redundant sequences from a sequence collection (aligned or unaligned).
Let’s create sample data with duplicated sequences.
Creating the omit_duplicated
app with the argument choose="longest"
selects the duplicated sequence with the least number of gaps and ambiguous characters. In the above example, only one of c
and d
will be retained.
Creating the omit_duplicated
app with the argument choose=None
means only unique sequences are retained.
The mask_degen
argument specifies how to treat matches between sequences with degenerate characters.
Let’s create sample data that has a DNA ambiguity code.
Since “Y” represents pyrimidines where the site can be either “C” or “T”, s1 indeed matches s2 and one of them will be removed.