Features

This guide provides instructions on creating, querying, and utilising features to manipulate biological sequence data.

How to create a custom Feature

Via add_feature

We can dynamically create a Feature via the add_feature() method, providing the key information about the feature.

The new feature will be added to the annotation_db attribute of the Sequence and/or Alignment. A Feature instance will be returned.

On a Sequence

We use add_feature to add a Feature to a Sequence, providing the biotype, name/id, and a list of start and stop indices.

A feature can also be added as a series of non-overlapping genomic segments:

On a Sequence within an Alignment

We use add_feature and provide a value for seqid to add a feature to a sequence within an Alignment or SequenceCollection.

Note the difference between the value provided to spans, and the value of map in the returned Feature… the resulting feature is in alignment coordinates!

The Feature specifies that seqid="from 'seq1'", indicating that it “belongs” to seq1.

On an Alignment

We use add_feature to add a Feature to an Alignment.

The Feature specifies that seqid=None, indicating that it belongs to the alignment

Via an AnnotationDb

We can use add_feature to add a feature directly into an AnnotationDb, and assign it to the annotation_db attribute of a Sequence or Alignment. For extensive documentation on handling features directly via an AnnotationDb see Annotation Databases.

How to load bulk Features from a File

Typically, we want to load bulk features from a genomic annotation file, such as a GFF or Genbank file. For the following examples, we will use Caenorhabditis elegans chromosome I.

Note

See the list of Data Files Used in the Documentation to download the data used in the following examples.

To load features from a genomic annotation file along with the corresponding sequence, we can use the load_seq function. The features are stored in a AnnotationDb and assigned to the annotation_db attribute of the sequence.

From a Genbank file

How to load features and sequence data

To load the sequence and all 40,578 features from C. elegans Chromosome 1, we use the load_seq function ⚡️

The features are stored in the annotation_db attribute.

Now that the Sequence is annotated, we can query it for specific features. For more details on querying, skip to Querying for Features.

From a GFF file

How to load features and sequence data

Given a FASTA file with sequence data and a GFF file with annotations, we can use load_seq to load both the sequence and its corresponding features.

Warning

total_records=0? 🤔 This is because load_seq assumes the sequence names match exactly between files! If the names are different, you need to provide function to the label_to_name argument.

Because the names above are different, for FASTA its "I dna:chromosome chromosome:WBcel235:I:1:15072434:1 REF" and for GFF its "I", we need a label_to_name argument. We provide a lambda function.

How to load features and associate them with an existing sequence

We can use the annotate_from_gff() method to associate the features from a GFF file with the existing Sequence.

If we know that the features and the sequence share the same coordinate space, then we only need to provide the path to the annotation file.

If the feature coordinates precede the sequence, for example, a sequence corresponds to a gene that starts 600 base pairs from the beginning of chromosome, but the annotation file is for the entire chromosome, we need to provide an offset to the annotate_from_gff() method.

How to load features and associate them with sequences in an existing alignment

To annotate one or more Sequence in an Alignment, call annotate_from_gff() on the Alignment instance, passing in the path to the GFF annotation file and a list of sequence names to annotate to the seq_ids argument.

For example, first we load an alignment of the brca1 gene in primates.

Next, we annotate with a GFF file that contains features specific to the human gene.

Note that the AnnotationDb is accessible via the Alignment (above) and Sequence (below) attribute.

Note

Alignment.annotate_from_gff() does not support setting an offset. If you need to set the offset for a sequence within an alignment, you can do so directly using the Sequence.annotation_offset attribute.

How to query a Sequence or Alignment for Features

The method get_features yields all features that match the given arguments. You can provide conditions for the name, biotype, and start/stop location of a feature.

Querying a Sequence for Features

Querying via Feature Name

We can search for a gene given its name (AKA its unique ID). For example we can search for a gene with name="WBGene00021661".

Querying via Feature Biotype

We can search for features with a certain biotype, for example, all coding sequences (CDS):

We can also provide combinations of argument to search, for example, all CDS with a given name:

Querying via region of interest

We can provide start and end arguments to get_features() and all features within the coordinates will be returned.

We can again provide a combination of conditions, for example, querying for all features with biotype="mRNA" within a certain range, and returning the first match.

Querying a Sequence (via an Alignment) for Features

To query for a particular Sequence within an Alignment or SequenceCollection, we can use get_features as shown above for a Sequence, but providing the seqid for the sequence of interest.

For example, given an alignment of primates, we can search for features that are just on the human sequence as follows:

Note that seqid="from'Human'" indicated this feature belongs to this particular sequence.

Querying an Alignment for Features

Querying for features on any Sequence in an Alignment

todo: on_alignment=False and dont provide seqid

Note there are features from both Rhesus, which we just added, and Human, which we annotated above

Querying for features on an Alignment

todo: on_alignment=True and dont provide seqid

Using add_feature we add a feature to the brca1 alignment we have been using above, by specifying on_alignment=True this feature will be on the Alignment.

To query for features on the alignment, we use get_features, again specifying on_alignment=True.

Note how even though we annotated the Human and Rhesus sequences in the above examples, only the pseudogene we added to Alignment is returned by this query.

Querying features that span gaps in alignments

If you query for a Feature from a Sequence (i.e. the feature is in sequence coordinates), its alignment coordinates may be discontinuous. This will lead to an omission of data from other sequences!

Note

In the above, the T in sequence Y opposite the gap is missing since this approach only returns positions directly corresponding to the feature.

To include the gaps, use the allow_gaps argument

Examples using the methods available on Features

A Feature has many methods to manipulate the sequence or alignment that they are bound to.

How to slice a Sequence or Alignment by its features

Given a Feature, we can directly slice its parent sequence to return its sequence information

Note

This only works for the Sequence that the Feature “belongs” to.

We can also achieve this via get_slice()

How to display the features of a Sequence

We can display all the features on a sequence using .get_drawable(), or a subset of biotypes. We do this for only the first 50,000 base pairs. The plotly figure returned, as displayed below, is interactive! 🤩 Zoom in on the dark vertical lines in the big gene and you will see small genes on the opposite strand. Hover your cursor over each block and the gene name is displayed.

Note

If a feature extends beyond the sequence region selected, its name includes the text “(incomplete)”.

How to find the coordinates of a feature

These are useful for doing custom things, e.g. if the introns are not annotated for a gene, we can generate the introns from the coordinates of the exons as follows:

We generate the intron coordinates from the second element of the first tuple, and the first element of the second tuple and so on:

We can then add the introns as a Feature to the sequence!

How to take the union of features

We can create a feature that is the union of all coding sequence.

How to get the shadow of a Feature

The “shadow” of a feature is a new feature containing all of the sequence except the feature!

How to use the shadow of a Feature to return the intergenic sequence

We first need to query our sequence for all genes. Using the union() method we combine all genes into a single feature.

Taking the “shadow” of all genes will return the intergenic region as a valid Feature

We can slice the sequence by this new Feature to return the complete intergenic sequence!

How to mask annotated regions

Masking annotated regions on a sequence

We can mask a certain annotation using with_masked_annotations()

The above sequence could then have positions filtered so no position with the ambiguous character ‘?’ was present.

Masking annotated regions on an Alignment

We can mask exons on an alignment.

After a reverse complement operation

these persist.

How to find the “children” of a Feature

To find the “children” of a feature, we can use the get_children() method. A “child” refers to a feature that is nested within or contained by another “parent” feature. For example, a child feature could be an exon contained within a gene or a CDS contained within a transcript.

This method returns a generator that yields all the child features of the specified feature.

For example, let’s find the children of the gene “WBGene00021661”:

How to find the “parent” of a Feature

To find the “parent” of a feature, we can use the get_parent() method, which achieves the inverse of the above method.

For example, we can use the first “child” we returned above, "transcript:Y74C9A.2a.3", to find the original parent gene!

How to copy features

We can copy features onto sequences with the same name. Note that the AnnotationDb instance bound to the alignment and its member sequences is the same.

Warning

Despite this, it is possible for the attributes to get out-of-sync. So, any copy annotations should be done using alignment.copy_annotations(), not alignment.get_seq("x").copy_annotations().

However, if the feature lies outside the sequence being copied to, you get a lost span

How to get the positions of a feature as one span

as_one_span unifies features with discontinuous alignment coordinates and returns positions spanned by a feature, including gaps.

Behaviour of annotations on nucleic acid sequences

Reverse complementing a sequence does not reverse features. Features are considered to have strand specific meaning (e.g. CDS, exons) and so they retain the reference to the frame for which they were defined.