An open reading frame (ORF) is a stretch of DNA that begins with a start codon and ends with a stop codon, with no interrupting stop codons in between. ORFs are the primary candidates for protein-coding genes, making ORF analysis one of the first steps in annotating a new sequence.
What Are Reading Frames?
DNA is read in triplets called codons. Because the reading can start at position 1, 2, or 3 of a sequence, there are three possible reading frames on each strand. Since DNA is double-stranded, there are six reading frames in total: three on the forward strand (+1, +2, +3) and three on the reverse complement strand (-1, -2, -3).
A true gene will typically have a long ORF in one of these frames. Random sequence produces stop codons roughly every 20 codons on average, so a long ORF is statistically unlikely to occur by chance.
Start and Stop Codons
The most common start codon is ATG, which codes for methionine. This is the universal start signal in most organisms. Some bacteria also use GTG or TTG as alternative start codons in certain contexts.
There are three stop codons:
- TAA (ochre)
- TAG (amber)
- TGA (opal or umber)
An ORF is defined as the sequence from an ATG to the next in-frame stop codon. The length of the ORF is measured in codons or amino acids.
How to Identify Meaningful ORFs
Not every ATG-to-stop sequence is a real gene. To filter out short random ORFs, a minimum length threshold is applied. Common thresholds are:
- 100 codons (300 bp) for prokaryotic gene prediction
- 30 codons (90 bp) for initial screening in eukaryotes
Longer ORFs are more likely to represent real genes. In bacteria, most protein-coding genes are over 100 codons. In eukaryotes, genes are interrupted by introns, so genomic ORF analysis is less straightforward than in cDNA or mRNA sequences.
Reading the Six Frames
To analyze all six reading frames, you need to:
- Take the original sequence and read it in frames +1, +2, and +3
- Generate the reverse complement of the sequence
- Read the reverse complement in frames +1, +2, and +3 (which correspond to -1, -2, -3 on the original)
In each frame, scan for ATG codons and then find the next in-frame stop codon. The region between them is an ORF candidate.
Translating an ORF
Once you have identified an ORF, you can translate it to a protein sequence using the standard genetic code. Each triplet codon maps to one of the 20 amino acids or a stop signal. For example:
The asterisk (*) represents the stop codon. The protein sequence starts at M and ends just before the stop.
What to Do with ORF Results
After identifying ORFs, the next steps typically include:
- BLAST searching the translated protein sequence against known databases
- Checking for conserved domains using tools like PFAM or InterPro
- Comparing ORF positions with known gene annotations if available
- Checking for a Kozak consensus sequence upstream of the ATG in eukaryotes
- Looking for a Shine-Dalgarno sequence upstream of the ATG in prokaryotes
Common Pitfalls
A few things to watch out for when interpreting ORF results:
- Nested ORFs: A long ORF may contain shorter ORFs within it. The longest ORF in a frame is usually the most relevant.
- Overlapping ORFs: In some viruses and bacteria, two genes overlap in different reading frames. This is rare in eukaryotes.
- Pseudogenes: A sequence may look like an ORF but contain premature stop codons due to mutations. These are non-functional pseudogenes.
- Non-ATG starts: Some genes use alternative start codons. If you miss a gene, try relaxing the start codon requirement.
ORF analysis is a fast and effective first pass at gene prediction. Combined with homology searching and functional annotation, it gives you a solid starting point for understanding any new DNA sequence.
Find ORFs in Your Sequence
Use our free ORF finder to scan all 6 reading frames, translate to protein, and view results with codon-by-codon highlighting.
Try the ORF Finder