Counting DNA Nucleotides
This is the very first problem on the website in the “Bioinformatics Stronghold” category. As expected we are starting fairly light.
Problem
We receive a DNA string consisting of the letters A, C, G and T and are asked to count the occurrence of each letter.
The sample dataset looks as follows:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
with the solution for it being 20 12 17 21
.
Solution
Since the dataset comes in the form of a file we first need to read it
data = open("rosalind_dna.txt").read()
then remove any extraneous whitespace
data = data.strip()
and finally we can use a defaultdict as a simple structure to store character counts.
from collections import defaultdict
counts = defaultdict(int)
for ch in data:
counts[ch] += 1
print(f"{counts['A']} {counts['C']} {counts['G']} {counts['T']}")
The defaultdict uses an int as a default type for missing keys and since the default for int is 0 we can +=
on it.
Protip: Instead of using a defaultdict here, Python also comes with a special Counter
collection that is made for, well, counting things.
The code could be simplified to:
from collections import Counter
data = open("rosalind_dna.txt").read()
data = data.strip()
cnt = Counter(data)
print(" ".join([str(c[k]) for k in sorted(c.keys())]))