DNA Outbreak Investigation Using Machine Learning

Codersarts AI
Sep 18, 2020
1 min read

You are given a data set consisting of DNA sequences (the file is available here) of the same length. Each DNA sequence is a string of characters from the alphabet ‘A’,’C’,’T’,’G’, and it represents a particular viral strain sampled from an infected individual. Your goal is to write a code that helps to identify transmission clusters corresponding to outbreaks.

The sequences should be considered as feature vectors and characters - as features. The data set is stored as a fasta file, which is essentially a text file that has the following form:

>Name of Sequence1

AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG

>Name of Sequence2

AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG

>Name of Sequence3

AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG

…..

Here each line starting with ‘>’ symbol contains the name of a sequence followed by the sequence itself in the next line.

You may proceed as follows:

1) Read sequences from the file.
2) Calculate pairwise distances between sequences. Use Hamming distance: it is the number of positions at which the sequences are different (see https://en.wikipedia.org/wiki/Hamming_distance)
3) Project the sequences in 2-D space using Multidimensional Scaling (MDS) based on Hamming distance matrix.
4) Plot the obtained 2-D data points. Estimate the number of clusters K by visual inspection.
5) Use k-means algorithm to cluster the 2-D data points.

You may use library functions to read data from the file and perform MDS. For multidimensional scaling in python, see e.g. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html

K-means clustering should be implemented from scratch. Your submission should contain:

The code of your script
Visualization plots for MDS with different clusters highlighted in different colors.

Please do not hesitate to ask questions.

contact@codersarts.com

DNA Outbreak Investigation Using Machine Learning

Recent Posts

Comments