top of page

DNA Outbreak Investigation Using Machine Learning

You are given a data set consisting of DNA sequences (the file is available here) of the same length. Each DNA sequence is a string of characters from the alphabet ‘A’,’C’,’T’,’G’, and it represents a particular viral strain sampled from an infected individual. Your goal is to write a code that helps to identify transmission clusters corresponding to outbreaks.

The sequences should be considered as feature vectors and characters - as features. The data set is stored as a fasta file, which is essentially a text file that has the following form:

>Name of Sequence1


>Name of Sequence2


>Name of Sequence3



Here each line starting with ‘>’ symbol contains the name of a sequence followed by the sequence itself in the next line.

You may proceed as follows:

  • 1) Read sequences from the file.

  • 2) Calculate pairwise distances between sequences. Use Hamming distance: it is the number of positions at which the sequences are different (see

  • 3) Project the sequences in 2-D space using Multidimensional Scaling (MDS) based on Hamming distance matrix.

  • 4) Plot the obtained 2-D data points. Estimate the number of clusters K by visual inspection.

  • 5) Use k-means algorithm to cluster the 2-D data points.

You may use library functions to read data from the file and perform MDS. For multidimensional scaling in python, see e.g.

K-means clustering should be implemented from scratch. Your submission should contain:

  • The code of your script

  • Visualization plots for MDS with different clusters highlighted in different colors.

Please do not hesitate to ask questions.

Contact us to get instant help:

40 views0 comments
bottom of page