Computing distances

Phylogenetically related organisms contain similar oligonucleotide composition or genomic signature. This behaviour is often observed within different elements in the same prokaryotic cell: chromosomes, plasmids and phages. The results obtained by comparing oligonucleotide composition of those elements may be retrieved in this side..

We have searched five algorithms to compute distances, and our results are available here. According to our calculations with randon sequences, GC-content of sequences may affect the distance between the complete sequence and their subsequences when some statistical methods are used to compute distances.  This effect has been observer for Hamming distance, Global distance and Tetranucleotide Usage Deviation, so those statistics have been removed from aour computing.

Consequently, this service uses only the following statistical methods to distances based in oligonucleotide composition of sequences:

  • Euclidean Distance for 2 to 6 bases oligonucleotide frequencies (PubMed), and
  • Pearson distance for z-scores of tetranucleotides (PubMed).

Both methods have been extensively used in this website, although they are not available simultaneously in all tools/data. The reasons to select these statistical methods are described here

Assignment of sequences to genomes

We searched the suitability of both methods to assign a DNA subsequence from sequenced prokaryotes to their source genome. The results of our computing are available here. Based on our data, the following conclusions had been obtained:

  • When random selecting a 5,000 bases long subsequence from sequenced prokaryotic genomes, Pearson distance for z-scores of tetranucleotides yielded the best results. To compute z-scores of tetranucleotides, data for dinucleotides, trinucleotides and tetranucleotides are used, and consequently, it may be considered that z-scores values contain very "complete" information about the sequence. This is probably the reason why the performance of the method was the best one for longer sequences. Performance of Euclidean distance was also good.
  • When shorter sequences were compared Euclidean distances was the best choice. Z-scores are not correctly computed when short sequences are searched. This is the reason why we have included both computing methods in our tools.
Oligo-Skews

When generating oligo-skews, the figures obtained by computing Pearson distances for z-scores of tetranucleotides were very flat. In contrast, Euclidean distances show the presence of picks in the figures. Although those picks had not been searched yet, we believe they could be related to Horizontal Transfer events. Oligo-Skews have been generated for all sequenced prokaryotes by using Euclidean distance.