Computing distances

We have searched five algorithms to compute distances. The following ones had been evaluated:

- Euclidean distance PubMed
- Hamming distance PubMed and Average absolute dinucleotide relative abundance (d*; PubMed; PubMed). d* distance is a special case of Hamming distance for dinucleotides.
- Global distance (PubMed)
- Pearson distance for z-scores of tetranucleotides (PubMed)
- Tetranucleotide Usage Deviation or TUD (PubMed).

Our first calculations involved searching the effect of GC-content in the distances computed by the methods above. Our conclusions are shown in the bottom of this page.

Effect of GC-content on distance:
Evaluation of Euclidean, Hamming and Global distances agains random DNA sequences.

G+C content (%)

Euclidean distance

Figure 1: Effect of GC-content on distance for dinucleotide frequencies computed between a random DNA sequence and a 20,000 bp subsequence of the same sequence. Each dot is the median of 10 measurements. Three methods were used to compute distances: Global distance (blue dots), Hamming distance (red dots) and Euclidean distance (black dots). Horizontal lines represent distances (0, 0.1 and 0.2 from top to bottom).

G+C content (%)

Euclidean distance

Figure 2: Effect of GC-content on distance for tetranucleotide frequencies computed between a random DNA sequence and a 20,000 bp subsequence of the same sequence. Each dot is the median of 10 measurements. Three methods were used to compute distances: Global distance (blue dots), Hamming distance (red dots) and Euclidean distance (black dots). Horizontal lines represent distances (0, 0.1 and 0.2 from top to bottom).

G+C content (%)

Euclidean distance

Figure 3: Effect of GC-content on distance for hexanucleotide frequencies computed between a random DNA sequence and a 20,000 bp subsequence of the same sequence. Each dot is the median of 10 measurements. Three methods were used to compute distances: Global distance (blue dots), Hamming distance (red dots) and Euclidean distance (black dots). Horizontal lines represent distances (0, 0.1 and 0.2 from top to bottom).

From the data shown in figures 1, 2 and 3, we concluded Hamming distance and Global distance were not suitable to search distances when GC-content of the sequence is close to 50%. This effect is very sharp for Global Distance, and smaller for Hamming distance. When searching Hamming distance, the longer the length of the oligonucleotide, the less suitable is this statistical method to compute distances. We do not recommend using Hamming distance for oligonucleotides over 2 bases long. In fact, Hamming distance for dinucleotides has been used in soma papers, although the name provided to the statistical was Average absolute dinucleotide relative abundance (d*), as pointed out in the top of this page.

Effect of GC-content on distance:
Evaluation of Pearson distance for z-scores of tetranucleotides and Tetranucleotide Usage Deviation (TUD) againts random DNA sequences.

These statistical methods involved computing the expected number of each tetranucleotide by using data from frequencies of di- tri- and tetranucleotides. When trying to search the effect of GC-content on those statistical methods we generated random DNA sequences, but the distances we obtained were totally useless. The reason was very simple: a random DNA sequence generated by mixing nucleotides A, C, G and T show the same or extremely similar values for tetranucleotide frequencies and their expected frequencies. And in this kind of sequences evaluation of those statistical was not possible.

Consequently, to evaluate these statistical methods we generated random DNA sequences with over- or under-representation of tetranucleotides ATTA and CGGC. Results are shown bellow:

G+C content (%)

Pearson distance for z-scores

Figure 4: Effect of GC-content on Pearson distance for z-scores of tetranucleotide computed between a random DNA sequence and a 20,000 bp subsequence of the same sequence. Blue dots correspond to individual measurements, and yellow ones are the median of 100 distances computed for each GC-content. Horizontal lines represent distances (0, 0.1 and 0.2 from top to bottom).

G+C content (%)

Tetranucleotide Usage Deviation

Figure 5: Effect of GC-content on Tetranucleotide Usage Deviation or TUD computed between a random DNA sequence and a 20,000 bp subsequence of the same sequence. Blue dots correspond to individual measurements, and yellow ones are the median of 100 distances computed for each GC-content. Horizontal lines represent distances (0, 0.1 and 0.2 from top to bottom).

As a consequence, we concluded Tetranucleotide Usage Deviation was not suitable to measure distances. On the contrary, Pearson distance for z-scores of tetranucleotides is a good statistical method.

Conclusions

According to our calculations, the most suitable statistcial methods to compute distances based on oligonucloetide frequencies are Euclidean Distance and Pearson distance for z-scores of tetranucleotides. Both methods have been extensively used in this website, although they are not available simultaneously in all tools.