We have searched five algorithms to compute
distances. The following ones
had been evaluated:
- Euclidean distance PubMed
- Hamming distance PubMed
and Average
absolute dinucleotide relative
abundance (d*; PubMed;
PubMed).
d* distance is a special case of
Hamming distance for dinucleotides.
- Global distance (PubMed)
- Pearson distance for z-scores of tetranucleotides (PubMed)
- Tetranucleotide Usage Deviation or TUD (PubMed).
Our first calculations involved searching the effect of
GC-content in the distances computed by the methods above. Our
conclusions are shown in the bottom of this page.
Effect of GC-content on distance:
Evaluation
of Euclidean, Hamming and
Global distances agains random DNA
sequences. |
G+C content (%)
Figure 1: Effect of GC-content on
distance for dinucleotide
frequencies computed between a random DNA sequence
and a 20,000 bp subsequence of the same sequence. Each dot is the
median of 10 measurements. Three methods were used to compute
distances: Global distance (blue dots), Hamming distance (red dots) and
Euclidean distance (black dots). Horizontal lines represent distances
(0, 0.1 and 0.2 from top to
bottom). |
G+C content (%)
Figure 2: Effect of GC-content on
distance for tetranucleotide frequencies
computed between a random DNA sequence and a 20,000 bp subsequence of
the same sequence. Each dot is the median of 10 measurements.
Three methods were used to compute distances: Global distance (blue
dots),
Hamming distance (red dots) and Euclidean distance (black dots).
Horizontal lines represent distances (0, 0.1 and 0.2 from top to
bottom).
|
G+C content (%)
Figure 3: Effect of GC-content on
distance for hexanucleotide
frequencies
computed between a random DNA sequence and a 20,000 bp subsequence of
the same sequence. Each dot is the median of 10 measurements.
Three methods were used to compute distances: Global distance (blue
dots),
Hamming distance (red dots) and Euclidean distance (black dots).
Horizontal lines represent distances (0, 0.1 and 0.2 from top to
bottom). |
From the data shown in figures 1, 2 and 3,
we concluded Hamming
distance and
Global distance were not suitable to search distances when
GC-content of
the
sequence is close to 50%. This effect is very sharp for Global
Distance, and
smaller for Hamming distance. When searching Hamming distance, the
longer the
length of the oligonucleotide, the less suitable is this statistical
method to
compute
distances. We do not recommend using Hamming distance for
oligonucleotides over
2 bases long. In fact, Hamming distance for dinucleotides has been used
in soma papers, although the name provided to the statistical was Average
absolute dinucleotide relative
abundance (d*), as pointed out in the
top of this page.
Effect of GC-content on distance:
Evaluation
of Pearson distance for
z-scores of tetranucleotides and Tetranucleotide Usage Deviation (TUD)
againts random DNA sequences. |
These statistical methods involved computing the expected number of
each tetranucleotide by using data from frequencies of di- tri- and
tetranucleotides. When trying to search the effect of GC-content on
those statistical methods we generated random DNA sequences, but the
distances we obtained were totally useless. The reason was very simple:
a random DNA sequence generated by mixing nucleotides A, C, G and T
show the same or extremely similar values for tetranucleotide
frequencies and their expected frequencies. And in this kind of
sequences evaluation of those statistical was not possible.
Consequently, to evaluate these statistical methods we generated
random DNA sequences with over- or under-representation of
tetranucleotides ATTA and CGGC. Results are shown bellow:
G+C content (%)
Figure 4: Effect of GC-content on Pearson
distance for z-scores of tetranucleotide computed between a
random DNA sequence
and a 20,000 bp subsequence of the same sequence. Blue dots correspond
to individual measurements, and yellow ones are the median of 100
distances computed for each GC-content. Horizontal lines represent
distances (0, 0.1 and 0.2 from top to
bottom). |
G+C content (%)
Figure 5: Effect of GC-content on Tetranucleotide
Usage Deviation or TUD computed
between a random DNA sequence
and a 20,000 bp subsequence of the same sequence. Blue dots correspond
to individual measurements, and yellow ones are the median of 100
distances computed for each GC-content. Horizontal lines represent
distances (0, 0.1 and 0.2 from top to
bottom). |
As a consequence, we concluded Tetranucleotide
Usage Deviation was not suitable to measure distances. On the
contrary, Pearson distance for z-scores of
tetranucleotides is a good statistical method.
According to our calculations, the most
suitable statistcial methods to compute distances based on
oligonucloetide frequencies are Euclidean Distance and Pearson
distance for z-scores of
tetranucleotides. Both methods have been extensively used in this
website, although they are not available simultaneously in all tools.
|