Looking for cyanobacterial phylogenetic trees? Visit cyanophylogeny.scienceontheweb.net/
INTRODUCTION
The 16S rRNA sequence alignment maintained in ARB and containing over 4000 sequences, is an essential part of CyanoPhy for analysis of cyanobacterial phylogeny. New sequences are downloaded from NCBI weekly to keep the database current. A surprising number have been identified as chimeric sequences by the various detection methods available in CyanoPhy. An example is shown in the figure above, the breakpoint being indicated by the arrow.
Chimeric sequences, composed of parts of each of two or
more true parental sequences, are most likely the result of PCR errors. Most
are believed to arise following incomplete extension during a PCR cycle; the
partially extended strand may bind to a template derived from a different
sequence, this primer being extended and amplified during succeeding cycles to
produce the chimeric sequence. A more complete description of this process is
given on the UCHIME home page.
RESULTS AND CONCLUSIONS
Chimera formation is unlikely to be a problem if
the organism used for sequencing is in the axenic state, such as strains
from the Pasteur Culture Collection of Cyanobacteria (PCC). However,
chimeric sequences may be formed in PCR reactions following contamination of such strains in the laboratory or if non-axenic cultures, proposed by most Culture Collections with cyanobacterial
holdings, are employed. Another source of chimeric sequences is a culture thought to be clonal (containing a single cyanobacterium) but in fact comprised of several cyanobacterial representatives. Finally, as evidenced by the results presented below, environmental samples show a high risk of formation of chimeric sequences since they contain several to many different organisms.
Chimeric sequences are difficult to detect and
to distinguish from real biological sequences, but it is essential to exclude
them from studies of cyanobacterial phylogeny because their inclusion can
falsify the results, causing misplacement of other organisms or clades in the phylogenetic tree, and giving the impression that genetic diversity is wider than it really is. In designing CyanoPhy I tested all chimera-detection programmes available, and continue to do so as new detection methods are published. A list of programmes is available here.
These tests revealed two major problems:
1) Over 60 % of the sites are identical in all sequences of the
cyanobacterial alignment and thus tend to mask recombination events;
cyanobacterial chimeric sequences therefore escape detection methods such
as Pintail, often used to verify the quality of sequences in the public
databases.
2) If the database used for sequence comparison contains few cyanobacterial
representatives, a chimeric sequence will not be detected because one or more
of the "parental" sequences are absent; this is the case for the
Bellerophon server, again frequently employed for the control of sequence
quality in the public databases.
Of the
programmes that do not employ a sequence database, Mallard appears to give
fairly reliable results, but should not be used with more than about 200
sequences per file. LARD (unfortunately discontinued) performs well in localizing the position of a
breakpoint - but must be supplied with the chimera and the two potential
parental sequences. The Decipher web server has a good database with many
cyanobacterial sequences, is rapid and performs reasonably well, but
unfortunately depends on the RDP classifier, sadly out of date for
Cyanobacteria. SplitsTree4 often gives an unambiguous result in the splits
graphs.
I have retained Lard, Pintail, Mallard and SplitsTree4 in
CyanoPhy, and added UCHIME. The latter uses a FASTA formatted database to which
I have added over 1000 cyanobacterial 16S rRNA sequences of good length and
quality. The standalone version of Decipher has the same problem as the web
server and has not been retained.
Some cyanobacterial chimeric sequences were
initially identified by chance from the marked effect their inclusion has on
tree topology. Thus, inclusion of AphaNH5 (Aphanizomenon strain NH5,
accession AF425995) causes a major change in the position of the heterocystous Nodularia
group, and Ukia101a (Umezakia natans strain TAC101, accession AF516748)
may, depending on the choice of other taxa included, cause the entire heterocystous clade to move from the top to the root of the
tree when Rhodopseudomonas is included in the outgroup. In contrast, a
second sequence of strain NH5 (accession AY196086) or strain TAC101 (accession AY897614), or the sequence of Umezakia
natans strain TAC661 (accession AB608023) are not chimeric and behave
normally in the tree.
The best detection method for chimeric sequences appears
to be UCHIME. I have built a shell script around this programme that first lists and removes sets of identical sequences (if any) from the query infile, then runs UCHIME on the remainder, checks the positive chimeric sequences against the list of identical sequences, reformats the statistical results, adds a list of the identical sequences in the original infile, then prints all results to file. The sequences UCHIME has
identified as chimeric are included in the Table below; this list should
be regarded as tentative, pending further study. A more complicated (and
time-consuming) detection method is to use NCBI BLAST to verify potential chimeras;
briefly, the sequence is cut into two or more fragments and subject to BLAST
analysis on NCBI; this has only been done for a few sequences, identified from
their effects on tree topology. Note that the local CyanoPhy BLAST database can
be queried (in place of NCBI).
All chimeric sequences found by the detection methods
available in CyanoPhy are given in the Table below, and should be excluded from
sequence sets used for building trees. They have been left in the ARB database
only for informational purposes.
The Table is followed by several examples of UCHIME output, automatically reformatted by the shell script. Note that the 5-prime and 3-prime "parental sequences" detected by UCHIME are not necessarily the true parents of the chimeric sequences, but their nearest relatives found in the UCHIME database.
Also shown are four examples of analysis with SplitsTree4. These demonstrate the unusual position of the chimeric sequence Ukia101a, which falls between the phyla Cyanobacteria and Protobacteria, the anomalous position of the chimeric sequence DspN78 between the heterocystous and Pseudanabaena clades, the anomalous behaviour of the chimeric sequence AphaNH5 within the heterocystous clade, and that of PlaH1128 within the genus Planktothrix.
The Table is followed by several examples of UCHIME output, automatically reformatted by the shell script. Note that the 5-prime and 3-prime "parental sequences" detected by UCHIME are not necessarily the true parents of the chimeric sequences, but their nearest relatives found in the UCHIME database.
Also shown are four examples of analysis with SplitsTree4. These demonstrate the unusual position of the chimeric sequence Ukia101a, which falls between the phyla Cyanobacteria and Protobacteria, the anomalous position of the chimeric sequence DspN78 between the heterocystous and Pseudanabaena clades, the anomalous behaviour of the chimeric sequence AphaNH5 within the heterocystous clade, and that of PlaH1128 within the genus Planktothrix.
Chimeric sequences identified by UCHIME, Mallard, Decipher
These sequences are listed in the Table below. Since this is large, it is masked by default. It may be visualized by clicking the "view table" button, and hidden again via the "mask table" button.
Only cyanobacterial sequences from ARB >1249 nt in length and with <6 ambiguous sites were examined. Mallard results are given at support value p=0.05; Breakpoints are from UCHIME output.
List updated 24:05:2014. Of the 4340 total sequences in the CyanoPhy ARB database, 3919 were >1249 nt in length and had <6 ambiguous sites. These contained 731 identical sets, which were removed. The remaining 3188 sequences were checked for potential chimeric origin using the tools available in CyanoPhy. A total of 64 were found to be real (indicated by "+" in the Table below) or possible ("(+)") chimeric sequences, corresponding to a frequency of 2.01 % among the 3188 unique cyanobacterial sequences. Of these, 36 (1.35 %) are among the 2657 unique sequences from cultured isolates, increasing markedly to 5.27 % (28 chimeric sequences) among the 531 uncultured samples included.
Three sequences described in the Table have one or more identical counterparts in the 726 sequence dataset:
PlaH1128 (Planktothrix HAB1128, FJ184439) is identical to PlaH417 (Planktothrix HAB417, FJ184441), PlaH1347 (Planktothrix HAB1347, FJ184442) and PlaH1379 (Planktothrix HAB1379, FJ184443);
LepCYN83 (Leptolyngbya CYN83, JF925321) is identical to LepCYN87 (Leptolyngbya CYN87, JF925322), and LepCYN95 (Leptolyngbya CYN95, JF925323);
PlaH1130 (Planktothrix HAB1130, FJ184437) is identical to PlaH662 (Planktothrix HAB662, FJ184438).
AphaNH5, ChrcCC4, DspN78 and Ukia101a
and suggested 6 additional strains, not found by the other methods
and suggested 6 additional strains, not found by the other methods
Mcys42 Microcystis NIES 42, U40335
Mcys43 Microcystis NIES 43, U40336
PluP302b Pleurocapsa VP3-02b, FR798929
PluVP302 Pleurocapsa VP3-02, FR798927
PluVP407 Pleurocapsa VP4-07,
FR798930
ScocIR11 Synechococcus IR11,
AF448079
.
Inter-clade chimera. The chimeric sequence DspN78 (accession AF313627) of the heterocystous Dolichospermum strain NIES 78 is distinct from its non-chimeric version DspN78n (accession AY701551); the splits show a strong identity with sequences DspN80/DspN78n (Dolichospermum) and with non-heterocystous Pseudanabena strains represented by sequences PsaBRG53 and Psa7429A. The parental sequences are from organisms that fall into distinct clades of the phylogenetic tree.
Intra-clade chimeric sequence. The chimeric sequence AphaNH5 (accession AF425995) of the heterocystous Aphanizomenon strain NH-5 is distinct from a second, non-chimeric, sequence (AphNH5, accession AY196086) of the same organism; the splits reveal identity with both Nodularia and Anabaena/Aphanizomenon strains (sequences NodB9427 and Ana27S03/AphNH5). The parental sequences are found within the same major clade of the phylogenetic tree.
Examples of UCHIME output for chimeric sequences (extracted from the full listing):
Chimeric
(Query) sequences (55) found by UCHIME in
/usr/local/uchime/cyano-0710.fasw
and taxonomic
information for their 5' (*****) and 3' (***) parents.
Statistics
show: qAB, % identity of the query to the assembled parents, and to
the full
length 5'- (qAt) and 3'- (qBt) parents; pAB, % identity of the parents;
total
discriminatory sites left (tL) and right (tR) of the breakpoint and their sum
(tT); number of discriminatory sites supporting the model (left, pL; right, pR;
sum, pT); YES/yes, strong/weak evidence for the chimera.
Breakpoint
region: residues conserved in chimera and parents may hinder precise
localization.
Query:
AphaNH5
***** NodB9427
Nodularia BCNOD9427 Section 4 AJ224447
*** Ana27S03
Anabaena 1LT27S03 Section 4 FM177478
Statistics:
qAB: 99.9 qAt: 97.3 qBt: 97.0
pAB: 94.7
tL: 40 tR: 39
tT: 79 pL: 39 pR: 35
pT: 74 YES
Breakpoint region: 667 to 751 of chimera
(1427 nt)
--------
Query:
ChrcCC4
***** FiscN592
Fischerella major NIES 592 Section 5 AB093487
*** S000446569 Ochrobactrum grignonense (T); type strain:OgA9a; AJ242581
(Bacteria)
Statistics:
qAB: 99.4 qAt: 94.3 qBt: 83.2
pAB: 78.0
tL: 243 tR: 72
tT: 315 pL: 233 pR: 71
pT: 304 YES
Breakpoint region: 1077 to 1087 of chimera
(1448 nt)
--------
Query:
DspC202
***** S000643548
Flavobacterium aquatile (T); type strain: DSM 1132; AM230485 (Bacteria)
*** DspC207
Dolichospermum crassum CENA207 Section 4 FJ830578
Statistics:
qAB: 99.1 qAt: 79.7 qBt: 93.5
pAB: 74.3
tL: 88 tR: 273
tT: 361 pL: 79 pR: 269
pT: 348 YES
Breakpoint region: 277 to 285 of chimera
(1436 nt)
--------
Query: DspN78
***** PsaBRG53
Pseudanabaena ABRG5-3 Section 3 AB527076
*** DspN80
Dolichospermum solitarium NIES 80 Section 4 AF247594
Statistics:
qAB: 97.9 qAt: 92.7 qBt: 92.6
pAB: 86.7
tL: 118 tR: 71
tT: 189 pL: 89 pR: 69
pT: 158 YES
Breakpoint region: 658 to 717 of chimera
(1369 nt)
--------
Query:
PlaH1128
***** Plan7S08
Planktothrix agardhii 1LT27S08 Section 3 AJ635435
*** Pla34S02
Planktothrix pseudagardhii 2LT34S02 Section 3 FM177501
Statistics:
qAB: 100.0 qAt: 97.8 qBt: 97.7
pAB: 95.4
tL: 32 tR:
31 tT: 63 pL: 32 pR: 30 pT: 62 yes
Breakpoint region: 757 to 890 of chimera
(1382 nt)
--------
Query:
PsaCA530
***** LimCC29
Limnothrix redekei CCAP 1459/29 Section 3 HE974998
*** S000129976 Stenotrophomonas rhizophila (T); e-p10;
AJ293463 (Bacteria)
Statistics: qAB: 99.1
qAt: 91.3 qBt: 89.2 pAB: 82.1
tL: 145 tR: 119
tT: 264 pL: 140 pR: 109
pT: 249 YES
Breakpoint region: 700 to 708 of chimera
(1400 nt)
--------
Query:
Ukia101a
***** Aph27S04
Aphanizomenon ovalisporum 1LT27S04 Section 4 FM177485
*** S000498560 Rhodopseudomonas faecalis (T); gc; AF123085 (Bacteria)
Statistics:
qAB: 98.5 qAt: 87.8 qBt: 87.7
pAB: 76.9
tL: 161 tR: 171
tT: 332 pL: 154 pR: 151
pT: 305 YES
Breakpoint region: 706 to 712 of chimera
(1418 nt)
--------
Note: one or
more sets of identical sequences were removed (see
/usr/local/uchime/cyano-0710-711.deleted).
Those (if
any) identical to the queries are shown below within square brackets:
LepCYN83 [ LepCYN87 LepCYN95 ]
PlaH1128 [ PlaH1347 PlaH1379 PlaH417 ]
PlaH1130 [ PlaH662 ]
The above
results show that all types of chimera formation – involving parental sequences
from different phyla (Cyanobacteria plus other bacterial phyla), different
clades within the Cyanobacteria (DspN78 involves parents representing
heterocystous and non-heterocystous Cyanobacteria), within the same
cyanobacterial clade (the parents of AphaNH5 both being heterocystous
organisms), or within the same genus – occur. Not surprisingly, the number of
discriminatory sites within the chimeric sequence, given as tT or pT values
above, decreases as the degree of sequence divergence decreases. The precision of location of the position of
the breakpoint also decreases, from 6 – 10 nucleotides for the inter-phylum
examples (involving a cyanobacterial and a bacterial parental sequence) to 133 for the intra-generic example, as a consequence of the increase
in sequence identity which renders precise location progressively difficult.
Example splits graphs produced by SplitsTree4 NeighborNet analysis:
Inter-phylum chimeric sequence formation. Ukia101a (accession AF516748) is a chimeric sequence involving parents from both cyanobacterial and proteobacterial phyla, represented by a cluster of heterocystous Cyanobacteria and Rhodopseudomonas faecalis strain gc, respectively.
.
Inter-clade chimera. The chimeric sequence DspN78 (accession AF313627) of the heterocystous Dolichospermum strain NIES 78 is distinct from its non-chimeric version DspN78n (accession AY701551); the splits show a strong identity with sequences DspN80/DspN78n (Dolichospermum) and with non-heterocystous Pseudanabena strains represented by sequences PsaBRG53 and Psa7429A. The parental sequences are from organisms that fall into distinct clades of the phylogenetic tree.
Intra-clade chimeric sequence. The chimeric sequence AphaNH5 (accession AF425995) of the heterocystous Aphanizomenon strain NH-5 is distinct from a second, non-chimeric, sequence (AphNH5, accession AY196086) of the same organism; the splits reveal identity with both Nodularia and Anabaena/Aphanizomenon strains (sequences NodB9427 and Ana27S03/AphNH5). The parental sequences are found within the same major clade of the phylogenetic tree.
Intra-generic chimer formation. Within the genus Planktothrix, the 16S rRNA sequence PlaH1128 (accession FJ184439) occupies an unusual position because it is a chimeric sequence, showing identity to both the P. agardhii strain cluster (e.g. Pla7821) and the P. pseudagardhii cluster (e.g. Pla34S02).
No comments:
Post a Comment