Monday, March 23, 2009

Tree Building for Y-DNA Surname Projects

Tree Building for Y-DNA Surname Projects

A brief comparison of a few genetic genealogy tree creation software programs

If you can't see the forest for the trees, then you're probably close enough





One item genealogists check for is accuracy. They check their sources, double check, and try to verify inconsistent data all of the time.
They cite their sources in order to know what each document tells them, as well as to know where to find the document.


Genealogists do this because incorrect data can affect your interpretation of who descends from whom. Without proper source information, you can draw the wrong conclusions about lineage. In genealogy, errors happen all the time, and is sometimes very annoying or frustrating to fix the errors created by the confusion of bad data or wrong information. The way in which phylogenetic trees are evaluated can affect your perception of who is closely related.


I spent a good part of March looking at trees. Creating phylogenetic trees is another item that can get confusing. If you have never
created your own Genetic Genealogy phylogram, then I suppose ignorance is bliss. If you have not converted your DNA data into ATGC format, then you may be wondering if there is a better way to analyze your data. At the moment, there are no genealogy courses, or even genetic genealogy books to explain this. So, short of taking a lot of time studying genetics courses at your local University, I thought a brief review of the subject might be helpful.

There are alternatives to building phylograms. Geneticists build trees all the time. The fact is, there are about 300 tree building programs. Which, of course makes you wonder how many Genetic Genealogists are building trees. I suspect few are building trees for their Projects at this time. There is no software that I know of that attempts to do this (other than "ft2dna" that I wrote).

I have heard via the FTDNA conference in 2009 that Family Tree DNA will be providing a new tree widget for folks, so there is hope for an easier way in the event that you don't have the time to build trees on your own. I understand that FTDNA is using the term "alpha," so I will be trying to gear this article along those lines.


In this article, I'd like to do a brief comparison of the Phylip software
packages. I will compare the use of ATGC frequency data with my standard method of using Dean McGee's Utility (using Dean McGee's Genetic Distance as evaluated with the Kitsch program). By default, Dean McGee's Y-DNA Utility builds the TMRCA charts using the infinite alleles method. This article should be somewhat of a comparison of an infinite alleles model vs. an ATGC frequencies approach to creating phylogenetic trees.


Hopefully, you will get an idea of the differences between DNAPARS, DNAML, DNAMLK, PHYML, and TreePuzzle.

PHYML and TreePuzzle are separate packages, and are not part of the Phylip package.
DNAPARS is a DNA Parsimony Program

DNAML is a DNA Maximum Likelihood Program
DNAMLK is a DNA Maximum Likelihood Program with a molecular clock


For this example, I will use the data from the HAM DNA Group #1, plus an individual from Group #5 (kit 27814) as an outgroup anchor. So, the data is composed of Y-DNA output for 9 individuals, using 37 marker data. ( Kit # 21554 not evaluated because this kit does not contain 37 markers, and most of these tree programs do not function well will data missing from a large number of markers.)


Genetic Distance and resulting trees are of interest to Group01 participants because most participants in this group have not yet identified their immigrant ancestor (through normal genealogical documentation). In fact, this immigration documentation may not even exist due to the destruction of records in Virginia. Kit #N54540 is a more recent immigrant, and has traced his line back to County Somerset, England. Kit #27814 has traced their line back to Elseheim, Germany (and is actually in HAM DNA Group #4, but is included here for the purpose of illustration). The remaining participants has not yet identified their immigrant ancestor. Therefore, understanding the Time to Most Recent Common Ancestor is significant for this group.


Putting it together:


The "control group" chart is from the standard method that is used from Dean McGee's Y-DNA Comparison Utility. It should be noted that this utility uses the infinite alleles model to create PHYLIP compatible Genetic Distance input. The Genetic Distance input is then used with
the Kitsch program to generate a phylogenetic tree.

The program "ft2dna" was used to convert into
ATGC format. At this time, I know of no other software program that will convert Y-DNA Project data (DYS repeat counts) into ATGC format. So, "ft2dna" was used for this purpose. I have tried to check the routines in "ft2dna" for accuracy, but the CAVEAT here is that if I have not done the conversion correctly, then it will affect the outcome of tree creation. Family Tree DNA has a funny way to calculate the Genetic Distance for DYS389i, DYS389ii, YCAIIa, and YCAIIb, and conversion to ATGC values is not published by FTDNA. Tree errors could easily be introduced at the time of conversion into ATGC format, so it is important to note that.


Next, this ATGC information was run through the PHYLIP program "DNADIST" in order to capture Genetic Distance in the form of ATGC frequencies.
That is, an effort was made to derive Genetic Distance independently from Dean McGee's Y-DNA Comparison Utility.


However, as Professor Felsenstein notes in his PHYLIP documentation on distance (distance.html), using frequency data may not be expected to be an independent evaluation (if the distance is computed from the original data by a method which does not correct for reversals and parallelisms in evolution). The example he gives is for (pure) genetic
drift, where the program CONTML may be more appropriate. Felsenstein says that Fitch, Kitsch, and Neighbor may be appropriate for use with frequency data if additivity holds, a neutral mutation model can be assumed, and Nei's genetic distance is used.

So, if you are a Genetic Genealogist, you should be aware that different software programs may deliver different tree results due to the underlying assumptions behind the software. If you are going to use these, then I would suggest that you use a kit with a greater Genetic Distance as an outgroup. The DNADIST program was used with ATGC data for input, and delivered output in the form of frequencies. The frequency output from DNADIST was evaluated using the programs Kitsch, Fitch, and Neighbor. The resulting trees were rooted on the outgroup kit, 27814 usually by selecting option "O" from within each program. Kitsch does not have an "outgroup" option, so the resulting tree was rooted on the outgroup (kit 27814) by the use of the MEGA software program.

The MEGA program was used to compute the "consensus" trees when a consensus was required, and was also used to set the "outgroup root" of the trees when required.


Only 37 markers were used in this study. Here's the normal Genetic Distance and TMRCA for
Group01 using Dean McGee's Utility:






And here's the tree generated by the PHYLIP package "Kitsch":














(click on image to enlarge)







You might notice that Robert is being depicted as having descent from the William HAM of Grayson County, VA, when in fact Robert's ancestors were in England at the time to most recent ancestor. Which should be telling us that 37 markers are not enough to discern a precise date for TMRCA. But, this graph has been close enough for use.


For the remaining comparisons, I converted the data from FTDNA's repeat values into ATGC format. Then, I used the ATGC format as input to the PHYLIP program DNADIST, which produced an output table in the form of frequencies.

The output would look something like this (repeated for each DYS value):

9


40777_WmVA 0.000000 0.000000 0.124014 0.000000 0.000000 0.101782 0.101782 0.101782 0.124014
68140_WmVA 0.000000 0.000000 0.124014 0.000000 0.000000 0.101782 0.101782 0.101782 0.124014
N54540_Rob 0.124014 0.124014 0.000000 0.124014 0.124014 0.103402 0.103402 0.103402 0.000000
58559_WmVA 0.000000 0.000000 0.124014 0.000000 0.000000 0.101782 0.101782 0.101782 0.124014

70450_WmVA 0.000000 0.000000 0.124014 0.000000 0.000000 0.101782 0.101782 0.101782 0.124014

42370_WmNC 0.101782 0.101782 0.103402 0.101782 0.101782 0.000000 0.000000 0.000000 0.103402
55330_WmNC 0.101782 0.101782 0.103402 0.101782 0.101782 0.000000 0.000000 0.000000 0.103402

46246_Geor 0.101782 0.101782 0.103402 0.101782 0.101782 0.000000 0.000000 0.000000
0.103402
27814_Valn 0.124014 0.124014 0.000000 0.124014 0.124014 0.103402 0.103402 0.103402 0.000000
... etc.

Next, these frequency values were used as input into the various options for Fitch and Kistch (F84, Jukes, Kimura, LogDet) programs. (The Fitch and Kistch output on frequency data are not shown here.) The remaining programs used the regular "ATGC" type format as input data.

DNAPARS then produced the following graph for this ATGC data:














(click on image to enlarge)




Which clearly is not correctly rooted. We know this because Valentine has the greatest Genetic Distance for this data. (The numbers there are frequency statistics, and not branch lengths). So, the DNAPARS tree was re-rooted to select Valentine (kit #27814) as the outgroup, which delivered this graph:















(click on image to enlarge)




Which makes better sense, as kit 27814 has the greatest Genetic Distance for this data.
Note that DNAPARS is a parsimony program, and has re-arranged Robert (kit N54540) between the kits that descend from the William HAM of Grayson County, Virginia. This corresponds to the Genetic Distance (above) and has similarities to Dean McGee's Utility output. But, Robert doesn't really belong there, as it is known that Robert does not descend from this particular William HAM. And, DNAPARS does not clearly separate out the paths between WmVA and WmNC. So, there is a slightly noticeable error with the use of DNAPARS.

Before we get to DNAML, let me run through the tree produced by the software program "TreePuzzle." FTDNA has indicated they may be basing new TMRCA estimates upon "alpha," and TreePuzzle was used to obtain "alpha" for HAM DNA Group #1.


Options used from TreePuzzle in order to obtain "alpha:"

"o" to select your outgroup (the one with the greatest Genetic Distance in this data is kit #27814)

"w" (model of heterogeneity) - this returns the number of Gamma rate categories, and it will automatically calculate "alpha"


TreePuzzle delivered the following values:

--------------------------------------------------------------------------------------------
Expected transition/transversion ratio: 2.18

Expected pyrimidine transition/purine transition ratio: 0.02


RATE HETEROGENEITY


Model of rate heterogeneity: Gamma distributed rates

Gamma distribution parameter alpha (estimated from data set): 0.03 (S.E. 0.02)

Number of Gamma rate categories: 8


--------------------------------------------------------------------------------------------

This is the tree obtained from TreePuzzle:










(click on image to enlarge)






Which, of course, needs to be re-rooted about the kit with the greatest Genetic Distance, kit #27814 and this is probably not a fair depiction of where Robert should land on the tree. (The re-root is not shown here.)


At any rate, having obtained a values for "alpha" and "transition/transversion ratio" I am now able to run DNAML with these parameters.


Using alpha = 0.03 and "T" = 2.18, DNAML delivered this tree:















(click on image to enlarge)




DNAML has managed to retain a resemblance of the grouping for the separate William HAM's of VA and NC, the George should be depicted with a greater genetic distance than is indicated here.


The last package evaluated here from PHYLIP is the DNAMLK program, which delivered this tree:














(click on image to enlarge)




Which is to say, after rooting the tree about kit 27814, DNAMLK retains the separation of the NC and VA William
HAM's (two different individuals), but George is probably not best placed on this tree. George has a greater Genetic Distance than the two William's, and is not known to be a descendant of the William HAM of Franklin County, NC. So, there is a slight problem using the options chosen (alpha = 0.03 and "T" = 2.18) with DNAMLK.


Finally, I wanted to include a quick look at PHYML for this article. Here's the tree that I obtained from PHYML:













(click on image to enlarge)




Which is a fairly decent tree (after rooting on 27814), but Robert is mis-placed again here, as N54540 is located within the descendants of the William HAM of Grayson County, VA. Robert is not known to descend from this William (documentation shows his ancestors were in England at the time). But, in all fairness, the Genetic Distance for Robert (according to FTDNA counting methods), does not depict his descent quite correctly either.



In summary, it should be said that I have to work hard to get as good a tree as obtained via Dean McGee's Y-DNA Utility, as run through the Kistch program. To date, these DNA programs require much more effort and much more knowledge of the various options available to each individual program.

I will look forward to seeing the new tree widget from Family Tree DNA, as it will be some relief to see an easier method of generating trees.


If you have a favorite tree building software program not reviewed here, feel free to comment on this Blog.













To post comments, click on the title and scroll to the bottom.

No comments: