A case study of computational models for Genetic Genealogists
June 28, 2008
Why is this TMRCA a vexing problem?
Well, the theory proposed by Bruce Walsh dives head first into a perplexing trail of complicated mathematical equations, molecular genetic theory, infinite alleles models, Poisson distributions, Bayesian posterior distributions, Bessel functions, differential mutation rates, and the like. For most Y-DNA administrators, this is a little difficult to apply to their family project(s). Bruce Walsh describes the calculations as an "upper boundary" (page 898) on the time back to a common ancestor shared by two individuals. That is, it was not to be taken as an exact calculation.
The Holy Grail of Genetic Genealogy:
Why would people want to know the Time to Most Recent Common Ancestor?
When you begin to use DNA for genealogy, the first thing you are able to determine is which family lines you relate to, and which family lines you do not relate to. That puts a whole new perspective on genealogy research. You begin to understand which family researchers you should be communicating with, and this also leads to which geographical areas should be of interest in your family line. That's because using genetics for TMRCA holds the promise of discovering migration paths going back thousands of years. That is, tracing your own line back several hundred years no longer seems like such a huge problem.
Knowing "How To" calculate the Time to Most Recent Common Ancestor suddenly becomes an interesting problem for genealogists to solve. This calculation should tell you how long it has been since your own line split from another line. Information that genealogists may not have had before. This becomes very interesting as you share resources with other genealogists around the globe.
How TMRCA calculations work
For geneticists, this usually means breaking down probability distributions into a computer program that can be applied to the data. For genealogists, this usually means breaking down those complicated equations into something easy to calculate. For most folks, that means applying a mutation rate. This should be a simple calculation. But, most people find that the "simple" calculation soon breaks down into some type of quagmire when it comes to applying it their own line. The calculations quickly turn into an endless stream of details.
The idea is to apply the mutation rate to the Genetic Distance between your project participants.
The problem becomes which mutation rate should I apply? Which probability distribution is the correct one? What's the correct mutation model (stepwise, infinite alleles, etc.)? Should I apply individual marker mutation rates to my study? Or, for that matter, how do I figure out the marker mutation rates, and how is that calculated for TMRCA? And even perhaps, how do I know that I have calculated the correct Genetic Distance?
Far too many questions there. Why can't we just point and click on some computer program? I suppose the main reason is that no satisfactory "easy to use" computer program has been written for a genealogist to use in their DNA studies. At this point in time, there is one software program that has attempted something along these lines. That would be Dean McGee's Y-DNA Utility. It at least generates some TMRCA data for you. Dean McGee has made his program available for public use.
One of my favorites, I use the output from Dean McGee's Utility and pass it through the PHYLIP software package to produce a phylogenetic graph of the HAM DNA project. This is all obtained only by the use of the DNA data that we have collected for the project.
While I enjoy the output from Dean McGee's Utility, there is a minor problem that I find when applying it to my data. I am observing that individuals with genetic distances of 0, 1, and 2 all descend from a common ancestor who is estimated to have been born in 1755 (or, TMRCA ~= 255 years ago). For these individuals (kits 40777, 68140, 58559, and 70450), Dean McGee's Utility calculates the TMRCA out (for the HAM DNA Project) as 150, 325, and 400 years ago. That is, for each of the differing genetic distances, there are corresponding differing TMRCA's. However, the results should show the same TMRCA.
The calculations are close to the genealogy information, but not quite exact. How do I resolve that? Can I obtain a TMRCA that actually corresponds better to the actual data? Does this have anything to do with Walsh's "upper boundary" theory? Is there something that Dean McGee's Utility should be doing differently? Are there calculations that I could do on my own data to improve the figure? Or, is it simply due to the number of markers tested? Page 909 of Walsh's paper suggests that per generation data will become accurate when about 580 markers have been tested between two individuals. To my knowledge, only about 417 Y-DNA markers have been discovered, and most testing companies only offer packages of 100 markers or less. Therefore, is the lack of accuracy due to the lack of the number of markers tested?
As for individuals making their own calculations, Bruce Walsh mentions how to modify the TMRCA equations for an individual haplotype (page 910 of his paper).
If I recall correctly, I believe one of Walsh's papers mentioned that Family Tree DNA has at least once considered generating this for their individual projects. But to date, FTDNA has only published information obtained from their data in very general terms, as it applies to the data as a whole.
Therefore, the question follows that if we are able to generate the same type of information from our own project (or haplotype), then will that data result in a more accurate estimation of TMRCA as it applies to our area of interest?
As of this writing, there are at least two individuals that have published thoughts along those lines (at least, to my knowledge). One is Charles Kerchner, who is tracking data across a multitude of projects. Charles is studying mutation rates for individual projects.
The other individual is David Roper, who shows how he has applied calculations for his own project on a very small scale. Mr. Roper has included some discussion of how to apply probabilities to genetic distance for an individual project , and he has posted the results of a simple example of "How To" calculate this out.
Comparison of Calculations:
I reviewed a case study of the calculations when applied to a special case in the HAM DNA Project. Comparisons of several of the more convenient models available on the internet, namely L. David Roper's probability model, Dean McGee's Y-DNA Utility, and the Lamarc software program.
For a review of the computations and results, see the PDF file posted to:
Many of the genetic genealogists are looking for specific results as they apply to their own project or group. They are not always content with generalized information, and the question of TMRCA is a topic of great interest especially if they can apply an equation to their own project(s).
It should be noted that a significant percentage of mutations have been reported for father son pairs. I have not examined the effect of a baseline to the case of father son pairs (mainly because I do not have that data at hand).
I should also note that there are any number of genetic software programs available, usually requiring input in the form of ATGC format. Nearly all of the genetic programs do not report output that the family genealogist can easily use to compute a reasonable TMRCA. That is, most programs do not take the input data as we have it from FTDNA, nor do most programs report TMRCA output expressed in terms such as generations or years. (See Bill Jackson's MRCA Probability calculator for a given number of markers and generations, or Ann Turner's Mutation calculator.)
It should also be noted that other quantized methods could be employed to calculate meaningful results (even without the use of a mutation rate). However to date, I have not yet examined other quantized computational models that could be employed by geneticists for TMRCA.
Nor have I derived the appropriate equations (for a small population) from Bruce Walsh's paper. (The mathematics appear to be beyond my abilities.)
Finally, it should be remembered that this study is based upon a special (and rather unusual) case of 37 marker computations. That means, due to the low number of markers examined, probabilities are a large factor in the results. Last I checked, there have been about 417 Y-DNA markers discovered, bringing to mind Hammer's comment to Walsh (in Walsh's paper) that about 580 Y-DNA markers would be required to resolve time frames down to each generation. When we finally have good analysis on a large number of markers, then this type of discussion will probably become a non issue.
Since most participants have not gone through the process of detailed calculations for per marker mutation rates (for individual projects), it has yet to be determined if all of the work is useful. It is interesting that Roper has detailed the probability calculations for individual markers, which differed slightly from the "standard" calculations using a "standard" mutation rate. This type of research could be useful for individual projects.
How were the models checked independently for the HAM DNA Project?
An attempt was made to calculate TMRCA from various independent computational models. First off, I am presuming that my methodology and math is reasonably correct. There are a number of opportunities to introduce errors in any of the above computations. For example, using the Lamarc program could introduce errors when the data from FTDNA is translated into ATGC format. Or, more simply, my math in this report may not have been applied appropriately.
I had enough data for a Lamarc run on my project for two groups, Group #1 and Group #2. Lamarc calculates out "Theta" values for the Group as a whole, and also Theta values per individual marker. This appears to generate slightly more accurate TMRCA estimates than Dean McGee's Y-DNA Utility. (Roughly, a 13 % improvement in TMRCA estimates.) However, the Lamarc program can run for several days, and takes a great deal more effort than does Dean McGee's Utility.
Finally, it was found that adding a baseline value for no mutations doubles the accuracy of the Lamarc data. Of course, a baseline could also be applied to Dean McGee's utility, or to Roper's model as well. Therefore, one can only wonder how accurate Bruce Walsh's equations might be if a simple baseline were added to his equations.
At the moment, Dean McGee's Y-DNA Utility appears to provide the best calculations with considerably less effort. Also, the Y-DNA Utility can be adjusted for model (infinite alleles or hybrid), for probability, and mutation rate. I would have to conclude that Dean McGee's Utility is the best program available for ease of use, parameters offered, and of course, price.
Time to Most Recent Common Ancestor (TMRCA) a PDF file by Bruce Walsh (2001) of the University of Arizona.
Dienke's Anthropology blog regarding Y-DNA mutation rates of father-son pairs (posted in 2006).
Dean McGee's Y-DNA Utility
David Roper's discussion of how to apply probabilities to genetic distance for an individual project , and his posted results
PHYLIP software package
Mutation Rate calculations by Rosche and Foster (2006)
Lamarc - Likelihood Analysis with Metropolis Algorithm using Random Coalescence
Y-DNA Computational Models - my review of convenient computational models used to determine TMRCA from the Y-DNA data.
An arguable example has been posted here:
HAM DNA Group #02 Lamarc compute model TMRCA contrast to Dean McGee's Y-DNA Comparison Utility (July, 2008).