Thursday, February 1, 2018

Autosomal DNA Half Life Equation

Autosomal DNA Half Life Equation

Feb 1, 2018


In the previous article, I had talked about creating a phylogenetic tree for autosomal data.
 
Andrew Millard suggested that because the Genetic Distance value of 1/cMs is not linear, I might try an exponential equation of the form -ln(cm/7200).

That formula suggestion was tried that, and I was not real happy with the results. The Genetic Distance did not correspond well to cousin level, and the upper limit for data greater than 1 cM appeared to be about 8 (generations), even for the results for Lauren Shutt (M804xxx), who was not even a match to the group.

Andrew had also suggested that I do not use kitsch, but I have not yet found different program that works with Genetic Distance.

So, here, I have used the equation for the half life decay rate, typically used in radioactive decay under the presumption that we are looking at Half Identical Regions (HIRs). The results mostly returned the expected cousin level (or number of generations).



Individual Segments:


Hypothetical Genetic Distance derived from half life decay rate: 

Nt = No*e^kt

Solvng for t, Hypothetical Genetic Distance for an individual segment is given as: 

t = -1*ln(cMs*.0035524)/0.693147

where 'ln' is the natural log function. 


where 'F1234" would be the location in cMs on line (row) 1234.

The largest segment (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at 281.5 cMs at current GEDMatch parameters):

a = 1/281.5 cMs = 0.0035524

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Excel spreadsheet:

For a spreadsheet, the equation for individual segments should be something like this:

=-LN(F1234*0.0035524)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.

Comparison to 23AndMe data:

The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kits Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.

Below is the link to the data and phylogenetic tree resulting from the use of the half life decay rate calculations. Basically, an update to the previous article by applying the autosomal half life decay equation.

Article:  Autosomal Half Life Equation

 
Autosomal Half Life Equation Largest Segment Table

Autosomal Half Life Equation Phylogenetic Tree


See below for more information about the "Endogamy Correction Factor."

Total SUMS in cMs:

The Total Sum of segments (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at about 3585 cMs at current GedMatch thresholds*):

a = 1/3585 cMs = 0.00027894

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Total Sums Excel spreadsheet:

For a spreadsheet, the equation for total sum of all segments should be something like this:

=-LN(F1234*0.
00027894)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.


* You will need to modify the SNP limit and cMs to 1 cM in order to use the "Total SUM" version of the equation. I am not getting total sums consistent with this equation due to either:

a) GEDMatch under reports matching segments.  GEDMatch has apparently attempted to remove some "Excess IBD" areas, which will affect the total sum of segments.

b) GEDMatch has a vendor conversion problem with vendor 23AndMe.
c) Therefore, I have not been able to adequately test the "Total SUM" version of the Half Life equation.

Two things that you should know when using Total SUMs:

 a) The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kit Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.


 b) The "Total SUM" natural log equation is not delivering adequate results from the data, as given by GEDMatch.  By comparing  my data from 2015 to today's data, it appears that GEDMatch has made an effort to NOT report some of the "Excess IBD" areas with the results. That will affect the Half Life equation for Total SUMs, because the sums are now under reported at GEDMatch.

ISOGG message from CeCe Moore on Thu, Jun 10, 2010:

"Hi All,
    I had a very fascinating interview with Bennett today and wanted to share something very important that I learned since I know it has been debated here quite a bit. I asked him about the reliability of using the combined smaller segments in "Total cMs" to predict relatedness. He stated that FTDNA only uses "Total cMs" for relationship predictions of 2nd cousin once removed and closer. From that point on, they only use the longest blocks to predict relationship. The "Total cMs" is only included in FF summaries because it was something that many people were interested in seeing.
    CeCe"




NOTES: 

The 23AndMe vendor does not generate sufficient results for a valid comparison in many instances. Currently, 23AndMe will only generate one small segment, and does not supply enough information from vendor conversion for sites like GEDMatch to make a good comparison. Try lowering the limit for SNPs to 250 for the vendor 23AndMe.

Removal of the Excess IBD regions has about the same effect on individual segments as that of using the "Endogamy Correction Factor." Either will produce some error for various reasons. However, if the Excess IBD regions have been removed, then this will affect how the Total SUM version of Half Life equation works.

If you want to use the  "Endogamy Correction Factor" on the excess IBD segments instead of removing them:

- Endogamy Correction Factor:     [(100*cMs)/SNPs] 

t = -1*ln[(cMs*.0035524*100*cMs)/SNPs]/0.693147 


- for Size in cMs and number of SNPs
- for Size in cMs EQ 0: set to 11 for an arbitrary upper limit
 



Updated 02/26/2018 arbitrary upper limit changed from 14 to 11, in order to avoid exponential results at the upper limit of phylogenetic trees.
Updated 02/26/2018 to add the equation for total sums in cMs and link to reference table.
Updated 02/17/2018 to add spreadsheet version of the equation.
Updated 03/27/2018 to add SNP parameters for Total SUM calculation and a note about 23andMe problems.
Updated 03/29/2018 Correct the reference regarding MyHeritage to 23AndMe (vendor indicated at GedMatch starting with an "M"). Note that Total SUMs is not giving adequate results. Added a quote from a public post by CeCe Moore from the ISOGG email list.
 Updated 04/03/2018Corrected to report that GEDMatch does not report out "Total SUMs" properly, due to an apparent removal of Excess IBD regions. Included equation for "Endogamy Correction Factor."




References:


HAM Group #1 Information

HAM Y-DNA Project Phylogenetic Tree

HAM Group #1 Initial Tiny Autosomal Segment Triad Study
ISOGG Autosomal DNA statistics





GEDMatch


FamilyTreeDNA

HAM DNA Project Dean McGee's Utility output

HAM DNA Project Y-DNA Results at HAM Country

HAM DNA Project at FTDNA

How to Read HAM DNA Phylograms
    (video)