Friday, May 4, 2018

Y-DNA STR Genetic Distance And The Probability of Error

Y-DNA STR Genetic Distance And The Probability of Error



A Brief Review of HAM DNA Group #1



This topics is in regard to whether or not Y-DNA is an adequate predictor of relationships. That is, are DNA matches using Y-DNA a good indicator of close relationships at up to 111 markers?

Genetic Distance” is a term used to show how well the DNA matches when compared to another person. Which is to say, what a perfect “match” at 111 Y-DNA STR markers should be. Genetic Genealogists generally want to associate Genetic Distance with how closely related the lineages may be. Genetic Distance combined with the concept of Time to Most Recent Common Ancestor (TMRCA) should deliver an indication of when the lines converge.


However, projects are finding that Genetic Distance for Y-DNA STRs are not a good indication of how closely related two individuals may be. In many cases, it does provide a fairly reliable account of surname history.

This article is written because it came as a surprise to me when a certain FaceBook group apparently censored my comments regarding The Genetic Distance for Y-DNA. 
That is, the "Y-DNA - Applied Genealogy & Paternal Origins" group on FaceBook.


I wanted to address what Genetic Distance means in terms a beginner could understand, as we had a little bit of a conversation about it.

Another FaceBook Group had also censored my comments regard the DNA analysis by Law Enforcement in the recent “Golden State Killer.”

To review a few Genetic Distance examples, my line has the following figures regarding Genetic Distance.

This first group connects at circa 1755 from Grayson County, Virginia. The values below are the Genetic Distance to me:

Jimmy 5th cousin, once removed GD 0 - 111 markers
Julian 5th cousin GD 1 - 37 markers
Gene 5th cousin GD 4 - 37 markers
Bill 5th cousin GD 2 - 37 markers
Steven 5th cousin GD 0 - 12 markers
Brick 5th cousin GD 0 - 67 markers



Most listed above descend from John Ham (1780-1850) of Grayson County, Virginia:

Julian and Jimmy are 3rd cousins and have a Genetic Distance of 1 on 37 markers.
Steven and Bill are 3rd cousins, with a Genetic Distance of 1 on 12 markers

Gene is 4th cousin to Jimmy with a Genetic Distance of 4 on 37 markers
Gene is 4th cousin to Julian with a Genetic Distance of 4 on 37 markers
Gene is 4th cousin to Bill with a Genetic Distance of 5 on 37 markers
Gene is 4th cousin to Steven with a Genetic Distance of 0 on 12 markers


Brick is about 5th cousin from everybody else above, as he descends from Thomas HAM of Ashe County, NC (1795-1865)

Brick is 5th cousin to Jimmy with a Genetic Distance of 0 on 67 markers
Brick is 5th cousin to Julian with a Genetic Distance of 1 on 37 markers
Brick is 5th cousin to Gene with a Genetic Distance of 4 on 37 markers
Brick is 5th cousin to Bill with a Genetic Distance of 2 on 37 markers
Brick is 5th cousin to Steven with a Genetic Distance of 0 on 12 markers

Brick is 5th cousin to Dave, as previously mentioned.


At Greater Than 5th cousin level (to me), the following three have a line are from another geographic areas, Franklin County, North Carolina, and connect to the above prior to 1755:

Marvin Greater Than 5th cousin GD 5 - 111 markers
Leonard GT 5th cousin GD 4 - 37 markers
James GT 5th cousin GD 2 - 37 markers

That is, James from a completely different line has a Genetic Distance “as close” or closer than two of my actual 5th cousins. The three above are from the same Franklin County line.

Above, between themselves, Marvin & Leonard have a Genetic Distance of 1 on 37 markers.
Between Marvin & James, they have a Genetic Distance of 1 on 37 markers.

Marvin and Leonard descend from Robert Solomon Ham, and appear to be about 2nd cousins. James descends from Francis (Frank) James Hamm, and appears to be about 3rd cousin to Marvin and Leonard.

Continuing on with the Genetic Distance to me:

Jon GT 6th cousin (Somerset, England) GD 5 - 111 markers
Tony GT 5th cousin (Somerset, England) GD 1 - 37 markers

[Tony and Jon have a Genetic Distance of 3 between the two of them.]

Tony has a Most Recent Common Ancestor in England and has a closer Genetic Distance to me than at least three of my 5th cousins, although we know Tony has to relate further back, as my line has been in this country prior to 1783, and and Tony’s genealogical information shows no connection (Tony’s line arrived in the U.S. circa 1850).

Michael Gene Greater Than 5th cousin (Patrick County, Virginia) & myself have a GD 2 on 111 markers.

That is, Michael Gene is from a completely different line from a different geographic area and has a closer Genetic Distance than two of my 5th cousins (at 111 markers), but we know that we must relate further back than 5th cousins from the genealogical information.

Occasionally, FTDNA has made changes that has these values jump around a bit. At one time, FTDNA had Michael Gene and I at a Genetic Distance of one on 111 markers.


The guidance given by Family Tree DNA on 111 markers says that at 50% confidence level, an exact match on 111 markers should be within 2 generations. That is, 2 generations or less. Obviously, if I am an exact match to my 5th cousin Jimmy at 111 markers, then we certainly do NOT want to use 50% confidence levels.

Fortunately, FTDNA provides other confidence levels for 111 markers: 90%, 95%, and 99%.

The 90% level also fails for the GD of zero between my 5th cousin Jimmy and myself. At 90% confidence level, the table says that Jimmy & I should be 4th cousins or less.

It is only when we reach the 95% or 99% confidence level that FTDNA returns a valid TMRCA for Genetic Distance of 0 on 111 markers of at least 5 generations. Since we are 5th cousins, Jimmy and I would be the 6th generation, meaning only the 99% confidence level actually meets that criteria.

If you are using Dean McGee's Y-Utility, you will want to use the highest probability for general purpose use.

Anybody looking at Genetic Distance should be thinking in terms of “X” generations OR LESS. For example, I typically refer to an exact match at 37 markers as “Any time after 1600,” as Ron Blevins has reported seeing that in his project.

Another genetic genealogist has also mentioned how unreliable Genetic Distance may be in determining relationships is Jim Owston in his 2014 article “Is Genetic Distance an Adequate Predictor of Relationships?” (Updated Jan 23, 2018)

Jim Owston mentions:

“Therefore, it is unlikely that two people with a GD=4 are close relatives; however, a GD=0 could represent numerous relationships from very close relatives to those who are very distant, as a genetic distance of zero is all over the road.”

Jim Owston has information back to 13th cousins, where 12th cousins or more are estimated.

We have few in the HAM DNA Project that can claim accurate documentation that far back. However, the Grayson County group does have a similar number of known 5th cousins who have tested with the Y-DNA.

In comparison, Jim Owston lists roughly eight 5th cousins listed, and I list roughly eleven 5th cousins relationships above, among 7 kits. Jim has roughly eleven 4th cousins listed, and I have five 4th cousin relationships listed above. Otherwise, Jim Owston has multiple dozens of relationships listed at 8th cousins or more.

Jim Owston now has 253 relationships 43 markers and 153 relationships at 37 markers on record, compared to 59 kits in the HAM DNA Project, and 17 autosomal kits in the HAM DNA Group #1 study. I do not know off hand how many relationships that represents for the HAM Group #1, but a reasonable guess would be roughly two dozen. Tiny in comparison Jim Owston.

In an effort to obtain a better TMRCA, Jim Owston is considering a study of the BigY results (the BigY-500 product provides over 500 STRs, and is largely based on SNPs).

For an improved TMRCA, I have been looking at autosomal results. There are 16 kits in Group #1 now participating in the autosomal study, with at least 7 kits from the Grayson County line. My initial autosomal DNA studies indicate that the autosomal DNA may deliver better TMRCA results than does up to 111 Y-DNA STR markers.

However, for the autosomal DNA, the immediate issues include the apparent removal of “Excess IBD” segments from GEDMatch reports, vendor conversion issues (such as 23andMe conversion issues), or slight differences in starting locations when compared to the vendor, and ‘How To’ verify data that falls below the vendor’s lowest threshhold, privacy issues, etc. It is not yet known if the autosomal DNA will hold up any accuracy when taken to the 13th cousin level that Jim Owston has in his study. According to the Autosomal Half Life Equation, the threshholds would have to be taken down to about 0.01 cMs in order to deliver 14th cousin relationships. GEDMatch cannot bet set lower than 1 cM (about 8th or 9th cousin level, according to the Half Life Equation). If concepts such as the “EndogamyFactor” could be considered to be a valid evaluation, then perhaps the lowest 1 cM threshhold at GEDMatch may deliver results even further back than 9th cousin.

Related Topics:

Y-DNA Mutation Rates – A Case Study

Y-DNA Project Grouping with Genetic Distance

Tree Building for Y-DNA Surname Projects

HAM DNA Output From Dean McGee’s Y-DNA Utility

Is Genetic Distance an Adequate Predictor of Relationships?

Autosomal Small Segment Triangulation HAM DNA Group #1

Autosomal Small Segment Phylogenetic Tree

Autosomal DNA Half Life Equation

FTDNA's Interpreting Genetic Distance for 37 Markers

FTDNA's Interpreting Genetic Distance for 67 Markers

FTDNA's Interpreting Genetic Distance for 111 Markers

FTDNA BigY-500 product

GEDMatch






Thursday, February 1, 2018

Autosomal DNA Half Life Equation

Autosomal DNA Half Life Equation

Feb 1, 2018


In the previous article, I had talked about creating a phylogenetic tree for autosomal data.
 
Andrew Millard suggested that because the Genetic Distance value of 1/cMs is not linear, I might try an exponential equation of the form -ln(cm/7200).

That formula suggestion was tried that, and I was not real happy with the results. The Genetic Distance did not correspond well to cousin level, and the upper limit for data greater than 1 cM appeared to be about 8 (generations), even for the results for Lauren Shutt (M804xxx), who was not even a match to the group.

Andrew had also suggested that I do not use kitsch, but I have not yet found different program that works with Genetic Distance.

So, here, I have used the equation for the half life decay rate, typically used in radioactive decay under the presumption that we are looking at Half Identical Regions (HIRs). The results mostly returned the expected cousin level (or number of generations).



Individual Segments:


Hypothetical Genetic Distance derived from half life decay rate: 

Nt = No*e^kt

Solvng for t, Hypothetical Genetic Distance for an individual segment is given as: 

t = -1*ln(cMs*.0035524)/0.693147

where 'ln' is the natural log function. 


where 'F1234" would be the location in cMs on line (row) 1234.

The largest segment (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at 281.5 cMs at current GEDMatch parameters):

a = 1/281.5 cMs = 0.0035524

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Excel spreadsheet:

For a spreadsheet, the equation for individual segments should be something like this:

=-LN(F1234*0.0035524)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.

Comparison to 23AndMe data:

The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kits Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.

Below is the link to the data and phylogenetic tree resulting from the use of the half life decay rate calculations. Basically, an update to the previous article by applying the autosomal half life decay equation.

Article:  Autosomal Half Life Equation

 
Autosomal Half Life Equation Largest Segment Table

Autosomal Half Life Equation Phylogenetic Tree


See below for more information about the "Endogamy Correction Factor."

Total SUMS in cMs:

The Total Sum of segments (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at about 3585 cMs at current GedMatch thresholds*):

a = 1/3585 cMs = 0.00027894

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Total Sums Excel spreadsheet:

For a spreadsheet, the equation for total sum of all segments should be something like this:

=-LN(F1234*0.
00027894)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.


* You will need to modify the SNP limit and cMs to 1 cM in order to use the "Total SUM" version of the equation. I am not getting total sums consistent with this equation due to either:

a) GEDMatch under reports matching segments.  GEDMatch has apparently attempted to remove some "Excess IBD" areas, which will affect the total sum of segments.

b) GEDMatch has a vendor conversion problem with vendor 23AndMe.
c) Therefore, I have not been able to adequately test the "Total SUM" version of the Half Life equation.

Two things that you should know when using Total SUMs:

 a) The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kit Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.


 b) The "Total SUM" natural log equation is not delivering adequate results from the data, as given by GEDMatch.  By comparing  my data from 2015 to today's data, it appears that GEDMatch has made an effort to NOT report some of the "Excess IBD" areas with the results. That will affect the Half Life equation for Total SUMs, because the sums are now under reported at GEDMatch.

ISOGG message from CeCe Moore on Thu, Jun 10, 2010:

"Hi All,
    I had a very fascinating interview with Bennett today and wanted to share something very important that I learned since I know it has been debated here quite a bit. I asked him about the reliability of using the combined smaller segments in "Total cMs" to predict relatedness. He stated that FTDNA only uses "Total cMs" for relationship predictions of 2nd cousin once removed and closer. From that point on, they only use the longest blocks to predict relationship. The "Total cMs" is only included in FF summaries because it was something that many people were interested in seeing.
    CeCe"



Per Chromosome Maximum


The equation can be customized per chromosome by using the maximum value of centimorgans per chromosome. Ann Turner has explained that you can get this by comparison to yourself. This can be done programmatically. If you are not a whiz on a spreadsheet, you can create a column for these values for each chromosome, then refer the Half Life equation to the "max cMs" column, as such:

=-1*(LN(F5/J5))/LN(2)

where F5 is the segment value in centimorgans in column F on line 5
and J is the maximum cMs on that chromosome in column J for line 5


Chromosome    FTDNA [A]  GEDMatch [B}  23andMe [C]

  1                         267.21      281.5         284
  2  
                     253.06      263.7         269
  3  
                     219.1        224.2         223
  4  
                     206.75      214.4         214
  5   
                    199.6        209.3         204
  6 
                      189.14      194.1         192
  7       
                180.79      187.0         187
  8       
                161.76      169.2         168
  9     
                  160.36      167.2         166
10   
                    176.25      174.1         181
11    
                   155.78      161.1         158
12   
                    167.39      176.0         175
13   
                    126.48      131.9         126
14   
                    111.66      125.2         119
15    
                   118.07      132.4         141
16    
                   131.90      133.8         134
17     
                  124.33      137.3         128
18    
                   119.39      129.5         117
19    
                     99.07      111.1         108
20        
               104.20      114.8         108
21         
               58.99        70.1          62.7
22        
                53.03        79.1          72.7
 
Warning: Chromosomes 21 and 22 have fairly low maximum values, and may require a different treatment because sizes can get large quickly, as in an 'Excess IBD' region or a 'Recombination' area. The idea with using individual chromosome maximum cMs is to apply it to all, then take the average.


NOTES: 

The 23AndMe vendor does not generate sufficient results for a valid comparison in many instances. Currently, 23AndMe will only generate one small segment, and does not supply enough information from vendor conversion for sites like GEDMatch to make a good comparison. Try lowering the limit for SNPs to 250 for the vendor 23AndMe.

Removal of the Excess IBD regions has about the same effect on individual segments as that of using the "Endogamy Correction Factor." Either will produce some error for various reasons. However, if the Excess IBD regions have been removed, then this will affect how the Total SUM version of Half Life equation works.

If you want to use the  "Endogamy Correction Factor" on the excess IBD segments instead of removing them:

- Endogamy Correction Factor:     [(100*cMs)/SNPs] 

t = -1*ln[(cMs*.0035524*100*cMs)/SNPs]/0.693147 


- for Size in cMs and number of SNPs
- for Size in cMs EQ 0: set to 11 for an arbitrary upper limit
 




Updated 10/20/2018to include table of maximum cMs per chromosome.
Updated 02/26/2018 arbitrary upper limit changed from 14 to 11, in order to avoid exponential results at the upper limit of phylogenetic trees.
Updated 02/26/2018 to add the equation for total sums in cMs and link to reference table.
Updated 02/17/2018 to add spreadsheet version of the equation.
Updated 03/27/2018 to add SNP parameters for Total SUM calculation and a note about 23andMe problems.
Updated 03/29/2018 Correct the reference regarding MyHeritage to 23AndMe (vendor indicated at GedMatch starting with an "M"). Note that Total SUMs is not giving adequate results. Added a quote from a public post by CeCe Moore from the ISOGG email list.
 Updated 04/03/2018Corrected to report that GEDMatch does not report out "Total SUMs" properly, due to an apparent removal of Excess IBD regions. Included equation for "Endogamy Correction Factor."




References:


HAM Group #1 Information

HAM Y-DNA Project Phylogenetic Tree

HAM Group #1 Initial Tiny Autosomal Segment Triad Study 


ISOGG Autosomal DNA statistics


Maximum Values for Centimorgans

cM Values Per Chromosome  (table by Ann Turner)

GEDMatch


FamilyTreeDNA

HAM DNA Project Dean McGee's Utility output

HAM DNA Project Y-DNA Results at HAM Country

HAM DNA Project at FTDNA

How to Read HAM DNA Phylograms
    (video)





  
  
 

Thursday, December 28, 2017

Autosomal Small Segment Phylogenetic Tree

  Autosomal Small Segment Phylogenetic Tree

 

Small Segment Triangulation
HAM Y-DNA Group #1


Taking some inspiration from Dean McGee, I put together a phylogenetic tree of the HAM autosomal DNA, using tiny thresholds and the largest shared segments of these small segments. For this one, these are not triads, they are just the largest of the small shared segments.
 
Basically, the autosomal DNA testing companies set a low threshold,
meaning they usually do not show much beyond 5th cousins (for the
autosomal DNA). As most of you know the Y-DNA goes much further back.
For Family Tree DNA and GEDMatch the threshold is set at 7 cMs.
 
Folks in our HAM Y-DNA Group #1 upload their autosomal DNA to GEDMatch, and I have lowered the thresholds by using GEDMatch utilities. The results from the largest shared segments roughly follow the Y-DNA, except that the autosomal DNA has totally separated out the line of our William HAM, Sr. of Grayson County.
 
For this study, I was not using triads, but simply the largest shared autosomal segments. Mostly from either FTDNA or Ancestry.
 
We have enough participants from Grayson County to almost make out his
three sons (John HAM, William HAM, Jr. and Thomas HAM).
 
If you wand the mouse over the tables (following the link below), it should show the largest shared chromosome and location. For example, a wand over of the horizontal for A274xxx (Roxanne) and her largest segment for T133xxx (Mary Ann Talbott) it shows the largest shared segment to be:

Chr     Start Location      End Location   Centimorgans (cM)

12        123,996,713        130,079,716         24.2

Moving the mouse to the right for A274xxx (Roxanne) andT074xxx (Wendell
Seaborne) it shows the largest shared segment to be:

Chr      Start Location      End Location   Centimorgans (cM)

12        123,996,713        128,587,277        18.1

Which is pretty much the same segment, meaning that Roxanne, Mary Ann,
and Wendell share the same largest tiny segment from the same ancestor.
The idea is to figure out which ancestor is at that location on that
chromosome. 
 
We also see the LOVIN NPE appears to be out of the Amelia County, VA HAM line.
 
We have no Y-DNA from Amelia County, just autosomal DNA. My guess is that his ancestor died in war and he was adopted. His line is more recently from Wayne
County, NC (from about 1800), and he does not match the Y-DNA of Wayne
County HAM lines.

Also, it looks like Amelia Co. and Patrick County, VA HAM lines split off from the Somerset HAM line earlier, and the Ashe County HAM line split from the Somerset HAM line later.
 
 
Group #1 Largest Shared Matches to Small Autosomal DNA Segments with Phylogenetic Tree
 
 
 
HAM Group 1 Autosomal DNA Phylogenetic Tree
Update Jan 31, 2018:
 
The exponential Half Life decay equation for Genetic Distance in this article was updated to show the resulting Genetic Distance and phylogenetic tree. 


References:

Autosomal DNA Half Life Equation

HAM Group #1 Information

HAM Y-DNA Project Phylogenetic Tree

HAM Group #1 Initial Tiny Autosomal Segment Triad Study


GedMatch 

FamilyTreeDNA

HAM DNA Project Dean McGee's Utility output

HAM DNA Project Y-DNA Results at HAM Country

HAM DNA Project at FTDNA
How to Read HAM DNA Phylograms    (video)