Showing posts with label Autosomal DNA Cousin Calculator. Show all posts
Showing posts with label Autosomal DNA Cousin Calculator. Show all posts

Thursday, September 20, 2018

Ancient DNA Clovis Anzick and HAM DNA Group #1



Ancient DNA Clovis Anzick and HAM DNA Group #1


Because of the destroyed or missing records in Virginia, I had been working on the autosomal Half Life equation in order to tie our group to Somerset by use of autosomal DNA. Previously, we had seen that two kits from Somerset, England are a match to HAM DNA Group #1 (I1-M253).

I has recently noticed that the Half Life equation was throwing off errors, or variation from what one would expect to see from the Half Life equation. This was particularly troublesome when the SNP density ratio was between 1.0 and 1.5 where RATIO = SNPs/(100*cMs).

So, when I ran across an article on Ancient DNA by Roberta Estes, I became curious as to what the half Life equation might look like when used on Ancient DNA.

"Analyzing the Native American Clovis Anzick Ancient Results" DNAeXplained – Genetic Genealogy


Roberta had been talking to Felix Chandrakamur about the Clovis results that had been uploaded to GEDMatch, and Roberta had noticed that the Clovis upload was matching living people. She had found 1466 matches to Clovis at GEDMatch above the 7 cM level.

This is a stunning result. For a little background on the Half Life equation, you can find it's limit by plugging "1 cM," which shows that the equation is built to display a result of about 8th cousin level (or 9 generations) for a shared segment of 1 cM in size.

   Half Life = - LN(1/281.5)/.693147


   Half Life = 8.1

For the 11th cousin level, it needs a segment size of 0.1 cMs.

Clearly, if the Clovis sample is 12,500 years old and is matching living people at 7 cMs and above, then the Half Life equation is useless in it's current form.

Now, the usual argument might be that these Clovis samples are Identical by State (IBS), and not Identical by Descent (IBD). For example, the current ISOGG statistics show that the smallest shared segment equivalent to 5th cousins is 3.32 cMs.

See:

"Autosomal DNA statistics"   at ISOGG


Or see also:

"Cousin statistics"   at ISOGG


See the scientific paper for the expected (i.e., theoretical) number of cMs at the 5th cousin level:

"Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples

   
 
Table 1. Expected extent of IBD and number of cousins for 1st–10th degrees of cousinship.

https://doi.org/10.1371/journal.pone.0034267.t002



Also, it is instructive to note that a good quality 10 cM segment was extracted from the Altai Neanderthal who lived 50,000 years ago in Siberia.


 - see "The complete genome sequence of a Neanderthal from the Altai Mountains"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4031459/



"To estimate the extent of their relatedness, we scanned the genome for 1Mb regions where most non-overlapping 50-kb-windows were devoid of heterozygous sites and merged adjacent regions (SI 10). The Neandertal genome has 20 such regions longer than 10cM whereas the Denisovan genome has one."

Finally, Roberta Estes wrote an article regarding possible sampling errors, due to the nature of conversion for upload to GEDMatch. The basic concern appears to be "No-Call" rates:

"Ancient DNA Matching – A Cautionary Tale"   DNAeXplained – Genetic Genealogy, by Roberta Estes


Roberta also explains that subsequent comparisons do not match previous comparisons. I have seen that with GEDMatch data, particularly when he changes versions of the 'One to One' Utility (which is now at version 2.1.1(c). I have had to re-work the data for Group #1 several times over the years, just to keep the data consistent with the current version of the GEDMatch 'One to One' Utility.



Also note that these Clovis segments for kit F999919 match to 12 cMs on living Native Americans.


With that, the first thing I wanted to know was to take a look at the SNP density RATIO for the Clovis comparisons. How do ancient Clovis segments compare with SNP density RATIO, and do these ancient segments also throw off variations from the Half Life equations?


  RATIO = SNPs/(100*cMs)

I took a look at hr matching segments for HAM DNA Group #1, plus our 'control group' kit, Arnold (22 kits). I found that out of the seven largest segments matching Clovis, all but five had a RATIO of less than 2.0, or about 3.2%.

To put that into perspective, 7 x 22 = 154 segments

    5 segments/154 segments = 3.2%

That means, for our sample (the Ancient Clovis DNA matching shared segments), 97% have an SNP density RATIO of less than 2.0

Among the Clovis shared segments for the group, there are about 8% that are in 'Excess IBD' regions.

Below is a summary table of the results.



 
Clovis Largest Shared Segment and HAM DNA Group01


  

The kit in HAM DNA Group #1 with the largest matching segment to Clovis:
   A404xxx at 5.9 cMs

- Kits with matching Clovis segments with the largest matching starting and/or ending locations:

     Clovis              Kit      Chr   Start Location   End Location    cMs    SNPs

   F99919 and T074xxx   1         1,751,874       3,003,550      4.2    261

   F99919 and T133xxx   1         1,751,874       3,003,550      4.2    258

   F99919 and T630xxx   9       38,523,004      70,819,104      4.0    286

   F99919 and T682xxx   9       38,694,680      70,536,108      3.4    148

   F99919 and A561xxx 12       11,840,131      12,861,007      3.8    347

   F99919 and A832xxx 12       11,840,131      12,861,007      3.8    350

   F99919 and A438xxx 17       13,813,353      14,461,941      3.6    251

   F99919 and T368xxx 17       13,785,798      14,618,990      4.7    354

   F99919 and A984xxx 19        8,282,431        9,460,034      4.2    292

   F99919 and T611xxx 19        8,214,446        9,934,324      5.5    401

I would think that these matching Clovis segments would imply a Native American ancestor for the above kits (Amelia County, Ashe County, and the Arkansas lines).

This also implies that if kit A171xxx (of Somerset) does not have a Native American ancestor, then the timeline to connection could be considerably further back in time and may have an impact upon how the Half Life equation should function.

For example, a person in Somerset with a documented line may have a very small chance of having a Native American ancestor. A Clovis match would push the age back at least 12,500 years, much more than the 9 generations from the current 1 cM limit of the Half Life equation.

Sample    Location  GEDMatch  Sex    Y-DNA    Mt-DNA    Approx. Age by authors   
Felix Chandrakumar Analysis or Comments
              

Clovis-Anzick-1    Montana, North America    F999919    M    Q-Z780    D4h3a    12,500 years    Matches Living people.

http://www.y-str.org/2014/09/clovis-anzick-dna.html



References:



"Analyzing the Native American Clovis Anzick Ancient Results" DNAeXplained – Genetic Genealogy, Roberta Estes

https://dna-explained.com/2014/09/23/analyzing-the-native-american-clovis-anzick-ancient-results/


"Ancient DNA Matching – A Cautionary Tale"   DNAeXplained – Genetic Genealogy, by Roberta Estes
 

https://dna-explained.com/2014/09/30/ancient-dna-matching-a-cautionary-tale/

"Matching DNA of Living Native Descendants to DNA of Native Ancestors

https://nativeheritageproject.com/2014/09/25/matching-dna-of-living-native-descendants-to-dna-of-native-ancestors/

"Autosomal DNA statistics"
https://isogg.org/wiki/Autosomal_DNA_statistics

Or see also:

"Cousin statistics"
https://isogg.org/wiki/Cousin_statistics

The scientific paper for the expected (i.e., theoretical) number of cMs at the 5th cousin level:

"Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples"

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267 

 
GEDMatch   John Olson depends upon financial contributions.


Clovis-Anzick-1    Montana, North America    F999919    M    Q-Z780    D4h3a    12,500 years    Matches Living people.

http://www.y-str.org/2014/09/clovis-anzick-dna.html


Autosomal DNA Half Life Equation
http://hamcountry-blog.blogspot.com/2018/02/autosomal-dna-half-life-equation.html



Thursday, February 1, 2018

Autosomal DNA Half Life Equation

Autosomal DNA Half Life Equation

Feb 1, 2018


In the previous article, I had talked about creating a phylogenetic tree for autosomal data.
 
Andrew Millard suggested that because the Genetic Distance value of 1/cMs is not linear, I might try an exponential equation of the form -ln(cm/7200).

That formula suggestion was tried that, and I was not real happy with the results. The Genetic Distance did not correspond well to cousin level, and the upper limit for data greater than 1 cM appeared to be about 8 (generations), even for the results for Lauren Shutt (M804xxx), who was not even a match to the group.

Andrew had also suggested that I do not use kitsch, but I have not yet found different program that works with Genetic Distance.

So, here, I have used the equation for the half life decay rate, typically used in radioactive decay under the presumption that we are looking at Half Identical Regions (HIRs). The results mostly returned the expected cousin level (or number of generations).



Individual Segments:


Hypothetical Genetic Distance derived from half life decay rate: 

Nt = No*e^kt

Solvng for t, Hypothetical Genetic Distance for an individual segment is given as: 

t = -1*ln(cMs*.0035524)/0.693147

where 'ln' is the natural log function. 


where 'F1234" would be the location in cMs on line (row) 1234.

The largest segment (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at 281.5 cMs at current GEDMatch parameters):

a = 1/281.5 cMs = 0.0035524

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Excel spreadsheet:

For a spreadsheet, the equation for individual segments should be something like this:

=-LN(F1234*0.0035524)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.

Comparison to 23AndMe data:

The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kits Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.

Below is the link to the data and phylogenetic tree resulting from the use of the half life decay rate calculations. Basically, an update to the previous article by applying the autosomal half life decay equation.

Article:  Autosomal Half Life Equation

 
Autosomal Half Life Equation Largest Segment Table

Autosomal Half Life Equation Phylogenetic Tree


See below for more information about the "Endogamy Correction Factor."

Total SUMS in cMs:

The Total Sum of segments (i.e., most obvious) for mother/child or parental half life calculations (for size is initially at about 3585 cMs at current GedMatch thresholds*):

a = 1/3585 cMs = 0.00027894

and for parent/child half life decay:


k = ln(N0/Nt) = ln(1/2) = 0.693147


Total Sums Excel spreadsheet:

For a spreadsheet, the equation for total sum of all segments should be something like this:

=-LN(F1234*0.
00027894)/0.693147


 - Where 'LN' is the natural log function 
 - F1234 is the size of the segment in cMs in column "F" on line (row) 1234.


* You will need to modify the SNP limit and cMs to 1 cM in order to use the "Total SUM" version of the equation. I am not getting total sums consistent with this equation due to either:

a) GEDMatch under reports matching segments.  GEDMatch has apparently attempted to remove some "Excess IBD" areas, which will affect the total sum of segments.

b) GEDMatch has a vendor conversion problem with vendor 23AndMe.
c) Therefore, I have not been able to adequately test the "Total SUM" version of the Half Life equation.

Two things that you should know when using Total SUMs:

 a) The default setting of 500 SNPs do not usually generate sufficient total size for the vendor 23AndMe. Otherwise, the results will not be compatible with the individual segment equation, and may generate poor results. This is a GEDMatch vendor conversion issue. If you need to compare kits from 23AndMe (kit Mxxxx at GEDMatch), I would suggest lowering the SNP limit to 250 instead of the default 500 SNPs.


 b) The "Total SUM" natural log equation is not delivering adequate results from the data, as given by GEDMatch.  By comparing  my data from 2015 to today's data, it appears that GEDMatch has made an effort to NOT report some of the "Excess IBD" areas with the results. That will affect the Half Life equation for Total SUMs, because the sums are now under reported at GEDMatch.

ISOGG message from CeCe Moore on Thu, Jun 10, 2010:

"Hi All,
    I had a very fascinating interview with Bennett today and wanted to share something very important that I learned since I know it has been debated here quite a bit. I asked him about the reliability of using the combined smaller segments in "Total cMs" to predict relatedness. He stated that FTDNA only uses "Total cMs" for relationship predictions of 2nd cousin once removed and closer. From that point on, they only use the longest blocks to predict relationship. The "Total cMs" is only included in FF summaries because it was something that many people were interested in seeing.
    CeCe"



Per Chromosome Maximum


The equation can be customized per chromosome by using the maximum value of centimorgans per chromosome. Ann Turner has explained that you can get this by comparison to yourself. This can be done programmatically. If you are not a whiz on a spreadsheet, you can create a column for these values for each chromosome, then refer the Half Life equation to the "max cMs" column, as such:

=-1*(LN(F5/J5))/LN(2)

where F5 is the segment value in centimorgans in column F on line 5
and J is the maximum cMs on that chromosome in column J for line 5


Chromosome    FTDNA [A]  GEDMatch [B}  23andMe [C]

  1                         267.21      281.5         284
  2  
                     253.06      263.7         269
  3  
                     219.1        224.2         223
  4  
                     206.75      214.4         214
  5   
                    199.6        209.3         204
  6 
                      189.14      194.1         192
  7       
                180.79      187.0         187
  8       
                161.76      169.2         168
  9     
                  160.36      167.2         166
10   
                    176.25      174.1         181
11    
                   155.78      161.1         158
12   
                    167.39      176.0         175
13   
                    126.48      131.9         126
14   
                    111.66      125.2         119
15    
                   118.07      132.4         141
16    
                   131.90      133.8         134
17     
                  124.33      137.3         128
18    
                   119.39      129.5         117
19    
                     99.07      111.1         108
20        
               104.20      114.8         108
21         
               58.99        70.1          62.7
22        
                53.03        79.1          72.7
 
Warning: Chromosomes 21 and 22 have fairly low maximum values, and may require a different treatment because sizes can get large quickly, as in an 'Excess IBD' region or a 'Recombination' area. The idea with using individual chromosome maximum cMs is to apply it to all, then take the average.


NOTES: 

The 23AndMe vendor does not generate sufficient results for a valid comparison in many instances. Currently, 23AndMe will only generate one small segment, and does not supply enough information from vendor conversion for sites like GEDMatch to make a good comparison. Try lowering the limit for SNPs to 250 for the vendor 23AndMe.

Removal of the Excess IBD regions has about the same effect on individual segments as that of using the "Endogamy Correction Factor." Either will produce some error for various reasons. However, if the Excess IBD regions have been removed, then this will affect how the Total SUM version of Half Life equation works.

If you want to use the  "Endogamy Correction Factor" on the excess IBD segments instead of removing them:

- Endogamy Correction Factor:     [(100*cMs)/SNPs] 

t = -1*ln[(cMs*.0035524*100*cMs)/SNPs]/0.693147 


- for Size in cMs and number of SNPs
- for Size in cMs EQ 0: set to 11 for an arbitrary upper limit
 




Updated 10/20/2018to include table of maximum cMs per chromosome.
Updated 02/26/2018 arbitrary upper limit changed from 14 to 11, in order to avoid exponential results at the upper limit of phylogenetic trees.
Updated 02/26/2018 to add the equation for total sums in cMs and link to reference table.
Updated 02/17/2018 to add spreadsheet version of the equation.
Updated 03/27/2018 to add SNP parameters for Total SUM calculation and a note about 23andMe problems.
Updated 03/29/2018 Correct the reference regarding MyHeritage to 23AndMe (vendor indicated at GedMatch starting with an "M"). Note that Total SUMs is not giving adequate results. Added a quote from a public post by CeCe Moore from the ISOGG email list.
 Updated 04/03/2018Corrected to report that GEDMatch does not report out "Total SUMs" properly, due to an apparent removal of Excess IBD regions. Included equation for "Endogamy Correction Factor."




References:


HAM Group #1 Information

HAM Y-DNA Project Phylogenetic Tree

HAM Group #1 Initial Tiny Autosomal Segment Triad Study 


ISOGG Autosomal DNA statistics


Maximum Values for Centimorgans

cM Values Per Chromosome  (table by Ann Turner)

GEDMatch


FamilyTreeDNA

HAM DNA Project Dean McGee's Utility output

HAM DNA Project Y-DNA Results at HAM Country

HAM DNA Project at FTDNA

How to Read HAM DNA Phylograms
    (video)





  
  
 

Wednesday, September 16, 2015

Autosomal Small Segment Triangulation HAM DNA Group #1

Small Segment Triangulation
HAM Y-DNA Group #1


The main purpose of the paper was to provide instructions that will permit viewing matching autosomal shared segments when FTDNA does not provide that information. Further, the intent is to help analyze a Y-DNA Group for matching shared autosomal segments by direct comparison between three or more people. This was written for those who had problems finding autosomal matches and who were also participants in the Y-DNA project. It is meant to help with a problem when the Y-DNA indicates that you should have a match, but the autosomal DNA indicates no match.


 https://drive.google.com/file/d/0B8IN3Go7mIx6clZYWTRjUlU2enM/view
  


Screenshot of Autosomal Small Segment Triangulation


See also:



"Table 5 shows a typical set of alleles... These alleles (AA and CC) may indicate a Mediterranean ethnicity. The probability of a one to one match on this segment being a false positive calculates to be 1 in 7 quadrillion."

"Many 7 cM matches are SNP poor and under certain conditions will calculate as a false positive. There are many triangulated matches at 2.5 cM that confirm a relationship. Unfortunately, that relationship may be in the 7 to 14 generation range, making it difficult to determine the common ancestor. Triangulated small segment matching is very valuable in our research."
Abstract
The process of genetic inheritance is often over simplified, leading consumers of genetic tests to believe that the amount of DNA from distant ancestors becomes negligible. In fact, segments of DNA pass down through the generations intact. Naturally occurring cleavage sites allow for small segments to exist at recurring chromosomal locations. These small segments can be used as familial markers in an autosomal haplotype.

Maximum-likelihood estimation of recent shared ancestry (ERSA)

Abstract
Accurate estimation of recent shared ancestry is important for genetics, evolution, medicine, conservation biology, and forensics. Established methods estimate kinship accurately for first-degree through third-degree relatives. We demonstrate that chromosomal segments shared by two individuals due to identity by descent (IBD) provide much additional information about shared ancestry


A Study Utilizing Small Segment Matching

"Now that we understand IBS, IBD, Phasing and how matching actually works on a case by case basis, let’s look at applying those same matching and IBS vs IBD guidelines to small data segments as well."

4 Generation Inheritance Study

"There is a lot more information available to us in our DNA results than is first apparent.  It takes a bit of digging and you need to understand how autosomal DNA works in order to ferret out those secrets.  Don’t discount or ignore evidence because it’s more difficult to use – meaning small segments.  The very piece or breadcrumb you need to solve a long-standing mystery may indeed be right there waiting for you.  Learn how to use your DNA information effectively and accurately – including those small segments."

Monday, March 24, 2014

Autosomal DNA Cousin Calculator

Autosomal DNA Cousin Calculator


Can you calculate your genetic cousins? Mar 24, 2014

Update  Jan 31, 2018:

Data and phylogenetic tree resulting from the use of the half life decay rate calculations. Basically, an update to the previous article by applying the autosomal half life decay equation.

The thought occurred to me to respond to a query from Lucy Sinkular on the Rootsweb Genealogy-DNA email list regarding matching autosomal chromosome segments to an adopted person, along with her known cousins. I thought I would mention something that I had put into my autosomal DNA spreadsheet to estimate cousin relationships. I used my "CousinCalc" equation (from my spreadsheet) to informed her that I got an estimation of 10th cousin for her genetic cousin, who was adopted.


I used my "CousinCalc" to estimate that her adopted genetic cousin was on the order of 10th cousin.

Back on September 28, 2011 Jared Roach, M.D., Ph.D. Senior Research Scientist Institute for Systems Biology posted a note on the Genealogy-DNA email list for the logic behind the prediction of cousin relationships. The theory is that the number of segments, when combined with the size of the segments, can be used to estimate distant relationships.


"Maximum-likelihood Estimation of Recent Shared Ancestry (ERSA)," Genome Res., May 21, 2011. ( see http://genome.cshlp.org/content/21/5/768/F1.expansion.html)


Long autosomal segments are unlikely to be from distant relationships, and short segments can either be from close or distant relationships. The equation given in the paper is based on an equation given by Thomas in 1994:


                  (dt/100)

   P(t) = e^



d = number of meiosis
t = length of segment in cM



For these past 20 years or so, the "number of meiosis" has been taken to be the number of segments. Terms have been introduced to define valid segments (Identical By Descent, or IBD) and invalid segments (Identical By State, or IBS). Segments are considered to be IBS if, in general, they are small (less than 5 to 7 cM, or less than 500 SNPs).



The above equation does not always work well, so a large number of probability distribution functions and Monte Carlo simulations have been invented in order to help make some reasonable estimates of relatedness between two matching individuals. The topic is popular because the predicting relatedness has a number of applications, from family history to medicine.


However, the thought that was nagging me was why were these scientists not using SNP's vs. cMs?? 


So, I thought I would try to find out. Upon investigating, I found that this equation worked for predicting my 4th and fifth cousins:


      CousinCalc = (1,000 x ToTal_cMs)/ToTal_SNPs


But that would change, as I attempted to incorporate closer relationships.



When I ignore the concept of IBD and IBS (and just use the sum of the figures as given by Family Tree DNA in their Family Finder product), this equation works for the distant cousins that I knew about thus far. An alternative method (such as an average size), would make use of segment counts. That type of calculation returns fairly dependable results. However, this 'CousinCal' equation does not use segment counts. My thought here was  to perform a study of the combination of SNP's and CentiMorgans.

But, the questions that bothered me was, 'is my sample too small??'
Would this be a statistically valid equation?



I don't have enough data to answer that, so I asked around.

I looked at Tim Janzen's autosomal segment matches to his mother, which has made publicly available. My "CousinCalc" came back with a cousin estimation of 6th cousin. Clearly, my "CousinCalc" equation (above) does not work for cousins less than 1. 

However, to be fair, Tim Janzen does not list his small segments for close relationships, and I believe this may be because companies such as Family Tree DNA do not report small segments for close relationships. So, the sum of any small segments is an unknown part of the equation. Yet, in use of my equation, I do find that the sum of all segments delivers nearly the same result as the sum of the largest segments.

Therefore, I can see that the expectation is that the "CousinCalc" equation will not to hold for cousins less than 1.

Tim Janzen has his provided information in the past.

For more distant relationships, you should begin to see a departure from large segments reflecting the results of this equation. Therefore, for distant relationships, you would want to use sums in cMs and SNPs for the above.

Ann Turner had the most patience with this idea. However, she wasn't exactly warming up to the idea of using the above equation. She explained that 23AndMe uses segments vs. cMs, as in the article Cryptic Distant Relatives are Common in both Isolated and Cosmopolitan Samples” and the chart (Fig, 3) is given on this page:

   http://blog.23andme.com/news/announcements/how-many-relatives-do-you-have/


Simulated data showing the relationship between shared identical segments of DNA (IBD-half) and # of shared segments for different degrees of relatedness in a population with European ancestry. http://blog.23andme.com/news/announcements/how-many-relatives-do-you-have/
In short, this is the basis for the "Relative Finder" tool available at 23AndMe, as described here:     ( https://www.23andme.com/ancestry/ )

I should note that the Team at Huff Lab has the ERSA software is freely available for download from their web site:


The data chosen for the chart were IBD segments, which basically means that the chart includes matching segments that are larger than 10 cMs. Which means, the number of IBD segments in the matches to your autosomal DNA that are over 10 cMs is an indication of how related they might be to you. Because segments do not fit very well, there are some fairly heavy duty probability equations behind the above chart. 

Then I thought, well, let me draw up what SNP's vs. cMs might look like.

Here is what my fourth cousin's matching segments look like for the individual segments:


and here is what my fifth cousin's matching segments look like for individual segments:

Individual autosomal segments vs. Centimorgans (cMs)


 This appeared to be a direct (near linear) relationship between SNP's and cMs. However, perhaps not statistically valid (not enough sample data). So, I thought I would see if Family Tree DNA's Chromosome Browser matching segments might give enough data to support the same direct (linear) relationship between SNP's and cMs. That produced the following chart:


When matching autosomal DNA segments are summed, the sums produce a chart that shows a direct relationship between total SNP's and total cMs. 1076 individual segments where SNP's and cM's have been summed per matching person.




Elizabeth Harris wrote me to say that she did not use SNP's because they did not work for her. Basically, she tested with 23AndMe, and the calculation did not work for 4 of her cousins with matching segments on chromosome 15.The 'CousinCalc' equation came out to be 33rd cousins for that segment, and the math is rather tortuous if you want the equation to come out as fourth cousins for that particular segment on chromosome 15.


Ann Turner cited this chart from Rutgers' University:


http://web.archive.org/web/20070113005025/http://compgen.rutgers.edu/maps/compare.pdf


The basic point there being that the marker position along chromosome 15 begins at about 20 MB. As does chromosome 13 and 14, but it is not a common phenomena among chromosome measurements.


However, I should point out that unfortunately, Elizabeth only gave the values for chromosome 15, and I did not get to see what 'CousinCalc' looks like using sums across all matching chromosomes for her 4th cousins. Elizabeth did not provide data for any other matching chromosome, so I did not get to see what the data looks like from 23AndMe.


And finally, Ann Turner also pointed out that chromosome 15 has some poor regions being reported out, and a number of other chromosomes have the same problem. She cited Table 3 of this article: "Relationship Estimation from Whole-Genome Sequence Data," Hong Li, et. al. Jan 2014.


  http://www.plosgenetics.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pgen.1004144&representation=PDF


That Table 3 shows two segments on chromosome 15 - one between starting location of 20,967,673 and ending at 25,145,260 that show a length of 10.46 cMs and the other starting at 27,115,823 and ending at 30,295,750 with a length of 9.29 cMs. That's two to three times what you might expect to be the length in cMs.Other chromosomes showing this type of anomaly include chromosomes 1, 2, 6, 8, 9, 10, 16, 17, 21, and 22. A total of some 14 regions.


Finally, the GedMatch.com web site has a utility that plugs relationship calculations into some of their reports. However, I think that Tim Janzen mentioned that GedMatch has not yet converted to Build 37. Family Tree DNA is now at Build 37, so the results may look slightly different at GedMatch than what you may see at your vendor.


     http://gedmatch.com/


That would be a brief overview regarding why SNP's are not usually compared vs. the use of  Centimorgans.

 An easy way to determine an estimate is to add all of your segment results and use the average. Take the total shared DNA and divide by the number of segments shared in order to get the average size in cMs.

Then go to the table on this site and look up the relationship:


 http://www.isogg.org/wiki/Autosomal_DNA_statistics

 Pretty handy for a quick reference.

In August, 2015, Robert James Liguori commented on this story (below) and lent the data that he had on hand. I then used Robert's data to derive a simple equation that would return large cousin values for small cM and SNP values, and at the same time deliver small cousin values for large cM values and large SNP values. Normally, a mathematician would want to derive that sort of equation using differential equations or a series, or a transformation, or probability distribution. Since I have limited data, I used no tools to derive the equation, and kept it simple for direct testing.

CCalc_V2 = SQRT(1/SQRT((F3*F3) / G3)*(0.75*SQRT(G3/F3))*(G3/(F3*600)))   
where:

   F3 is the cM segment values from column 'F' on line 3
   G3 is the SNP segment value from column 'G' on line 3 

  
I would imagine that some academics would prefer exponential or natural logs in the equation, but this one was fairly straightforward for me to test. It certainly can be improved. 

  
 In the event that you prefer the equation for programming purposes:


Alpha =  sqrt(1/sqrt((cMs * cMs) / SNPs) * (.75*sqrt(SNPs / cMs)) * (SNPs/(cMs * 600)))


A few things I know about the use of the equation at the moment:

 - This works better on individual segments than it does on sums. Largest segment is good.
 - It still does not work well for cousins less than 1, but this is an improvement.
 - It does not handle "2nd Removed," etc. very well.
 - It is mostly designed to work with values in excess of 3 cMs (use values greater than 3 cMs). Therefore, if you see the wrong value, please check the size in cMs first. 


 
 Here is a display of what the cousin calculator equation look like for a few of my cousins:


Autosomal DNA 'CousinCalc' equation for fourth cousin using SNPs vs. cMs.


 Using the Cousin Calculator version 2 returns the expected value for the cousin relationship on the largest segment. The smaller segments suggest an older ancestral relationship.



Autosomal DNA 'CousinCalc' equation for fourth cousin, once removed using SNPs vs. cMs.

In the above, there are several segments that return (roughly) the expected value.

 
Autosomal DNA 'CousinCalc' equation for fourth cousin using SNPs vs. cMs.

 
Autosomal DNA 'CousinCalc' equation for fifth cousin using SNPs vs. cMs.



The next step should be to try to gather enough data regarding the results of this equation in order to determine if this is a valid calculation that I can use in my spreadsheet. If the statistics do not bear it out as valid, then the following step should be to determine if the problems mentioned above could be remedied (or avoided) by use of a program.

I use the cousin calculator in an examination of autosomal DNA segment triads here:

 http://hamcountry-blog.blogspot.com/2015/09/autosomal-small-segment-triangulation.html

  
I used the information from the cousin calculator to create a 'Dean McGee' style Time to Most Recent Common Ancestor (TMRCA) table within the above report. It was interesting to see a TMRCA table from females for a change. That information (in years) can be used to create a phylogenetic tree. For the years, I used the calculation:

   Years = 2 * (Cousin Cal V2 value) * (25 years per generation)
 
Robert James Liguori subsequently posted his own calculation tool, but that site is no longer available (see comments).






Updated 03/27/2014 - fix for math error in table for Frank, made hyperlinks active.

---------------------------------------------------------------------------------------------------------------- 
Update 08/25/2015 - I posted an improved equation to the Genealogy-DNA email list:

Hello,

I have been playing around with my cousin calculator again, and have tried to have it work better at the extreme limits. That is, have it return a small result when the cMs are huge, and return a huge result when the cMs are small. I received some cMs, SNP counts, and relationships from Robert James Liguori, which includes some data on close relationships that I do not have. 

  
If anyone would care to humor me and plug it into your spreadsheet from FTDNA, my cousin calculator version 2 equation now looks like this: 

  
= SQRT(1/SQRT((F3*F3) / G3)*(0.75*SQRT(G3/F3))*(G3/(F3*600))) 

  
where:

   F3 is the cM segment values from column 'F' on line 3
   G3 is the SNP segment value from column 'G' on line 3 

  
I would imagine that some academics would prefer exponential or natural logs in the equation, but this one was fairly straightforward for me to test. It certainly can be improved. 

  
A few things I know at the moment:

 - This works better on segments than it does on sums.
 - It still does not work well for cousins less than 1, but this is an improvement.
 - It does not handle "2nd Removed," etc. very well.
 - It is mostly designed to work with values in excess of 3 cMs (use values greater than 3 cMs). Therefore, if you see the wrong value, please check the size in cMs first. 

  
   I designed it so that it could handle small values (large cousin values), just so I could see how large a cousin value it would return. The largest value that I have seen generated is around 100th cousin, which I would guess would be in the vicinity of 200 generations, or 5,000 years at 25 years per generation. 

  
Thanks for the data from Robert James Liguori.


Fun stuff.


---------------------------------------------------------------------------------------------------------------- 
Updated:  09/11/2015 It appears that Huff Labs no longer the web interface to their ERSA software. Updated to indicate that you can download their software.

---------------------------------------------------------------------------------------------------------------- 
Updated:  10/05/2015
Re-phrasing on the reference to Tim Janzen's data, updated to include the new calculation derived from the data of Robert James Liguori, and an update of the graphics in order to illustrate the use of sums vs using individual segments.