Monday, March 24, 2014

Autosomal DNA Cousin Calculator

Autosomal DNA Cousin Calculator


Can you calculate your genetic cousins?






The thought occurred to me to respond to a query from Lucy Sinkular on the Rootsweb Genealogy-DNA email list regarding matching autosomal chromosome segments to an adopted person, along with her known cousins. I thought I would mention something that I had put into my autosomal DNA spreadsheet to estimate cousin relationships. I used my "CousinCalc" equation (from my spreadsheet) to informed her that I got an estimation of 10th cousin for her genetic cousin, who was adopted.


I used my "CousinCalc" to estimate that her adopted genetic cousin was on the order of 10th cousin.

Back on September 28, 2011 Jared Roach, M.D., Ph.D. Senior Research Scientist Institute for Systems Biology posted a note on the Genealogy-DNA email list for the logic behind the prediction of cousin relationships. The theory is that the number of segments, when combined with the size of the segments, can be used to estimate distant relationships.


"Maximum-likelihood Estimation of Recent Shared Ancestry (ERSA)," Genome Res., May 21, 2011. ( see http://genome.cshlp.org/content/21/5/768/F1.expansion.html)


Long autosomal segments are unlikely to be from distant relationships, and short segments can either be from close or distant relationships. The equation given in the paper is based on an equation given by Thomas in 1994:


                  (dt/100)

   P(t) = e^



d = number of meiosis
t = length of segment in cM



For these past 20 years or so, the "number of meiosis" has been taken to be the number of segments. Terms have been introduced to define valid segments (Identical By Descent, or IBD) and invalid segments (Identical By State, or IBS). Segments are considered to be IBS if, in general, they are small (less than 5 to 7 cM, or less than 500 SNPs).



The above equation does not always work well, so a large number of probability distribution functions and Monte Carlo simulations have been invented in order to help make some reasonable estimates of relatedness between two matching individuals. The topic is popular because the predicting relatedness has a number of applications, from family history to medicine.


However, the thought that was nagging me was why were these scientists not using SNP's vs. cMs?? 


So, I thought I would try to find out. Upon investigating, I found that this equation worked for predicting my 4th and fifth cousins:


      CousinCalc = (1,000 x ToTal_cMs)/ToTal_SNPs


When I ignore the concept of IBD and IBS (and just use the figures as given by Family Tree DNA in their Family Finder product), this equation works for the distant cousins that I knew about thus far. IBD and IBS at present are terms derived from the use of segment counts, the 'CousinCal' equation does not use segment counts, so my thoughts are that the current definition of IBD and IBS do not apply to the use of this equation.

But, the questions that bothered me was, 'is my sample too small??'
Would this be a statistically valid equation?



I don't have enough data to answer that, so I asked around.

I looked at Tim Janzen's autosomal segment matches to his mother, which has made publicly available. My "CousinCalc" came back with a cousin estimation of 6th cousin. Clearly, my "CousinCalc" equation does not work for cousins less than 1. But, to be fair, Tim Janzen does not list his IBS segments, so the sum of the IBS segments is an unknown part of the equation. Yet, in use of my equation, I do find that the sum of all segments delivers nearly the same result as the sum of IBD segments, so I presume that the expectation is valid that the "CousinCalc" equation will not to hold for cousins less than 1.

For more distant relationships, you should begin to see a departure from IBD reflecting the results of this equation. So, be careful about using IBD instead of sums.

Ann Turner had the most patience with this idea. However, she wasn't exactly warming up to the idea of using this new equation. She explained that 23AndMe uses segments vs. cMs, as in the article Cryptic Distant Relatives are Common in both Isolated and Cosmopolitan Samples” and the chart (Fig, 3) is given on this page:

   http://blog.23andme.com/news/announcements/how-many-relatives-do-you-have/


Simulated data showing the relationship between shared identical segments of DNA (IBD-half) and # of shared segments for different degrees of relatedness in a population with European ancestry. http://blog.23andme.com/news/announcements/how-many-relatives-do-you-have/
In short, this is the basis for the "Relative Finder" tool available at 23AndMe, as described here:     ( https://www.23andme.com/ancestry/relfinder/ )

I should note that the Team at Huff Lab has a web site available that enables you to plug in your autosomal segment information, and they will do the calculations for you. The ERSA software is freely available for download from their web site:


The data chosen for the chart were IBD segments, which basically means that the chart includes matching segments that are larger than 10 cMs. Which means, the number of IBD segments in the matches to your autosomal DNA that are over 10 cMs is an indication of how related they might be to you. Because segments do not fit very well, there are some fairly heavy duty probability equations behind the above chart. 

Then I thought, well, let me draw up what SNP's vs. cMs might look like.

Here is what my fourth cousin's matching segments look like if the segments are not summed up:


and here is what my fifth cousin's matching segments look like if the segments are not summed up:

Individual autosomal segments vs. Centimorgans (cMs)


 Looked like a direct (linear) relationship between SNP's and cMs. However, perhaps not statistically valid (not enough sample data). So, I thought I would see if Family Tree DNA's Chromosome Browser matching segments might give enough data to support the same direct (linear) relationship between SNP's and cMs. That produced the following chart:


When matching autosomal DNA segments are summed, the sums produce a chart that shows a direct relationship between total SNP's and total cMs. 1076 individual segments where SNP's and cM's have been summed per matching person.




Elizabeth Harris wrote me to say that she did not use SNP's because they did not work for her. Basically, she tested with 23AndMe, and the calculation did not work for 4 of her cousins with matching segments on chromosome 15.The 'CousinCalc' equation came out to be 33rd cousins for that segment, and the math is rather tortuous if you want the equation to come out as fourth cousins for that particular segment on chromosome 15.


Ann Turner cited this chart from Rutgers' University:


http://web.archive.org/web/20070113005025/http://compgen.rutgers.edu/maps/compare.pdf


The basic point there being that the marker position along chromosome 15 begins at about 20 MB. As does chromosome 13 and 14, but it is not a common phenomena among chromosome measurements.


However, I should point out that unfortunately, Elizabeth only gave the values for chromosome 15, and I did not get to see what 'CousinCalc' looks like for her 4th cousins using the sums across all matching chromosomes. Elizabeth did not provide data for any other matching chromosome, so I did not get to see what the data looks like from 23AndMe.


And finally, Ann Turner also pointed out that chromosome 15 has some poor regions being reported out, and a number of other chromosomes have the same problem. She cited Table 3 of this article: "Relationship Estimation from Whole-Genome Sequence Data," Hong Li, et. al. Jan 2014.


  http://www.plosgenetics.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pgen.1004144&representation=PDF


That Table 3 shows two segments on chromosome 15 - one between starting location of 20,967,673 and ending at 25,145,260 that show a length of 10.46 cMs and the other starting at 27,115,823 and ending at 30,295,750 with a length of 9.29 cMs. That's two to three times what you might expect to be the length in cMs.Other chromosomes showing this type of anomaly include chromosomes 1, 2, 6, 8, 9, 10, 16, 17, 21, and 22. A total of some 14 regions.


Finally, the GedMatch.com web site has a utility that plugs relationship calculations into some of their reports. However, I think that Tim Janzen mentioned that GedMatch has not yet converted to Build 37. Family Tree DNA is now at Build 37, so the results may look slightly different at GedMatch than what you may see at your vendor.


     http://gedmatch.com/


Having given an overview regarding why SNP's are not used vs. Centimorgans, here is a display of what the cousin calculator equation look like for a few of my cousins:


Autosomal DNA 'CousinCalc' equation for fourth cousin using SNPs vs. cMs.



Autosomal DNA 'CousinCalc' equation for fourth cousin, once removed using SNPs vs. cMs.

Autosomal DNA 'CousinCalc' equation for fourth cousin using SNPs vs. cMs.

Autosomal DNA 'CousinCalc' equation for fifth cousin using SNPs vs. cMs.


The next step should be to try to gather enough data regarding the results of this equation in order to determine if this is a valid calculation that I can use in my spreadsheet. If the statistics do not bear it out as valid, then the following step should be to determine if the problems mentioned above could be remedied (or avoided) by use of a program.


Updated 03/27/2014 - fix for math error in table for Frank, made hyperlinks active.

---------------------------------------------------------------------------------------------------------------- 
Update 08/25/2015 - I posted an improved equation to the Genealogy-DNA email list:

Hello,

I have been playing around with my cousin calculator again, and have tried to have it work better at the extreme limits. That is, have it return a small result when the cMs are huge, and return a huge result when the cMs are small. I received some cMs, SNP counts, and relationships from Robert James Liguori, which includes some data on close relationships that I do not have. 

  
If anyone would care to humor me and plug it into your spreadsheet from FTDNA, my cousin calculator version 2 equation now looks like this: 

  
= SQRT(1/SQRT((F3*F3) / G3)*(0.75*SQRT(G3/F3))*(G3/(F3*600))) 

  
where:

   F3 is the cM segment values from column 'F' on line 3
   G3 is the SNP segment value from column 'G' on line 3 

  
I would imagine that some academics would prefer exponential or natural logs in the equation, but this one was fairly straightforward for me to test. It certainly can be improved. 

  
A few things I know at the moment:

 - This works better on segments than it does on sums.
 - It still does not work well for cousins less than 1, but this is an improvement.
 - It does not handle "2nd Removed," etc. very well.
 - It is mostly designed to work with values in excess of 3 cMs (use values greater than 3 cMs). Therefore, if you see the wrong value, please check the size in cMs first. 

  
   I designed it so that it could handle small values (large cousin values), just so I could see how large a cousin value it would return. The largest value that I have seen generated is around 100th cousin, which I would guess would be in the vicinity of 200 generations, or 5,000 years at 25 years per generation. 

  
Thanks for the data from Robert James Liguori.


Fun stuff.


---------------------------------------------------------------------------------------------------------------- 









4 comments:

Robert James Liguori said...

Hi there. I enjoyed this blog post. I am loooking to build an autosomal calculator myself. Have you learned anything more in the last year?

I have many instances of pedigree collapse in my tree, btw.

Autosomal Sharing Examples
http://robertjliguori.blogspot.com/2015/04/autosomal-sharing-examples.html

Thanks,
Robert

Odon said...

Hello Robert,

Thanks for the commment. And thanks for the link. I have an account on GedMatch, and am interested in looking up you SNP information there.

Let me see, yes, a year later, I still find the cousin calculator to be helpful for my studies.

What I have learned...

I have several in my Y-DNA (I1-M253) group that have tested for autosomal now, and we are finding that FTDNA is showing many reported as not a match. To see the shared segments, I have them upload their dfata to GedMatch, then do a one-to-one compare, using these settings:

GEDmatch.Com Autosomal Comparison

Minimum threshold size to be included in total = 500 SNPs
Mismatch-bunching Limit = 100 SNPs
Minimum segment cM to be included in total = 1.0 cM

I find that we have an autosomal match on the order of 3 to 4 cMs, apparently too small for FTDNA or GedMatch to call them a match. The 'cousin calculator' places them at about 4th to 8th cousins, with a maximum of 11th cousins (presumably the 11th cousin matches are for the female lines). To me, this means those of us who have tested relate no more than 8th cousins, or about 16 generations. So, we should be looking at a time from for the group from about 250 to 400 years ago (16 generations x 25 years per generation = 400 years).

We have on in this group from our estimated Country and County of origin (Somerset, England), so it was very disappointing to see that FTDNA does not consider our Y-DNA group to be an autosomal match, when we have a TMRCA on our Y-DNA match.

Regardless, we were able to use the information to devise a theory on why our ancestors migrated to America, based upon the general time frame (early 1700's). Basically, it would appear that may have been part of the merchant class that participated in the Monmouth Rebellion, and hence were part of the Bloody Assizes.

That is supported by the ethnic groups that we see from the GedMatch 'Chromosome Painting,' where we find segments from the Baloch peoples as well as from southern India. (A merchant trader could have traveled to East India, or known traders from that area during the period. I have used the chromosome painting estimates for each shared segment.)

With this GedMatch one-to-one data, I have a hunch that SNPs may also help place a date on our relationship.

Since I am dealing with small segments, I looked up how they might date Neanderthal segments. I am currently looking at a formula for dating cMs, but it does not appear to work well:

See ""Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans," by Gundula Povysil and Sepp Hochreiter (2014):

http://biorxiv.org/content/biorxiv/early/2014/07/15/003988.full.pdf

(Section 4.1.1) Exponentially Distributed IBD Lengths (pg 55)

The length of an IBD segment is exponentially distributed with a mean of

100 = (2g) cM (centi-Morgans),

where g is the number of generations which separate the two haplotypes that share a segment from their common ancestor (4, 17, 38, 55, 56). Ulgen and Li (57) recommend to use a recombination rate, cM-to-Mbp ratio, of 1, however it varies from 0 to 9 along a chromosome (62).

Note the quote:

"We are not able to perform reliable age estimations of the IBD segments based on their length."

According to my own calculations (using this formula), it would appear that 250 years equates to 10 generations at 25 years per generation. The equation would give you 5 cM for 10 generations.

Yu is dated, and the assumption they made was based upon Ulgen and Li, with the cM-to-Mbp ratio, of 1, however it varies from 0 to 9 along a chromosome as reported by the Rutgers study:

http://web.archive.org/web/20070113005025/http://compgen.rutgers.edu/maps/compare.pdf

- Dave

Odon said...

Hi Robert,

One more note (hoping to preserve the format this time).

What I would also like to further study (or confirm) in the way of dating SNP's
goes something like this:

Triangulation of 6 individuals from HAM DNA Group #1 (I1-M253) in August, 2015:

SNP size**..Cousin Calculator....years ago..........Max cMs (Sum/# of segments??)
............(ccalc Maximum)..........................
observed:
1150............4th cousin.....250 years ago
750............5th cousin.....250 years ago
triangulated:
500 ?..........11th cousin.... ~500 years ago...........64.0 ??
250 ?..........20th cousin.. ~1,000 years ago...........32.0 ??
- extrapolating:
125 ?..........40th cousin..~2,000 years ago...........16.0 ??
62 ?..........80th cousin..~4,000 years ago............8.0 ??
31 ?.........160th cousin..~8,000 years ago............4.0 ??
15 ?.........320th cousin..~16,000 years ago...........2.0 ??
8 ?.........640th cousin..~32,000 years ago...........1.0 ??
4 ?........1280th cousin..~64,000 years ago...........0.5 Neanderthal ??
2 ?........2560th cousin..~128,000 years ago
1 ?........5120th cousin..~256,000 years ago
0.5 ?.....10240th cousin..~512,000 years ago

** SNP size given as the sum of matching SNP's divided by the number of shared segments

- Dave

Odon said...

Hi Robert,

Try this one with you data:


= SQRT(1/SQRT((F3*F3) / G3)*(0.75*SQRT(G3/F3))*(G3/(F3*600)))

where:

F3 is the cM segment values from column 'F' on line 3
G3 is the SNP segment value from column 'G' on line 3

I would imagine that some academics would prefer exponential or natural logs in the equation, but this one was fairly straightforward for me to test. It certainly can be improved.

A few things I know at the moment:

- This works better on segments than it does on sums.
- It still does not work well for cousins less than 1, but this is an improvement.
- It does not handle "2nd Removed," etc. very well.
- It is mostly designed to work with values in excess of 3 cMs (use values greater than 3 cMs).
Therefore, if you see the wrong value, please check the size in cMs first.