-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-model species, unphased genotype data from whole-genome sequencing #5
Comments
Vicencio,
A few thoughts:
- Running across populations is not ideal, as the LD structure could be
influenced by population structure. LDhat assumes your data comes from a
homogenous population.
- When you say: *t**his occurs on chromosomes and populations that have
few SNPs (e.g. 10,000 or so) D*oes this imply that you're running LDhat
on whole chromosomes? If so, it may be the algorithm is not converging,
despite what you see in the likelihood. In other data, we have generally
split the data into chunks of maybe ~1000-2000 SNPs, with, say, 200 SNP
overlap between windows. In doing so, the MCMC algorithm is better able to
converge.
- I would not recommend the hotspot model for your data, as that model
was originally intended for human data with larger sample sizes.
Good luck!
Adam
…On 20 November 2017 at 07:40, vicenciooostra ***@***.***> wrote:
Dear Adam, Gil, and other developers,
Thank you very much for developing and maintaining LDhat. I was hoping you
could help me with a couple of questions, and I hope this is the right
place for that.
I am using LDhat interval to estimate recombination rates along the genome
in different fish populations. We are using SNP data from 30X whole-genome
sequences for 4 to 5 individuals per population. I initially tried phasing
with SHAPEIT and then using LDHelmet, but I abandoned phasing because I
suspect it is not very accurate given our per-population sample sizes
(although total N across all populations is 40), and we have no
experimental crosses to estimate switch error rates (or any reference
panel).
I am now using LDhat interval on unphased genotype data (converted from
vcf into LDhat format using vcf-tools), running with 15 million iterations,
a burn-in of 150,000, and a block penalty of 50 (after evaluating 0, 5, 10
and 20). I computed my own likelihood tables and used a single theta
estimated for the whole metapopulation (as the point is to compare across
populations, so I wanted to treat them equally). My first question is, do
these parameters make sense?
Looking at the MCMC chains, the runs converge, in the sense that the
likelihoods and the number of change points stabilise within a million
iterations and then stay the same (although in one case there is a jump to
a different value around the 10 millionth iteration). However the map
lengths vary more between different iterations, in some cases a lot more
where there is a lot of spread even in the final iterations. In some cases,
but certainly not always, this occurs on chromosomes and populations that
have few SNPs (e.g. 10,000 or so). Why would this be? I suspect this might
give biased recombination estimates. Should I change my parameters, e.g.
increase the burn-in or numbers of runs?
(here are some plots to illustrate:
https://www.dropbox.com/s/lona2w9egwtqcjt/ldhat_convergence_illustrative_
plots.pdf?dl=0
mainly from cases that failed to converge, of the likelihoods, number of
change points, and map lengths plotted against the iteration / run).
After running stats to get the actual recombination rates across the
genome, I am getting sensible results overall when comparing populations.
However when comparing in detail with LD calculated independently (using
vcftools geno-r2) I find some regions of supposedly very low recombination
where LD is also low (while it should be high if recombination is low).
Averaging both measures in windows of 100 kb, the correlations between them
are around -0.30. Given that LDhat uses LD to estimate rates of
recombination I would’ve expected a higher (absolute) correlation between
these two measures. Why do you think this is so low? Calculating LD based
on genotype correlations (as vcftools does) with only 4 or 5 individuals
per populations is obviously not extremely precise.
Finally, although I am not specifically looking for hotspots, do you think
I should model them anyway? I am now summarising rho in 100 kb windows with
at least 20 SNPs. I wonder how much my results will be biased if hotspots
turn out to be very important, as I am not sure how to assess (lack of) fit
of the non-hotspot model.
Thanks very much for any input you’d be able to provide.
Best regards,
Vicencio
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AElWYyRJJvpNcgB9Tno517za0iqHtZ2wks5s4Z1ogaJpZM4Qkc8O>
.
--
Adam Auton
|
Hi Adam, thank you so much for your helpful suggestions, including not using the hotspot model. I am not running LDhat across populations; each of my runs is within a single population. After the LDhat runs I then compare rates of recombination, and in particular the spatial distribution of high and low recombination regions across the chromosome, between different populations. Yes, I am currently running on whole chromosomes. Thank you for the excellent suggestion to split the chromosomes into chunks of 1000-2000 SNPs, I will try that and see if I get better convergence. Just to clarify, I am using interval, not rhomap, to assess variability in recombination rates across the chromosome. One final question: I am running LDhat separately for each population, with the goal of comparing different populations. Specifically, I am comparing them in a pairwise manner, i.e. pop A1 with pop A2, pop B1 with pop B2, etc. Now, in many cases the populations that I am comparing differ a lot in number of SNPs. In order to make a “fair” comparison between pop A1 and A2 if they differ a lot in numbers of SNPs, I have used only the SNPs that are shared between pop A1 and pop A2. In other words, when preparing my input files for pop A1 I have removed any SNPs that are not also found in pop A2, and when preparing my input files for pop A2 I have removed any SNPs that are not also found in pop A1. The reason is that pop A1 and A2 differ a lot in diversity, and also in recombination, and I want to be sure that the difference in recombination that I measure between them is not due to simply differing in diversity. Do you think this is indeed a potentially confounding factor, and do you think my solution is reasonable? All the best, |
Dear Adam, Gil, and other developers,
Thank you very much for developing and maintaining LDhat. I was hoping you could help me with a couple of questions, and I hope this is the right place for that.
I am using LDhat interval to estimate recombination rates along the genome in different fish populations. We are using SNP data from 30X whole-genome sequences for 4 to 5 individuals per population. I initially tried phasing with SHAPEIT and then using LDHelmet, but I abandoned phasing because I suspect it is not very accurate given our per-population sample sizes (although total N across all populations is 40), and we have no experimental crosses to estimate switch error rates (or any reference panel).
I am now using LDhat interval on unphased genotype data (converted from vcf into LDhat format using vcf-tools), running with 15 million iterations, a burn-in of 150,000, and a block penalty of 50 (after evaluating 0, 5, 10 and 20). I computed my own likelihood tables and used a single theta estimated for the whole metapopulation (as the point is to compare across populations, so I wanted to treat them equally). My first question is, do these parameters make sense?
Looking at the MCMC chains, the runs converge, in the sense that the likelihoods and the number of change points stabilise within a million iterations and then stay the same (although in one case there is a jump to a different value around the 10 millionth iteration). However the map lengths vary more between different iterations, in some cases a lot more where there is a lot of spread even in the final iterations. In some cases, but certainly not always, this occurs on chromosomes and populations that have few SNPs (e.g. 10,000 or so). Why would this be? I suspect this might give biased recombination estimates. Should I change my parameters, e.g. increase the burn-in or numbers of runs?
(here are some plots to illustrate:
https://www.dropbox.com/s/lona2w9egwtqcjt/ldhat_convergence_illustrative_plots.pdf?dl=0
mainly from cases that failed to converge, of the likelihoods, number of change points, and map lengths plotted against the iteration / run).
After running stats to get the actual recombination rates across the genome, I am getting sensible results overall when comparing populations. However when comparing in detail with LD calculated independently (using vcftools geno-r2) I find some regions of supposedly very low recombination where LD is also low (while it should be high if recombination is low). Averaging both measures in windows of 100 kb, the correlations between them are around -0.30. Given that LDhat uses LD to estimate rates of recombination I would’ve expected a higher (absolute) correlation between these two measures. Why do you think this is so low? Calculating LD based on genotype correlations (as vcftools does) with only 4 or 5 individuals per populations is obviously not extremely precise.
Finally, although I am not specifically looking for hotspots, do you think I should model them anyway? I am now summarising rho in 100 kb windows with at least 20 SNPs. I wonder how much my results will be biased if hotspots turn out to be very important, as I am not sure how to assess (lack of) fit of the non-hotspot model.
Thanks very much for any input you’d be able to provide.
Best regards,
Vicencio
The text was updated successfully, but these errors were encountered: