-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some game initialisation parameters #37
Conversation
This is a first step to let user initialise the game with different parameters from the app directly. Here only for one parameter, the `rng_seed`.
Now it show an error message when corresponding parameters are invalid.
Great! |
This PR is rather big sorry. Especially the general documentation about the generative model ("Quantitative traits simulation (T1 and T2)"). Here I changed a bit the notation in regards of what is in the "setup report": Also I would have 2 questions First do you think the data generation for the box plot are correct ?
With the default parameters we have (for trait 1):
However, in the setup reports section "7.7 Simulate traits 1 and 2" it shows:
I know that the "observed" variance of the genetic effects will depend on the actual values of the simulated genotypes and may not be exactly the same as Also, to give an visual idea of the effect of the "pleiotropy parameters" I wanted to show a simulation of this plot we have in the setup report: I tried to find a way to calculate What I did (which is wrong) was:
I don't consider the But So: Since I don't know the So finally: But when I test with the values in the setup report I found a correlation < -1 Also the (co)variances of the You are probably quite busy and I may ask too much, but I would appreciate if I could get some directions, maybe what could help me would be a bit more information about how we can show that:
I could ask Jacques too. Thank you very much. |
Ok, I'll try to look at it |
"I found it more straightforward to understand, Y for Year." Q1: "First do you think the data generation for the box plot are correct ?" "I know that the "observed" variance of the genetic effects will depend on the actual values of the simulated genotypes and may not be exactly the same as but wouldn't this difference be too big ? (+80%)" I will answer your 2nd question later. |
I rather speak about "trait 1" and "trait 2" but indeed this could be confusing too.
My idea here is to be able to show how the phenotypes will look like (or what the user could expect) dynamically when the user "play" with the parameters, so I need something that can be executed quickly (~0.1s, <<0.5s). Therefore I generate the genetic values directly rather than simulating some genotypes, some SNP allelic effects and calculate the genetic values (it would be too long, especially the genotype simulation, it takes ~20s in the setup script).
I just noticed that from the the "game setup report" (that takes about 1min 30s to execute). I didn't write any code that could check that automatically. I can do it but it may not be a priority over other improvement of the game. I will at least check it manually few times. |
Ok, I understand better now. it's just to help visualize. So forget what I wrote, and what you did ("I generate the genetic values directly") is fine by me. About Q2, if again you just want to show some pedagogical plot, then the easiest would simply to draw from MASS::mvrnorm with a covariance matrix computed from cor.pleio. I think it's fine since it is only an example and not the real data. |
We just have to figure out how to compute this covariance matrix from |
I'm thinking out loud, here.
For a given individual Let's ignore the residual error, thus:
We then have the formula of a covariance for a linear combination and we also use the facts that only the effects of the We also assume that, for two different pleiotropic SNPs, there is no covar btw their allele effects, thus: Now we use the fact that the X's are considered fixed: We can compute Oh wait, we should also draw The only remaining difficulty here I think is to quickly generate the values for Does it help? p.s.: the formula in section 7.4.3 comes from Habier et al (2007); you can also look at the course by A. Legarra |
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
Yes thank you very much !
For the population generated with by coalescence 1000 individuals and 36414 markers we have:
For the centered values we have this distribution: This is quite stable as we have a lot of individuals and markers. And for the initial population given to the breeders (1000 hapo-diploidisations from the above mentioned population) we have:
For the centered values we have mostly a similar distribution (visually). Also, for what I said:
And your reply
I tested a bit and found that for the population generated with coalescence we have indeed their genetic variance approximately equal the the theoretical values. But for the initial population given to breeders ("Coll"), which come from haplodiploidization of the "coalecence population", their genetic variance is approximately equal to 2 times the theoretical values. Code (click to expand)library(rutilstimflutre)
I <- 1000
ind.ids <- sprintf(fmt = paste0("ind%0", floor(log10(I)) + 1, "i"), 1:I)
nb.chrs <- 10
L <- 10^6 # chromosome length, in base pairs
mu <- 10^(-8) # neutral mutation rate in events / base / generation
(u <- mu * L) # neutral mutation rate in events / chrom / gen
c <- mu # recomb rate in events / base / gen
(r <- c * L) # recomb rate in events / chrom / gen
Nea <- 5 * 10^4 # ancestral
Neb <- 10 * 10^2 # bottleneck
Nec <- Ne0 <- 5 * 10^3 # current
(theta <- 4 * Ne0 * u) # scaled neutral mutation rate in events / chrom
(rho <- 4 * Ne0 * r) # scaled recomb rate in events / chrom
T1 <- 8000
T2 <- T1 + 2000
(alpha <- -(4 * Ne0) / T1 * log(Neb / Ne0)) # exp growth rate from Nec to Neb
(cmd <- paste0(
"-eG ", T1 / (4 * Ne0), " ", alpha,
" -eN ", T2 / (4 * Ne0), " ", Nea / Ne0
))
g0 <- simulCoalescent(
nb.inds = I,
ind.ids = ind.ids,
nb.reps = nb.chrs,
pop.mut.rate = theta,
pop.recomb.rate = rho,
chrom.len = L,
other = cmd,
nb.pops = 1,
get.trees = FALSE,
get.tmrca = FALSE,
permute.alleles = TRUE,
verbose = 1
)
afs <- estimSnpAf(X = g0$genos, allow.updating = TRUE)
thin.snps <- names(afs)
sigma.a2 <- c(100, 0.81)
(sigma.beta2 <- sigma.a2 / (2 * sum(afs[thin.snps] * (1 - afs[thin.snps]))))
# (constant <- 4 * sum(afs[thin.snps]^2))
(sigma.beta2 <- c(
trait1 = sigma.beta2[1],
trait2 = sigma.beta2[2]
))
prop_pleio <- 0.4
cor_pleio <- -0.7
snp.effects12 <- simulSnpEffectsTraits12(
snp.ids = thin.snps,
sigma.beta2 = sigma.beta2,
prop.pleio = prop_pleio, cor.pleio = cor_pleio
)
G0 <- g0$genos %*% snp.effects12$Beta
(cor(G0)[1,2])
(var(G0[,1]))
(var(G0[,2]))
coll <- list()
crosses <- data.frame(
parent1 = ind.ids,
parent2 = NA,
child = sprintf(
fmt = paste0("Coll%0", floor(log10(I)) + 1, "i"),
1:I
),
stringsAsFactors = FALSE
)
loc.crossovers <- drawLocCrossovers(crosses, sapply(g0$haplos, ncol))
system.time(
coll$haplos <- makeCrosses(
haplos = g0$haplos, crosses = crosses,
loc.crossovers = loc.crossovers,
howto.start.haplo = 0, nb.cores = 1
)
)
coll$genos <- segSites2allDoses(
seg.sites = coll$haplos, ind.ids = crosses$child,
snp.ids = colnames(g0$genos)
)
G_coll <- coll$genos %*% snp.effects12$Beta
(cor(G_coll)[1,2])
(var(G_coll[,1]))
(var(G_coll[,2])) From your comment, I have made an implementation in c3f3bec However, since we would still have to generate the genotypes for such calculation, I opted for an approach where I generate the genotypes (this is the long calculation), the marker effects (this is rather fast) and calculate the genetic values from that (rather fast). To quickly generate the genotype I first draw some alleles frequency (AF) with a beta distribution and then deduced the genotypes. But I could only do it for the "Coll" population. Since they are homozygous, (so their genotypes for each markers is only 0 or 2) given the number of individuals and the AF I can deduce the number of individuals with genotype 0 or 2. (When I write this message, I realise that maybe I could have done it for the previous population too by simply "create a big urn" with the 2 alleles in their correct proportion and let each individuals draw 2 of these alleles. I didn't thought of that when I was coding, but since the breeders will start with the "Coll" population I think it is ok to show how this "Coll" population will look like rather that their parent population). For the parametrisation of the beta distribution, I looked on the AF distribution in the setup report. For the "Coll population" the estimated beta distribution was almost with parameters 0.5/0.5 so I used this parametrisation: quick_afs_simul <- function(n_marker_sim = 15000, shape1 = 0.5, shape2 = 0.5, seed = 42) {
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
sim.afs <- rbeta(n_marker_sim, shape1, shape2)
.GlobalEnv$.Random.seed <- saved_seed
sim.afs
} Then I simulate the genotypes with: quick_geno_simul <- function(afs, n_inds = 1000, seed = 43) {
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
geno_centered <- sapply(afs, function(af) {
n <- round(n_inds * af)
sample(c(rep(2 - 2 * af, n), rep(-2 * af, n_inds - n)))
})
.GlobalEnv$.Random.seed <- saved_seed
geno_centered
} This function is still a bit long to run (~3s on my hardware). The main problem is that I loop (with the And the marker effects are simulated with: quick_g0_simul <- function(T1_sig_a2,
T2_sig_a2,
afs,
prop_pleio,
cor_pleio,
geno,
seed = 44) {
if (!is.null(valid_variance(T1_sig_a2))
|| !is.null(valid_variance(T2_sig_a2))) {
return(NULL)
}
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
sigma.beta2 <- c(T1_sig_a2, T2_sig_a2) / (2 * sum(afs * (1 - afs)))
n_snp <- length(afs)
n_pleio <- round(n_snp * prop_pleio)
n_non_pleio <- n_snp - n_pleio
Sigma.beta.nopleio <- matrix(c(sigma.beta2[1], 0,
0, sigma.beta2[2]),
nrow = 2, ncol = 2)
cov.pleio <- cor_pleio * sqrt(sigma.beta2[1] * sigma.beta2[2])
Sigma.beta.pleio <- matrix(c(sigma.beta2[1], cov.pleio,
cov.pleio, sigma.beta2[2]),
nrow = 2, ncol = 2)
# I don't use MASS::mvrnorm and use a "manual approach" here so that the
# generated values remain the same more or less the covariance structure
# whatever the provided parameters. This will help the visualisation as
# the cloud point will be deformed.
base_random_values <- matrix(rnorm(n_snp * 2), 2)
if (n_non_pleio != 0) {
R_nopleio <- chol(Sigma.beta.nopleio)
beta_nopleio <- t(R_nopleio) %*% base_random_values[, 1:n_non_pleio]
} else {
beta_nopleio <- c()
}
if (n_pleio != 0) {
R_pleio <- chol(Sigma.beta.pleio)
beta_pleio <- t(R_pleio) %*% base_random_values[, (n_non_pleio+1):n_snp]
} else {
beta_pleio <- c()
}
beta <- t(cbind(beta_nopleio, beta_pleio))
.GlobalEnv$.Random.seed <- saved_seed
geno %*% beta
} (I don't use With this, I visually do not see much differences from the code in the setup. Thank you for your help ! 🙇 |
"With this, I visually do not see much differences from the code in the setup." I don't have the time now to check all this, but I read through what you did and it looks fine. The most important is to use this shortcut only for visualization purposes. The actual SNP genotypes should still be generated with the sequential coalescent with recombination, as it also generates LD btw SNPs, which is important for genomic prediction. |
For that I had to refactor how the genetic values are generated. (cf. discussion on #37) I used a simulation approach where I simulate the genotypes and marker effects and then calculate the genetic values. This is not the best in my opinion since the simulation of the genotypes is a bit slow (~3s), fortunately the genotypes are simulated only once (when the "game initialisation" windows apprears) as this simulation do not depends an any parameter for now. However, the marker effects simulation and genetic values calculation are done every time `sigma_a2` or the pleiotropy parameters change. This calculation takes a bit of time and the plots update is not 100% smooth but it is acceptable (on my hardware).
This PR implements the ability to initialise the game with different parameters than the default ones.
The game initialisation script have a lot of parameters and not all of them have been implemented yet. This PR implements:
To have a rather concise UI, each group of parameters are encapsulated in collapsible sections that extend when clicked:
The documentation about the role of each parameters is done by providing some general guidelines at the beginning of each section (once expanded), and by clicking on the labels of each parameter, for example (for the RNG seed):
Moreover, some information related to the current set of parameters are shown on the right hand side in order to help the user in their choices (cf. the section showing the implementation of each group of parameters below).
In order to avoid crashes of the game initialisation script as much as possible, each parameter is validated in real time by the application. In case some parameters are invalid, they will appears in red along with a feedback message explaining what is wrong.
If any parameters is invalid, it will be impossible for the user to trigger the game initialisation (the button is automatically disabled)
Cost related parameters:
The right part show a summary of the provided cost in terms of "phenotyping plot" and in "Mendels" (the game's money).
Phenotyping simulation parameters:
The right part show the values of different variances of the generative model calculated with the current set of inputs.
Along with 2 box plots (one for each trait) that show what the phenotype of the initial population can look like. (similar to one of the plot in section "7.7 Simulate traits 1 and 2" of the setup report).
These values are quickly generated by the application following the current parameters but do not correspond to what will be actually generated by the initialisation script (as mentioned below the plots).
One important things to keep in mind is the game initialisation script is rather complex and it is possible to have a set of apparently "valid" parameters (eg. heritability is between 0 and 1 etc...) that leads to invalid "simulations parameters" (ie. other values calculated by the initialisation script and used for the simulation, for example the variances of the generative model in this case).
Here, I check that all those variances are acceptable (ie. stictly positive). Those that are not appears in red and the initialisation can not be triggered.