forked from tskit-dev/msprime-1.0-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
paper.tex
1003 lines (896 loc) · 51.7 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage[round]{natbib}
\usepackage{listings}
\usepackage[english]{babel}%
\usepackage[T1]{fontenc}%
\usepackage[utf8]{inputenc}%
\usepackage{amsmath,amssymb,amsfonts}%
\usepackage{geometry}%
\usepackage{color}
\usepackage{dsfont}
\usepackage{verbatim}%
\usepackage{environ}%
\usepackage[right]{lineno}%
%\usepackage{showkeys}
% local definitions
\newcommand{\msprime}[0]{\texttt{msprime}}
\newcommand{\tskit}[0]{\texttt{tskit}}
\newcommand{\ms}[0]{\texttt{ms}}
\newcommand{\scrm}[0]{\texttt{scrm}}
\newcommand{\stdpopsim}[0]{\texttt{stdpopsim}}
\newcommand{\jkcomment}[1]{\textcolor{red}{#1}}
%% local definitions for section multiple merger coalescents
\newcommand{\be}{\begin{equation}}
\newcommand{\ee}{\end{equation}}
\newcommand{\bd}{\begin{displaymath}}
\newcommand{\ed}{\end{displaymath}}
\newcommand{\IN}{\ensuremath{\mathds{N}}}%
\newcommand{\EE}[1]{\ensuremath{\mathds{E}\left[ #1 \right]}}%
\newcommand{\one}[1]{\ensuremath{\mathds{1}_{\left\{ #1 \right\}}}}%
\newcommand{\prb}[1]{\ensuremath{\mathds{P}\left( #1 \right) } }%
\NewEnviron{esplit}[1]{%
\begin{equation}
\label{#1}
\begin{split}
\BODY
\end{split}\end{equation}
}
\begin{document}
\title{msprime 1.0: an efficient and extensible coalescent simulation framework}
\author{Author list to be filled in
}
% \address{
% \section{Contact:} \href{[email protected]}{[email protected]}
\maketitle
% Abstract: ~ 150 words. Concise summary of the results.
% Key points in abstract.
% 1) Coalescent simulation is a vital tool
% 2) Coalescent with recombination is hard
% 3) Large sample sizes are crucial today
% 4) Msprime has lots of features and uses a community development model
\begin{abstract}
Coalescent simulation is a key tool in population genetics and
has been integral to coalescent theory since its beginnings.
Because of the ease at which the basic model can be simulated,
a large number of different simulators have been developed. However,
the coalescent with recombination is far more challenging to simulate,
and, until recently, it was not possible to simulate efficiently.
The \msprime\ software has revolutionised
coalescent simulation, making it possible to simulate millions
of whole genomes for the first time. We summarise the many features
of \msprime\ 1.0, which is built around a core model of efficiently
implementing recombination via the succinct tree sequence data
structure. We advocate a community oriented open-source development
process as a way to reduce duplication of effort and increase
the quality of simulation software.
\end{abstract}
% Keywords: For software, include the relevant models, algorithms and language;
\textbf{Keywords:} Coalescent simulation, Python
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Introduction: Limit of 500 words. The introduction should describe the
% significance of the software or resource presented, discuss novelty, provide an
% overview of the software or resource, and describe how researchers can access it.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Introduction}
The coalescent
process~\citep{kingman1982coalescent,hudson1983testing,tajima1983evolutionary}
models the ancestry of set of sampled genomes. Given some sampled
genomes, the coalescent provides a mathematical description of the
genealogical tree relating these to each other. Because of its
elegance and simplicity, the coalescent is now central to much of
population genetics and
genomics~\citep{hudson1990gene,hein2004gene,wakely2008coalescent}.
Simulation has been a vital
tool in coalescent theory since its earliest days~\citep{hudson1983testing},
and dozens of different simulation tools have been
developed over the decades~\citep{carvajal2008simulation,liu2008survey,
arenas2012simulation,yuan2012overview,hoban2012computer}.
Two developments of recent years, however, have made the majority
of these simulators of little relevance to modern datasets.
% Some citations here? Darwin tree of life maybe?
Firstly, whole genome sequencing technology---and in particular fourth-generation
sequencing technologies---have made chromosome-level assemblies
commonplace. Thus, regarding input data as a series of unlinked loci
(a previously reasonable assumption), is no longer defensible and
recombination must be fully accounted for. While the
single-locus coalescent for $n$ samples can be simulated in $O(n)$
time~\citep{hudson1990gene}, the coalescent with recombination is
far more challenging.
% Sentence is too long...
Programs such as Hudson's classical \ms~\citep{hudson2002generating}
can only simulate short segments under the influence of recombination,
and, until recently, approximate models of
recombination~\citep{mcvean2005approximating,staab2014scrm}
were necessary to simulate whole chromosomes.
The second development that has made the majority of
present day simulators inapplicable to modern data sets is the
exponential increase in sample size. Human datasets such
UK Biobank~\citep{bycroft2018genome} and
gnomAD~\citep{karczewski2019variation} now contain hundreds of
thousands of genomes, and there is every reason to believe that
researchers will be soon be routinely working with
datasets containing millions of samples~\citep{stephens2015big}.
Classical simulators such as \ms\ and even fast approximate simulators
such as \scrm\ simply cannot cope with this scale. Even if we
spend months of CPU time on running simulations with millions of samples,
the output would require terabytes of storage space to store and
days of CPU time to parse~\citep{kelleher2016efficient}.
The \msprime\ simulator~\citep{kelleher2016efficient,kelleher2020coalescent}
has greatly increased the scope of coalescent simulations.
Through the use of efficient data structures, it is
now straightforward to simulate millions of whole chromosomes.
The ``succinct tree sequence'' data
structure~\citep{kelleher2016efficient,kelleher2018efficient,kelleher2019inferring,
wohns2021unified},
introduced as part of \msprime\, makes it possible to store simulations
of millions of whole chromosomes in a few gigabytes, several orders
of magnitude smaller than the standard
Newick~\citep{felsenstein1989phylip} or
VCF~\citep{danecek2011variant} based formats formats.
This data structure has
also lead to major advances in forwards-time
simulation~\citep{kelleher2018efficient,haller2018tree},
ancestry inference~\citep{kelleher2019inferring,wohns2021unified}
and calculation of population genetic statistics~\citep{ralph2019efficiently}.
Through a rigorous open-source community development process,
\msprime\ has gained a large number of features since its introduction,
making a highly efficient and flexible platform for population
genetic simulation.
% This is a bit flat --- anything else to say at the end here?
This paper marks the release of \msprime\ 1.0.
% JK: Earlier notes on this para
% - Msprime 1.0 combines the features of many different simulators and is
% far more efficient than all of them. The combination of a Python API,
% an extremely efficient storage format, and a well-engineered and
% extensible codebase changes the game. Stdpopsim builds on this base
% to provide easy access to standard models of population genetics.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Methods: For software, the authors should describe key elements of the algorithms
% implemented, and are expected to provide both the source code as well as
% examples illustrating the use of the software.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Methods}
In this section we describe the main features of \msprime\ 1.0.
\subsection*{Recombination}
Recombination is implemented in \msprime\ using Hudson's algorithm, which
works backwards in time tracking the
effects of common ancestor and recombination
events~\citep{hudson1983properties,hudson1990gene,kelleher2016efficient}.
Common ancestor events combine the ancestral material of two lineages, and may
result in coalescences in the marginal trees. Recombination events
split the ancestral material for some lineage, creating two independent
lineages. Using the appropriate data structures~\citep{kelleher2016efficient},
this process is much more efficient to simulate than either
methods based on the Ancestral Recombination Graph
formulation~\citep{griffiths1991two,griffiths1997ancestral}
or the left-to-right approach~\citep{wiuf1999recombination,wiuf1999ancestry}.
In \msprime\ 1.0 recombination rates can vary along a chromosome, allowing
us to simulate recombination hotspots like \texttt{msHOT}~\citep{hellenthal2007mshot}
and empirical recombination maps similar to other
simulators~\citep[e.g.][]{shlyakhter2014cosi2}.
The implementation of recombination in \msprime\ is extensively validated
against analytical results~\citep{hudson1983properties,kaplan1985use}
and results from \ms.
The Sequentially Markovian Coalescent (SMC) is an approximation of the
coalescent with recombination
model~\citep{mcvean2005approximating,marjoram2006fast},
and was originally motivated to a significant degree by the need to
simulate longer genomes than was possible using tools like \ms.
The SMC is a good approximation to the
coalescent with recombination when we have four or fewer sampled
genomes~\citep{hobolth2014markovian,wilton2015smc}, but the
effects of the approximation are less well understood for larger
sample sizes and several approaches have been proposed to
allow simulations more closely approximate the coalescent
with recombination~\citep{chen2009fast,wang2014new,staab2014scrm}.
The SMC is an important analytical approximation
and has allowed the development of numerous inference
methods~\citep{
li2011inference,
harris2013inferring,
sheehan2013estimating,
schiffels2014inferring,
carmi2014renewal,
rasmussen2014genome,
zheng2014bayesian,
terhorst2017robust}.
\begin{figure}
\caption{\label{fig-recombination-perf}Approximate run time for \msprime\
simulations as a function of scaled recombination rate.}
\end{figure}
In human-like parameter regimes, \msprime's implementation of the
exact coalescent with recombination comprehensively outperforms
all other simulators, including those based on SMC
approximations~\citep{kelleher2016efficient}. However, it is important
to note that although the implementation of Hudson's algorithm is
very efficient, it is still quadratic in the scaled recombination rate
$\rho$, and therefore for sufficiently large $N_e$, recombination rate
or long genome, efficient SMC implementations will be faster
that \msprime.
\jkcomment{We want to round out this discussion here by giving some
sort of rough argument for why Hudson's algorithm is quadratic, and discussing
to Fig.~\ref{fig-recombination-perf} to show roughly what we can expect
to simulate in how long. See issue 28.}
The SMC~\citep{mcvean2005approximating} and
SMC'~\citep{marjoram2006fast} models are available
in \msprime\ 1.0. However, they are currently implemented using a
naive rejection sampling approach, and are somewhat slower
to simulate than the exact coalescent with recombination. These
models are therefore only appropriate for studying the SMC approximations
themselves; an important avenue for future work is to implement
efficient SMC-like models, perhaps based on the approaches used
in \texttt{cosi2}~\citep{shlyakhter2014cosi2}, to facilitate
simulation of organisms such as \textit{Drosophila melagaster}.
\subsection*{Gene conversion}
Gene conversion is \jkcomment{A couple of sentences describing gene
conversion, and what the model is. What other simulators simulate
gc? }
We have implemented this model of gene
conversion in \msprime\ 1.0, and validated the output against
\ms\ and analytical results~\citep{wiuf2000coalescent}.
\begin{figure}
\caption{\label{fig-gc-perf}Comparison of the running time
for E-coli simulations using \msprime\ and SimBac for varying
sample sizes. The parameters used are [] based on [references].}
\end{figure}
Gene conversion is particularly useful to model homologous recombination in
bacterial evolution.
We compare the performance of \msprime\ with gene conversion to
SimBac~\citep{brown2016simbac}, a simulator developed specifically to
simulate bacterial evolution. Although \msprime\ is missing some features
relevant to this case, it is far more efficient at simulating the basic
processes. An important avenue for future work
will be to add further features to model bacterial
evolution. In particular, we aim to include bacterial gene transfer and utilize
the succinct tree sequence structure to represent the resulting ancestral gene
transfer graph~\citep{baumdicker2014AGTG}.
\subsection*{Population structure}
We support all the stuff that ms does.
Demographic events where we change
the properties of the population.
We provide a debugger interface which allows the user to see the
various population sizes and growth rates over time. This includes
a way to compute coalescence rate trajectory, which provides a useful
ground truth when comparing with methods that estimate population
size over time~\citep{adrion2019community}.
The interface
is very flexible and general. We also reading in population tree
descriptions as produced by programs such as
StarBeast~\citep{heled2009bayesian}.
We support tracking lineages through time by associating a population
with each node, and having migration records. We also have
census events, which will record the location of every lineage
at a particular time.
\subsection*{Selection}
Selection was added to the coalescent ... Lots of simulators exist:
SelSim~\citep{spencer2004selsim}
mbs~\citep{teshima2009mbs},
msms~\citep{ewing2010msms},
% What about Cosi, did it simulate selection?
Cosi2~\citep{shlyakhter2014cosi2},
and discoal~\citep{kern2016discoal}.
We have implemented the basic infrastructure exists for the structured
coalescent, which makes adding support for, e.g.,
inversions straightforward~\citep{peischl2013sequential}
\subsection*{Discrete time Wright-Fisher}
The coalescent is a good approximation for ancient history, but
is bad for the recent past and in particular when we have large
sample
sizes~\citep{wakeley2012gene,bhaskar2014distortion,nelson2020accounting}.
Describe basic
approach [DOM, can you fill in please].
See ~\cite{nelson2020accounting} for more details.
We compare the performance of this simulation in Fig X
with ARGON~\citep{palamara2016argon}, a specialised DTWF simulator.
We support changing the simulation model, so that we can easily
run hybrid simulations like described by~\cite{bhaskar2014distortion}
This ia a powerful approach because it lets us combine any
number of different simulation models.
\subsection*{Finite sites mutations}
Version 0.x of \texttt{msprime} provided a very simple mutation model,
supporting only the infinite sites model.
With version 1.0, we now support a large number of mutation models,
including
JC69~\citep{jukes1969evolution},
F84~\citep{felsenstein1996hidden},
and GTR~\citep{tavare1986some} nucleotide models
and the BLOSUM62~\citep{henikoff1992amino}
and PAM~\citep{dayhoff1978} amino acid models.
The general model accepts an arbitrary transition matrix between
any number of alleles (which can be arbitrary unicode strings),
and so other models such as the Kimura two and three
parameter models~\citep{kimura1980simple,kimura1981estimation}
parameters models can be easily
and efficiently defined in user code.
Full support for the original infinite sites model is also maintained.
The results of mutation simulations across a range of models are
extensively validated
against both theoretical expectations and output from
Seq-Gen~\citep{rambaut1997seq} and
Pyvolve~\citep{spielman2015pyvolve}.
Comparing the performance of mutation generation methods is problematic.
Because there is no standard format for representing trees and
the accompanying mutations, all existing methods are focused on
writing out the resulting sequences. Methods are also not modular,
and so it is very difficult to disentangle the time required to
read in the tree file and output the sequences from the actual
time required to perform the simulation.
The mutation generation machinery in \msprime\ is extremely
efficient. For example, we simulated mutations under the JC69
model for a tree with $10^6$ leaves over $10^4$ sites (resulting
in $\sim 280000$ mutations) in less that 1.5 seconds, including
the time required for file input and output. While a systematic
comparison of run-times is not feasible here, this is orders
of magnitude faster than methods such as
Seq-Gen~\citep{rambaut1997seq} and
Pyvolve~\citep{spielman2015pyvolve}
and at least competitive with the recent method from
\cite{demaio2021phastsim}.
\textbf{TODO: would love to do a like-for-like comparison, but quite hard}
Another substantial advantage of our approach is that the simulated
mutations are stored directly in the output data format with
a defined API. Although
Pyvolve and Phastsim are implemented in Python, neither provide
programatic access to the simulated output, outputting instead
to a series of files which must subsequently be parsed.
See the Data Model section for more details and discussion.
\subsection*{Multiple merger coalescents}
Kingman's coalescent assumes that only two ancestral lineages can merge at
each merger event. Although this is generally a reasonable approximation there
are certain situations in which the underlying mathematical assumptions are
violated. For example in certain highly fecund organisms
\citep{hedgecock_94,B94,HP11,A04,irwin16}, where individuals have the ability
to produce numbers of offspring on the order of the population size and
therefore a few individuals may produce the bulk of the offspring in any given
generation~\citep{hedgecock_94}. These population dynamics violate basic
assumptions of the Kingman coalescent, and are better modelled by
`multiple-merger' coalescents \citep{DK99,P99,S99,S00,MS01}, in which more than
two lineages can merge in a given event. Multiple-merger coalescent processes
have also been shown to be relevant for modeling the effects of selection on
gene genealogies \citep{Gillespie909,DS04}.
Although multiple merger coalescents have been of significant theoretical
interest for around two decades, there has been little practical software
available to simulate these models.
\cite{kelleher2013coalescent,kelleher2014coalescent} developed packages to
simulate a related spatial continuum model~\citep{barton2010new},
\cite{zhu2015hybrid} simulate genealogies within a species tree
based on a multiple-merger model, and
\cite{becheler2020occupancy} provide a general method for simulating
multiple merger processes
as part of the Quetzal framework~\citep{becheler2019quetzal}.
The \texttt{Beta-Xi-Sim} simulator~\citep{koskela2018multi,koskela2019robust}
includes a number of extensions to the Beta-coalescent.
None of these methods work with large genomes, and very little work
has been performed on simulating multiple merger processes with recombination.
We have added two multiple merger coalescent models in \msprime\ 1.0, the
Beta-coalescent and ``dirac''-coalescent, allowing us to simulate
such models with recombination efficiently for the first time.
These simulation models have been extensively validated against
analytical results from the site frequency
spectrum~\citep{birkner2013statistical,blath2016site,hobolth2019phase}
as well as more general properties of coalescent processes.
Please see the Appendix for more details and model derivations.
\subsection*{Ancestral recombination graphs}
The \msprime~simulator generates tree sequences by simulating realisations of the
Ancestral Recombination Graph (ARG).
By default, only the marginal genealogy at each site is recorded in the tree sequence
output.
Marginal genealogies are sufficient for e.g.~simulating and storing sequence data,
and involve fewer edges than nodes than storing a realisation of the full ARG.
However, the mapping from ARG realisations to marginal tree sequences is not lossless.
For example, times of recombination events can be recovered from the former
but not the latter.
The same is true of the number of extant ancestors at each time in the past.
In addition to the default behaviour, we support the retention of the full
ARG realisation using the \texttt{record\_full\_arg} option.
This option augments the default tree sequence data structure with three
new kinds of nodes:
\begin{itemize}
\item Each recombination event is marked by two nodes with identical birth times
given by the time of the recombination event, and marked with the
\texttt{NODE\_IS\_RE\_EVENT} flag.
Each node represents an ancestor of the recombinant child node, and edges connect
the relevant segments of genome from the child node to these two ancestors.
\item We store common ancestor nodes even when the corresponding merger event
does not result in coalescence of ancestral material.
The resulting nodes are marked with the \texttt{NODE\_IS\_CA\_EVENT} flag.
\item When multiple populations are simulated, migration events are depicted by nodes
with the \texttt{NODE\_IS\_MIG\_EVENT} flag, and with birth times equal to the time of
the migration.
The population flag of the migration node identifies the population to which the lineage
belongs after the reverse time migration event.
\end{itemize}
The likelihood, i.e.~the joint sampling probability of the ancestry and mutations,
is an example of a function which is intractable when the ancestry is stored as a
marginal tree sequence, but tractable when the ancestry is given by a full
ARG realisation.
The likelihood is a central quantity in evaluating the plausibility of a putative
ancestry as an explanation of DNA sequence data, both directly through
e.g.~approaches based on maximum likelihood, and as an ingredient of
methods such as Metropolis-Hastings
\citep{kuhner2000maximum, nielsen2000estimation, wang2008bayesian}.
We provide two functions to facilitate likelihood-based inference.
Both are implemented only for the simplest case of the standard ARG with a
constant population size, and require tree sequences compatible with the
\texttt{record\_full\_arg} option as their arguments.
\begin{itemize}
\item \texttt{msprime.log\_arg\_likelihood(ts, r, N)} returns the natural logarithm of
the sampling probability of the tree sequence \texttt{ts} under the ARG with per-link,
per-generation recombination probability \texttt{r} and population size \texttt{N}.
Specifically, the function returns the logarithm of
\begin{equation*}
\Bigg( \frac{ 1 }{ 2 N } \Bigg)^{ q_c } \Bigg( \prod_{ i : \mathcal{R} } r g_i \Bigg)
\exp\Bigg( -\sum_{ i = 1 }^q \Big[\frac{ 1 }{ 2 N } \binom{ k_i }{ 2 }
+ r l_i \Big] t_i \Bigg),
\end{equation*}
where $t_i$ is the number of generations between the $(i - 1)$th and $i$th event,
$k_i$ is the number of extant ancestors in that interval, $l_i$ is the number of links
in that interval that would split ancestral material should they recombine,
$q$ is the total number of events in the tree sequence $\texttt{ts}$,
$q_c$ is the number of coalescences, $\mathcal{R}$ is the set of indices
of time intervals which end in a recombination, and $g_i$ is the corresponding
\emph{gap}: the length of material in which the recombination at the end of the
$i$th period could have taken place without affecting the resulting genealogy.
A recombination in ancestral material will always have $g_i = 1$, while a recombination
in trapped non-ancestral material can result in other values of $g_i$.
For a continuous model of the genome and a recombination in ancestral material,
we set $g_i = 1$ and interpret the result as a density.
\item \texttt{msprime.unnormalised\_log\_mutation\_likelihood(ts, m)} returns the
natural logarithm of the probability of the mutations recorded in the tree sequence
\texttt{ts} given the corresponding ancestry, assuming the infinite sites model, up to
a normalising constant which depends on the pattern of mutations,
but not on the tree sequence or the per-site, per-generation mutation
probability \texttt{m}.
Specifically, the function returns the logarithm of
\begin{equation*}
e^{ - T m / 2 } \frac{ ( T m / 2 )^M }{ M ! }
\prod_{ \gamma \in \mathcal{ M } } \frac{ h_{ \gamma } }{ T },
\end{equation*}
where $T$ and $\mathcal{M}$ are the total branch length and set of mutations
in \texttt{ts}, respectively, and for a mutation $\gamma$, $h_{ \gamma }$ is the
total branch length on which $\gamma$ could have arisen while appearing on all
of the leaves of \texttt{ts} it does, and on no others.
Unary nodes on marginal trees arising from the \texttt{record\_full\_arg} option
mean that, in general $h_{ \gamma }$ corresponds to the length of one or more
edges.
\end{itemize}
\begin{comment}
Previous draft of the ARG feature retained here as a comment.
The definition of ARGs is vague and confusing.
\cite{minichiello2006mapping} define it as the data structure
rather than the original stochastic
process~\citep{griffiths1991two,griffiths1997ancestral}.
\cite{rasmussen2014genome} have a slightly different approach.
Sometimes the ARG is useful.
The tree sequence output for \msprime\ can be turned into
an ARG using the \texttt{record\_full\_arg} option, where we store
recombination nodes.
We also provide an interface to compute the likelihood of a given ARG
under the~\cite{kuhner2000maximum} model.
\end{comment}
\subsection*{Instantaneous bottlenecks}
A common approach to modelling the effect of demographic history on genealogies
is to assume that effective population size ($N_e$) changes in discrete steps
which define a series of epochs~\citep{griffiths1994sampling, marth2004allele,
keightley2007joint,li2011inference}. In this setting of piecewise constant
$N_e$, capturing a population bottleneck requires three epochs: $N_e$ is
reduced by some fraction $b$ at the start of the bottleneck, $T_{start}$, and
recovers to its initial value at time $T_{end}$~\citep{marth2004allele}. If
bottlenecks are short both on the timescale of coalescence and mutations,
there may be little information about the duration of a bottleneck
($T_{end}-T_{start}$) in sequence data. Thus a simpler, alternative model is to
assume that bottlenecks are instantaneous ($T_{end}-T_{start} \rightarrow 0$)
and generate a sudden burst of coalescence events (a multiple merger event) in
the genealogy. The strength of the bottleneck $B$ can be thought of as an
(imaginary) time period during which coalescence events are collapsed,
i.e.\ there is no growth in genealogical branches during $B$ and the probability that a
single pair of lineages entering the bottleneck coalesce during the bottleneck
is $1-e^{-B}$. Although this simple two parameter model of bottlenecks is
attractive and both analytic results and empirical
inference~\citep{griffiths1994sampling, galtier2000detecting,
bunnefeld2015inferring} have been developed under this model, there has
been no software available to simulate data under instantaneous
bottleneck histories.
We have implemented instantaneous bottlenecks in \msprime~1.0
using a variant of Hudson's linear time single-locus coalescent
algorithm~\citep{hudson1990gene}. Instantaneous bottlenecks are specified
by adding events to the Demography object (see the Demography section)
and can be used in combination with any other demographic modelling
features. We have validated the results of these simulations by comparing
against analytical expectations for coalescence times and the
site frequency spectrum~\citep{bunnefeld2015inferring}.
\subsection*{Simulation interface}
The primary interface for \msprime\ is through a Python application
programming interface (API). This has many advantages. One is
that we can grow the interface over time. Another is that
downstream applications such as \texttt{stdpopsim}~\citep{adrion2019community}
can be developed.
\subsection*{Data interchange and interoperability}
The point of simulations is to generate data and the efficiency with
which simulated data can be incorporated into an overall application
is an important---if often overlooked---issue.
For example, Approximate Bayesian Computation
(ABC)~\citep{beaumont2002approximate,csillery2010approximate}
and more recent machine learning
methods~\citep{sheehan2016deep,schrider2018supervised,flagel2019unreasonable}
perform population genetic inference by using large numbers of
simulations as training data. Clearly, simulation efficiency is
crucial since the size and number of simulations that can be performed
largely determines the scope of what can be inferred. Equally important,
however, is the efficiency with which the simulation results can be
transformed into the specific input required by the inference method.
In the case of ABC, this is usually a set of summary statistics of the
sequence data. General approaches such as
ABCtoolbox~\citep{wegmann2010abctoolbox} connect to
simulation and summary statistic computation programs via text streams
and are quite flexible, but the overhead involved in processing
the text is often a computational bottleneck.
Other methods such as DIYABC~\citep{cornuet2008inferring}
or PopABC~\citep{lopes2009popabc} avoid this bottleneck by
developing their own specialised simulators, tightly integrating them
with the inference method. While this is certainly more efficient,
there is a great deal of duplication of effort and significant
room for error.
\jkcomment{Stuff to cover}
\begin{itemize}
\item The core problem is that the field lacks an efficient mechanism for
interchanging simulated data, and so developers are forced to choose
between efficiency and duplication of effort.
\item What are the output formats of simulators? Newick, or sequence data
basically. What's the problem with this? We don't have mutations.
Much more efficient to have the tree and mutations than the
sequence data. Fundamentally, the output of coalescent simulations
should be trees, otherwise we are throwing a lot of information
away.
\item Side note somewhere: with efficient access to branch length
statisics~\citep{ralph2019efficiently}, maybe we don't need to
simulate mutations at all.
\item Msprime uses the \tskit\ library~\citep{tskit2021tskit}, which provides
extremely efficient interchange and data processing. A quick example
of how efficient.
\item Tskit has facilited unprecedented interoperability between simulators:
Forward simulations can also use tree
sequences~\citep{kelleher2018efficient,haller2018tree}. In particular, SLiM
3.0~\citep{haller2019slim} and fwdpp~\citep{thornton2014cpp} can now output
tree sequences. Msprime can take tree sequences simulated by these forward
simulators as input and complete the simulation (``recapitation'').
See \cite{haller2018tree} for more details.
\item Recently, Phastsim/USHER~\cite{demaio2021phastsim,turakhia2021ultrafast}
have used this idea of storing the
tree plus mutations, but approach is crude compared to tskit.
\end{itemize}
\subsection*{Development model}
The community development process for \msprime\ is enabled by following an
open source development practise with a strong emphasis on code quality
and correctness. To ensure a consistent style, we require that all code
follows the PEP8 guidelines. Unit tests for all additions are required, and
are automatically run on each pull request on a variety of continuous
integration testing platforms. For new simulation features, statistical tests
must also be added, comparing the distribution of results to either analytical
results or existing simulators.
\subsection*{Computing coalescence rates and lineage location probabilities}
Many popular approaches in population genetics use the distribution of coalescence rates between
pairs of lineages in one or more populations to infer effective population sizes over time
\cite{li2011inference,sheehan2013estimating,schiffels2014inferring}
or split times
and subsequent migration rates between populations
\cite{wang2020tracking}.
Given a demographic model, the
\texttt{msprime} DemographyDebugger is able to compute coalescence rates for two or more lineages drawn
from one or more populations at specified times in the past, providing a ground truth for coalescent rate-based
inference methods. To compute coalescence rates for a set of times $Ts=(t_0, t_1, \ldots)$ in the
past, we write
\begin{lstlisting}[frame=single]
dbg = demography.debug()
rates = dbg.coalescence_rate_trajectory(Ts, num_samples)
\end{lstlisting}
where \texttt{num\_samples} is a list with length equal to the number of populations that
specifies the number of sampled lineages from each population. For example,
to get pairwise rates within the same
population, this list would contain zeros for all populations except for the population index of interest,
which would have a $2$. To get the rate of coalescence for two lineages sampled from different
populations, we would set \texttt{num\_samples} to have ones in those population indices.
More simply, \texttt{msprime} can also compute the probability of a lineage that was in a given
population at some time to be found in other populations (through continuous or mass migration events).
For a given sample time $t\geq0$ and any list of times $Ts$ in the past, we write
\begin{lstlisting}[frame=single]
prob = dbg.lineage_probabilities(Ts, sample_time=t)
\end{lstlisting}
This function is also used to determine the time intervals that lineages can be found in each population,
given a list of contemporary and/or ancient samples. For a given sampling configuration, by writing
\begin{lstlisting}[frame=single]
lineage_dict = dbg.possible_lineage_locations(samples=samples)
\end{lstlisting}
we get a dictionary with epoch intervals as keys whose values are a list that indicates True or False for
whether each population could contain ancestral material to our samples.
If no sampling configuration is provided, we assume we sample lineages from each population at time zero.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results and Discussion: For software, the authors are expected to present a
% few examples that demonstrate the use of the software and benchmarks. Ideally,
% the examples should be simple. We encourage the use of boxes with scripts
% supported by a narrative. Benchmarks carried under more realistic situations
% than the one presented in the examples are also encouraged.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Results and Discussion}
\subsection*{Discussion}
% Simulation is useful, and it's really easy for a single locus.
% This has lead to a culture of developing lots of simulators
% rather than focusing on one
Simulation has played a key role in the development of coalescent theory
and its use in understanding observed patterns of genetic variation.
The basic model can be simulated for $n$ samples in $O(n)$ time
and is straightforward to implement---for example, Hudson provided a fully
functional and highly efficient simulator as a C source code listing in an
appendix to his seminal review paper~\citep{hudson1990gene}. Because of
this ease, the dominant approach to producing simulators for extensions
to this basic model has been to develop them independently. This is
often seen as a learning opportunity, since those developing extensions
to the coalescent must have an excellent understanding of the basic
model in order to develop extensions to the coalescent.
The coalescent with recombination, however, is \emph{not} straightforward
to implement, and numerous different approaches have been
proposed over the
years~\citep{hudson1983properties,griffiths1997ancestral,wiuf1999recombination,
mcvean2005approximating}. It has only recently become clear that it is
possible to implement the model efficiently, and only through the use
of sophisticated data structures and an intricate and subtle
algorithm~\citep{kelleher2016efficient}.
This efficiency, and the ability to handle recombination in a principled
way is vital with today's vast numbers of whole genome samples.
A significant advantage to simulating the coalescent with recombination
backwards in time using Hudson's algorithm rather than
left-to-right
approaches~\citep{wiuf1999recombination,chen2009fast,staab2014scrm}
is the ease with which additional processes can be integrated.
We have seen that processes such as gene conversion, complex
demography, instantaneous bottlenecks, multiple merger caolescents
and selective sweeps have been integrated seamlessly into
the simulations. Incorporating any of these extensions into
either the exact or approximate left-to-right processes would require
complex derivations. Because we are going backwards
in time, processes can be modelled as competing exponentials
using Gillespie's algorithm \jkcomment{I think?},
and therefore in principle we can
add as many extra processes as we like.
\jkcomment{so, backwards in time
sims in msprime is extensible for the future.}
% Making lots of simulators has given us a fragmented and poort
% quality ecosystem.
The approach of developing ``in-house'' simulators
has lead to a huge proliferation of different programs, with
slightly different feature sets and often incompatible interfaces.
It is a fragmented ecosystem with significant issues with software
quality (see, e.g.,~\cite{yang2014critical});
few simulators are actively maintained after publication.
There seems to be little point
in continuing the huge duplication of effort we have seen.
It is not possible to achieve the performance levels of
\msprime\ without investing substantially in software engineering.
As demonstrated here, \msprime\ is extensible and we would argue
that developing this simulator as an open-source community asset
is of more benefit to all.
There is also the welcome development of libraries such as
Quetzal~\citep{becheler2019quetzal} and libsequence~\citep{thornton2014cpp}
which break this pattern of monolithic simulation software.
Thus, development of \msprime\ is not completed and much remains
to be done.
% We need to acknowledge that for long genomes and for large population
% sizes msprime's performance does drop off and SMC simulators like
% scrm become more efficient.
Msprime's performance for large sample sizes is excellent, but
simulating species with large population sizes is still challenging.
An important avenue for future development is to improve performance
in these scenarios by incorporating data structures such as skip lists
and implementing more efficient coalescent approximations, as done
in \texttt{cosi2}~\citep{shlyakhter2014cosi2}.
In terms of mutation simulation, future work will need to do things like
Dawg~\citep{cartwright2005dna} and to implement
codon models and indels~\citep{fletcher2009indelible}.
% Finish up with a paragraph saying that there loads of exciting
% new things that simulation can do. We need MORE sims!!!
The ability to infer trees at scale for the first time
in the presence of
recombination~\citep{harris2019database,kelleher2019inferring,
speidel2019method,tang2019genealogy}
opens the possibility for new applications of coalescent simulations.
Also \stdpopsim\ is cool~\citep{adrion2019community}
\section*{Acknowledgments}
JK is supported by the Robertson Foundation.
Jere Koskela is supported in part by EPSRC grant EP/R044732/1.
[Others fill in their ACKs please]
\bibliographystyle{plainnat}
\bibliography{paper}
\appendix
\label{app-multiple-mergers}
\section{Multiple merger coalescent model}
Multiple-merger coalescents, in which a random number of ancestral
lineages may merge at a given time in one group, and only one such
group of lineages can merge at any given time (asynchronous multiple
mergers), are referred to as $\Lambda$-coalescents; the rate at which a given group of $k$ lineages out of a total of $b$ lineages merges is given by ($a\ge 0$ constant)
\begin{equation}\label{lambdabk}
\lambda_{b, k} = \int_0^1 x^{k-2}(1-x)^{b-k}\Lambda(dx) + a\one{k=2}, \quad 2 \le k \le b,
\end{equation}
where $\Lambda$ is a finite measure on the (Borel subsets of) unit interval without an atom at zero \citep{DK99,P99,S99}. The total rate is given by
\be\label{lambdab}
\lambda_{b} = \int_0^1 \left(1 - (1-x)^{b} - bx(1-x)^{b-1} \right)x^{-2}\Lambda(dx) + \binom{b}{2},
\ee
\citep{S99} which can be useful in applications, in particular provided the antiderivative can be explicitly identified.
A larger class of multiple-merger coalescents involving simultaneous
multiple mergers of distinct groups of ancestral lineages also exists
\citep{S00}. These are commonly referred to as $\Xi$-coalescents, and
can be shown to be the limits of ancestral processes derived from
population models incorporating diploidy (or more general polyploidy)
\citep{BBE13,blath2016site}, and certain models of selection \citep{DS04}.
To describe a general $\Xi$-coalescent let $\Delta$ denote the
infinite simplex \be\label{Delta} \Delta := \{ (x_1, \ldots ): x_1 \ge
x_2 \ge \cdots \ge 0, \sum_{j}x_j \le 1\}; \ee for any
$r \in \{1,2, \ldots\}$ let $k_1, \cdots, k_r \ge 2$, and
$b = s + k_1 + \cdots + k_r$ be the total number of blocks in a given
partition ($s = b - k_1 - \cdots - k_r$ is the number of blocks not
participating in the mergers at the given time). The existence of
simultaneous multiple-merger coalescents was proved by \cite{S00}.
There exists a finite measure $\Xi$ on $\Delta$, with
$\Xi = \Xi_0 + a\delta_0$, the measure $\Xi_0$ has no atom at zero,
$a >0$ is fixed, and the rate at which one sees a given (simultaneous)
merger of ancestral lineages with merger sizes $k_1, \ldots, k_r$,
$s = n - k_1 - \cdots - k_r$, is given by
\begin{esplit}{xi}
\lambda_{n; k_1, \ldots, k_r; s} & = \int_\Delta \sum_{\ell = 0}^s \sum_{\substack {i_1, \ldots, i_{r+\ell} = 1\\ \text{all distinct}} }^\infty \binom{s}{\ell} x_{i_1}^{k_1} \cdots x
_{i_{r}}^{k_r} x_{i_{r+1}} \cdots x_{i_{r+\ell}}\left(1 - \sum_{j=1}^\infty x_j \right)^{s-\ell} \frac{1}{ \sum_{j=1}^\infty x_j^2 } \Xi_0(dx) \\
& + a\one{r=1, k_1 = 2}.
\end{esplit}%
The number of such $(k_1, \ldots, k_r)$ mergers is
\be\label{N}
\mathcal{N}(b; k_1, \ldots, k_r ) = \binom{b}{k_1 \ldots k_r\, s} \frac{1}{ \prod_{j=2}^b\ell_j! },
\ee
\citep{S00}, in particular $\mathcal{N}(b;2) = b(b-1)/2$, and one can compute the total rate of a $(k_1, \ldots, k_r)$ merger as
\be\label{lambdabkall}
\lambda(n; k_1, \ldots, k_r) = \mathcal{N}(b; k_1, \ldots, k_r ) \lambda_{n; k_1, \ldots, k_r; s}.
\ee
The total rate is given by, with
$n\ge 2$ denoting the total number of ancestral lineages,
\be
\label{Xilambdab}
\lambda_{n} = \int_\Delta \left(1 - \sum_{\ell = 0}^n \sum_{i_1 \neq
\cdots \neq i_\ell } \binom{n}{\ell} x_{i_1}\cdots x_{i_\ell}\left(1
- \sum_{i=1}^\infty x_i\right)^{n-\ell} \right)
\frac{1}{\sum_{j=1}^\infty x_j^2}\Xi_0(dx) + a\binom{n}{2} \ee
\citep{S00}. Viewing coalescent processes strictly as mathematical
objects, it is clear that the class of $\Xi$-coalescents contains
$\Lambda$-coalescents as a specific example (i.e.\ allowing at most
one group of lineages to merge each time), and the class of
$\Lambda$-coalescents contain the Kingman-coalescent as a special
case. However, viewed as limits of ancestral processes derived from
specific population models they are not nested, since one would
obtain $\Lambda$-coalescents when deriving coalescent processes from
haploid population models incorporating sweepstakes reproduction and
high fecundity, and $\Xi$-coalescents for diploid populations. One
should therefore apply the models as appropriate, i.e.\
$\Lambda$-coalescents to data (e.g.\ mtDNA data) inherited in a
haploid fashion, and $\Xi$-coalescents to e.g.\ autosomal data
inherited in diploid or polyploid fashion \citep{blath2016site}.
In \msprime\ we have incorporated two examples of multiple-merger coalescents.
One is a diploid extension \citep{BBE13} of the haploid model of sweepstakes
reproduction considered by \cite{EW06}, which is a haploid Moran model adapted
to sweepstakes reproduction. Let $N$ denote population size, and take $\psi
\in (0,1]$ to be fixed. In every generation, with probability
$1-\varepsilon_N$ a single individual (picked uniformly at random) perishes,
and one of the surviving individuals (sampled uniformly at random) produces one
offspring; with probability $\varepsilon_N$ a total of $\lfloor \psi N \rfloor$
individuals perish, and of the remaining individuals a single individual
produces $\lfloor \psi N \rfloor -1 $ offspring. Taking $\varepsilon_N =
1/N^\gamma$ for some $\gamma > 0$, \cite{EW06} obtain specific examples of
$\Lambda$-coalescents, where the $\Lambda$ measure in Eq.~\eqref{lambdabk} is a
point mass at $\psi$. The simplicity of this model does allow one to obtain
some explicit mathematical results (see e.g.\
\cite{EF2018,Matuszewski2017,Der2012,Freund2020}). The model considered by
\cite{EW06} has also been applied in algorithms for simulating gene genealogies
within phylogenies \citep{zhu2015hybrid}. The specific model incorporated into
\msprime\ is the diploid version \citep{BBE13} of the model studied by
\cite{EW06}, which would be necessary in order to incorporate recombination.
In the model considered by \cite{BBE13}, a single pair of diploid individuals
contribute offspring in each generation, selfing is excluded, and each
offspring is assigned one chromosome from each parent. There are, therefore,
four parent chromosomes involved in each reproduction event, which can lead to
up to four simultaneous mergers. Let \begin{esplit}{Cconst} C_{b; k_1, \ldots,
k_r;s } &= \frac{4}{\psi^2} \sum_{\ell = 0}^{s \wedge (4-r)} \binom{s}{\ell}
(4)_{r+\ell} (1-\psi)^{s-\ell }\left( \frac{\psi}{4} \right) ^{k_1 + \cdots +
k_r + \ell}, \\ \end{esplit} where $C_{b; k_1, \ldots, k_r;s }$ corresponds
to the coalescence rate $ \lambda_{b;k_1, \ldots, k_r;s}$ (see Eq~\eqref{xi})
of a $\Xi$-coalescent on $(\psi/4, \psi/4, \psi/4, \psi/4, 0, \ldots)$. The
total merger rate (see Eq~\eqref{lambdabkall}) is then
\be\label{xidirlambdabk} \lambda(b;k_1, \ldots, k_r) = \mathcal{N}(b; k_1,
\ldots, k_r ) \left( \frac{c\psi^2/4}{ 1 + c\psi^2/4}C_{b; k_1, \ldots, k_r;s
} + \frac{ \one{r=1, k_1 = 2} }{1 + c\psi^2/4} \right), \ee and the total
coalescence rate becomes \begin{esplit}{xilambdab} \lambda_b & =
\frac{c\psi^2/4}{1 + c\psi^2/4} \frac{4 }{\psi^2 } \left(1 - \sum_{\ell = 0
}^{b\wedge 4} \binom{b}{\ell} (4)_\ell \left( \frac{\psi}{4} \right)^\ell (1 -
\psi)^{b-\ell } \right) + \frac{1}{1 + c\psi^2/4} \binom{b}{2}. \\
\end{esplit} The interpretation of \eqref{xilambdab} is that with probability
$1/(1 + c\psi^2/4)$ a `small' reproduction event occurs, in which the parent
pair produce one diploid offspring; with probability $c\psi^2/4/(1 +
c\psi^2/4)$ a `large' reproduction event occurs, in which the parent pair
produce $\lfloor \psi N \rfloor$ offspring.
The other multiple-merger coalescent model incorporated in \msprime\ is derived
from an adaptation of the haploid population model considered by
\cite{schweinsberg03} to diploid populations \citep{BLS15}. In the haploid
version, in each generation individuals independently produce random numbers of
juveniles, where the random number of juveniles $(X)$ produced by a given
individual has the stable law \be\label{jX} \lim_{k\to \infty} C k^\alpha
\prb{X \ge k} = 1 \ee with index $\alpha > 0$, and $C > 0$ is a normalising
constant. One can interpret Eq.~\eqref{jX} as stating the form of the
probability distribution for number of juveniles at least on the order of the
population size. If the random total number of juveniles $(S_N)$ produced
in this way is at least the population size $(N)$, then one samples $N$
juveniles uniformly at random without replacement to form the next generation
of reproducing individuals; if $S_N < N$ one simply carries on with the
same set of individuals. However, assuming $\EE{X} > 1$, one can show
that $\prb{S_N < N}$ decays exponentially fast in $N$ \citep{schweinsberg03}.
If $\alpha \ge 2$ the ancestral process converges to the Kingman-coalescent; if
$1 \le \alpha < 2$ the ancestral process converges to a specific case of a
$\Lambda$-coalescent, where the $\Lambda$ measure in Eq.~\eqref{lambdabk} is
associated with the Beta$(2-\alpha, \alpha)$ probability distribution, i.e.\
\be\label{Fbeta} \Lambda(dx) = \one{0< x \le 1} \frac{1}{B(2-\alpha,\alpha)}
x^{1 - \alpha}(1-x)^{\alpha - 1} dx, \ee where $B(2-\alpha,\alpha)$ is the
beta function $B(a,b) = \Gamma(a)\Gamma(b)/\Gamma(a+b)$, $a,b > 0$
\citep{schweinsberg03}.
This model has been adapted to diploid populations
by \cite{BLS15}, where the resulting coalescent process
is a four-fold $\Xi$-coalescent on $(x/4, x/4, x/4, x/4, 0, 0, \ldots)$,
where $x$ is a random variate with the Beta$(2-\alpha,\alpha)$ distribution.
The merger rate (see Eq.~\eqref{xi}) is then ($1 \le r \le 4$)
\be\label{xibeta} \lambda_{b;k_1, \ldots, k_r} = \sum_{\ell = 0}^{ (b -
k)\wedge (4-r) } \binom{b-k}{\ell} \frac{ (4)_{r+\ell} }{4^{k+\ell}}
\frac{B(k+\ell - \alpha, b-k-\ell + \alpha ) }{B(2-\alpha,\alpha)} \ee
\citep{blath2016site,BLS15}. The interpretation of Eq.~\eqref{xibeta} is that the
random number of diploid juveniles each diploid pair of parents produces
is governed by the law in Eq.~\eqref{jX}, and each diploid juvenile is
assigned one chromosome from each parent (selfing is excluded).
The model in Eq~\eqref{jX} assumes that individuals can produce arbitrarily
large numbers of juveniles. Considering diploid juveniles, this assumption is
probably rather strong, since diploid juveniles are at least fertilised eggs,
and so it is reasonable to suppose that the number of juveniles surviving to
the particular life stage we are modeling cannot be arbitrarily large. With
this in mind, we also consider an adaptation of the Schweinsberg model, where
the random number of juveniles $(X)$ produced by a given parent pair is
distributed according to \be\label{jtr} \prb{X=k} = \one{1 \le k \le \phi(N)}
\frac{\phi(N+1)^\alpha }{ \phi(N+1)^\alpha - 1 } \left( \frac{1}{k^\alpha} -
\frac{1}{(k+1)^\alpha} \right) , \ee where $\phi(N)$ is a deterministic strict
upper bound on the number of juveniles produced by any given parent pair (see
also \citep{Eldon2018}). One can follow the calculations in
\citep{schweinsberg03} or \citep{BLS15} to show that if $1 < \alpha < 2$
then $(k = k_1 + \cdots + k_r)$ then the merger rate (see Eq.~\eqref{xibeta})
is
\be \lambda_{b;k_1, \ldots, k_r} = \sum_{\ell = 0}^{ (b - k)\wedge (4-r) }
\binom{b-k}{\ell} \frac{ (4)_{r+\ell} }{4^{k+\ell}} \frac{B(M; k+\ell - \alpha,
b-k-\ell + \alpha ) }{B(M;2-\alpha,\alpha)}
\ee
with $B(z;a,b) := \int_0^z
t^{a-1}(1-t)^{b-1}dt$ for $a,b>0$ and $0< z\le 1$, and \be M := \frac{K}{K+m}
\one{\phi(N) = KN} + \one{\phi(N)/N \to \infty } \ee where $K > 0$ is a
constant and $m := \lim_{N\to \infty} \EE{X} = 1 + 2^{1-\alpha}/(\alpha - 1)$
\citep{CDEE2020,AEKKZ2020}. In other words, the measure $(\Lambda)$ driving
the multiple mergers is of the same form as in Eq.~\eqref{Fbeta} with $0 < x
\le M$ in the case $1 < \alpha < 2$ and $\phi(N) \ge KN$. If $\alpha \ge 2$
or $\phi(N)/N \to 0$ then the ancestral process converges (in the sense of
convergence of finite-dimensional distributions) to the Kingman-coalescent
\citep{CDEE2020,AEKKZ2020}.
\begin{comment}%
A coalescent process is a continuous-time Markov process taking values among
the partitions of $\IN := \{1,2, \ldots \}$, such that the restriction to any
finite $n \in \IN $ takes values among partitions of $[n] := \{1, 2, \ldots,
n\}$. Write $\one{A} = 1$ if $A$ holds, and zero otherwise. Let $\cP_n$
denote the set of partitions of $[n]$. In the classical Kingman-coalescent,
the only possible transitions are the mergers of pairs of blocks (elements of a
partition $\pi \in \cP_n$), one pair at a time. The $n$ leaves (corresponding
to the sampled DNA sequences) are arbitrarily labelled from 1 to $n$, and the
blocks of a partition represent the common ancestors of the labels of each
block. The initial state is (usually) taken as $\{ \{1\}, \ldots, \{n\}\}$,
and the final state, i.e.\ when the most recent common ancester is reached, as
$\{ [n]\}$. A block in a partition of $[n]$ represents an ancestor of the
leaves in the block, i.e.\ the block $\{i_1, \ldots, i_k\}$ in a given
partition of $[n]$ is an ancestor of the $k$ leaves $i_1, \ldots, i_k \in [n]$,
and the leaves correspond to arbitrarily labelled DNA sequences in the sample.
\end{comment}