Ploidy

nQuire: a statistical framework for ploidy estimation using next generation sequencing

Ploidy

Ploidy traditionally has been investigated by measuring DNA content using flow cytometry
It can also be inferred from next generation sequencing (NGS) data either by examining k-mer distributions, or by assessing the distribution of allele frequencies at biallelic single nucleotide polymorphisms (SNPs)

NGS

Disadvantage:
- does not provide summary statistics that permit quantifying how well the data fit the expected distributions
- this approach is that it is preceded by the identification of variable sites (“SNP calling”), which is carried out using methodologies that benefit from a previously known ploidy level
This method was primarily developed for resequencing studies

nQuire

It models base frequencies as a Gaussian Mixture Model (GMM), and uses maximum likelihood to assess empirical data under the assumptions of diploidy, triploidy and tetraploidy.

使用高斯混合模型对可变位点的基频分布进行建模并使用最大似然法来选择最合理的倍性模型

\log L = \sum_{i = 1}^{n} \log \sum_{j = 1}^{3} α_{j} N (x_{i} | μ_{j}, σ_{j})

$\sum_{j = 1}^{3} α_{j} = 1.$

Expectation-Maximization (EM) algorithm

P (Z_{i} = j | x_{i}) = \frac{α_{j} N (x_{i} | μ_{j}, σ_{j})}{\sum_{j = 1}^{3} α_{j} N (x_{i} | μ_{j}, σ_{j})} = γ_{Z_{i}} (j)

latent variables $Z_{i}$ .

M-step

S_{j} = \sum_{i = 1}^{n} γ_{Z_{i}} (j)

{\hat{μ}}_{j} = \frac{1}{S_{j}} \sum_{i = 1}^{n} γ_{Z_{i}} (j) x_{i}

{\hat{σ}}_{j}^{2} = \frac{1}{S_{j}} \sum_{i = 1}^{n} γ_{Z_{i}} (j) (x_{i} - μ_{j})^{2}

{\hat{α}}_{j} = \frac{S_{j}}{n}

The log-likelihood is calculated after the M-step, and the next E-step is initiated unless the log-likelihood has changed by less then $ϵ = 0.01$ from the previous M-step.

E-step

\log L_{d i p l o i d} = \sum_{i = 1}^{n} \log N (x_{i}; 0.5, σ)

\log L_{t r i p l o i d} = \sum_{i = 1}^{n} \log \sum_{j = 1}^{2} 0.5 \cdot N (x_{i}; μ_{j}, σ_{j}), μ_{j} \in {0.33, 0.67}

\log L_{t e t r a p l o i d} = \sum_{i = 1}^{n} \log \sum_{j = 1}^{3} 0.33 \cdot N (x_{i}; μ_{j}, σ_{j}), μ_{j} \in {0.25, 0.5, 0.75}

Δ \log L_{d i p l o i d} = \log L_{f r e e} - \log L_{d i p l o i d}

Δ \log L_{t r i p l o i d} = \log L_{f r e e} - \log L_{t r i p l o i d}

Δ \log L_{t e t r a p l o i d} = \log L_{f r e e} - \log L_{t e t r a p l o i d}

Algorithm

Tutorial

assignment

Assignment

As-1

As-2

Lab-1

Lab-2

Lab-3

Lab-4

GAMES101

Assignment-1

Assignment-2

Assignment-3

Assignment-4

Lab

Lecture

Peoject

CSCN

Ploidy

Ploidy ​

nQuire: a statistical framework for ploidy estimation using next generation sequencing ​

Ploidy ​

NGS ​

nQuire ​

Expectation-Maximization (EM) algorithm ​

M-step ​

E-step ​

GenomeScope 2.0 and Smudgeplot for referencefree profiling of polyploid genomes ​

AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data ​

ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data ​

Ploidy

nQuire: a statistical framework for ploidy estimation using next generation sequencing

Ploidy

NGS

nQuire

Expectation-Maximization (EM) algorithm

M-step

E-step

GenomeScope 2.0 and Smudgeplot for referencefree profiling of polyploid genomes

AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data

ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data