Questions about preprocessing the data and velocity_genes #808

huizizhang949 · 2022-01-07T11:40:33Z

huizizhang949
Jan 7, 2022

Hello,

I have some questions about preprocessing the data and velocity_genes in the output.

In the paper, it says 'the count matrices are size normalized to the median of total molucules across cells'. Could you please explain what is 'normalized to the median of total molucules across cells'?

For the real application in the package, in one of the tutorial (https://scvelo.readthedocs.io/VelocityBasics/), it just mentions 'normalizing every cell by its total size and logarithmizing X', which, to my understanding, just requires dividing the spliced counts(or unspliced counts) for each cell by the total spliced (or unspliced) counts in that cell (and hence not relevant to 'median'). Is my understanding correct?

Moreoever, the tutorial also mentions 'Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched'. My questions is that, here X is the spliced count matrix and has been normalized and logrithmized, while unspliced counts are only normalized. Then when fitting the dynamical model (equation (4) in the paper), are we using normalized and logarithmized spliced values for s(t) but only normalized unspliced values for u(t) in the equation (4)? But this is different from the paper (in the parameter inference section of the paper, it just says 'u^{obs} and s^{obs} are size-normalized unspliced and spliced counts'). So what transformed values are actually used as s(t) and u(t) in equation (4) when fitting them in the python package?

In addition, the paper also mentions 'u^{obs} is rescaled to have the same variance as s^{obs}'. Is this applied in
scv.pp.filter_and_normalize？How do you rescale to ensure same variance?

As for filtering genes in scv.pp.filter_and_normalize, what is the definition/formula of 'dispersion' used in choosing 'n_top_genes'?

In the discussion here #686, you mention that 'parameter inference is only run on variables marked as velocity_genes', but for the tutorial https://scvelo.readthedocs.io/DynamicalModeling/, I notice that there are estimated parameter values even for velocity_genes=False (please see the attached figure, gene Sntg1). Apart from that, why are there NaN values andwhat can lead to NaN values in general?

Lastly, what are the meaning of 'highly_variable_genes' and 'highly_variable' columns? What are the differences? Thank you!

I am sorry for the long questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about preprocessing the data and velocity_genes #808

{{title}}

Replies: 0 comments

Select a reply

Questions about preprocessing the data and velocity_genes #808

huizizhang949 Jan 7, 2022

Replies: 0 comments

huizizhang949
Jan 7, 2022