Asteroid on non-randomly missing data #5

LPDagallier · 2023-06-13T21:36:38Z

Hi Benoit,

Thanks for Asteroid, looks a very promising tool!
This is not an issue on the program, but more a question.
From what I understand of the paper, Asteroid performs well with high proportion of data that is missing because of a stochastic process of data deletion (in the case of simulated datasets) or data absence (in the case of empirical datasets).

Do you have any idea of the performance of Asteroid in case data is non-randomly missing?
For example, in case where a dataset combines a few species represented by a lot of genes (e.g. phylogenomic dataset) with a lot of species represented by a few genes (e.g. sanger sequencing/barcode data) (see e.g. https://doi.org/10.1093/molbev/msad109).

Did you tried to simulate missing data in a non random manner?

I'm curious to know whether Asteroid would perform similarly well with high levels of non-random missing data.

Thanks,
Léo-Paul

BenoitMorel · 2023-06-14T08:08:22Z

Hi Léo-Paul, That's a very good question. Although we don't specifically address this case in our paper, that's actually a very good example for which Asteroid should work quite well, compared with tools that suffer from systematic biases such as Astrid. I would expect that when missing data gets more systematic, tools such as Astrid gets even worse, but Asteroid should not be affected much: the lack of data (= less information) will always be a problem for any tool, but there should not be any systematic bias due to missing data with Asteroid. Have you tried running it on such a dataset? I am always happy to know if our approaches work (or not :)) well on empirical datasets. I hope this helps, Benoit Le mar. 13 juin 2023 à 23:36, Léo-Paul Dagallier ***@***.***> a écrit :

…

Hi Benoit, Thanks for Asteroid, looks a very promising tool! This is not an issue on the program, but more a question. From what I understand of the paper, Asteroid performs well with high proportion of data that is missing because of a stochastic process of data deletion (in the case of simulated datasets) or data absence (in the case of empirical datasets). Do you have any idea of the performance of Asteroid in case data is non-randomly missing? For example, in case where a dataset combines a few species represented by a lot of genes (e.g. phylogenomic dataset) with a lot of species represented by a few genes (e.g. sanger sequencing/barcode data) (see e.g. https://doi.org/10.1093/molbev/msad109). Did you tried to simulate missing data in a non random manner? I'm curious to know whether Asteroid would perform similarly well with high levels of non-random missing data. Thanks, Léo-Paul — Reply to this email directly, view it on GitHub <#5>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADJJ3UITFKVENRAWXY52YQLXLDMPDANCNFSM6AAAAAAZFO3H7A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

LPDagallier · 2023-06-14T19:37:51Z

Hi Benoit,
Ok I see, yes for sure Asteroid would be less affected than other tools.
I haven't tried on such dataset, but I'm planning to in the upcoming months. Will let you know how it goes :)
Cheers,
Léo-Paul

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asteroid on non-randomly missing data #5

Asteroid on non-randomly missing data #5

LPDagallier commented Jun 13, 2023

BenoitMorel commented Jun 14, 2023 via email

LPDagallier commented Jun 14, 2023

Asteroid on non-randomly missing data #5

Asteroid on non-randomly missing data #5

Comments

LPDagallier commented Jun 13, 2023

BenoitMorel commented Jun 14, 2023 via email

LPDagallier commented Jun 14, 2023