-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can songbird be used in a Jupyter notebook without command line wrappers? #155
Comments
Great - Songbird should be able to deal with large datasets with out a problem.
Yes statsmodels is in the backend, so all of the formula formatting applies here. |
Thanks @mortonjt , I'll give this a try. Sorry, that was poor phrasing on my end. I didn't mean "large dataset" I meant a bunch of different versions of my features I wanted to run DGE analysis. It's a metatranscriptomics dataset so a lot of hierarchical features I can collapse. I've installed the standalone version in a separate environment (I'm using Python 3.8 in my main environment). Will try out that version if I'm unable to get the pieces in the Jupyter notebook working properly. All implementations require tensorflow and there are no purely statsmodels or scikit-learn implementations right? |
right, everything requires tensorflow atm, but we are currently refactoring everything to use Stan in Birdman |
I've gotten songbird to run (had the ModuleNotFound error but realized I needed to install tensorflow-estimator==1.15.1). I have a few questions regarding running and results:
For reference, here was my command: songbird multinomial \
--input-biom ../../counts/metaT/featurecounts.no_multimapping.formatted.biom \
--metadata-file ../../metadata/caries.metaT.metadata.lite.tsv \
--formula "C(CariesPhenotype, Treatment('Caries-free'))" \
--epochs 10000 \
--differential-prior 0.5 \
--summary-interval 1 \
--summary-dir output Does this fit seem standard? |
|
Ok that makes sense. I appreciate the clarification, I must have been confusing it w/ logistic regression. I've read through the README but I'm still a bit confused.
Here's the most relevant info I found regarding differentials:
From my understanding, the Q2 metric is for the entire model as a whole and not the individual features. Is there a way to assess statistical significance for features or am I making an incorrect assumption about
In the example, you have your model: songbird multinomial \
--input-biom data/redsea/redsea.biom \
--metadata-file data/redsea/redsea_metadata.txt \
--formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
--epochs 10000 \
--differential-prior 0.5 \
--training-column Testing \
--summary-interval 1 \
--summary-dir results and your null model: songbird multinomial \
--input-biom data/redsea/redsea.biom \
--metadata-file data/redsea/redsea_metadata.txt \
--formula "1" \
--epochs 10000 \
--differential-prior 0.5 \
--training-column Testing \
--summary-interval 1 \
--summary-dir results The only parameter change is the formula specifying the null model. However, when I ran this on mine it appears to have overwritten my Thanks again for your help. A colleague has been talking about |
Right ... FDR correction is quite flawed if you don't have absolute abundances. If you don't have absolute abundances, it isn't possible to do any meaningful statistical significance testing on a per-feature level. See our paper on this : https://www.nature.com/articles/s41467-019-10656-5 I don't completely understand question 2a. What do you mean the parameters should be the same? Regarding your last question, yes save them to separate directory. |
That's a great point about abundances and differentials. Going to read your songbird and qurro papers to get more familiar with reference frames this week. I encounter differentials in a lot of my datasets so a method to properly analyze them would be really useful.
Very much looking forward to BIRDMAn. It's always nice seeing some properly implemented bayesian methodologies towards problems I'm actually interested in instead of radon contamination. Helps me understand the underlying stats much better.
I worded this poorly. I meant to ask if parameters such as |
Got it - yes it is preferable if |
I'll let @mortonjt answer your first 3-4 questions, but I think I can help out with questions 5 and 6 (and maybe a bit with questions 2 and 4):
The differentials are from Songbird, yes. The regression model run using Songbird was comparing each of the fish body sites with sea water samples (with the goal of identifying features that did a good job distinguishing samples from each body site from a sea water sample). Section 1 of the supplementary material (should be freely available online, let me know if you can't access it and I can send you a PDF) of the paper goes into more detail about this. If you're interested in seeing how we ran Songbird to generate these differentials, this notebook demonstrates that (please note that this analysis was done before we had the validation-against-a-null-model stuff added to the Songbird README, so this doesn't account for that). So, to answer your question: I guess you could think of the y-axis label in this plot, instead of The Numerator and Denominator legends in this figure apply to another log-ratio. This log-ratio is computed using a few of the highest and lowest ranked features, going by the rankings of the differentials (this is the general workflow proposed in the Morton/Marotz paper, which the Qurro paper provides one method for doing). In Figure 2(a) of the paper, which you showed here, the numerator of this log-ratio was all features annotated as Shewanella (which, from manual inspection, we observed were mostly highly-ranked for the gill differentials), and the denominator was the 98 lowest-ranked features in the gill differentials. So, for each sample represented in this plot in Figure 2(b) of the Qurro paper: ... the y-axis could be rewritten as The general takeaway is that, using Songbird and Qurro, we can narrow things down to a log-ratio of these two sets of features that are demonstrably good at separating gill samples from seawater samples. This is certainly not the only log-ratio that could do that, though!
This is the same idea as the stuff described above -- these features were used as the "reference" after computing the differentials, in computing the log-ratios of features based on their rankings in the differentials. I don't think the paper's mention of these as a reference was in relation to the ALR procedure Songbird uses internally.
One nice method we added to Qurro is called "autoselection", where you just take the log-ratio of the top N / bottom N (or bottom N / top N, if you prefer) ranked features based on the Songbird differentials. This is a similar idea as to the denominator in Figure 2 of the Qurro paper, which is just an aggregate of "all of the bottom-ranked features for these differentials". When we apply this to the tooth-brushing data used in the Morton/Marotz paper (taking the log-ratio of the bottom 5% to top 5% of features -- we say ... Which very clearly distinguishes the samples based on before / after brushing. This is often a good place to get started -- if the resulting log-ratio gives you a clear distinction between metadata categories (and it often does if Songbird was able to find some sort of signal based on your formula), it can be worth investigating what specific features within these groups of features are causing this separation / if any of these features have a plausible biological reason for causing this separation / etc. For example, one of the Haemophilus ASVs is included in the yellow group of features (i.e. it's one of the top 5% highest ranked features in Songbird's differentials). We could move on from autoselection to creating log-ratios of features using more targeted filtering -- for example, keeping the numerator as an aggregate of the lowest-ranked features (say, those with differentials less than
We could stop here, or we could try to filter the numerator to a single taxonomic group of features as well. (Actinomyces isn't included in the numerator features shown here -- its differential is slightly too high for that -- but perhaps we could select it anyway based on prior knowledge about the oral microbiome for what we'd expect to be a particularly "stable" microbe.) So, to sum up: to an extent it's possible to identify these trends just from the data, but there are of course lots and lots of "degrees of freedom" here. Ideally, the whole manual inspection thing roughly described here would be replaced with a formal procedure describing how exactly to test for discriminatory log-ratios (removing the obvious temptation of trying a hundred different log-ratios until something separates the data). It should be possible to describe an algorithm that looks at these differentials and the log-ratios of features in a more automated way that is less prone to human bias (the ideas you propose with clustering / looking at null models are interesting!), but to my knowledge that exact sort of method doesn't exist (yet). That said, there have been some recent interesting papers that try to select useful log-ratios for distinguishing between groups of samples (albeit without running Songbird first): CoDaCoRe and selbal both aim to solve this general problem. Hope this helps clarify things! |
I'd like to try out songbird but I have a lot of data to prototype.
Creating intermediate files really puts a bottleneck in my workflow and was wondering if I would be able to do the following:
Also, I noticed you have some statsmodels tutorials. Does that mean there is a version of songbird using statsmodels as the backend???
The text was updated successfully, but these errors were encountered: