-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory errors for large datasets #141
Comments
One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix), where the call to With the advent of BIRDMAn this may not be an urgent issue, tho. * This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated" |
Yea SVD on 60k features is insane. I’d recommend looking into dask for this
sort of thing, or randompca
https://examples.dask.org/machine-learning/svd.html
https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
I guess it also depends on what you are trying to accomplish— 60k >> 100
samples; chances are your system is underdetermined
…On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko ***@***.***> wrote:
One relatively straightforward thing that might help with this (for the
QIIME 2 version, at least) is adding an extra command or parameter* that
disables the construction of the biplot. I was running Songbird on a
large-ish dataset (~60k features, ~100 samples: this matrix
<https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt>),
where the call to np.linalg.svd(differentials) here
<https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104>
caused an error about there not being enough memory to allocate for the
array (it was something like 16 GB of memory that was needed? this was on
my laptop). I commented out the biplot code so that this line
<https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120>
was always used to create an empty biplot, and then Songbird seemed to work
without a problem.
With the advent of BIRDMAn this may not be an urgent issue, tho.
* This might need to be a command, I guess, since I don't think QIIME 2
currently has ways of varying the number of outputs. Or it could just be a
parameter where "hey if you specify this an empty biplot will be generated"
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#141 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ>
.
|
Fair, thanks. I didn't really need a biplot (I just wanted the
differentials), but those options could be useful if people start needing
biplots from that sort of data.
You're right, it's a lot of features... mostly for a proof-of-concept
analysis. I imagine this is a pretty niche problem for people to run into
in practice (hopefully ;).
On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton ***@***.***>
wrote:
… Yea SVD on 60k features is insane. I’d recommend looking into dask for this
sort of thing, or randompca
https://examples.dask.org/machine-learning/svd.html
https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
I guess it also depends on what you are trying to accomplish— 60k >> 100
samples; chances are your system is underdetermined
On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko ***@***.***>
wrote:
> One relatively straightforward thing that might help with this (for the
> QIIME 2 version, at least) is adding an extra command or parameter* that
> disables the construction of the biplot. I was running Songbird on a
> large-ish dataset (~60k features, ~100 samples: this matrix
> <
https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt
>),
> where the call to np.linalg.svd(differentials) here
> <
https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104
>
> caused an error about there not being enough memory to allocate for the
> array (it was something like 16 GB of memory that was needed? this was on
> my laptop). I commented out the biplot code so that this line
> <
https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120
>
> was always used to create an empty biplot, and then Songbird seemed to
work
> without a problem.
>
> With the advent of BIRDMAn this may not be an urgent issue, tho.
>
> * This might need to be a command, I guess, since I don't think QIIME 2
> currently has ways of varying the number of outputs. Or it could just be
a
> parameter where "hey if you specify this an empty biplot will be
generated"
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#141 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ>
.
|
Got it. Note that this is only for the q2 plugin, the standalone doesn’t
have this
On Wed, Jun 2, 2021 at 11:04 PM Marcus Fedarko ***@***.***>
wrote:
… Fair, thanks. I didn't really need a biplot (I just wanted the
differentials), but those options could be useful if people start needing
biplots from that sort of data.
You're right, it's a lot of features... mostly for a proof-of-concept
analysis. I imagine this is a pretty niche problem for people to run into
in practice (hopefully ;).
On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton ***@***.***>
wrote:
> Yea SVD on 60k features is insane. I’d recommend looking into dask for
this
> sort of thing, or randompca
>
> https://examples.dask.org/machine-learning/svd.html
>
>
https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
>
>
> I guess it also depends on what you are trying to accomplish— 60k >> 100
> samples; chances are your system is underdetermined
>
> On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko ***@***.***>
> wrote:
>
> > One relatively straightforward thing that might help with this (for the
> > QIIME 2 version, at least) is adding an extra command or parameter*
that
> > disables the construction of the biplot. I was running Songbird on a
> > large-ish dataset (~60k features, ~100 samples: this matrix
> > <
>
https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt
> >),
> > where the call to np.linalg.svd(differentials) here
> > <
>
https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104
> >
> > caused an error about there not being enough memory to allocate for the
> > array (it was something like 16 GB of memory that was needed? this was
on
> > my laptop). I commented out the biplot code so that this line
> > <
>
https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120
> >
> > was always used to create an empty biplot, and then Songbird seemed to
> work
> > without a problem.
> >
> > With the advent of BIRDMAn this may not be an urgent issue, tho.
> >
> > * This might need to be a command, I guess, since I don't think QIIME 2
> > currently has ways of varying the number of outputs. Or it could just
be
> a
> > parameter where "hey if you specify this an empty biplot will be
> generated"
> >
> > —
> > You are receiving this because you modified the open/close state.
> > Reply to this email directly, view it on GitHub
> > <#141 (comment)
>,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ
> >
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#141 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ
>
> .
>
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#141 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXLPI7ZQ3DCD63N2AMDTQ4EOZANCNFSM4SZ6DYEQ>
.
|
For datasets with >10k samples, the memory requirements can be quite high.
If there isn't enough memory available, this can throw an out-of-memory error.
The text was updated successfully, but these errors were encountered: