-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API cleanup: Drop argument center
?
#187
Comments
Yes, I'd also be in favour of deprecation. If I understand correctly it's both:
|
A common use case is that we want to calculate (mean, variance), which means that we often end up calculating the 'mean' twice. One could get around this by providing |
I know that speed is an absolute priority for |
I guess you could say that several of them already do two things but just they don't share it. There are other cases where this could be useful, e.g. returning the number of elements used when FWIW, the birth of matrixStats came about when we needed to do lots of these (n, mu, sigma, ...) calculations across millions (sic!) of probesets in microarray data. Using the traditional R methods was just way too slow. So, matrixStats provides a way for developers to reach into these common low-level tasks with minimal overhead - things you'd normally turn to C/C++ to implement otherwise. I don't want to close that door because there's always someone out there who finds it useful as we did in the early days. The problem is to find a balance on API coverage and bloat. On a related note, this low-level performance at the R level is also why I've always said no to wrap it up in S3 generics - it's just too expensive for many of the use cases. I'm sharing this because I think you guys are coming from other use cases where these types of overheads are relatively small compared to the other overheads. |
Sorry to hijack the conversation, but I am really curious - did you eventually move to a better way of doing things (outside of matrixStats), or rather stopped having data with millions of rows, so that the speed stopped being an issue? |
Here's a possible deprecation plan:
How fast this can be done will depend on how many package developers as well as end-users rely on It might also be that someone has a strong case to support |
It wasn't one table of millions of rows, the data came from outside in the form of millions of small matrix-like structures (e.g. Affymetrix data; see the affxparser on Bioconductor). We did not have control of the format of these input data. The total amount of data was/is also larger than RAM, so we couldn't/can't just load it all in and reshuffle. This type of data was/is used in many different pipelines and packages, so there was never an option to solve the data format once and for all (say using say a database) - it's a constant moving target, especially in the world of science where new models and methods are constantly developed and investigated. |
I think this is actually at the root of all the confusion, misunderstandings, frustration, and sadness, around this issue ;-)
I don't have a strong use case to support
and not the alternate formula:
then it doesn't seem that it would be too hard to do. This would:
Just my 2 cents. |
Yes, an alternative path is to tighten up the definition of Note, changing from the alternative to primary formula requires making sure that no one is relying on the alternative formula being implemented. As I mentioned in #183 (comment), we might see another group of people who wishes to use
and not
Note, if there are missing values dropped, one cannot just "adjust" afterward by I still have to think about the |
…matrixStats.center.onUse'/env var 'MATRIXSTATS_CENTER_ONUSE' for controlling this [#187]
…ble, e.g. even if we defunct it for others in revdep check [#187]
Ok, so I've made the decision to keep the |
As said in my previous comment, we'll keep Since |
Argument
center
forrowVars()
exists for historical reasons, it causes confusion, it introduces a significant risk for mistakes/bugs, e.g. #183, PeteHaitch/DelayedMatrixStats#68, and const-ae/sparseMatrixStats#13.The main reason for it existing is that in the beginning all these functions were implemented in plain R and then there was a significant performance gain to reuse an already estimated center point estimate, e.g.
The benefit if this is smaller today, or even non-existing, because when using
center
we no longer use the native code but the old R code branch.Due to the problems that come from using an incorrect value of
center
, should we deprecate/dropcenter
from the API?Argument
center
is currently used in:cc/ @const-ae, @hpages, @LTLA, @PeteHaitch
The text was updated successfully, but these errors were encountered: