-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpectedly high memory usage for large models. #1
Comments
Thanks for your feedback. For really big problems you probably want to use Spark. But there's no reason why this code should not work for models that run fine in python/statsmodels. Can I check that you are allocating plenty of heap memory? The JVM default is piffling. eg. if you are running using sbt, are you starting sbt with extra heap, like "sbt -mem 12000" to start with 12GB of heap? |
Regarding your second question, there is no code for this currently. There should be a "verb" or "debug" option when calling the function that will print (or log) diagnostic information while running. I'll file a separate issue for that. |
Thanks for your reply! Even if I set |
Does it run OK if (say) you halve the number of observations and covariates? If so, I suspect the problem is just that I haven't tried to do in-place operations or similar tricks from numerical linear algebra to avoid copying and allocation. I could easily imagine that it is possible to save a factor of 2 or 4 on memory by being a bit more careful with the numerics. |
It runs as expected for smaller numbers of observations and samples. I believe the time complexity for most optimizer algorithms for this problem is quadratic in the number of features though, so going from e.g. 100 to 500 feature columns causes a slowdown. I have no idea how the memory consumption grows as a function of those parameters for a non-optimized covariance calculation vs. an optimized one, though. At any rate, this is still quite a nice library. As far as I can tell, I think only Spark and scala-glm provide standard errors for logistic regression estimates in the scala ecosystem. Since this is such a fundamental need for any regression fit, I expect scala-glm could attract a lot of attention. |
When testing how well the
LogisticGlm
model scales with a large toy data set, I am finding on my local machine (16 GB RAM) that I hit out of memory errors even for fairly tiny problem sizes.Here is some example code to make a toy logistic regression:
With this problem size (1 million observations for 50 features), I immediately get an OOM error:
This is a fairly small problem instance. If I generate the data set with
numpy
for example and serialize to a binary file on disk, it is less than 5 GB. For example, there is no trouble loading this data and fitting the model (even with the standard error calculations) in thestatsmodels
orscikit-learn
libraries for Python.What are the root causes for such unexpectedly high memory usage in scala-glm?
A secondary question is how to monitor convergence for this large data. I can increase the iterations, but there is no feedback-per-iteration during model fitting to give an update on whether the fit seems to be converging or not.
The text was updated successfully, but these errors were encountered: