-
Notifications
You must be signed in to change notification settings - Fork 690
ScalaNLP Project
The purpose of this document is to (loosely!) plot out the direction of ScalaNLP over the next months and years.
Jason Baldridge (@jasonbaldridge), Dan Garrette (@dhgarrette), and David Hall (@dlwh) sat down over Skype recently to talk. We were mostly talking about the direction of the ScalaNLP project, taken as a whole, and what we want it to be. Among other things, we decided that clandestine Skype meetings should be avoided in favor of a mailing list.
Said mailing is here: https://groups.google.com/group/scalanlp-discuss. Please join it if you are interesting in talking about the direction of the overall ScalaNLP project, or the organization of any of its libraries.
Contributors We're also very interested in getting more contributors! We're happy to talk about projects, things you'd like to do, whatever. If you do Scientific Computing, ML, or NLP, we would like to have you! Thanks to everyone who's contributed code and documentation so far!
There are currently 5 projects associated with the ScalaNLP ecosystem:
- Breeze: Linear algebra, numerics, visualization
- Nak: Machine Learning
- Epic: Natural Language Processing and Structured Prediction
- Junto: Label Propagation
- Puck: Super-fast GPU parser
Breeze is now focused on the core algebra and numerics aspects. The visualization code is deprecated and will not be further developed. The dependencies between the projects is simple:
- Breeze -> {Everything Else}
- Nak -> Epic
- Epic -> Puck
- Junto is a standalone library that is not currently used by the others.
Breeze has recently slimmed down to just the linear algebra and the sampling and optimization components.
Spire provides a great basis for the algebra components of Breeze, modulo mutability and a few choices of operator names. The biggest concern here is to not raise the abstraction level in Breeze any higher than it already is.
Progress: None.
David has never really wanted to maintain a visualization library: it's not his focus, and he doesn't really even use it. He's happy to leave what's there and fix bugs, but he feels that someone else would be better suited to maintaining and expanding it.
One thing that seems clear is that there's a lot of awesome stuff going on in the info-visualization community right now (e.g. d3.js), and we're more or less making matlab-esque plots. That's fine as it is, but it shouldn't be the premiere data visualization library for Scala.Processing and Prefuse both look like good JVM-based options to wrap. Maybe we just output d3 and open a browser window? (ugh.)
See also https://github.com/sameersingh/scalaplot
Progress: None.
Nak has recently incorporated breeze-learn. It also provides a Scala wrapper around the Java port of Liblinear, a great library for training models using logistic regression and support vector machines.
Epic began XXX
Chalk has shed its previous OpenNLP roots and taken on breeze-process. Currently, it is quite impoverished as a text processing library, but this will be changing soon. One of the focuses is on creating text annotation pipelines that have UIMA-like functionality without the heavy-handedness of UIMA, and possibly building on the Akka toolkit. (UIMA pretty much casts NLP components as actors -- e.g. compare initialize, process, shutdown from UIMA with Akka's preStart, receive, and postStop -- but it uses mutable data-structures and seems to leave plenty of room open for errors -- errors created by programs that happily compile.)
- What is our relationship to large-scale libraries like Spark and OptiML? Should we be playing the large-scale game?
- What about Factorie?
- As we start offloading more functionality, we need to start worrying about dependency weight. We should identify good libraries and try to stick to them as much as possible without letting the dependency graph get out of control. See the next section.
Though there are a number of related but separate ScalaNLP projects, we'd like to have some reasonably common core of dependencies for various needs. Here's a list for our own reference of dependencies to be used across ScalaNLP.
At this stage, this is a HashMap that may have collisions, and that's okay -- we can discuss and sort out any duplication if we feel it is necessary.
- Option parsing: scallop
- Logging: ScalaLogging
- Testing: ScalaTest
- Configuration: Typesafe's config
- JVM-based BLAS: netlib-java
- can all be nativized but it's a pain.
- Native BLAS: jblas
- csv: opencsv
(Logging note: we use SLF4J for logging in Breeze, which means you can pick your own backend. By default, no logging information is emitted. You can easily add basic logging by adding the SLF4J Simple dependency to your build.sbt
file.)
Breeze is a numerical processing library for Scala. http://www.scalanlp.org