Upcoming events

Sprint planning: 1 April 2011

Places

In Paris: at Logilab's (104 boulevard blanqui, Paris) - Metro 6 - Glacière

In Boston at MIT (36-537: 5th floor of building 36)

On IRC (#scikit-learn on irc.freenode.net)

People present

Please add skills/interests or planned task, to facilitate the sprint organization and pairing of people on tasks. To share knowledge as much as possible, it would be ideal to have pair-like programming of 2 people on a task, with different skills.

At Logilab, Paris (from 9H to 19H):

Gaël Varoquaux: task: code review, pair programming on specific task where needed.
Julien Miotte
Feth Arezki: could help with coding (w/ the logger?), LaTeX. Interested in learning about scikit.
Nelle Varoquaux
Fabian Pedregosa
Vincent Michel: task: code review, pair programming. features: ward's clustering.

At MIT, Boston:

Alexandre Gramfort: task: code review and pair programming
Demian Wassermann: task: Gaussian Processes with sparse data
Satra Ghosh: task: Ensemble Learning, random forests
Nico Pinto
Pietro Berkes

At IRC (from around 9am Brasília time (GMT-3):

Alexandre Passos: task: minibatch k-means

Tasks

In addition to the tasks listed below, it is useful to consider any issue in this list : https://github.com/scikit-learn/scikit-learn/issues

Easy

Improve test coverage: Run 'make test-coverage' after installing the coverage module, find low hanging fruits to improve coverage, and add tests. Try to test the logic, and not simple aim for augmenting the number of lines covered.

Py3k support: Almost everything is in the pull request Py3k, it just remains to check some failing tests in joblib and backport the latest joblib into the source tree (I'm not sure the status of this)

Not requiring expertise in machine learning

Logging: create a logger (using the standard libary's 'logging' module) for the scikit learn and a couple of simple print functions to replace 'print calls' through out the scikit. Talk to Gael Varoquaux about this task.

Prettify the PDF documentation - for instance modify the LaTeX stylesheet so that blocks are less ugly. Talk to Gael Varoquaux about this task.

Multiple figures in documentation examples: when generating the documentation, figures plotted via matplotlib are captured using the code in doc/sphinxext/gen_rst.py. However, currently only the first figure is captured. It would be nice to capture all the figures.

Restore the 'source' link on the documentation: the html template does not give a 'source' link to the rst source of the file. This should be added back.

Thumbnails for examples : It would be cool to have thumbnails generated for the examples 'a la matplotlib'.

data downloading cleaning: running the examples and building the docs can download a lot of data. We need to hunt down every data downloading step and rationalize it so that it does leave unnecessary files behind (such as zip files that have been expanded) or download files in different locations: the scikit tree is starting to take a lot of space on disk.

Rationalize images in documentation: we have 56Mo of images generated in the documentation (doc/_build/html/_images). First we should save jpg instead of pngs: it shrinks this directory to 45Mo (not a huge gain, granted). Second there is many times the same file saved. I need to understand what is going on, and fix that.

BallTree wrapping in Cython: redo the wrapping of the C++ ball tree code using Cython and make sure that the resulting class is pickleable.

Affinity propagation using sparse matrices: the affinity propagation algorithm (scikits.learn.cluster.affinity_propagation_) should be able to work on sparse input affinity matrices without converting them to dense. A good implementation should make this efficient on very large data.

Branch merging

A lot of good work is waiting for small fixes in branches:

merge Hierarchical Clustering (merge in the HCluster v2 pull request)

merge LDA improvements

Machine learning tasks

Improve the documentation: You understand some aspects machine-learning. You can help making the scikit rock without writing a line of code: http://scikit-learn.sourceforge.net/developers/index.html#documentation

Mini-batch kmeans: algorithm 1 of http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf this should be of moderate difficulty and could give great speed improvements to the kmeans

Matrix factorization (Sparse PCA, NNMF)

Add transform to LDA + pipe LDA with covariance estimator

More ambitious/long term tasks

Random Forest

Fused Lasso

Group Lasso

MultiTask Lasso

KMeans with triangular inequality

Manifold learning

Bayesian classification (e.g. RVM)

Past sprints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly