Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintaining constant factors with skip_modes #25

Open
davidhbrann opened this issue Jan 2, 2020 · 6 comments
Open

Maintaining constant factors with skip_modes #25

davidhbrann opened this issue Jan 2, 2020 · 6 comments

Comments

@davidhbrann
Copy link

Hi Alex,

Thanks for this nice package! I've been applying it to some single-cell RNA seq data in our lab (tensors with cell-types x genes x conditions), and it's yielding promising results.

Like Kelly, I've been interested in fitting the tensor decomposition on subsets of the data (cells or genes in this case) and then applying that decomposition to additional datasets.

It looks like this is implemented via the init and skip_modes arguments, as you described in #20. However, even in the output of the refitting example that you provided, the skip_modes factors have slightly different values than their init values. Although similar, they are not actually being held constant.

Presumably, this is because of the calls to U.rebalance in the ALS algorithms? Not sure it's that big of an issue, but I didn't know the best way to address it. Should I use the bcd algorithm instead, add an option to ignore the rebalancing when refitting, or do you have any other suggestions?

Best,
David

@davidhbrann
Copy link
Author

Sorry to repost, but I've encountered another small bug running the code below.

Uext = U.factors.copy()
Uext.factors[1] = np.random.randn(*Uext.factors[1].shape)
# Fit model to the full dataset, only fitting the final set of factors.
V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 1], verbose=True)

Here's the traceback:

NCP_BCD: iteration 1, objective 1.1158774944307905, improvement inf.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-92-31e1d436f07a> in <module>
      5 
      6 # Fit model to the full dataset, only fitting the final set of factors.
----> 7 V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 1], verbose=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/tensortools/optimize/ncp_bcd.py in ncp_bcd(X, rank, random_state, init, skip_modes, negative_modes, **options)
    139 
    140             # Compute Gradient.
--> 141             grad = Um[n].dot(grams) - p
    142 
    143             # Enforce nonnegativity (project onto nonnegative orthant).

IndexError: too many indices for array

My guess is that this is due to https://github.com/ahwillia/tensortools/blob/304b467c0a404b638db7458d8363c82c152fa016/tensortools/optimize/ncp_bcd.py#L162 This makes Um an np.array with no shape that cannot be indexed rather than a KTensor.

Would just calling U_old.copy() work?

@davidhbrann
Copy link
Author

This might also be obvious, but ncp_bcd will fail if you include the last dimension in skip_rows

NCP_BCD: iteration 1, objective 0.4673757824789901, improvement inf.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-135-f4bdde02fcd6> in <module>
      3 
      4 # Fit model to the full dataset, only fitting the final set of factors.
----> 5 V = tt.ncp_bcd(X, rank=24, init=Uext, skip_modes=[0, 2], verbose=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/tensortools/optimize/ncp_bcd.py in ncp_bcd(X, rank, random_state, init, skip_modes, negative_modes, **options)
    154         # Correction and extrapolation.
    155         grams *= U[N - 1].T.dot(U[N - 1])
--> 156         obj_bcd = 0.5 * (sci.sum(grams) - 2 * sci.sum(U[N-1] * p) + normX**2 )
    157 
    158         extraw = (1 + sci.sqrt(1 + 4 * extraw_old**2)) / 2.0

ValueError: operands could not be broadcast together with shapes (396,24) (890,24) 

This is because in the earlier loop through N the loop skips over N-1 and therefore p = unfold(X, n).dot(kr) is never called for N-1, thus giving the ValueError.

It looks like one might also encounter a similar error on line 148 in ncp_hals.

@ahwillia
Copy link
Collaborator

ahwillia commented Jan 2, 2020

Excellent -- sounds like a really cool application! I've always wanted to try some of these analyses on gene expression data.

All these questions and comments are very helpful. Let me see if I'm understanding correctly.

Like Kelly, I've been interested in fitting the tensor decomposition on subsets of the data (cells or genes in this case) and then applying that decomposition to additional datasets. It looks like this is implemented via the init and skip_modes arguments

Can you explain your thinking here a bit more? My understanding is that you are looking to validate the model on heldout data. What we did in our paper was to hold out randomized subsets of the data, since you can't entirely leave out a full 2-D slice of the tensor and still be able to fit the model. I have a blog post that explains the thinking behind this in a bit more detail. I hope that clears up what I'm trying to describe above.

Now, to do this with tensortools you'll want to use the mask option that is available for most of the optimization methods. Here is a quick sketch:

tensor =   # ... this is your 3d data
mask = np.random.rand(tensor.shape) > .9  # Leave out ~10% of the data at random.

# Fit model.
result = ncp_hals(tensor, rank, mask=mask, ...)

# Compute residuals and training/test mean squared error.
sq_resids = (result.factors.full() - tensor) ** 2
test_mse = np.linalg.mean(sq_resids[mask])
train_mse = np.linalg.mean(sq_resids[~mask])

I didn't test that ^^^ out, so check it over and use with care.

However, even in the output of the refitting example that you provided, the skip_modes factors have slightly different values than their init values. Although similar, they are not actually being held constant. Presumably, this is because of the calls to U.rebalance in the ALS algorithms?

I think this is okay, because rebalance should just be rescaling the magnitude of the factors that are being skipped. The magnitude of the model factors is always arbitrary anyways --- I can divide the low-d factors along one mode of the tensor and then multiply by the same amount for one of the other factors.

Sorry to repost, but I've encountered another small bug running the code below. This might also be obvious, but ncp_bcd will fail if you include the last dimension in skip_rows

Good catch on these. I'll try to fix now. I would generally recommend using ncp_hals instead of ncp_bcd.

@ahwillia
Copy link
Collaborator

ahwillia commented Jan 2, 2020

I think this should fix the problems you identified. Let me know if you agree -- ahwillia@200c9c3

cc @klmcguir -- have you been using skip_modes on the final dimension of the tensor? If so, you might want to re-run the model fits and make sure nothing changes in your results. Either way you probably want to pull and update your code.

@davidhbrann
Copy link
Author

Hi Alex,

Yes, I think that commit should fix those errors. Thank you!

As for my first question, I think I might have been unclear. Your suggestion of the mask makes sense for cross-validating on held-out data from a single dataset, but I was think of using the function in an analogous manner to what I've been doing for PCA and NMF. There, I call .fit on one dataset and then .transform on another. With this scRNA-seq data, I decompose a matrix of cells x gene, and I often get a set of rank k gene loadings from one dataset. Then I'm interested in the scores for those k factors for a separate set of cells, which may or may not include any of the ones I originally fit on. Likewise with NMF, I often solve for a new W while keeping H constant.

Extending that to the tensor, I'd fit a rank k decomposition to a 3D tensor of conditions x cell-types x genes. I'm interested in holding the gene factors constant while refitting on either new conditions or cell-types or both. This is where I was thinking I'd use skip_modes. Similarly, I often make multiple restarts where I randomly sub-sample equal cells for each cell-type. For the tensor decomposition and other analyses, I then make pseudo-cells where I aggregate (sum/average/etc) across all the cells for each cell-type to generate the condition x cell-type x gene tensor. By holding the gene factors constant while refitting the weights of the other two, I can then compare for each cell-type the distributions of weights across restarts for components of interest.

Does that approach make sense?

@ahwillia
Copy link
Collaborator

ahwillia commented Jan 3, 2020

That approach potentially makes sense, but you should be careful with how you interpret the outcome. Even in the NMF case you aren't really cross-validating the full model, because when you call transform on the heldout data you are in some sense fitting new model parameters while peaking at the test set. I think doing the speckled holdout pattern or something similar to that is the cleaner approach. Here are two nice papers to reference on this (also our Neuron paper has some brief discussion on these topics):

In short, I think it may depend on your goals. If you want to evaluate the performance of the model quantitatively on heldout data I would be careful with what you're doing. If you are just trying to create a visualization of some sort then what you're suggesting is probably fine (and I think similar to what Kelly has been doing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants