Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input-Output Dimension #2

Open
ekokrek opened this issue Jul 10, 2020 · 3 comments
Open

Input-Output Dimension #2

ekokrek opened this issue Jul 10, 2020 · 3 comments

Comments

@ekokrek
Copy link

ekokrek commented Jul 10, 2020

Hello,

I tried to predict high resolution matrix using the processed file that you share.

My assumptions were:

  • This file contains raw low resolution matrix (downsampling ratio: 1/16)
  • It is the downsampled version of GM12878 primary or replicate intrachromosomal contacts.

I placed the above file into the specified directory. Then choosing the 1/16 model parameters, I ran the following code.

python data_predict.py -lr 40kb -ckpt save/deephic_raw_16.pth -c GM12878

The chromosome I am interested in is 12. The size of this chromosome is 133851895 bases; so when it is binned at 10kb, one should have 13,386 bins. However, the predicted chromosome 12 matrix has dimensions of 13,398 x 13,398. When I checked the input file, I've seen that 'sizes' key in the dictionary holds this same value of 13398 for chromosome 12. That discrepancy occurs in other chromosomes too.

So the question is:
How are these shapes/sizes are calculated?

Thanks in advance!

@omegahh
Copy link
Owner

omegahh commented Jul 21, 2020

Sorry for delayed reply, The Hi-C data were downloaded from GSE63525, and only .tar.gz files were available when we downloaded these data.

I checked the raw data from GSE63525 (e.g. GSE63525_GM12878_primary_intrachromosomal_contact_matrices.tar.gz). The largest index for binned coordinates in the three-column-tab file (chr12_10kb.RAWobserved) is 133840000. But there are 13398 values in the bias file (chr12_10kb.KRnorm/SQRTVCnorm/VCnorm). The processed matrix is expanded to 13398 to match the bias file. But values in bias file are NaNs when row index is larger than 13384, so corresponding values in Hi-C matrix are all zeros.

@ekokrek
Copy link
Author

ekokrek commented Jul 22, 2020

Yes, I realized that the dimensions are taken from the KRnorm vector.
However, I saw that there are Nan's in initial rows and final rows.
I couldn't decide from which direction I should trim the predicted matrix,
since I didn't know the normalization procedure very well.

So, I guess the final rows and columns are the "extra/trimmable" Nan values, would you agree with that?
My main aim is to compare the final chromosome matrix with other predicted matrices, so I don't want to shift the values in any way and obtain a low similarity value.

Thanks again for the reply :)

@omegahh
Copy link
Owner

omegahh commented Jul 23, 2020

Yes, I agree with you. According to the description in the README file (GSE63525_GM12878_primary_README.rtf)

To normalize this entry using the KR normalization vector, one would divide 59.0 by the 8001st line ((40000000/5000)+1=8001) and the 8021st line ((40100000/5000)+1=8021) of GM12878_primary/5kb_resolution_intrachromosomal/chr1/MAPQGE30/chr1_5kb.KRnorm.

We can see that the genome locations are converted to line numbers in the bias vector without shift at the beginning. So I think the final rows and columns could be omitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants