Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scFoundation on GEARS with pretrained model used #37

Open
YOU-k opened this issue Jul 1, 2024 · 4 comments
Open

scFoundation on GEARS with pretrained model used #37

YOU-k opened this issue Jul 1, 2024 · 4 comments

Comments

@YOU-k
Copy link

YOU-k commented Jul 1, 2024

Hi there, thanks for the nice work!
I am trying to follow your code on perturbation prediction task.
Based on your provided code
pre_in = x.clone().reshape(num_graphs, self.num_genes+1)
x = x.reshape(num_graphs, self.num_genes+1)[:,:-1]
the last column in pre-in should be total counts. x is removed of total counts, which means that only expression values are retained in x.

However, when I look for the pretrained model to be used, the bin type in the provided one in the github is 'auto_bin'. does that mean total counts is not used in the input to get the pretrained model?
But if I would like to use it to get embedding for GEARS, what should I do with the total counts?

Also, it seems that the pre_in is directly used as input for the pretrained model. does this mean that the input data is already reformatted to have 19264 genes?

@YOU-k
Copy link
Author

YOU-k commented Jul 2, 2024

also, the 'pad_token_id': 103, 'mask_token_id': 102 are stored in the config file, while there are genes that have the same token id with them based on the csv file.

@WhirlFirst
Copy link
Collaborator

Hi,
Yes, the input format should have 19264 genes, please see our tutorial. 'auto_bin' also uses the mean total counts. the pad and mask token is processed with the gene expression values, not with the gene name, so some genes have the same token ID.

@YOU-k
Copy link
Author

YOU-k commented Jul 23, 2024

thanks for the response!
If I understand correctly, when the gene expression values are used, then the position of the genes should be important. So, does this mean every time the gene ids should be the same as what you provided in OS_scRNA_gene_index.csv to indicate the positions of those genes? For the total counts, what position ID should it have? Is it 19264?

@YOU-k
Copy link
Author

YOU-k commented Aug 6, 2024

Another related question. I found that in the GEARS tutorial, to process the input expression data and gene id. there is a gatherdata function:
image
I understand that the pad for expression value could be as provided in the config file, where the pad is 103. but for the gene ids, should they be ids ranging from 0 to 19264? then how could 103 is used as pad token here as well? Just correct me if I am wrong.
Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants