scFoundation on GEARS with pretrained model used #37

YOU-k · 2024-07-01T14:47:38Z

Hi there, thanks for the nice work!
I am trying to follow your code on perturbation prediction task.
Based on your provided code
pre_in = x.clone().reshape(num_graphs, self.num_genes+1)
x = x.reshape(num_graphs, self.num_genes+1)[:,:-1]
the last column in pre-in should be total counts. x is removed of total counts, which means that only expression values are retained in x.

However, when I look for the pretrained model to be used, the bin type in the provided one in the github is 'auto_bin'. does that mean total counts is not used in the input to get the pretrained model?
But if I would like to use it to get embedding for GEARS, what should I do with the total counts?

Also, it seems that the pre_in is directly used as input for the pretrained model. does this mean that the input data is already reformatted to have 19264 genes?

The text was updated successfully, but these errors were encountered:

YOU-k · 2024-07-02T03:12:42Z

also, the 'pad_token_id': 103, 'mask_token_id': 102 are stored in the config file, while there are genes that have the same token id with them based on the csv file.

WhirlFirst · 2024-07-17T18:56:26Z

Hi,
Yes, the input format should have 19264 genes, please see our tutorial. 'auto_bin' also uses the mean total counts. the pad and mask token is processed with the gene expression values, not with the gene name, so some genes have the same token ID.

YOU-k · 2024-07-23T08:32:40Z

thanks for the response!
If I understand correctly, when the gene expression values are used, then the position of the genes should be important. So, does this mean every time the gene ids should be the same as what you provided in OS_scRNA_gene_index.csv to indicate the positions of those genes? For the total counts, what position ID should it have? Is it 19264?

YOU-k · 2024-08-06T06:33:32Z

Another related question. I found that in the GEARS tutorial, to process the input expression data and gene id. there is a gatherdata function:

I understand that the pad for expression value could be as provided in the config file, where the pad is 103. but for the gene ids, should they be ids ranging from 0 to 19264? then how could 103 is used as pad token here as well? Just correct me if I am wrong.
Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scFoundation on GEARS with pretrained model used #37

scFoundation on GEARS with pretrained model used #37

YOU-k commented Jul 1, 2024

YOU-k commented Jul 2, 2024

WhirlFirst commented Jul 17, 2024

YOU-k commented Jul 23, 2024

YOU-k commented Aug 6, 2024

scFoundation on GEARS with pretrained model used #37

scFoundation on GEARS with pretrained model used #37

Comments

YOU-k commented Jul 1, 2024

YOU-k commented Jul 2, 2024

WhirlFirst commented Jul 17, 2024

YOU-k commented Jul 23, 2024

YOU-k commented Aug 6, 2024