Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting initial result of data analysis #24

Open
swiri021 opened this issue Sep 20, 2021 · 17 comments · Fixed by #30, #42 or #84
Open

Reporting initial result of data analysis #24

swiri021 opened this issue Sep 20, 2021 · 17 comments · Fixed by #30, #42 or #84
Assignees
Labels
Milestone

Comments

@swiri021
Copy link
Contributor

No description provided.

@swiri021
Copy link
Contributor Author

swiri021 commented Sep 26, 2021

@kicheolkim It seems CIS and RR has DiseaseDuration difference significantly (notebook link), so if we find features that are relevant to early and late markers by using DiseaseDuration, those might be differential expressed genes between RR and CIS. Do you think CIS and RR have actual meaning for the early and late stages of MS? for example, CIS might be really the early stage of MS and RR for the medium stage.

@kicheolkim
Copy link
Contributor

@swiri021 I don't think RR can consider as a middle of the stage. As far as I know, in clinics, CIS is just experienced the first time neurodegeneration. If the patient had an attack again, it considers as the RR stage. A patient can have more attacks or have no more attacks after the CIS stage. That's why I think the data is good for disease mechanisms study but may not be good for diagnostic biomarkers.

@swiri021 swiri021 linked a pull request Sep 30, 2021 that will close this issue
@swiri021 swiri021 changed the title Initial result of data analysis Reporting initial result of data analysis Sep 30, 2021
@swiri021
Copy link
Contributor Author

swiri021 commented Sep 30, 2021

@kicheolkim
Copy link
Contributor

kicheolkim commented Oct 2, 2021

wow... it's very interesting...
Is the state single gene? or gene set? Is there a list of the genes?

@swiri021
Copy link
Contributor Author

swiri021 commented Oct 2, 2021

wow... it's very interesting... Is the state single gene? or gene set? Is there a list of the genes?

Activation score is calculated by gene-sets(gene signatures), and DEG is one list of DEG as you know... So, here, DEG model is using one gene as one features and Activation Score model is using one pathway score as one feature.

@swiri021 swiri021 pinned this issue Oct 4, 2021
@swiri021
Copy link
Contributor Author

swiri021 commented Oct 4, 2021

wow... it's very interesting... Is the state single gene? or gene set? Is there a list of the genes?

I think this may occur because of features numbers, and I downed DEG fold change threshold to 0.58(1.5 fold) and performance is better than 2 fold threshold. But still, the activation score model has narrower discrepancies of AUC between validation-set and test-set. Anyway, I am going to dig pathway features deeply.

@kicheolkim
Copy link
Contributor

kicheolkim commented Oct 4, 2021

So, the activation score is based on pathways, and the gene set is from DEG (DESeq2 results)?
Do you have a list of pathway and/or gene sets? I'm curious what pathways/genes are included.

@swiri021
Copy link
Contributor Author

swiri021 commented Oct 4, 2021

So, the activation score is based on pathways, and the gene set is from DEG (DESeq2 results)? Do you have a list of pathway and/or gene sets? I'm curious what pathways/genes are included.

Yes, we have a list of pathways, and the activation score was calculated by using MSigDB. Additionally, I will let you know if we have more interesting points here.

@swiri021
Copy link
Contributor Author

@kicheolkim @lacuss I got a weird signal in the data: Notebook link
That signal is related to 'Sex' of patients, unfortunately, it is related to RR and CIS category significantly..... maybe another noise?

@lacuss
Copy link
Collaborator

lacuss commented Oct 18, 2021

@kicheolkim @lacuss I got a weird signal in the data: Notebook link
That signal is related to 'Sex' of patients, unfortunately, it is related to RR and CIS category significantly..... maybe another noise?

So…I looked up about MS at Mayo clinic website. Seems that “Sex-Women are more than two to three times as likely as men are to have relapsing-remitting MS” Maybe this is the reason? Should dig up more about the correlation I think.

@swiri021
Copy link
Contributor Author

Yeah, I have seen a similar review paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3707353/ , but still, we can't say that RR and CIS could be related on sex factor because of the case number. Any thought? @kicheolkim do we need to go some further with gender information? Top pathways are here(for that clustering):

RUNNE_GENDER_EFFECT_UP
PYEON_CANCER_HEAD_AND_NECK_VS_CERVICAL_DN
GSE5099_DAY3_VS_DAY7_MCSF_TREATED_MACROPHAGE_DN
GSE3982_MEMORY_CD4_TCELL_VS_BCELL_UP
GOMF_HISTONE_DEMETHYLASE_ACTIVITY_H3_K4_SPECIFIC

@kicheolkim @lacuss I got a weird signal in the data: Notebook link
That signal is related to 'Sex' of patients, unfortunately, it is related to RR and CIS category significantly..... maybe another noise?

So…I looked up about MS at Mayo clinic website. Seems that “Sex-Women are more than two to three times as likely as men are to have relapsing-remitting MS” Maybe this is the reason? Should dig up more about the correlation I think.

@swiri021
Copy link
Contributor Author

  • CD4, CD8, CD14 all dataset is showing the same PCA pattern (Sex is the highest factor to cluster data in pathways level)

@swiri021
Copy link
Contributor Author

These genes are outliers to cluster male and femal. When these genes removed from the list, clustering by Sex has been completely gone. Let me know if these genes are interesting or need more investigation. (Sorry for not converting EntrezID)

EntrezID pval fc
6192 3.013055e-25 5.829828
8284 3.013055e-25 5.683377
5616 3.013055e-25 5.566646
8653 3.013055e-25 5.436171
8287 3.013055e-25 5.414928
246126 3.013055e-25 5.140104
7404 3.013055e-25 4.902533
7544 3.013055e-25 3.396781
9086 3.013055e-25 2.624308
9087 3.013055e-25 1.204376

@swiri021 swiri021 linked a pull request Oct 20, 2021 that will close this issue
@lacuss
Copy link
Collaborator

lacuss commented Oct 21, 2021

input name symbol alias (first 5) HGNC
8653 DEAD-box helicase 3 Y-linked DDX3Y DBY HGNC:2699
9086 eukaryotic translation initiation factor 1A Y-linked EIF1AY eIF-4C HGNC:3252
8284 lysine demethylase 5D KDM5D HY, HYA, JARID1D, SMCY HGNC:11115
5616 protein kinase Y-linked (pseudogene) PRKY PRKXP3, PRKYP HGNC:9444
6192 ribosomal protein S4 Y-linked 1 RPS4Y1 RPS4Y, S4 HGNC:10425
9087 thymosin beta 4 Y-linked TMSB4Y TB4Y HGNC:11882
246126 taxilin gamma pseudogene, Y-linked TXLNGY CYorf15A, CYorf15B, TXLNG2P HGNC:18473
8287 ubiquitin specific peptidase 9 Y-linked USP9Y DFFRY, SPGFY2 HGNC:12633
7404 ubiquitously transcribed tetratricopeptide repeat containing, Y-linked UTY KDM6AL, KDM6C, UTY1 HGNC:12638
7544 zinc finger protein Y-linked ZFY ZNF911 HGNC:1

@swiri021
Copy link
Contributor Author

As we expected, all of genes are Y-linked

@kicheolkim
Copy link
Contributor

Sorry for the late reply. I was busy becoming a father this week :)
MS is more prevalent in women, and immune cells are strongly affected by gender. I used gender and age as a covariate in my analysis.

@swiri021 swiri021 removed this from the Phase 0 milestone Nov 5, 2021
@swiri021 swiri021 added this to the Phase 1 milestone Nov 5, 2021
@swiri021
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment