Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cox coefficients documentation #54

Open
leonardosegurat opened this issue Nov 9, 2022 · 2 comments
Open

Cox coefficients documentation #54

leonardosegurat opened this issue Nov 9, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@leonardosegurat
Copy link

Is your feature request related to a problem? Please describe.

When sending cox regression coefficients to a data table, only beta terms are included and h0 term is omitted, resulting in incomplete information to compute the results.

Since this was my first cox regression, I did not know that the cox risk score was calculated with a specific expression, and naively calculated it as multiple linear regression. This discrepancy led me to investigate, and while I found documentation for cox regression, I was unable to do so with the widget's docs or searching for Orange Data Mining and the process was time consuming.
Importantly, I found there is an h0 term and this is not included in the coefficients output, at least not that I could find, and had to calculate id "by hand" on a spreadsheet by comparing my results with the widget's data output. Far from ideal.

Describe the solution you'd like

  1. Extend cox regression widget documentation to include the equation used to calculate cox risk score, or a reference to an explanation.
  2. Include h0 term in the coefficients output, either as a different output (Data / Coefficients / Constant) or within the coefficients output.

Describe alternatives you've considered

Additional context
The cox risk score can be computed with the following equation: h(t) = h0(t) * exp( x1b1 + x2b2 + x3b3... xnbn)
Where h0 indicates a "base risk" term, x's correspond to predictor features, and b's correspond to their coefficients.

PS: If you're open to contributions, I'm willing to dedicate some time to researching and helping with documentation. I have no experience working with Open-Source Projects, and minimal coding experience. On the other hand, I do have a strong background in statistics as a Lean Six Sigma Black Belt, a taste for technology and a lot of admiration towards the Open-Source community.

@leonardosegurat leonardosegurat added the enhancement New feature or request label Nov 9, 2022
@JakaKokosar JakaKokosar self-assigned this Nov 11, 2022
@JakaKokosar
Copy link
Member

Hey @leonardosegurat, I apologise for the late reply.

Orange uses survival models implemented in the lifelines package. Just recently we updated the Cox regression widget to output not only regression coefficients but also other statistics, look here.

Screenshot 2022-11-23 at 12 29 49

Indeed there is no reason for not having an additional output channel for estimated baseline hazard. I would imagine this would be a table with two columns; the first column is time and the second is the estimated baseline at that time point. Your thoughts?

As you noticed the documentation is lacking and could use improvements. In Orange, the risk scores (or sometimes refered to as prognostic index) are the predicted partial hazards (the second part of the equation).

@leonardosegurat
Copy link
Author

Sounds good! Apologies for the late reply, and thanks for pointing me to the lifelines docs!

I would imagine this would be a table with two columns; the first column is time and the second is the estimated baseline at that time point. Your thoughts?

As I understand it, risk scores / prognostic indexes are constant over time (at least in basic COX regression), so the output h (0) would be a single value that predicts survival over time, and is altered proportionally to whatever the right side of the formula resolves to.

I've got a few more suggestions, but I'd like to refine them a bit before opening a suggestion thread. For example, it'd be nice to have some sort of log-rank matrix when comparing cohorts in Kaplan-Meier plots, so that we can make individual comparisons rather than comparing all of the curves, or compare against the baseline curve (I'm using select rows for now). It would also be useful to have an option to plot the baseline survival curve along with the cohorts, to make comparisons. This can be accomplished with edit domain and concatenation, but it took me a while to get it working.
Perhaps this image will make the idea a bit clearer:
image
That's two low-risk curves (training and validation), the baseline curve (made with a 5th category and a copy of the whole dataset), and two high-risk curves (training and validation, again).

I'll be sure to open issues for these ideas once refined! (If they aren't being worked on already)

Thanks again, and keep up the good work!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants