Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification vs regression #2

Closed
TarekHC opened this issue Nov 30, 2020 · 21 comments
Closed

Classification vs regression #2

TarekHC opened this issue Nov 30, 2020 · 21 comments

Comments

@TarekHC
Copy link
Collaborator

TarekHC commented Nov 30, 2020

We can use two approaches: multi-class classification and regression

Multi-class classification:
The performance of most algorithms is really bad (roughly 35-40% precision), but generally I chose the algorithms that if they don't label properly an event, they are usually relatively close:

Each of these plots is a different energy bin in log scale, each showing the confusion matrix of the classifier: the Y axis are the true event types and the X axis the predicted one.

imagen

As you can see, it seems the "bad" events are generally well labeled across all energies (event type 3), while best events are more or less also well labeled. The intermediate event types seem rather random to me... But we will probably need to wait for the IRFs to see how good the separation really is. Best algorithm seems to be a One vs One ensemble of random forest classificators.

Regression:
Instead of just dividing into 4 groups, we can also try to estimate the expected angular difference between true and reconstructed direction. For that, I used the same variables as in the previous step.

Following a similar approach as before, I show the true (Y) vs reconstructed (X) log10(angular difference):

imagen

For the moment the best classification is given by a Ridge linear regression, but I probably need to play around more.

The good thing of performing a regression is that we can decide the statistics falling into each event type during IRF production, while in the case of classification we can only control the training statistics. I have not compared yet which classification method provides better classifications, but it will be trivial to do.

@orelgueta
Copy link
Collaborator

Decided to go with regression for now. Working on the performance (see #4).

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 7, 2021

Yes, makes sense to me. Although we should probably find a way to create a direct comparison between classification and regression. For instance, we could rank the test sample into the N event types for each regression algorithm applied. This way we can evaluate performance exactly the same way for classification and regression.

@orelgueta
Copy link
Collaborator

The way I understood it, the reason to go for regression was motivated by the freedom to divide the events after training. It wasn't motivated by performance. I actually assume that we will be able to get slightly better performance with classification. Let's discuss this later and decide if there's a reason to go back and look at classification.

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 11, 2021

Yes, the reason to go for regression is clear. But I would find very informative to show how much better/worse will regression be. If there was an enormous improvement, then not having such a control over test statistics is not such a big deal...

@orelgueta
Copy link
Collaborator

Yeah, I agree, if the performance is significantly better for classification, it might be worth it. We will keep this on the to-do list then. However, I would first like to solve all of the pending issues so we can get a good estimate on the regression performance (I am sure it will help for classification as well).

@orelgueta
Copy link
Collaborator

Testing the regression performance for classification, the confusion matrix plots below were obtained. Three plots are shown below, for 2, 3 and 4 event type partitions. When partitioning to event types, the sample is divided based on the true angular difference such to have equal statistics in each sub-sample.
This is all for on-source gammas, only the test sample.

I think it is fairly safe to say that 4 classes is too much based on these results. Maybe 3 classes is OK. Need to see what happens when we add all samples (gamma-cone, electrons and protons).

In terms of classification vs regression, we need to run the classification again, but regression doesn't look too bad at the moment compared to the results in the first comment (hard to compare of course).

All_confusion_matrix_n_types_2

All_confusion_matrix_n_types_3

All_confusion_matrix_n_types_4

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 22, 2021

Hi Orel. The plots look really really good! The performance seems way more than enough... So I'm really hoping the IRFs will look super different.

One suggestion: perhaps convert these to row-wise %? Meaning that in a row, 70% are correct, 15% not so bad, 7% wrong, etc...

To be fair with classification vs regression we should then devote a bit of time also to optimize the classification exercise. I would definitely not devote a lot of time to it, but if you feel like playing with ML some more... It could be fun.

I'll open one issue now!

@orelgueta
Copy link
Collaborator

orelgueta commented Jan 23, 2021

I am not as impressed by the performance to be honest and I wonder how this will translate when looking at the IRFs, but we'll see.

One suggestion: perhaps convert these to row-wise %? Meaning that in a row, 70% are correct, 15% not so bad, 7% wrong, etc...

Good idea, see corresponding plots below. (Still need to make the size change dynamically with the number of types, but it's a small detail.)

To be fair with classification vs regression we should then devote a bit of time also to optimize the classification exercise. I would definitely not devote a lot of time to it, but if you feel like playing with ML some more... It could be fun.

Yeah, I definitely have it already on the to-do list.

All_1d_confusion_matrix_n_types_2

All_1d_confusion_matrix_n_types_3

All_1d_confusion_matrix_n_types_4

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 25, 2021

I am actually positively impressed by the performance! :-)

Regarding the plots, I was thinking of producing the same N by N plots with % instead of number of events, but the plots you produced are also super informative.

We leave this issue open to show classification algorithms performance in the future.

@orelgueta
Copy link
Collaborator

Please see below updated confusions matrices, after fixing the mistake mentioned in #4. Namely, the event type bins are now defined based on the reconstructed angular error rather than the true one.
The differences compared to the previous results are not big, but it's always better to do things correctly.

All_confusion_matrix_n_types_2

All_confusion_matrix_n_types_3

All_confusion_matrix_n_types_4

@orelgueta
Copy link
Collaborator

Same as the previous plots, but this time one-dimensional ones.

All_1d_confusion_matrix_n_types_2

All_1d_confusion_matrix_n_types_3

All_1d_confusion_matrix_n_types_4

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 29, 2021

Hi Orel,

Interesting!

The differences compared to the previous results are not big, but it's always better to do things correctly.

Actually, I think there is a very clear effect on the intermediate types (now, they have a significantly worse determination over all energies), which is much more consistent to my previous tests (I was very surprised about how good your classification was in those!).

In any case, it is not a show stopper at all: It just means that we are very good at identifying good and bad events, and it might simply mean that the types we may define in the future are "very good", "very bad" and "average performance" event types, not really requiring equal statistics.

@orelgueta
Copy link
Collaborator

Actually, I think there is a very clear effect on the intermediate types (now, they have a significantly worse determination over all energies), which is much more consistent to my previous tests (I was very surprised about how good your classification was in those!).

I wasn't referring to the 3 and 4 type cases because I am not sure it makes sense to define an "average performance" type if we can't classify to that event type well. We can discuss later of course and maybe it will improve if we improve performance.

What's important now is that it looks like regression is not significantly worse than classification.

@orelgueta
Copy link
Collaborator

Latest results of regression using Prod5 sample (see #4 for more details, first comment mentioning Prod5 results).
Based on the results below, I still think we should define only two event types. I will try also classification with the new sample, but I doubt we will get significantly better results.

All_confusion_matrix_n_types_2

All_confusion_matrix_n_types_3

All_confusion_matrix_n_types_4

@TarekHC
Copy link
Collaborator Author

TarekHC commented Feb 3, 2021

Hi Orel,

Regarding the "only define 2 event types", I'm not sure that would be the best approach, mainly due to statistics.

In the 4-type classification you show, 25% of the best/worst events are very well classified. This means that if you select just 2 event types (each with 50% of the signal) we would be "dirtying" the great resolution of those super good events with others that we know are not as good.

Maybe, as we have already discussed in the past, what we are seeing here is that we need 3 event types with uneven statistics:

  • Good events (top e.g. 20% of the events, that would provide excellent resolution)
  • Average events (most of the events, without a great classification, and average performance)
  • Bad events (bottom 20%, only used for few science cases in which you are searching for a signal and you need maximum effective area)

I feel we need to start calculating IRFs to be able to answer this question. For example, if we are able to improve resolution by selecting just the top 10% of the events, then it could definitely be worth it. If resolution does not really improve when going from 20->10 %, then it makes no sense to go farther than that...

@orelgueta
Copy link
Collaborator

Yes, I agree that calculating the IRFs would be the way to go here to make a decision. Also, I will try making the confusion matrices for different partitioning of events instead of the equal statistics.

@orelgueta
Copy link
Collaborator

Going back to the main topic of this issue, which is better, classification or regression?
I think the easiest way to test this is to look at the 1D confusion matrices for 3 event types below (top is regression, marked as "All" and bottom is classification).
It looks like classification can be 1-2% better in some energy bins in getting the correct classification. However, it is also 1-2% more likely to classify the event 2 types "off" from the true type. I think we can make our decision based on this, close this issue and move forward with regression. @TarekHC Thoughts?

All_1d_confusion_matrix_n_types_3

MLP_small_classifier_ntypes_3_1d_confusion_matrix_n_types_3

@TarekHC
Copy link
Collaborator Author

TarekHC commented Feb 4, 2021

Hi Orel,

These plots show that their performance is very comparable... So seems reasonable to assume that the benefits provided by the regression algorithms (flexibility on deciding the amount of statistics for each event type at the IRF calculation stage) are more relevant than improving the classification at the 1-2% level.

Although what I would show in the paper is probably just a accuracy vs energy plot comparing our best classifier with the best regressor (what you have here, but in a classical 1D plot). Maybe we could add the "2 off" lines just for reference...

But again, these results seem pretty solid: classifiers don't seem to show a very significant improvement in performance over regressors (which is what we wanted!). :)

@TarekHC
Copy link
Collaborator Author

TarekHC commented Feb 4, 2021

By the way, this could actually be a plot to add into the proceeding. Its a bit technical, but its a good test any reviewer would ask you.

@orelgueta
Copy link
Collaborator

Sure, I can make this plot for the paper no problem. Considering the space limitations of the proceedings, not sure if it will go there as well, but we'll see when the time comes.

OK, so regression it is. Maybe we can even close this issue.

@TarekHC
Copy link
Collaborator Author

TarekHC commented Feb 4, 2021

Yep, this is definitely good enough!

@TarekHC TarekHC closed this as completed Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants