Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection #1

Open
TarekHC opened this issue Nov 30, 2020 · 11 comments
Open

Feature selection #1

TarekHC opened this issue Nov 30, 2020 · 11 comments

Comments

@TarekHC
Copy link
Collaborator

TarekHC commented Nov 30, 2020

The first thing I did was select the parameters that better separate the event types (from the PSF class, dividing the events through the 4 quartiles of the angular difference between true and reconstructed direction).

imagen

As I define the event types as a function of the reconstructed energy, I chose the following variables to be used for the training:

log_reco_energy = log10 of the reco energy
log_NTels_reco = log10 of the number of telescopes used
array_distance = distance to the array center
img2_ang = Not sure how it is defined... angle between the showers of the second brighter telescope pair? No clue...
log_SizeSecondMax = log10 of the size of the second brighter image

@orelgueta
Copy link
Collaborator

As we already discussed, feature selection should be done per energy bin. It can actually be done "automatically" as part of a pipeline (see https://scikit-learn.org/stable/modules/feature_selection.html).

In the meantime, I use a long list of variables in all energy bins (should be extended even further). I paste below the distributions of these variables for two event types in the 0.200 < E < 0.258 TeV energy bin.
None of the variables look too promising and indeed the scores we get aren't great. Need further study, more variables, new ideas and discussions...

image

image

image

@orelgueta
Copy link
Collaborator

For completeness, I paste here the plots of all of the variables currently used in the regression, including the new ones leading to the improvement reported in #4.
Note that these plots are not that useful. In order to get a real idea on the separation between the classes one would have to look at a multidimensional plot or at all different combinations of plots. However, it still gives some idea.

image

image

image

image

image

@orelgueta
Copy link
Collaborator

A better way to choose the features is by looking at feature importance plots. Unfortunately, those are not provided out of the box for our best regressor (MLP), so I paste below a few energy bins of the random forest ones. The performance of the random forest isn't great, as can be seen in #4, but perhaps this still provides some idea.

image

image

image

image

image

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 7, 2021

Yes, I agree: feature importance is the way to decide on how many features we add.

Just to be sure: are you sure all variables listed here are reconstructed quantities? Because if we made the mistake of adding one or two true quantities, then it could explain the performance improvement...

@orelgueta
Copy link
Collaborator

I am not sure of anything. However, based on the names and distributions of the parameters I added, I don't think any of them is a true quantity. Also, none of them provides direct information on the direction (unlike the camera_offset), so I am a bit more confident that the result is OK. I still need to do a careful study of where this improvement came from. Then I can give you a more educated answer.

@orelgueta
Copy link
Collaborator

orelgueta commented Jan 8, 2021

To facilitate feature selection, I opened a new branch called "study_features" where I changed the structure slightly. I used it to study which of the variables I added is responsible to the big improvement seen in #4. From the plot below it is fairly clear that the average cross provided the improvement (the plot shows the scores with all variables except the one in the legend). This makes sense, since the cross values are also used in the training.
Additional studies to follow.

compare_scores

@orelgueta
Copy link
Collaborator

orelgueta commented Jan 11, 2021

I produced similar plots to the one above to test all the variables, see plots below. Once again, each plot shows the score obtained using all variables as inputs except the one mentioned in the legend for that curve.
These plots are not that informative, but they can help guide us in feature selection (will work on that next).

compare_scores_1

compare_scores_2

compare_scores_3

compare_scores_4

compare_scores_5

@orelgueta
Copy link
Collaborator

Trying to study the current features a bit further, I made a few lists of the most promising features based on the plots in the comment above. I then trained with them and compared to using all the features. The results are shown in the plot below. What I take from this is that we need to start choosing features separately for each energy bin. Also, it looks like using all the features is probably the best for now.

features_1 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross']
features_2 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'MWR', 'MLR']
features_3 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'MSCW', 'MSCL']
features_4 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'log_EmissionHeightChi2', 'log_DispDiff']

scores_features_1

@TarekHC
Copy link
Collaborator Author

TarekHC commented Jan 25, 2021

Hi Orel,

Were you able to get similar performance (not better, but at least "close") with any tree-based method? (BDTs, RFs...). That would allow us a much more direct way to asses feature importance.

Unfortunately if BDTs don't reach that great performance, then they won't be that informative...

@orelgueta
Copy link
Collaborator

As you can see in the last plot in #4, RF is not really close. BDT is not better either. I added it particularly because I wanted to test the feature importance, but I am not sure it would be very informative.
From the studies above though I learned quite a bit. The conclusion so far is that if efficiency is not an issue, I will still use all of the available variables. If it becomes an issue, I have a small list of variables with which we reach an almost equivalent performance.
Also, the next item on my to-do list is to apply an automatic feature selection per energy bin, that might help in terms of efficiency (less in terms of performance I think).

BTW, the other reason why I would like to avoid tree based methods is because saving them to disk take A LOT of space. If MLP takes 3 MB for all energies, the trees take a few GBs .

@orelgueta
Copy link
Collaborator

Added an additional training feature (see #44). Essentially it is the difference between two position reconstruction methods, simple intersection and DISP. Naively, one would think that for "bad" events the two methods would provide quite different results while for "good" events they would be similar. From the plot below it seems this indeed helps and improves the score (blue is without the new feature and red is with, the training is for all off-axis angles using MLP_tanh).

Therefore, decided to keep this feature as a nominal feature.

scores_features_1

orelgueta pushed a commit that referenced this issue Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants