Feature selection #1

TarekHC · 2020-11-30T14:03:12Z

The first thing I did was select the parameters that better separate the event types (from the PSF class, dividing the events through the 4 quartiles of the angular difference between true and reconstructed direction).

As I define the event types as a function of the reconstructed energy, I chose the following variables to be used for the training:

log_reco_energy = log10 of the reco energy
log_NTels_reco = log10 of the number of telescopes used
array_distance = distance to the array center
img2_ang = Not sure how it is defined... angle between the showers of the second brighter telescope pair? No clue...
log_SizeSecondMax = log10 of the size of the second brighter image

orelgueta · 2021-01-04T10:43:30Z

As we already discussed, feature selection should be done per energy bin. It can actually be done "automatically" as part of a pipeline (see https://scikit-learn.org/stable/modules/feature_selection.html).

In the meantime, I use a long list of variables in all energy bins (should be extended even further). I paste below the distributions of these variables for two event types in the 0.200 < E < 0.258 TeV energy bin.
None of the variables look too promising and indeed the scores we get aren't great. Need further study, more variables, new ideas and discussions...

orelgueta · 2021-01-05T07:17:24Z

For completeness, I paste here the plots of all of the variables currently used in the regression, including the new ones leading to the improvement reported in #4.
Note that these plots are not that useful. In order to get a real idea on the separation between the classes one would have to look at a multidimensional plot or at all different combinations of plots. However, it still gives some idea.

orelgueta · 2021-01-05T12:58:21Z

A better way to choose the features is by looking at feature importance plots. Unfortunately, those are not provided out of the box for our best regressor (MLP), so I paste below a few energy bins of the random forest ones. The performance of the random forest isn't great, as can be seen in #4, but perhaps this still provides some idea.

TarekHC · 2021-01-07T10:33:06Z

Yes, I agree: feature importance is the way to decide on how many features we add.

Just to be sure: are you sure all variables listed here are reconstructed quantities? Because if we made the mistake of adding one or two true quantities, then it could explain the performance improvement...

orelgueta · 2021-01-07T10:40:24Z

I am not sure of anything. However, based on the names and distributions of the parameters I added, I don't think any of them is a true quantity. Also, none of them provides direct information on the direction (unlike the camera_offset), so I am a bit more confident that the result is OK. I still need to do a careful study of where this improvement came from. Then I can give you a more educated answer.

orelgueta · 2021-01-08T20:01:02Z

To facilitate feature selection, I opened a new branch called "study_features" where I changed the structure slightly. I used it to study which of the variables I added is responsible to the big improvement seen in #4. From the plot below it is fairly clear that the average cross provided the improvement (the plot shows the scores with all variables except the one in the legend). This makes sense, since the cross values are also used in the training.
Additional studies to follow.

orelgueta · 2021-01-11T11:01:18Z

I produced similar plots to the one above to test all the variables, see plots below. Once again, each plot shows the score obtained using all variables as inputs except the one mentioned in the legend for that curve.
These plots are not that informative, but they can help guide us in feature selection (will work on that next).

orelgueta · 2021-01-11T19:44:14Z

Trying to study the current features a bit further, I made a few lists of the most promising features based on the plots in the comment above. I then trained with them and compared to using all the features. The results are shown in the plot below. What I take from this is that we need to start choosing features separately for each energy bin. Also, it looks like using all the features is probably the best for now.

features_1 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross']
features_2 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'MWR', 'MLR']
features_3 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'MSCW', 'MSCL']
features_4 ['img2_ang', 'log_SizeSecondMax', 'log_EmissionHeight', 'av_dist', 'av_cross', 'log_EmissionHeightChi2', 'log_DispDiff']

TarekHC · 2021-01-25T09:30:50Z

Hi Orel,

Were you able to get similar performance (not better, but at least "close") with any tree-based method? (BDTs, RFs...). That would allow us a much more direct way to asses feature importance.

Unfortunately if BDTs don't reach that great performance, then they won't be that informative...

orelgueta · 2021-01-25T09:45:35Z

As you can see in the last plot in #4, RF is not really close. BDT is not better either. I added it particularly because I wanted to test the feature importance, but I am not sure it would be very informative.
From the studies above though I learned quite a bit. The conclusion so far is that if efficiency is not an issue, I will still use all of the available variables. If it becomes an issue, I have a small list of variables with which we reach an almost equivalent performance.
Also, the next item on my to-do list is to apply an automatic feature selection per energy bin, that might help in terms of efficiency (less in terms of performance I think).

BTW, the other reason why I would like to avoid tree based methods is because saving them to disk take A LOT of space. If MLP takes 3 MB for all energies, the trees take a few GBs .

orelgueta · 2022-09-08T14:42:04Z

Added an additional training feature (see #44). Essentially it is the difference between two position reconstruction methods, simple intersection and DISP. Naively, one would think that for "bad" events the two methods would provide quite different results while for "good" events they would be similar. From the plot below it seems this indeed helps and improves the score (blue is without the new feature and red is with, the training is for all off-axis angles using MLP_tanh).

Therefore, decided to keep this feature as a nominal feature.

Merge readme

orelgueta pushed a commit that referenced this issue Sep 15, 2022

Merge pull request #1 from cta-observatory/main

e98fdfb

Merge readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature selection #1

Feature selection #1

TarekHC commented Nov 30, 2020

orelgueta commented Jan 4, 2021

orelgueta commented Jan 5, 2021

orelgueta commented Jan 5, 2021

TarekHC commented Jan 7, 2021

orelgueta commented Jan 7, 2021

orelgueta commented Jan 8, 2021 •

edited

Loading

orelgueta commented Jan 11, 2021 •

edited

Loading

orelgueta commented Jan 11, 2021

TarekHC commented Jan 25, 2021

orelgueta commented Jan 25, 2021

orelgueta commented Sep 8, 2022

Feature selection #1

Feature selection #1

Comments

TarekHC commented Nov 30, 2020

orelgueta commented Jan 4, 2021

orelgueta commented Jan 5, 2021

orelgueta commented Jan 5, 2021

TarekHC commented Jan 7, 2021

orelgueta commented Jan 7, 2021

orelgueta commented Jan 8, 2021 • edited Loading

orelgueta commented Jan 11, 2021 • edited Loading

orelgueta commented Jan 11, 2021

TarekHC commented Jan 25, 2021

orelgueta commented Jan 25, 2021

orelgueta commented Sep 8, 2022

orelgueta commented Jan 8, 2021 •

edited

Loading

orelgueta commented Jan 11, 2021 •

edited

Loading