-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature selection #1
Comments
As we already discussed, feature selection should be done per energy bin. It can actually be done "automatically" as part of a pipeline (see https://scikit-learn.org/stable/modules/feature_selection.html). In the meantime, I use a long list of variables in all energy bins (should be extended even further). I paste below the distributions of these variables for two event types in the 0.200 < E < 0.258 TeV energy bin. |
For completeness, I paste here the plots of all of the variables currently used in the regression, including the new ones leading to the improvement reported in #4. |
A better way to choose the features is by looking at feature importance plots. Unfortunately, those are not provided out of the box for our best regressor (MLP), so I paste below a few energy bins of the random forest ones. The performance of the random forest isn't great, as can be seen in #4, but perhaps this still provides some idea. |
Yes, I agree: feature importance is the way to decide on how many features we add. Just to be sure: are you sure all variables listed here are reconstructed quantities? Because if we made the mistake of adding one or two true quantities, then it could explain the performance improvement... |
I am not sure of anything. However, based on the names and distributions of the parameters I added, I don't think any of them is a true quantity. Also, none of them provides direct information on the direction (unlike the camera_offset), so I am a bit more confident that the result is OK. I still need to do a careful study of where this improvement came from. Then I can give you a more educated answer. |
To facilitate feature selection, I opened a new branch called "study_features" where I changed the structure slightly. I used it to study which of the variables I added is responsible to the big improvement seen in #4. From the plot below it is fairly clear that the average cross provided the improvement (the plot shows the scores with all variables except the one in the legend). This makes sense, since the cross values are also used in the training. |
I produced similar plots to the one above to test all the variables, see plots below. Once again, each plot shows the score obtained using all variables as inputs except the one mentioned in the legend for that curve. |
Trying to study the current features a bit further, I made a few lists of the most promising features based on the plots in the comment above. I then trained with them and compared to using all the features. The results are shown in the plot below. What I take from this is that we need to start choosing features separately for each energy bin. Also, it looks like using all the features is probably the best for now.
|
Hi Orel, Were you able to get similar performance (not better, but at least "close") with any tree-based method? (BDTs, RFs...). That would allow us a much more direct way to asses feature importance. Unfortunately if BDTs don't reach that great performance, then they won't be that informative... |
As you can see in the last plot in #4, RF is not really close. BDT is not better either. I added it particularly because I wanted to test the feature importance, but I am not sure it would be very informative. BTW, the other reason why I would like to avoid tree based methods is because saving them to disk take A LOT of space. If MLP takes 3 MB for all energies, the trees take a few GBs . |
Added an additional training feature (see #44). Essentially it is the difference between two position reconstruction methods, simple intersection and DISP. Naively, one would think that for "bad" events the two methods would provide quite different results while for "good" events they would be similar. From the plot below it seems this indeed helps and improves the score (blue is without the new feature and red is with, the training is for all off-axis angles using MLP_tanh). Therefore, decided to keep this feature as a nominal feature. |
The first thing I did was select the parameters that better separate the event types (from the PSF class, dividing the events through the 4 quartiles of the angular difference between true and reconstructed direction).
As I define the event types as a function of the reconstructed energy, I chose the following variables to be used for the training:
log_reco_energy = log10 of the reco energy
log_NTels_reco = log10 of the number of telescopes used
array_distance = distance to the array center
img2_ang = Not sure how it is defined... angle between the showers of the second brighter telescope pair? No clue...
log_SizeSecondMax = log10 of the size of the second brighter image
The text was updated successfully, but these errors were encountered: