-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This project is based on the Microsoft Malware challenge hosted on Kaggle spring of 2015. The competition was essentially to classify 9 kinds of malware. The data available was in two forms, byte code and assembly code. For more specific information on the data and the competition please visit this site.
After a summary review of the top 7 finishers in the competition we were surprised to find that uni-gram byte code alone was a highly predictive feature. To elaborate a single word frequency such as: 56,00,ff was enough to achieve 96% accuracy in one entry. The value of this feature was confirmed by performing Infogain and ReliefF attribute selection search on a small subs-sample during pretesting. All of the top place finishers used this feature as a base and squeezed their extra performance out of features such as larger n-grams of the byte and assembly code, or used novel additions. Each top performer invariably used some form of meta tree-based classifier, either Random Forests, traditional Gradient Boosted trees, or XGBoost.
With any classifier that achieves close to 100% accuracy there is always the concern of not being able to generalize. Overfitting can be handled by carefully cross-validating the performance of the classifier within the data-set. Even with that precaution the concern remains that a classier may be over-tuned to the data set presented and perform worse when it comes to data in the wild. With that reality in mind we chose to leave our classifiers accuracy at 97.8% and instead experiment with web-based, client-facing features to enrich our project and gain industry oriented experience to carry forward in future projects.