Skip to content

Latest commit

 

History

History
24 lines (16 loc) · 1.33 KB

README.md

File metadata and controls

24 lines (16 loc) · 1.33 KB

EECS731Project2

Analyzing Shakespeare Play data set downloaded from https://www.kaggle.com/kingburrito666/shakespeare-plays.

Part 1

The first part mainly focuses on analyzing the data set by pandas. We count the number of different plays, the number of players in each play, the number of playlines of each player in each play and etc.

Part 2

The second part focuses on the classification of player based on the features in other columns. We propose four diffirent models to do this classification task.

Challenge and idea

There is a challenge for this task is the string data in most columns of this data set. So we first convert those string data into int data. We propose to transfer the playline into the number of words and average number of characters in each word for each player.

Classification Models

The first model is decision tree. The second model is random forrest. The third model is support vector machine. The fourth model is non-bayesian regressor.

Conclusion

As the simulation shows, random forrest achieves the best performance at a classification rate of 0.65 and non-bayesian regressor achieves the worst performance at a classification rate of 0.05. We also found that support vector machine with linear kernel cannot be applied to multi-group classification problme.

For the details, please refer to the code.