Effort to analyze and clean up J2ME archives.
So far the procedure consists of 4 Python scripts:
Recursively extract ZIP, 7Z and RAR archives.
Create a JSON index of all JARs found, including their hashes and manifest data.
We also need to check if the JARs are valid J2ME midlets, since some of them can be broken files, Java desktop apps, or libraries.
Remove entries from the index that show signs of modification by third parties (such as pirate sites that put their own name on the manifest - we call these "bad keywords").
At the same time, we also do de-duping of files based on the hashes.
Sort each JAR file into directories based on the app's name, and use a standard naming scheme, so variants of the same game are easy to find.
Some miscellaneous scripts are provided in this repo for further data analysis:
- vendor_count.py: Outputs a list of MIDlet vendors in the JSON index sorted by frequency.