Python MapReduce Program for Hadoop in Semi-Apriori Style

This is a simple python mapreduce program aiming to find n-K top 100 frequent patterns.
I'm currently a student studying data science so I tried to make things more comprehensible to other beginners like me.
Uses Hadoop 2.7.3 and Python 2.7.

What is Aprioiri and Why My Code is SEMI-Apriori

Apriori is an algorithm for finding frequent patterns.
The famous "beers and diapers" thing is a great example of what apriori does.

The major difference between my code and apriori is that I do not use the concept of "support" and "confidence".
What my code does is finding top 100 item patterns and frequent patterns of their subsets.

Input

Plain text file with Items seperated by space.

Configuration

No special configs needed, only some path changes in python file and renaming output files(Check step 5 and 6 in usage.)

Usage

Set up a single/multi node Hadoop cluster.
For beginners, please check Michael G. Noll's Running Hadoop on Ubuntu Linux (Single-Node Cluster) and Writing an Hadoop MapReduce Program in Python. Those are great tutorials and my code are mostly edit version of his.
Many thanks to him.
Do $sudo chmod +x [all mapper and reducer files] to make them executable.
Run the n1 mapreduce job on Hadoop using Apriori_mapper.py as mapper and Apriori_reducer.py as reducer.
If the job is done successfully, download the part-00000 file from your output directory in HDFS.
Run $sort -k2 -n -r part-00000 >> [The path you wish]/MapRedSorted_n1_top100 to output a sorted counting result, then manually make a top 100 list by deleting lines after line 100. (Come on, don't be lazy.)
You should get something like this:

Item1    12345
Item2    9999
Item3    4567
...
...
Repeat step 2 to 4, just remember to change mapper to proper .py file corresponding to the n-K job you are going to do.
For example, use Apriori_mapper_n2.py for 2-item frequent pattern search.
Remember to change your top 100 list name too.(MapRedSorted_n2_top100, MapRedSorted_n3_top100.......)
Also don't forget to set the file path to your top 100 list in line 9 of every mapper file.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Mapper and reducer		Mapper and reducer
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python MapReduce Program for Hadoop in Semi-Apriori Style

What is Aprioiri and Why My Code is SEMI-Apriori

Input

Configuration

Usage

About

Releases

Packages

Languages

jimmyzero3/semi-apriori-python-mapreduce-program-for-Hadoop

Folders and files

Latest commit

History

Repository files navigation

Python MapReduce Program for Hadoop in Semi-Apriori Style

What is Aprioiri and Why My Code is SEMI-Apriori

Input

Configuration

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages