Simple text mining code that can be used to perform basic content analysis. (Originally written years ago.)
Example 1: Top Keywords as a comma separated string
$tm = new TextMiner();
//add any number of files and or text
$tm->addFile("http://en.wikipedia.org/wiki/Data_mining");
$tm->addFile("http://freebase.com/search?limit=30&start=0&query=data+mining");
$tm->addText("Data mining text can also be added this way.");
$tm->convertToLower = TRUE; // optional
$tm->process();//should be called before accessing keywords
printa($tm->getTopNGrams(10,false));
echo $tm->printSummary();
Result:
data mining, knowledge discovery, machine learning, mining m, mining software, mining knowledge, discovery data, data analysis, doi 10, mining data
======================
Text: data mining - wikipedia the free encyclopedia data mining from wikipedia...
Total nGrams: 7009
======================
Example 2: Top Keywords as an array
$tm = new TextMiner();
//add any number of files and or text
$tm->addFile("http://www.google.com/search?q=data+mining");
$tm->addText("Text can be added this way.");
$tm->convertToLower = TRUE; // optional
$tm->process();//should be called before accessing keywords
printa($tm->getTopNGrams(10));
echo $tm->printSummary();
Result:
Array
(
[data mining] => 46
[8206 cached] => 10
[cached similar] => 7
[mining data] => 6
[8206 ad] => 3
[big data] => 3
[oracle data] => 3
[mining 8206] => 3
[predictive analytics] => 3
[search search] => 2
)
======================
Text: data mining - google search search images maps play youtube news gmail drive more calendar translate...
Total nGrams: 483
======================
Example 3: Top N-Grams (including lower N-grams) as an array
$tm = new TextMiner();
$tm->addFile("http://en.wikipedia.org/wiki/Data_mining");
$tm->setN(3);
$tm->convertToLower = TRUE;
$tm->includeLowerNGrams = TRUE; // include all lower N-Grams
$tm->process();
printa($tm->getTopNGrams(10));
echo $tm->printSummary();
Result:
Array
(
[data] => 398
[mining] => 257
[data mining] => 229
[knowledge] => 53
[discovery] => 49
[information] => 48
[analysis] => 47
[learning] => 43
[patterns] => 40
[edit] => 40
)
======================
Text: data mining - wikipedia the free encyclopedia data mining from wikipedia...
Total nGrams: 19173
======================
Example 4: Stemming
//Note: no instance of tm is necessary nor does process need not be called
//STEMMING FUNCTIONALITY (requires class: Stemming.php)
//get stem counts as an array
$text = "The quick brown fox jumped over the lazy dog";
echo $text;
$words = explode(" ",$text);
$sc = TextMiner::getStemCounts($words);
printa($sc);
Result:
Array
(
[the] => 2
[lazi] => 1
[dog] => 1
[over] => 1
[fox] => 1
[quick] => 1
[brown] => 1
[jump] => 1
)
Example 5: Stemming, output in a table
//Note: no instance of tm is necessary nor does process need not be called
//STEMMING FUNCTIONALITY (requires class: Stemming.php)
//output stems in a table
$tm = new TextMiner();
$tm->addFile("http://en.wikipedia.org/wiki/Data_mining");
$tm->process();//should be called before accessing keywords
TextMiner::outputStemTable($tm->getNGrams(),12);
Result:
STEM | WORDS |
---|---|
data min | (237) |
knowledge discoveri | (38) |
machine learn | (24) |
data set | (17) |
mining softwar | (15) |
discovery data | (13) |
data analysi | (12) |
Example 6: Removing stopwords
/* STATIC METHOD EXAMPLE */
//Note: no instance of tm is necessary nor does process need not be called
//STOPWORD REMOVAL
//Stopword removal
$text = "The quick brown fox jumped over the lazy dog";
echo $text;
$words = explode(" ",strtolower($text));
printa(TextMiner::removeStopWords($words));
Result:
The quick brown fox jumped over the lazy dog
Array
(
[0] => quick
[1] => brown
[2] => fox
[3] => jumped
[4] => lazy
[5] => dog
)