MATE-ML is a Method for Automatic Term Extraction based on Machine Learning.
This method treats the term extraction as a classification task, since the purpose of the extraction can be seen as classify candidates into terms or non-terms. Figure below shows four steps of MATE-ML, which are completely automated and allow to adapt them depending on the application in which the extracted terms will be used.
Input: corpus, general language corpus (optional), external knowledge (corresponds to labeled words).
Text preprocessing:
cleans and standardizes the input data, identifies POS (part-of-speech), remove stopwords, etc.Feature extraction:
calculates linguistic, statistical, and hybrid features that describe the words of input corpus.Filter application:
applies feature and attribute (words) selection.Classification of the candidate terms:
applies inductive or transductive classification algorithms in order to identify the terms.
Output: a list of extracted terms.
Note: The current version implements the first two steps:
Text preprocessing: br.usp.mateml.steps.feature_extraction
Feature extraction: br.usp.mateml.steps.preprocessing