This library has been made in typescript and nodejs. It will be used for gererating automatic questions from a text in portuguese (pt-br). This library works through a corpora (corpora/macmorpho-v3/train) and uses a probabilistic bigram to solve part of the speech tags, even for unknown words. Currently I am using a brazilian portuguese corpus named Macmorpho-v3 (http://nilc.icmc.usp.br/macmorpho)
Temporary tests: http://www.murilokunze.com.br
Getting part of speech tags from text
import DefaultViterbiTaggerFactory = require("./PartOfSpeechTagger/Factory/DefaultViterbiTaggerFactory");
import DefaultQuestionGeneratorFactory = require("./QuestionGenerator/Factory/DefaultQuestionGeneratorFactory");
import CorporaCYKParserFactory = require("./Parser/Factory/CorporaCYKParserFactory");
import Text = require("./Text");
import TaggedToken = require("./TaggedToken");
import CYKTable = require("./Parser/CYKTable");
let questionGenerator = DefaultQuestionGeneratorFactory.create();
CorporaCYKParserFactory.create().then((parser) => {
DefaultViterbiTaggerFactory.create().generateModel().then(tagger => {
console.time("tagger");
let phrases = "Murilo Kunze gosta de programar sozinho de noite.";
let tokens = tagger.tag(phrases);
let text = new Text(tokens);
for (let phrase of text.getPhrases()) {
console.log("-".repeat(50));
console.log(`Text: ${phrase.toString()} \n`)
console.log("Questions:")
let cykTable: CYKTable = parser.parse(phrase.getTokens());
for (let question of questionGenerator.generate(cykTable)) {
console.log(question);
}
for (let token of phrase.getTokens()) {
console.log("-".repeat(40));
console.log(`word: ${token.getWord()}`);
console.log(`tag: ${token.getTag()}`);
console.log(`known word: ${token.getKnown()}`);
console.log(`probability: ${token.getProbability()}`);
}
}
console.timeEnd("tagger");
});
});
Running on nodejs
Just run npm start
.
The result should be as shown below:
Text: Murilo Kunze gosta de programar sozinho de noite.
Questions:
Quem gosta de programar?
Qual o nome da pessoa que gosta de programar?
Murilo Kunze gosta de programar?
----------------------------------------
word: Murilo Kunze
tag: NPROP
known word: true
probability: 0.07448322988440294
----------------------------------------
word: gosta
tag: V
known word: true
probability: 0.0971433588948077
----------------------------------------
word: de
tag: PREP
known word: true
probability: 0.10937787990826241
----------------------------------------
word: programar
tag: V
known word: true
probability: 0.08880982082104785
----------------------------------------
word: sozinho
tag: ADJ
known word: true
probability: 0.03734058192619177
----------------------------------------
word: de
tag: PREP
known word: true
probability: 0.11378864211986091
----------------------------------------
word: noite
tag: N
known word: true
probability: 0.3886058596720611
----------------------------------------
word: .
tag: END
known word: true
probability: 0.10145041539848605
tagger: 40.990ms
Tagset
CLASSE GRAMATICAL | ETIQUETA |
---|---|
ADJETIVO | ADJ |
ADVÉRBIO | ADV |
ADVÉRBIO CONECTIVO SUBORDINATIVO | ADV-KS |
ADVÉRBIO RELATIVOSUBORDINATIVO | ADV-KS-REL |
ARTIGO (def. ou indef.) | ART |
CONJUNÇÃO COORDENATIVA | KC |
CONJUNÇÃO SUBORDINATIVA | KS |
INTERJEIÇÃO | IN |
NOME(SUBSTANTIVO) | N |
NOME PRÓPRIO | NPROP |
NUMERAL | NUM |
PARTICÍPIO | PCP |
PALAVRA DENOTATIVA | PDEN |
PREPOSIÇÃO | PREP |
PRONOME ADJETIVO | PROADJ |
PRONOME CONECTIVO SUBORDINATIVO | PRO-KS |
PRONOME PESSOAL | PROPESS |
PRONOME RELATIVO CONECTIVO SUBORDINATIVO | PRO-KS-REL |
PRONOME SUBSTANTIVO | PROSUB |
VERBO | V |
VERBO AUXILIAR | VAUX |
SÍMBOLO DE MOEDA CORRENTE | CUR |