-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with indeterminacy? #41
Comments
OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :) |
@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits) |
I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1). I made a first attempt at allowing to specify this here but have yet to write a test for it |
Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing. (ns com.owoga.prhyme.util.nlp
(:require [opennlp.nlp :as nlp]
[opennlp.treebank :as tb]
[clojure.string :as string]
[clojure.java.io :as io]
[clojure.zip :as zip]
[com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
(:import (opennlp.tools.postag POSModel POSTaggerME)))
(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
(tags [this sent])
(probs [this])
(top-k-sequences [this sent]))
(defn make-pos-tagger
[modelfile]
(let [model (with-open [model-stream (io/input-stream modelfile)]
(POSModel. model-stream))
tagger (POSTaggerME. model)]
(reify Tagger
(tags [_ tokens]
(let [token-array (into-array String tokens)]
(map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
(probs [_] (seq (.probs tagger)))
(top-k-sequences [_ tokens]
(let [token-array (into-array String tokens)]
(.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))
(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
(comment
(let [phrase "The feeling hurts."]
(map (juxt #(.getOutcomes %)
#(map float (.getProbs %)))
(top-k-sequences prhyme-pos-tagger (tokenize phrase))))
;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
;; [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])
) |
Evaluating
(treebank-parser ["What can happen in a second ."])
using the set-up in the README file here, I get the following parse:Actually I'm pretty sure the
JJ
should be anNN
. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?The text was updated successfully, but these errors were encountered: