A service for converting natural language queries into CMR search parameters
About ↟
This project aims to provide basic natural language processing (NLP) support for the NASA Earthdata Common Metadata Repository (CMR) clients wishing for a greater user experience when making queries to the CMR Search endpoints. Initial focus is on NLP support for spatio-temporal queries.
Future focus will be on supporting collection, granule, and variable identification from natural language queries.
Dependencies ↟
- Java
lein
curl
(used to download English language models)docker
anddocker-compose
(used to run local Elasticsearch cluster)
Supported versions:
cmr-nlp | Elasticsearch | Status |
---|---|---|
0.1.0-SNAPSHOT | 6.5.2 | In development |
Usage ↟
There are several ways in which this project may be used:
- the NLP portion of the codebase as a library (in-memory NLP models will be required)
- the Geolocation functionality as a service (an Elasticsearch cluster, local or otherwise, will be required)
- both NLP and Geolocation running as a service (no in-memory models; requires Elasticsearch cluster)
Each approach requires slightly different setup.
Setup ↟
If running just the NLP portion of the code as a library, you will need to have
the required OpenNLP models available to the JVM on the classpath. You may do
this easily in a cloned cmr-nlp
directory with the following command:
$ lein download-models
This executes the script resources/scripts/download-models
, which may be
adapted for use in your own project.
Starting up a local Elasticsearch+Kibana cluster is as simple as:
$ lein start-es
Note that this utilizes docker-compose
under the hood.
Once started, Elasticsearch's Kibana interface will be available here:
TBD
Before ingesting Geonames data, you need to
- Start your Elasticsearch cluster (see above), and
- Download the Geonames gazzette files locally:
$ lein download-geonames
Note that this will also unzip
the two compressed files that get downloaded:
allCountries.zip
(340MB) uncompresses to 1.4GBshapes_all_low.zip
(1MB) uncompresses to 3.1MB
With that done, you're ready to ingest the Geonames files into Elasticsearch:
$ lein ingest
NLP Library ↟
Start up a repl, do a require, and define a testing query:
$ lein repl
(require '[cmr.nlp.core :as nlp])
(def query "What was the average surface temperature of Lake Superior last week?")
Tokenize:
[cmr.nlp.repl] λ=> (def tokens (nlp/tokenize query))
[cmr.nlp.repl] λ=> tokens
["What"
"was"
"the"
"average"
"surface"
"temperature"
"of"
"Lake"
"Superior"
"last"
"week"
"?"]
Tag the parts of speech:
[cmr.nlp.repl] λ=> (def pos (nlp/tag-pos tokens))
[cmr.nlp.repl] λ=> pos
(["What" "WP"]
["was" "VBD"]
["the" "DT"]
["average" "JJ"]
["surface" "NN"]
["temperature" "NN"]
["of" "IN"]
["Lake" "NNP"]
["Superior" "NNP"]
["last" "JJ"]
["week" "NN"]
["?" "."])
Get chunked phrases:
[cmr.nlp.repl] λ=> (nlp/chunk pos)
({:phrase ["What"] :tag "NP"}
{:phrase ["was"] :tag "VP"}
{:phrase ["the" "average" "surface" "temperature"] :tag "NP"}
{:phrase ["of"] :tag "PP"}
{:phrase ["Lake" "Superior"] :tag "NP"}
{:phrase ["last" "week"] :tag "NP"})
Find locations:
[cmr.nlp.repl] λ=> (nlp/find-locations tokens)
("Lake Superior")
Find dates:
[cmr.nlp.repl] λ=> (nlp/find-dates tokens)
("last week")
Get actual dates from English sentences:
[cmr.nlp.repl] λ=> (nlp/extract-dates query)
(#inst "2018-11-27T21:40:12.946-00:00")
This is returned as a collection due to the fact that a query may have more than one date (i.e., indicate a range):
[cmr.nlp.repl] λ=> (def query2 "What was the average high temp between last year and two years ago?")
[cmr.nlp.repl] λ=> (nlp/extract-dates query2)
(#inst "2017-12-04T21:42:42.874-00:00"
#inst "2016-12-04T21:42:42.878-00:00")
Create a CMR temporal parameter query string from a natural language sentence:
[cmr.nlp.repl] λ=> (require '[cmr.nlp.query :as query])
[cmr.nlp.repl] λ=> (query/->cmr-temporal {:query query2})
{:query "What was the average high temp between last year and two years ago?"
:temporal "temporal%5B%5D=2016-12-12T13%3A58%3A05Z%2C2017-12-12T13%3A58%3A05Z"}
Which, when URL-decoded, gives us:
"temporal[]=2016-12-05T12:21:32Z,2017-12-05T12:21:32Z"
NLP via Elasticsearch ↟
TBD
Geolocation via Elasticsearch ↟
TBD
License ↟
Copyright © 2018 NASA
Distributed under the Apache License, Version 2.0.