Entity-Semantic Document Search

Please see updated README here: https://github.com/forward-uiuc/Forward-Search-Installation-Guide. It contains sufficient guideline to install the entire system. The following README may contain more information and potentially some outdated guideline.

Entity-Semantic Document Search

Intro

This plugin extends Entity Search to support Entity-Semantic Document Search by providing more operators, such as @near, @contains, and @layout_near in the hash-tag queries. For example, users can type @near(#professor, #email, #phone) @contains(mining) to search for homepages of professors who work in Data Mining. Users can type of the queries for the same task. The idea is we want to provide users with expressive operators so that users can define the documents they look for in a semantic manner.

Essentially, the idea is to parse user query in our predefined language, e.g., @near(#professor, #email, #phone) @contains(mining) , into the format where ElasticSearch can execute, e.g., SpanQuery, and with information about predefined layout of the index.

For more details and structure of plugins in elastic search see this - https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html Information is provided for different kinds of plugins – discovery, analysis, mapper , ingest and store plugin. Also, information is given on how to develop and maintain the plugins.

Install

To install the plugin: Go to the plugin folder and run the following commands on linux or mac machine (if you use a windows machine, please find the way to install ES plugins on internet):

First, we need to compile the code:

./gradlew clean assemble

Then, we need to remove the plugin if beeing installed previously:

path-to-elasticsearch-5.6.1/bin/elasticsearch-plugin remove elasticsearch-esdocumentsearch

Then, install the plugin:

path-to-elasticsearch-5.6.1/bin/elasticsearch-plugin install file://path-to-plugin/target/releases/elasticsearch-esdocumentsearch-5.6.1.zip

Please note that, in order for the plugin to work, we need to define index schema as below and annotate data as being described in elasticsearch-cs-professors-crawler-annotator, which can be imported automatically by Bulk API:

PUT /entity_lucene_dinv_new_analysis/
{
  "mappings": {
    "d_document": {
      "properties": {
        "entityContent": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "physicalDoc": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "text": {
          "type": "text",
          "term_vector": "with_positions_offsets_payloads",
          "store": true,
          "analyzer": "fulltext_analyzer"
        }
      },
      "dynamic_templates": [
        {
          "entity_type": {
            "match_mapping_type": "string",
            "match": "_entity_*",
            "mapping": {
              "type": "text",
              "term_vector": "with_positions_offsets_payloads",
              "store": true,
              "analyzer": "entity_analyzer"
            }
          }
        }, 
        {
          "entity_type": {
            "match_mapping_type": "string",
            "match": "_xpos_entity_*",
            "mapping": {
              "type": "text",
              "term_vector": "with_positions_offsets_payloads",
              "analyzer": "xpos_entity_analyzer"
            }
          }
        }
      ]
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    },
    "analysis": {
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        },
        "keep_entity_word": {
          "type": "keep",
          "keep_words": [
            "oentityo"
          ]
        }
      },
      "analyzer": {
        "entity_analyzer": {
          "type": "custom",
          "tokenizer": "classic",
          "filter": [
            "lowercase",
            "delimited_payload_filter",
             "keep_entity_word"
          ]
        },
        "xpos_entity_analyzer": {
          "type": "custom",
          "tokenizer": "layout_tokenizer",
          "filter": [
            "lowercase",
            "delimited_payload_filter",
             "keep_entity_word"
          ]
        },
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "classic",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

Run

With the plugin, you can run a query as below (with a new request handler _es_document_search):

GET /entity_search_cs_departments/_es_document_search?
{
  "search_request":{
    "query": "@near ( #course #number )",
    "size" : 1000,
    "explain" : true
  } 
}

To understand the code

The plugin is action plugin, which means it extends Elasticsearch’s runtime action by adding a customized RESTful endpoint called _es_document_search.

The plugin contains a restful handler which has customized response handler and request handler. There we can get user query and restructure it into the format we want: ClusteringAction.java

Future work

Add more operators such as near corner, etc.
Create another plugin to customize ranking function
Add support for Natural Language Querying

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
doc		doc
etc/gradle/wrapper		etc/gradle/wrapper
src/main		src/main
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
GRADLE.CHEATSHEET		GRADLE.CHEATSHEET
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
build.gradle		build.gradle
deployPlugin.sh		deployPlugin.sh
gradlew		gradlew
gradlew.bat		gradlew.bat
testPlugin.sh		testPlugin.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity-Semantic Document Search

Intro

Install

Run

To understand the code

Future work

About

Releases

Packages

Languages

License

forward-uiuc/Entity-Elastic-Search-API-Extension-Plugin

Folders and files

Latest commit

History

Repository files navigation

Entity-Semantic Document Search

Intro

Install

Run

To understand the code

Future work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages