-
Notifications
You must be signed in to change notification settings - Fork 6
gselva/Simple-Tika-Server
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Simple Tika Server https://github.com/gselva ========================== Tika as a HTTP service to parse and get metadata and the textual content of documents (as json) such as below. Sample Invocation: ------------------ curl -T tika.pdf http://localhost:8080/tika/fulldata Sample Output: ------------- { "metadata": { "Author":"Apache", "Content-Length":243166, "Content-Type":"application/pdf", "Creation-Date":"2010-05-11T13:39:32Z", "Keywords":"", "Last-Modified":"2010-05-11T13:39:32Z", "created":"Wed May 11 14:39:32 BST 2010", "creator":"Google Chrome", "producer":"Mac OS X 10.6.7 Quartz PDFContext", "resourceName":"tika.pdf", "subject":"", "title":"Apache Tika - Apache Tika", "xmpTPg:NPages":4 }, "text":"Apache Tika\nApache Tika - a content analysis toolkit" } Based on https://github.com/maxcom/tikaserver-ex (thanks Max!) More info --------- Tika as server : https://issues.apache.org/jira/browse/TIKA-593 Json output : https://issues.apache.org/jira/browse/TIKA-213 Build and run ------------- Install Maven if you haven't, then, under project root folder do: mvn jetty:run Deploy ------ mvn war:war Then drop the war into the webapps folder of a servlet container like Jetty or Apache Tomcat. If using Simple Tika Server for parsing local files with HTTP GET, you'll need to setup JNDI entries, see below. Usage ----- HTTP PUT a document to parse with Tika PUT document to get metadata curl -T pom.xml http://localhost:8080/tika/metadata returns metadata as JSON PUT document to get text curl -T pom.xml http://localhost:8080/tika/text returns textual content, if any, as json PUT document to get metadata and text curl -T pom.xml http://localhost:8080/tika/fulldata returns metadata and textual content as json HTTP GET to parse a locally available document with Tika GET Metadata for files local to server ('metadata' option) http://localhost:8080/tika/metadata/localfiles/tika.pdf returns metadata as JSON 'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) tika.pdf should be made available there before calling http://localhost:8080/tika/metadata/wikipedia/wiki/Douglas_Adams returns metadata as JSON 'wikipedia' is a JNDI value that returns a URL (see jetty-env.xml for example) "wiki/Douglas_Adams" is the resource you want to parse http://localhost:8080/tika/metadata/wikipedia/w/index.php?title=Talk:Douglas_Adams&action=history document part can have query params as well "w/index.php?title=Talk:Douglas_Adams&action=history" is treated as the source document here GET Text for files local to server ('text' option) http://localhost:8080/tika/text/localfiles/tika.pdf returns textual content, if any, as json 'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) tika.pdf should be made available there before calling GET Metadata and Text for files local to server ('fulldata' option) http://localhost:8080/tika/fulldata/localfiles/tika.pdf returns metadata and textual content as json 'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) tika.pdf should be made available there before calling HTTP Codes returned ------------------- 200 - Ok 404 - Document not found 415 - Unknown file type 422 - Unparsable document of known type (password protected documents and unsupported versions like Biff5 Excel) 500 - Internal error
About
Apache Tika as a http service, PUT files and get the metadata as JSON
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published