Revise parser to make html cleanup optional #16

roomthily · 2015-02-12T20:42:39Z

Related: #3 encoding problems.

So there's a parsing pathway for the NLP pipeline (clean everything) and a pipeline to the triplestore (text from the node, untouched).

Tasks:

unicode escape cruft removal
add those as options to the xml parser - possible that we don't want to strip out the html tags for the triplestore

roomthily · 2015-02-20T04:36:15Z

See the rawresponse class - from solr to xml as string parsable by etree. Note that the html tag removal can't be here - it's running against the xml text blocks instead. Likely also of any encoding issues related to the unicode escape.

So basic text cleanup just to parse and then the two other cleanup tasks against the xml.

roomthily · 2015-03-05T05:19:44Z

Note: the CDATA wrapper for raw_content is not part of the newer nutch plugin/extension/etc. So the removal is there but likely unnecessary.

roomthily · 2015-03-05T05:20:50Z

We are only stripping out the unicode escape cruft if it precedes the initial XML tag - we just want a etree-parsable string.

roomthily mentioned this issue Feb 20, 2015

Add second round of solr access for solr-to-triplestore harvest #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise parser to make html cleanup optional #16

Revise parser to make html cleanup optional #16

roomthily commented Feb 12, 2015

roomthily commented Feb 20, 2015

roomthily commented Mar 5, 2015

roomthily commented Mar 5, 2015

Revise parser to make html cleanup optional #16

Revise parser to make html cleanup optional #16

Comments

roomthily commented Feb 12, 2015

roomthily commented Feb 20, 2015

roomthily commented Mar 5, 2015

roomthily commented Mar 5, 2015