Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise parser to make html cleanup optional #16

Open
1 of 2 tasks
roomthily opened this issue Feb 12, 2015 · 3 comments
Open
1 of 2 tasks

Revise parser to make html cleanup optional #16

roomthily opened this issue Feb 12, 2015 · 3 comments

Comments

@roomthily
Copy link
Contributor

Related: #3 encoding problems.

So there's a parsing pathway for the NLP pipeline (clean everything) and a pipeline to the triplestore (text from the node, untouched).

Tasks:

  • unicode escape cruft removal
  • add those as options to the xml parser - possible that we don't want to strip out the html tags for the triplestore
@roomthily
Copy link
Contributor Author

See the rawresponse class - from solr to xml as string parsable by etree. Note that the html tag removal can't be here - it's running against the xml text blocks instead. Likely also of any encoding issues related to the unicode escape.

So basic text cleanup just to parse and then the two other cleanup tasks against the xml.

@roomthily
Copy link
Contributor Author

Note: the CDATA wrapper for raw_content is not part of the newer nutch plugin/extension/etc. So the removal is there but likely unnecessary.

@roomthily
Copy link
Contributor Author

We are only stripping out the unicode escape cruft if it precedes the initial XML tag - we just want a etree-parsable string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant