-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdfa doesn't scrape well #11
Comments
Similar effect with hasPart. This time the issue is the markup which creates nodes with no content. This html:
Produces the following raw triples:
I convert to:
Thus we no longer have blank nodes BUT we do have nodes with basically no information. On this single page there seems to be more than 10 instances of this. Ultimately produces a very cluttered and unuseful page. Google SDT Tool |
Difference between any23 & googleHTML source:
Output from Google: Triple produced by any23:
Notice the order in the HTML is Nucleus then Cytoskeleton, which is the order Google has too. HOWEVER, the order is reversed by any23. Furthermore, notice how much of the text found by Google is not detected by Any23. ALSO notice that much of the text inside the HTML has completely gone from both Google and any23. E.g., The HTML says "The cytoskeleton is a dynamic three-dimensional structure that fills the cytoplasm of cells", but this is missing from both Google and any23. |
When properties are nested, the inner properties are removed to form triples leaving the outer property looking rather messy. Eg, from https://www.uniprot.org/uniprot/Q62226 :
The triple representing the text property (in the 2nd line) ends up as:
Google SDT Tool
Leaves in the text that is removed by Any23; however, it is still not easy to read and has weird bits in it. Better than Any23 though.
Extruct
Behaves in the same way as Google.
The text was updated successfully, but these errors were encountered: