Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdfa doesn't scrape well #11

Open
kcmcleod opened this issue May 6, 2019 · 2 comments
Open

rdfa doesn't scrape well #11

kcmcleod opened this issue May 6, 2019 · 2 comments

Comments

@kcmcleod
Copy link
Contributor

kcmcleod commented May 6, 2019

When properties are nested, the inner properties are removed to form triples leaving the outer property looking rather messy. Eg, from https://www.uniprot.org/uniprot/Q62226 :

<div class="annotation" property="hasPart" typeof="CreativeWork">
<span property="text">Sonic hedgehog protein: The C-terminal part of the sonic hedgehog protein precursor displays an autoproteolysis and a cholesterol transferase activity (PubMed:
<a href="/citations/8824192">8824192</a>, PubMed:
<a href="/citations/7891723">7891723</a>). Both activities result in the cleavage of the full-length protein into two parts (ShhN and ShhC) followed by the covalent attachment of a cholesterol moiety to the C-terminal of the newly generated ShhN (PubMed:
<a href="/citations/8824192">8824192</a>). Both activities occur in the reticulum endoplasmic (PubMed:
<a href="/citations/21357747">21357747</a>). Once cleaved, ShhC is degraded in the endoplasmic reticulum (PubMed:
<a href="/citations/21357747">21357747</a>).
<span class="attribution ECO305">
<span class="attributionHeader ">1 Publication
<span class="showHideEvidence caret_grey displayThisInline"></span>
    </span>
    <span style="display:none" class="evidenceContainer">
<p class="attributionExplain">
<span class="context-help tooltipped-click html tipId-1">
<span style="display:none">
<span class="toolTipContent">&#xd; &lt;p>Manually curated information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article.&lt;/p>&#xd; &lt;p>&lt;a href="/manual/evidences#ECO:0000305">More...&lt;/a>&lt;/p>&#xd;
</span>
    </span>Manual assertion inferred by curator from
    <sup>i</sup>
    </span>
    </p>
    <ul>
        <li>
            <div class="Q62226#ref18 referenceAttribution">
                <div class="reference_header">Ref.18</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/21357747" typeof="ScholarlyArticle">
                        <strong property="name">"Processing and turnover of the Hedgehog protein in the endoplasmic reticulum."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Chen+X.%22&amp;sort=score" rel="nofollow">Chen X.</a>,
                        <a href="/uniprot/?query=author:%22Tukachinsky+H.%22&amp;sort=score" rel="nofollow">Tukachinsky H.</a>,
                        <a href="/uniprot/?query=author:%22Huang+C.H.%22&amp;sort=score" rel="nofollow">Huang C.H.</a>,
                        <a href="/uniprot/?query=author:%22Jao+C.%22&amp;sort=score" rel="nofollow">Jao C.</a>,
                        <a href="/uniprot/?query=author:%22Chu+Y.R.%22&amp;sort=score" rel="nofollow">Chu Y.R.</a>,
                        <a href="/uniprot/?query=author:%22Tang+H.Y.%22&amp;sort=score" rel="nofollow">Tang H.Y.</a>,
                        <a href="/uniprot/?query=author:%22Mueller+B.%22&amp;sort=score" rel="nofollow">Mueller B.</a>,
                        <a href="/uniprot/?query=author:%22Schulman+S.%22&amp;sort=score" rel="nofollow">Schulman S.</a>,
                        <a href="/uniprot/?query=author:%22Rapoport+T.A.%22&amp;sort=score" rel="nofollow">Rapoport T.A.</a>,
                        <a href="/uniprot/?query=author:%22Salic+A.%22&amp;sort=score" rel="nofollow">Salic A.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1083/jcb.201008090">J. Cell Biol. 192:825-838(2011)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/21357747">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/21357747">Europe PMC</a>] [
                        <a href="/citations/21357747">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> REVIEW, FUNCTION.
                    </div>
                </div>
            </div>
        </li>
    </ul>
    </span>
    </span>
    <span class="attribution ECO269">
<span class="attributionHeader ">3 Publications
<span class="showHideEvidence caret_grey displayThisInline"></span>
    </span>
    <span style="display:none" class="evidenceContainer">
<p class="attributionExplain">
<span class="context-help tooltipped-click html tipId-2">
<span style="display:none">
<span class="toolTipContent">&#xd; &lt;p>Manually curated information for which there is published experimental evidence.&lt;/p>&#xd;
&lt;p>&lt;a href="/manual/evidences#ECO:0000269">More...&lt;/a>&lt;/p>&#xd;
</span>
    </span>Manual assertion based on experiment in
    <sup>i</sup>
    </span>
    </p>
    <ul>
        <li>
            <div class="Q62226#ref6 referenceAttribution">
                <div class="reference_header">Ref.6</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/7891723" typeof="ScholarlyArticle">
                        <strong property="name">"Proteolytic processing yields two secreted forms of sonic hedgehog."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Bumcrot+D.A.%22&amp;sort=score" rel="nofollow">Bumcrot D.A.</a>,
                        <a href="/uniprot/?query=author:%22Takada+R.%22&amp;sort=score" rel="nofollow">Takada R.</a>,
                        <a href="/uniprot/?query=author:%22McMahon+A.P.%22&amp;sort=score" rel="nofollow">McMahon A.P.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1128/MCB.15.4.2294">Mol. Cell. Biol. 15:2294-2303(1995)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/7891723">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/7891723">Europe PMC</a>] [
                        <a href="/citations/7891723">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> PROTEOLYTIC PROCESSING, GLYCOSYLATION, SUBCELLULAR LOCATION.
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="Q62226#ref7 referenceAttribution">
                <div class="reference_header">Ref.7</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/7736596" typeof="ScholarlyArticle">
                        <strong property="name">"Floor plate and motor neuron induction by different concentrations of the amino-terminal cleavage product of sonic hedgehog autoproteolysis."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Roelink+H.%22&amp;sort=score" rel="nofollow">Roelink H.</a>,
                        <a href="/uniprot/?query=author:%22Porter+J.A.%22&amp;sort=score" rel="nofollow">Porter J.A.</a>,
                        <a href="/uniprot/?query=author:%22Chiang+C.%22&amp;sort=score" rel="nofollow">Chiang C.</a>,
                        <a href="/uniprot/?query=author:%22Tanabe+Y.%22&amp;sort=score" rel="nofollow">Tanabe Y.</a>,
                        <a href="/uniprot/?query=author:%22Chang+D.T.%22&amp;sort=score" rel="nofollow">Chang D.T.</a>,
                        <a href="/uniprot/?query=author:%22Beachy+P.A.%22&amp;sort=score" rel="nofollow">Beachy P.A.</a>,
                        <a href="/uniprot/?query=author:%22Jessell+T.M.%22&amp;sort=score" rel="nofollow">Jessell T.M.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1016/0092-8674(95)90397-6">Cell 81:445-455(1995)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/7736596">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/7736596">Europe PMC</a>] [
                        <a href="/citations/7736596">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> FUNCTION, PROTEOLYTIC PROCESSING, AUTOCATALYTIC CLEAVAGE.
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="Q62226#ref8 referenceAttribution">
                <div class="reference_header">Ref.8</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/8824192" typeof="ScholarlyArticle">
                        <strong property="name">"Cholesterol modification of hedgehog signaling proteins in animal development."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Porter+J.A.%22&amp;sort=score" rel="nofollow">Porter J.A.</a>,
                        <a href="/uniprot/?query=author:%22Young+K.E.%22&amp;sort=score" rel="nofollow">Young K.E.</a>,
                        <a href="/uniprot/?query=author:%22Beachy+P.A.%22&amp;sort=score" rel="nofollow">Beachy P.A.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1126/science.274.5285.255">Science 274:255-259(1996)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/8824192">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/8824192">Europe PMC</a>] [
                        <a href="/citations/8824192">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
																<strong>Cited for:</strong>
															</span> CHOLESTERYLATION AT GLY-198, FUNCTION.
                    </div>
                </div>
            </div>
        </li>
    </ul>
    </span>
    </span>
    </span>
</div>

The triple representing the text property (in the 2nd line) ends up as:

http://bioschemas.org/crawl/v1/28/www.uniprot.org/uniprot/Q62226/781026336 http://schema.org/text  "Sonic hedgehog protein: The C-terminal part of the sonic hedgehog protein precursor displays an autoproteolysis and a cholesterol transferase activity (PubMed:8824192, PubMed:7891723). Both activities result in the cleavage of the full-length protein into two parts (ShhN and ShhC) followed by the covalent attachment of a cholesterol moiety to the C-terminal of the newly generated ShhN (PubMed:8824192). Both activities occur in the reticulum endoplasmic (PubMed:21357747). Once cleaved, ShhC is degraded in the endoplasmic reticulum (PubMed:21357747).1 Publication <p>Manually curated information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article.</p> <p><a href="/manual/evidences#ECO:0000305">More...</a></p> Manual assertion inferred by curator fromi
          
           
            
             
              Ref.18
             
             
              
               "Processing and turnover of the Hedgehog protein in the endoplasmic reticulum."
               
               , 
               , 
               , 
               , 
               , 
               , 
               , 
               , 
               , 
               
               
                [
               ] [
               ] [
               ]
              
              
               Cited for: REVIEW, FUNCTION.
              
             
            
          3 Publications <p>Manually curated information for which there is published experimental evidence.</p> <p><a href="/manual/evidences#ECO:0000269">More...</a></p> Manual assertion based on experiment ini
          
           
            
             
              Ref.6
             
             
              
               "Proteolytic processing yields two secreted forms of sonic hedgehog."
               
               , 
               , 
               
               
                [
               ] [
               ] [
               ]
              
              
               Cited for: PROTEOLYTIC PROCESSING, GLYCOSYLATION, SUBCELLULAR LOCATION.
              
             
            
           
            
             
              Ref.7
             
             
              
               "Floor plate and motor neuron induction by different concentrations of the amino-terminal cleavage product of sonic hedgehog autoproteolysis."
               
               , 
               , 
               , 
               , 
               , 
               , 
               
               
                [
               ] [
               ] [
               ]
              
              
               Cited for: FUNCTION, PROTEOLYTIC PROCESSING, AUTOCATALYTIC CLEAVAGE.
              
             
            
           
            
             
              Ref.8
             
             
              
               "Cholesterol modification of hedgehog signaling proteins in animal development."
               
               , 
               , 
               
               
                [
               ] [
               ] [
               ]
              
              
               Cited for: CHOLESTERYLATION AT GLY-198, FUNCTION.
              
             
            
          "

Google SDT Tool

Leaves in the text that is removed by Any23; however, it is still not easy to read and has weird bits in it. Better than Any23 though.
Screenshot 2019-05-06 at 15 10 24

Extruct

Behaves in the same way as Google.

@kcmcleod
Copy link
Contributor Author

kcmcleod commented May 6, 2019

Similar effect with hasPart. This time the issue is the markup which creates nodes with no content.

This html:

<div class="annotation" property="hasPart" typeof="CreativeWork">Belongs to the 
  <a href="/uniprot/?query=family:%22hedgehog+family%22&amp;sort=score">hedgehog family</a>.
  <span class="attribution ECO305">
    <span class="attributionHeader tooltipped" title="Manual assertion inferred by 
    curator">Curated
    </span>
  </span>
</div>

Produces the following raw triples:

genid-2f27fdee3aaf4285a4db8253476df489-n61  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/CreativeWork .
http://purl.uniprot.org/uniprot/Q62226  http://schema.org/hasPart  genid-2f27fdee3aaf4285a4db8253476df489-n61 .

I convert to:

http://bioschemas.org/crawl/v1/30/www.uniprot.org/uniprot/Q62226/1168557303  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/CreativeWork .
http://purl.uniprot.org/uniprot/Q62226  http://schema.org/hasPart  http://bioschemas.org/crawl/v1/30/www.uniprot.org/uniprot/Q62226/1168557303 .

Thus we no longer have blank nodes BUT we do have nodes with basically no information. On this single page there seems to be more than 10 instances of this. Ultimately produces a very cluttered and unuseful page.

Google SDT Tool

Same result:
Screenshot 2019-05-06 at 14 42 30

@kcmcleod
Copy link
Contributor Author

kcmcleod commented May 9, 2019

Difference between any23 & google

HTML source:

<div property="hasPart" class="annotation">
   <ul class="noNumbering subcellLocations">
      <li class="Nucleus">
         <h6>Nucleus</h6>
         <ul>
            <li>
               <a href="/locations/SL-0191">Nucleus </a><a class="icon icon-generic tooltipped" data-tippy="The nucleus is the most obvious organelle in any eukaryotic cell. It is a membrane-bound organelle surrounded by double membranes which contains most of the cell's genetic material. It communicates with the surrounding cytosol via numerous nuclear pores." data-icon="i"></a> 
               <span class="attribution ECO269">
                  <span class="attributionHeader ">1 Publication<span class="showHideEvidence caret_grey displayThisInline"></span></span>
                  <span style="display:none" class="evidenceContainer">
                     <p class="attributionExplain"><span class="context-help tooltipped-click html tipId-1">Manual assertion based on experiment in<sup>i</sup></span></p>
                     <ul>
                        <li>
                           <div class="Q8K330#ref1 referenceAttribution">
                              <div class="reference_header">Ref.1</div>
                              <div class="reference_content">
                                 <div property="citation" resource="http://purl.uniprot.org/citations/14531860" typeof="ScholarlyArticle"><strong property="name">"Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."</strong><br/><a href="/uniprot/?query=author:%22Ohta+Y.%22&amp;sort=score" rel="nofollow">Ohta Y.</a>, <a href="/uniprot/?query=author:%22Kousaka+K.%22&amp;sort=score" rel="nofollow">Kousaka K.</a>, <a href="/uniprot/?query=author:%22Nagata-Ohashi+K.%22&amp;sort=score" rel="nofollow">Nagata-Ohashi K.</a>, <a href="/uniprot/?query=author:%22Ohashi+K.%22&amp;sort=score" rel="nofollow">Ohashi K.</a>, <a href="/uniprot/?query=author:%22Muramoto+A.%22&amp;sort=score" rel="nofollow">Muramoto A.</a>, <a href="/uniprot/?query=author:%22Shima+Y.%22&amp;sort=score" rel="nofollow">Shima Y.</a>, <a href="/uniprot/?query=author:%22Niwa+R.%22&amp;sort=score" rel="nofollow">Niwa R.</a>, <a href="/uniprot/?query=author:%22Uemura+T.%22&amp;sort=score" rel="nofollow">Uemura T.</a>, <a href="/uniprot/?query=author:%22Mizuno+K.%22&amp;sort=score" rel="nofollow">Mizuno K.</a><br/><a href="http://dx.doi.org/10.1046/j.1365-2443.2003.00678.x">Genes Cells 8:811-824(2003)</a>  [<a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/14531860">PubMed</a>] [<a property="sameAs" href="https://europepmc.org/abstract/MED/14531860">Europe PMC</a>] [<a href="/citations/14531860">Abstract</a>]</div>
                                 <div class="citedFor"><span class="details"><strong>Cited for:</strong></span> NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.</div>
                              </div>
                           </div>
                        </li>
                     </ul>
                  </span>
               </span>
            </li>
         </ul>
      </li>
      <li class="Cytoskeleton">
         <h6>Cytoskeleton</h6>
         <ul>
            <li>
               <a href="/locations/SL-0090">cytoskeleton </a><a class="icon icon-generic tooltipped" data-tippy="The cytoskeleton is a dynamic three-dimensional structure that fills the cytoplasm of cells. The cytoskeleton is responsible for cell movement, cytokinesis, and the organization of the organelles or organelle-like structures within the cell. The major components of the cytoskeleton are the microfilaments (of actin), microtubules (of tubulin), the intermediate filament systems and a fourth group, the MinD-ParA group, that appears to be unique to bacteria." data-icon="i"></a> 
               <span class="attribution ECO269">
                  <span class="attributionHeader ">1 Publication<span class="showHideEvidence caret_grey displayThisInline"></span></span>
                  <span style="display:none" class="evidenceContainer">
                     <p class="attributionExplain"><span class="context-help tooltipped-click html tipId-1">Manual assertion based on experiment in<sup>i</sup></span></p>
                     <ul>
                        <li>
                           <div class="Q8K330#ref1 referenceAttribution">
                              <div class="reference_header">Ref.1</div>
                              <div class="reference_content">
                                 <div property="citation" resource="http://purl.uniprot.org/citations/14531860" typeof="ScholarlyArticle"><strong property="name">"Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."</strong><br/><a href="/uniprot/?query=author:%22Ohta+Y.%22&amp;sort=score" rel="nofollow">Ohta Y.</a>, <a href="/uniprot/?query=author:%22Kousaka+K.%22&amp;sort=score" rel="nofollow">Kousaka K.</a>, <a href="/uniprot/?query=author:%22Nagata-Ohashi+K.%22&amp;sort=score" rel="nofollow">Nagata-Ohashi K.</a>, <a href="/uniprot/?query=author:%22Ohashi+K.%22&amp;sort=score" rel="nofollow">Ohashi K.</a>, <a href="/uniprot/?query=author:%22Muramoto+A.%22&amp;sort=score" rel="nofollow">Muramoto A.</a>, <a href="/uniprot/?query=author:%22Shima+Y.%22&amp;sort=score" rel="nofollow">Shima Y.</a>, <a href="/uniprot/?query=author:%22Niwa+R.%22&amp;sort=score" rel="nofollow">Niwa R.</a>, <a href="/uniprot/?query=author:%22Uemura+T.%22&amp;sort=score" rel="nofollow">Uemura T.</a>, <a href="/uniprot/?query=author:%22Mizuno+K.%22&amp;sort=score" rel="nofollow">Mizuno K.</a><br/><a href="http://dx.doi.org/10.1046/j.1365-2443.2003.00678.x">Genes Cells 8:811-824(2003)</a>  [<a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/14531860">PubMed</a>] [<a property="sameAs" href="https://europepmc.org/abstract/MED/14531860">Europe PMC</a>] [<a href="/citations/14531860">Abstract</a>]</div>
                                 <div class="citedFor"><span class="details"><strong>Cited for:</strong></span> NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.</div>
                              </div>
                           </div>
                        </li>
                     </ul>
                  </span>
               </span>
            </li>
         </ul>
      </li>
   </ul>
</div>

Output from Google:
Screenshot 2019-05-09 at 09 18 16
To view this on Google: https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fwww.uniprot.org%2Funiprot%2FQ8K330

Triple produced by any23:

http://purl.uniprot.org/uniprot/Q8K330  http://schema.org/hasPart  
          
           Cytoskeleton
            
             cytoskeleton  1 PublicationManual assertion based on experiment ini
                
                 
                  
                   
                    Ref.1
                   
                   
                    
                     "Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."
                     
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     
                     
                      [
                     ] [
                     ] [
                     ]
                    
                    
                     Cited for: NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.
                    
                   
                  
                
            
           Nucleus
            
             Nucleus  1 PublicationManual assertion based on experiment ini
                
                 
                  
                   
                    Ref.1
                   
                   
                    
                     "Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."
                     
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     
                     
                      [
                     ] [
                     ] [
                     ]
                    
                    
                     Cited for: NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.
                    
                   
                  
                
            
          
         

Notice the order in the HTML is Nucleus then Cytoskeleton, which is the order Google has too. HOWEVER, the order is reversed by any23. Furthermore, notice how much of the text found by Google is not detected by Any23.

ALSO notice that much of the text inside the HTML has completely gone from both Google and any23. E.g., The HTML says "The cytoskeleton is a dynamic three-dimensional structure that fills the cytoplasm of cells", but this is missing from both Google and any23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant