Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking seems not to be working #60

Open
skarampatakis opened this issue Apr 12, 2016 · 9 comments
Open

Blocking seems not to be working #60

skarampatakis opened this issue Apr 12, 2016 · 9 comments

Comments

@skarampatakis
Copy link

Hi, we have been using the Silk Single machine to create some links between two datasets. We would like to enable Blocking to reduce running times. But nothing seems to happen.

While here is denoted that blocking should be enabled by adding
[<Blocking blocks="100" />]

Java throws an error about mailformed configuration.

We changed it to
<Blocking blocks="100" />

Silk seems to running but there is not any reduction in running times. Is it actually use it but has no effect because of our data? Or wrong configuration?

@afeliachi
Copy link
Contributor

Hi Sotirios,
What distance measure(s) are you using? have you checked that the attribute "indexingindexing="true"" in your script?

@skarampatakis
Copy link
Author

Where should this attribute be? We are using a mix of distance measures (levenstein, jarro, dice, jaccard etc) with thresholds 0.2.

@afeliachi
Copy link
Contributor

you find it in every Compare element, example:
<Compare id="comparison1" required="false" weight="1" metric="levenshteinDistance" threshold="1.0"indexing="true">
you have to put it to "true" for every distance measure you want to use for blocking.
In fact the blocking is based on the indexing of the values used in the comparisons (a block will contain similar values only).
for the thresholds, I know that it depends on the data and on how much you want your interlinking to be strict, but I advise you to be careful when choosing them, 0.2 would work for normalized distance measures only. See the plugins doc for more details on each distance measure.

@skarampatakis
Copy link
Author

We included indexing="true"in every metric with no significant result.

@afeliachi
Copy link
Contributor

After taking a look into the code, I think the blocking is activated by default, even if you don't add the
<Blocking blocks="100" /> , the blocking was working from the begining.
It's also the case for the indexing.
to be sure, can you take a look into your {user_home}/.silk/entityCache/{your_interlink_id}/ you will find two folders "source" and "target", each one must contain 100 folders representing the blocks. If it's the case, that means the blokning is working just fine.

you can also try with <Blocking blocks="10" /> and you'll probably see that the execution time will become longer.

@skarampatakis
Copy link
Author

I have tried running SiLK single machine in different OS.
In Ubuntu 15.04 it seems that blocking is working, giving bad results, low number of links.
In windows 10 silk seems to ignore blocing command giving better results for our task. While this sounds as a bug, at least we found out that blocking isn't helping. So for now we will not be using blocking. The question is how do i disable blocking? It seems that if I comment out the blocking statement in Ubuntu, silk ignores it and enables blocking by default as you mentioned. If we declare enabled="false" silk gives no results.

@afeliachi
Copy link
Contributor

This seems to be realy odd. Normally enabled="false" would do the job.
Can you post your whole script please? It may better understanding the bug.
Meanwhile I hope the project members could give a better answer to your concern. I am mainly just a user like you :)

@skarampatakis
Copy link
Author

Thank you for your time.

<?xml version="1.0"?>
<Silk>
  <Prefixes>
    <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#"/>
    <Prefix id="xsd" namespace="http://www.w3.org/2001/XMLSchema#"/>
    <Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#"/>
    <Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
    <Prefix id="sesame" namespace="http://www.openrdf.org/schema/sesame#"/>
    <Prefix id="fn" namespace="http://www.w3.org/2005/xpath-functions#"/>
    <Prefix id="skos" namespace="http://www.w3.org/2004/02/skos/core#"/>
  </Prefixes>
  <DataSources>
    <DataSource id="codelist1" type="file">
      <Param name="file" value="source.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </DataSource>
    <DataSource id="codelist2" type="file">
      <Param name="file" value="target.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </DataSource>
  </DataSources>

  <Blocking enabled="false"  />


  <Interlinks>
    <Interlink id="labels">
      <SourceDataset dataSource="codelist1" var="a">
        <RestrictTo></RestrictTo>
      </SourceDataset>
      <TargetDataset dataSource="codelist2" var="b">
        <RestrictTo></RestrictTo>
      </TargetDataset>
      <LinkageRule linkType="skos:closeMatch">
        <Aggregate type="max">
          <Compare metric="levenshtein" threshold="0.20" >
              <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>            
          <Compare metric="jaro" threshold="0.20" >
           <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="jaroWinkler" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="jaccard" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="dice" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
          <Compare metric="softjaccard" threshold="0.20" >
            <TransformInput function="lowerCase">
                <Input path="?a/skos:prefLabel"/>
            </TransformInput>  
            <TransformInput function="lowerCase">
                <Input path="?b/skos:prefLabel"/>
            </TransformInput>  
          </Compare>
        </Aggregate>
        <Filter limit="10"/>
      </LinkageRule>
    </Interlink>
  </Interlinks>
  <Outputs>
    <Output id="suggestions" type="file" minConfidence="0.5">
      <Param name="file" value="top10_project5.nt"/>
      <Param name="format" value="N-TRIPLE"/>
    </Output>
    <Output id="exactMatch" type="file" minConfidence="1">
      <Param name="file" value="exact_project5.nt"/>
      <Param name="format" value="N-TRIPLE"/>
    </Output>
    <Output id="score" type="alignment" minConfidence="0.5" maxConfidence="1">
      <Param name="file" value="score_project5.rdf"/>
      <Param name="format" value="RDF/XML"/>
    </Output>
  </Outputs>
</Silk>

@skarampatakis
Copy link
Author

skarampatakis commented Apr 14, 2016

If we get enabled="false" in Ubuntu java throws an error
/home/user/.silk/entityCache/labels/source/block0/parition0 (No such file or directory) - error loading resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants