Potential concurrency issues AbstractMappedCommitter #67

leonardsaers · 2015-03-10T13:32:57Z

Documents from other simultanious running jobs are added as ICommitOperation to the jobs commiter when using a commiter based on AbstractMappedCommitter.

Using a simpel Commiter like this will log out URLs from all simultanious running jobs.

public class LoggingCommiter extends AbstractMappedCommitter {
    private static final Logger LOG = LogManager.getLogger(LoggingCommiter.class);

...

    @Override
    protected void commitBatch(List<ICommitOperation> batch) {
        for (ICommitOperation op : batch) {
            if (op instanceof IAddOperation) {
                LOG.info(((IAddOperation) op).getReference());
            }
    }

}

Is this how it should word? or should it be possible to have a AbstractMappedCommitter while run several jobs simultaneously?

essiembre · 2015-03-10T13:51:43Z

Not sure I understand your question. The intent is that multiple threads can invoke the committer and that committer will queue requests and process them in batch, regardless which thread added the documents. Does that answer the question?

leonardsaers · 2015-03-11T10:09:44Z

If I create the following two job files:

<!-- crawljob1.xml -->
<httpcollector id="crawl1">
  <crawlerDefaults>
...
    <committer class="... LoggingCommiter" />
...
  </crawlerDefaults>
  <crawlers>
    <crawler id="crawl1">
      <startURLs>
       <url>http://coffee.someurl.com/</url>
        </startURLs>
      <referenceFilters>
        <filter class="$urlRegex" onMatch="include">^https?://coffee.someurl.com/.*$</filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>

<!-- crawljob2.xml -->
<httpcollector id="crawl2">
  <crawlerDefaults>
...
    <committer class="... LoggingCommiter" />
...
  </crawlerDefaults>
  <crawlers>
    <crawler id="crawl2">
      <startURLs>
       <url>http://tea.someurl.com/</url>
        </startURLs>
      <referenceFilters>
        <filter class="$urlRegex" onMatch="include">^https?://tea.someurl.com/.*$</filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>

Now if I start both jobs at the same time:

nohup ./collector-http.sh -a start -c crawljob1.xml &
nohup ./collector-http.sh -a start -c crawljob2.xml &

This would result in URLs from crawljob2 is printed out in the log file for crawljob1 and vice versa.

I would only expect URLs from crawljob1 in the log file for crawljob1 and the same for crawljob2.

essiembre · 2015-03-11T15:45:37Z

I agree with you. The behavior your describe does not make much sense. Did you make sure to specify different log directories in your config? Maybe they are pointing to the same location? Even then, the log file names bear the name of the collector followed by the name of the crawler, which are unique in your example. What are the log names generated by each?

leonardsaers · 2015-04-08T14:58:15Z

I can reproduce the behaviour with the following two configurations:
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/sample-norconex-config/a-config.xml
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/sample-norconex-config/b-config.xml

The configurations both uses the following simple committer:
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/src/main/java/com/norconex/collector/committer/SimpleLoggingCommitter.java

Try run both a-config and b-config from the same folder at the same time.

The committer will be given IAddOperation from both jobs:

INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/210
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/190
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/search-ui-showcase-the-search-box/
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/enterprise-search-pitfalls-believing-in-magic/
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/162
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/173
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/154
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/90
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/google-search-appliance-gsa-a-journey-into-an-accessible-responsive-web-design/

martinfou · 2015-04-08T19:44:13Z

Hello !

I was able to reproduce your issue. I was able to fix it by adding the parameter <queueDir>queue-B</queueDir>

Give it a try and let met know how it works for you.

<committer class="com.norconex.collector.committer.SimpleLoggingCommitter">
<queueDir>queue-B</queueDir>
</committer>

leonardsaers · 2015-04-09T14:12:08Z

Thanks, it works fine after adding queueDir.

I would suggest that the default queueDir includes the httpcollector id. Than this would be handled with default configuration.

essiembre · 2015-04-09T15:42:08Z

We thought about your suggestion, but the challenge is that the Committers are designed so they can be used by anything, not just the Collectors. This means in a context where it is used without a collector, we would have to request an arbitrary prefix, and nothing would prevent someone from storing the same.

Still, we should figure out something when used with Collectors, so I will create a feature-request to that effect under the Norconex Committer Core project.

essiembre · 2015-04-09T15:54:38Z

Since you have a viable solution for now and a new ticket has been created to address this better in the future (Norconex/committer-core#9), I am closing this issue. Feel free to re-open or create a new one if something else pops up. For follow-ups on general Committer issues, use this link.

essiembre mentioned this issue Apr 9, 2015

Ensure committer queue uniqueness to avoid queue collisions Norconex/committer-core#9

Open

essiembre closed this as completed Apr 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential concurrency issues AbstractMappedCommitter #67

Potential concurrency issues AbstractMappedCommitter #67

leonardsaers commented Mar 10, 2015

essiembre commented Mar 10, 2015

leonardsaers commented Mar 11, 2015

essiembre commented Mar 11, 2015

leonardsaers commented Apr 8, 2015

martinfou commented Apr 8, 2015

leonardsaers commented Apr 9, 2015

essiembre commented Apr 9, 2015

essiembre commented Apr 9, 2015

Potential concurrency issues AbstractMappedCommitter #67

Potential concurrency issues AbstractMappedCommitter #67

Comments

leonardsaers commented Mar 10, 2015

essiembre commented Mar 10, 2015

leonardsaers commented Mar 11, 2015

essiembre commented Mar 11, 2015

leonardsaers commented Apr 8, 2015

martinfou commented Apr 8, 2015

leonardsaers commented Apr 9, 2015

essiembre commented Apr 9, 2015

essiembre commented Apr 9, 2015