Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential concurrency issues AbstractMappedCommitter #67

Closed
leonardsaers opened this issue Mar 10, 2015 · 8 comments
Closed

Potential concurrency issues AbstractMappedCommitter #67

leonardsaers opened this issue Mar 10, 2015 · 8 comments

Comments

@leonardsaers
Copy link

Documents from other simultanious running jobs are added as ICommitOperation to the jobs commiter when using a commiter based on AbstractMappedCommitter.

Using a simpel Commiter like this will log out URLs from all simultanious running jobs.

public class LoggingCommiter extends AbstractMappedCommitter {
    private static final Logger LOG = LogManager.getLogger(LoggingCommiter.class);

...

    @Override
    protected void commitBatch(List<ICommitOperation> batch) {
        for (ICommitOperation op : batch) {
            if (op instanceof IAddOperation) {
                LOG.info(((IAddOperation) op).getReference());
            }
    }

}

Is this how it should word? or should it be possible to have a AbstractMappedCommitter while run several jobs simultaneously?

@essiembre
Copy link
Contributor

Not sure I understand your question. The intent is that multiple threads can invoke the committer and that committer will queue requests and process them in batch, regardless which thread added the documents. Does that answer the question?

@leonardsaers
Copy link
Author

If I create the following two job files:

<!-- crawljob1.xml -->
<httpcollector id="crawl1">
  <crawlerDefaults>
...
    <committer class="... LoggingCommiter" />
...
  </crawlerDefaults>
  <crawlers>
    <crawler id="crawl1">
      <startURLs>
       <url>http://coffee.someurl.com/</url>
        </startURLs>
      <referenceFilters>
        <filter class="$urlRegex" onMatch="include">^https?://coffee.someurl.com/.*$</filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>
<!-- crawljob2.xml -->
<httpcollector id="crawl2">
  <crawlerDefaults>
...
    <committer class="... LoggingCommiter" />
...
  </crawlerDefaults>
  <crawlers>
    <crawler id="crawl2">
      <startURLs>
       <url>http://tea.someurl.com/</url>
        </startURLs>
      <referenceFilters>
        <filter class="$urlRegex" onMatch="include">^https?://tea.someurl.com/.*$</filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>

Now if I start both jobs at the same time:

nohup ./collector-http.sh -a start -c crawljob1.xml &
nohup ./collector-http.sh -a start -c crawljob2.xml &

This would result in URLs from crawljob2 is printed out in the log file for crawljob1 and vice versa.

I would only expect URLs from crawljob1 in the log file for crawljob1 and the same for crawljob2.

@essiembre
Copy link
Contributor

I agree with you. The behavior your describe does not make much sense. Did you make sure to specify different log directories in your config? Maybe they are pointing to the same location? Even then, the log file names bear the name of the collector followed by the name of the crawler, which are unique in your example. What are the log names generated by each?

@leonardsaers
Copy link
Author

I can reproduce the behaviour with the following two configurations:
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/sample-norconex-config/a-config.xml
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/sample-norconex-config/b-config.xml

The configurations both uses the following simple committer:
https://github.com/leonardsaers/SimpleLoggingCommitter/blob/master/src/main/java/com/norconex/collector/committer/SimpleLoggingCommitter.java

Try run both a-config and b-config from the same folder at the same time.

The committer will be given IAddOperation from both jobs:

INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/210
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/190
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/search-ui-showcase-the-search-box/
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/enterprise-search-pitfalls-believing-in-magic/
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/162
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/173
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/154
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.jabbo.se/stores/90
INFO  [SimpleLoggingCommitter] Commit IAddOperation: http://www.norconex.com/google-search-appliance-gsa-a-journey-into-an-accessible-responsive-web-design/

@martinfou
Copy link

Hello !

I was able to reproduce your issue. I was able to fix it by adding the parameter <queueDir>queue-B</queueDir>

Give it a try and let met know how it works for you.

<committer class="com.norconex.collector.committer.SimpleLoggingCommitter">
<queueDir>queue-B</queueDir>
</committer>

@leonardsaers
Copy link
Author

Thanks, it works fine after adding queueDir.

I would suggest that the default queueDir includes the httpcollector id. Than this would be handled with default configuration.

@essiembre
Copy link
Contributor

We thought about your suggestion, but the challenge is that the Committers are designed so they can be used by anything, not just the Collectors. This means in a context where it is used without a collector, we would have to request an arbitrary prefix, and nothing would prevent someone from storing the same.

Still, we should figure out something when used with Collectors, so I will create a feature-request to that effect under the Norconex Committer Core project.

@essiembre
Copy link
Contributor

Since you have a viable solution for now and a new ticket has been created to address this better in the future (Norconex/committer-core#9), I am closing this issue. Feel free to re-open or create a new one if something else pops up. For follow-ups on general Committer issues, use this link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants