-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential concurrency issues AbstractMappedCommitter #67
Comments
Not sure I understand your question. The intent is that multiple threads can invoke the committer and that committer will queue requests and process them in batch, regardless which thread added the documents. Does that answer the question? |
If I create the following two job files: <!-- crawljob1.xml -->
<httpcollector id="crawl1">
<crawlerDefaults>
...
<committer class="... LoggingCommiter" />
...
</crawlerDefaults>
<crawlers>
<crawler id="crawl1">
<startURLs>
<url>http://coffee.someurl.com/</url>
</startURLs>
<referenceFilters>
<filter class="$urlRegex" onMatch="include">^https?://coffee.someurl.com/.*$</filter>
</referenceFilters>
</crawler>
</crawlers>
</httpcollector> <!-- crawljob2.xml -->
<httpcollector id="crawl2">
<crawlerDefaults>
...
<committer class="... LoggingCommiter" />
...
</crawlerDefaults>
<crawlers>
<crawler id="crawl2">
<startURLs>
<url>http://tea.someurl.com/</url>
</startURLs>
<referenceFilters>
<filter class="$urlRegex" onMatch="include">^https?://tea.someurl.com/.*$</filter>
</referenceFilters>
</crawler>
</crawlers>
</httpcollector> Now if I start both jobs at the same time:
This would result in URLs from crawljob2 is printed out in the log file for crawljob1 and vice versa. I would only expect URLs from crawljob1 in the log file for crawljob1 and the same for crawljob2. |
I agree with you. The behavior your describe does not make much sense. Did you make sure to specify different log directories in your config? Maybe they are pointing to the same location? Even then, the log file names bear the name of the collector followed by the name of the crawler, which are unique in your example. What are the log names generated by each? |
I can reproduce the behaviour with the following two configurations: The configurations both uses the following simple committer: Try run both a-config and b-config from the same folder at the same time. The committer will be given IAddOperation from both jobs:
|
Hello ! I was able to reproduce your issue. I was able to fix it by adding the parameter Give it a try and let met know how it works for you.
|
Thanks, it works fine after adding queueDir. I would suggest that the default queueDir includes the httpcollector id. Than this would be handled with default configuration. |
We thought about your suggestion, but the challenge is that the Committers are designed so they can be used by anything, not just the Collectors. This means in a context where it is used without a collector, we would have to request an arbitrary prefix, and nothing would prevent someone from storing the same. Still, we should figure out something when used with Collectors, so I will create a feature-request to that effect under the Norconex Committer Core project. |
Since you have a viable solution for now and a new ticket has been created to address this better in the future (Norconex/committer-core#9), I am closing this issue. Feel free to re-open or create a new one if something else pops up. For follow-ups on general Committer issues, use this link. |
Documents from other simultanious running jobs are added as ICommitOperation to the jobs commiter when using a commiter based on AbstractMappedCommitter.
Using a simpel Commiter like this will log out URLs from all simultanious running jobs.
Is this how it should word? or should it be possible to have a AbstractMappedCommitter while run several jobs simultaneously?
The text was updated successfully, but these errors were encountered: