Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping event called multiple times #28

Open
dmarafetti opened this issue Nov 29, 2012 · 0 comments
Open

Scraping event called multiple times #28

dmarafetti opened this issue Nov 29, 2012 · 0 comments

Comments

@dmarafetti
Copy link

Due an issue with PanthomJS (I've tested on 1.7.0 in both MacOSX and Debian 6) issue 353, the page.open() event is being called multiple times on some url's. This is related to iframes being created within the page (you can find more details in the open issue).

pjscrape.js (master branch)

line 680 // run the scrape
line 681 page.open(url, function(status) {

Below you can see an output example of how the log looks like when scraping is invoked many times:

xxxxx@ip-xxxxxxxxx:~/crawler$ phantomjs   --web-security=no --load-images=no --ignore-ssl-errors=yes ./pjscrape-600e20a/pjscrape.js  ./bin/pjscrape-600e20a/pjscrape.js ./config.js

Using config file: src/main/resources/com/apicube/crawler/pjscrape/config.js
* Suite 0 starting
* Opening http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items

I've applied a workaround in the meantime in order to stop duplicated events. I use the visited array to know if that page was already visited. I added a condition before line 700 as you can see below:

pjscrape.js (master branch) line 700

                   if(visited[url]) {

                        log.msg('Page recalled: ' + url);
                        return;
                    }


                   // mark as visited
                  visited[url] = true;

Hope this help to fix this bug.
Diego

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant