Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when URL is too long aborts crawling #8

Open
kevinmarks opened this issue Jan 28, 2018 · 3 comments
Open

Failure when URL is too long aborts crawling #8

kevinmarks opened this issue Jan 28, 2018 · 3 comments

Comments

@kevinmarks
Copy link
Contributor

If I patch this will it recrawl the whole site, or only files it doesn't have?

`Processing: http://the-toast.net/The%20dining%20room%20will%20have%20been%20dismantled%20by%20now,%20the%20tableware%20and%20furniture%20packed%20up.%20The%20final%20cuttlefish%20soba%20noodles%20and%20iced%20rose-petal%20broth%20have%20left%20the%20kitchen;%20the%20last%20wild%20duck%20has%20been%20barbecued%20and%20carved;%20no%20more%20sweet%20potatoes%20will%20appear%20at%20the%20pass,%20simmering%20in%20molten%20sugar.%20After%20two%20years%20of%20planning%20for%20a%20five-week%20run%20of%2032%20lunches%20and%2032%20dinners,%20with%203,584%20guests%20fed%20approximately%2057,350%20courses,%20the%20labour%20of%20love%20that%20was%20Noma%27s%20time%20in%20Tokyo%20came%20to%20an%20end%20yesterday.
Filename: The%20dining%20room%20will%20have%20been%20dismantled%20by%20now,%20the%20tableware%20and%20furniture%20packed%20up.%20The%20final%20cuttlefish%20soba%20noodles%20and%20iced%20rose-petal%20broth%20have%20left%20the%20kitchen;%20the%20last%20wild%20duck%20has%20been%20barbecued%20and%20carved;%20no%20more%20sweet%20potatoes%20will%20appear%20at%20the%20pass,%20simmering%20in%20molten%20sugar.%20After%20two%20years%20of%20planning%20for%20a%20five-week%20run%20of%2032%20lunches%20and%2032%20dinners,%20with%203,584%20guests%20fed%20approximately%2057,350%20courses,%20the%20labour%20of%20love%20that%20was%20Noma%27s%20time%20in%20Tokyo%20came%20to%20an%20end%20yesterday.
Directory: ./the-toast.net/
fs.js:558
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^

Error: ENAMETOOLONG: name too long, open './the-toast.net/The%20dining%20room%20will%20have%20been%20dismantled%20by%20now,%20the%20tableware%20and%20furniture%20packed%20up.%20The%20final%20cuttlefish%20soba%20noodles%20and%20iced%20rose-petal%20broth%20have%20left%20the%20kitchen;%20the%20last%20wild%20duck%20has%20been%20barbecued%20and%20carved;%20no%20more%20sweet%20potatoes%20will%20appear%20at%20the%20pass,%20simmering%20in%20molten%20sugar.%20After%20two%20years%20of%20planning%20for%20a%20five-week%20run%20of%2032%20lunches%20and%2032%20dinners,%20with%203,584%20guests%20fed%20approximately%2057,350%20courses,%20the%20labour%20of%20love%20that%20was%20Noma%27s%20time%20in%20Tokyo%20came%20to%20an%20end%20yesterday.'
at Object.fs.openSync (fs.js:558:18)
at Object.fs.writeFileSync (fs.js:1223:33)
at Request._callback (/Users/kevinmarks/code/spiderpig/spider.js:145:12)
at Request.self.callback (/Users/kevinmarks/code/spiderpig/node_modules/request/request.js:188:22)
at emitTwo (events.js:106:13)
at Request.emit (events.js:191:7)
at Request. (/Users/kevinmarks/code/spiderpig/node_modules/request/request.js:1171:10)
at emitOne (events.js:96:13)
at Request.emit (events.js:188:7)
at IncomingMessage. (/Users/kevinmarks/code/spiderpig/node_modules/request/request.js:1091:12)
`

@aaronpk
Copy link
Owner

aaronpk commented Jan 28, 2018

I don't think I ever made it do incremental crawls. That would be a good feature to add tho

@kevinmarks
Copy link
Contributor Author

kevinmarks commented Jan 28, 2018 via email

@kevinmarks
Copy link
Contributor Author

Looking at it, to make it incremental, I'd factor the url to path stuff into a function, call it at the beginning of process_link() and then see if a file exists already.
If it does, set visited to true.
Then, if the file ends in .html, read it and run the link parsing on it.

What I'm not sure about is how that would handle redirects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants