Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: 增加爬虫整体运行 最大超时时间 #137

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

HeisenbergV
Copy link
Contributor

@HeisenbergV HeisenbergV commented Jan 6, 2023

目前没有整个进程的最大运行时间控制,因此实现此功能;
但无法达到精准的控制,只能在秒级别内,比如设置20秒,会在20-23秒左右停止

配置:

  1. 增加配置项:max-run-time 单位秒: ./crawlergo --max-run-time=20 www.xxx.com
  2. config.go: MaxRunTime 默认最大执行时间为 3600秒

控制点

  1. 产生新url 阶段- task_main.go:addTask2Pool(): 若检测超时则无法再添加新的url 创建新的tab
func (t *CrawlerTask) addTask2Pool(req *model.Request) {
	t.taskCountLock.Lock()
	if t.crawledCount >= t.Config.MaxCrawlCount {
		t.taskCountLock.Unlock()
		return
	} else {
		t.crawledCount += 1
	}

	if t.Start.Add(time.Second * time.Duration(t.Config.MaxRunTime)).Before(time.Now()) {
		t.taskCountLock.Unlock()
		return
	}
	t.taskCountLock.Unlock()
.....
  1. 创建tab准备爬取 阶段- task_main.go:Task()
	// 设置tab超时时间,若设置了程序最大运行时间, tab超时时间和程序剩余时间取小
	timeremaining := t.crawlerTask.Start.Add(time.Duration(t.crawlerTask.Config.MaxRunTime) * time.Second).Sub(time.Now())
	tabTime := t.crawlerTask.Config.TabRunTimeout
	if t.crawlerTask.Config.TabRunTimeout > timeremaining {
		tabTime = timeremaining
	}

	if tabTime <= 0 {
		return
	}

每个tab的最大超时时间由TabRunTimeout控制,因此这里的逻辑是:
进程剩余时间和tab最大超时时间 取最小,作为 tab的超时时间。
如果没时间了,则取消创建。

@Qianlitp Qianlitp merged commit a02e03b into Qianlitp:master Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants