-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109
base: master
Are you sure you want to change the base?
Conversation
…ing of TWSE_EQUITIES and TPEX_EQUITIES
Hello JunTingLin! I think I encountered the same problem with you. The update function fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Consider adding try-except blocks can help handle potential exceptions.
- use WebDriverWait(driver, 10).until rather than time.sleep
driver.get(main_page_url) | ||
time.sleep(5) # 等待JavaScript渲染完成 | ||
driver.get(url) | ||
time.sleep(5) # 等待JavaScript渲染完成 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magical number is not a good way :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
777
# 使用WebDriver先訪問主頁面,再訪問指定的URL | ||
main_page_url = "https://isin.twse.com.tw" | ||
driver.get(main_page_url) | ||
time.sleep(5) # 等待JavaScript渲染完成 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magical number is not a good way :(
作者您好,
首先感謝您開發並分享這麼實用的專案。我在使用過程中發現,自從過年之後,原本透過靜態爬蟲requests去抓取http://isin.twse.com.tw/isin/C_public.jsp?strMode=2 上的所有股票代號資料的方法已經無法正常運作了。我推測這可能是網站加強了防爬機制的結果。
為了解決這個問題,我對fetch.py中的fetch_data函數進行了一番修正,改用Selenium進行動態爬蟲。考慮到可能有使用者會在無GUI環境下運行此專案,我有啟用了無頭模式(headless mode)。但...一旦啟用無頭模式後,就頻繁遇到連線失敗的問題。經過一番嘗試後,我發現了一個可行的解決方案:先訪問主頁面https://isin.twse.com.tw 並暫停幾秒,然後再去訪問目標URL,這樣就能順利獲取所需的資料了。
如果我的修改存在任何問題,或者有更好的解決方案,請隨時聯繫我。