Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling error #105

Closed
BlackChila opened this issue Sep 27, 2024 · 20 comments
Closed

Crawling error #105

BlackChila opened this issue Sep 27, 2024 · 20 comments
Assignees
Labels
bug Something isn't working

Comments

@BlackChila
Copy link

BlackChila commented Sep 27, 2024

hey and thanks for this nice package!
am having the following issue: some websites are randomly not scraped, while others get scraped correctly. On each run of the code which websites are scraped or not varies randomly. For the not scraped websites I get the following error:
[ERROR] 🚫 Failed to crawl https://random-website.com, error: 'NoneType' object has no attribute 'get'.

I save the .html files after scraping and the websites which are affected by this bug are saved in an html file with just ['', None] contained in the file.

tried to update all packages and also setup a new conda environment, but it didnt fix the issue. I am using WebCrawler, not the AsyncWebCrawler

@unclecode
Copy link
Owner

@BlackChila Would you please share the link of the page you experienced this?

@unclecode unclecode self-assigned this Sep 28, 2024
@unclecode unclecode added the bug Something isn't working label Sep 28, 2024
@BlackChila
Copy link
Author

BlackChila commented Sep 28, 2024

hey unclecode, thanks for your answer!
I experience this on multiple pages and i can access the pages manually, so my IP is not blocked. Also it varies randomly if crawl4ai can access the page or not, on some run it accesses it, sometimes not, so i do not think it is an issue regarding the page itself. It affects eg https://de.wikipedia.org/wiki/Aral or https://www.aldi-nord.de/, but other times not. in total it affects 20-30 % of my crawled websites

@unclecode
Copy link
Owner

@BlackChila Thx for sharing, Please do us a favor and try using the asynchronous method. Let's see if you get something similar with it or not. If you still face issues, then we'll start to do the stress test by crawling a set of links and websites to see when such things happen. However, let's just try it with asynchronous and let me know about it first. Thank you.

@RhonnieAl
Copy link

RhonnieAl commented Sep 28, 2024

@unclecode Getting the same NoneType error. Here are the logs:

INFO: Started server process [14039]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[LOG] 🌤️ Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
Delaying for 10 seconds...
Resuming...
[LOG] 🕸️ Crawling https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ successfully!
[LOG] 🚀 Crawling done for https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, success: True, time taken: 0.74 seconds
[ERROR] 🚫 Failed to crawl https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str
url='https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str' session_id=None responser_headers=None status_code=None
INFO: 127.0.0.1:62431 - "GET / HTTP/1.1" 200 OK

from fastapi import FastAPI, HTTPException
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from dotenv import load_dotenv
import asyncio
import json


load_dotenv() 

app = FastAPI()

@app.get("/")

async def crawl(url: str = "https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/"):
    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            
            # Introduce Delay
            print("Delaying for 10 seconds...")
            await asyncio.sleep(10)
            print("Resuming...")

            # Extract data
            result = await crawler.arun(url=url, bypass_cache=True)
            
            # Return data
            print(result)
            return result.dict()
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)



@BlackChila
Copy link
Author

Thanks @RhonnieAl for posting the same issue with the asynchronous method here!

@ojaros
Copy link

ojaros commented Sep 30, 2024

Getting the same [ERROR] 🚫 Failed to crawl error: Failed to extract content from the website: error: can only concatenate str (not "NoneType") to str

When trying to crawl a Notion site

@xansrnitu
Copy link

+1. I am encountering this same issue.

theguy000 added a commit to theguy000/crawl4ai that referenced this issue Oct 3, 2024
Related to unclecode#105

Fix the 'NoneType' object has no attribute 'get' error in `AsyncWebCrawler`.

* **crawl4ai/async_webcrawler.py**
  - Add a check in the `arun` method to ensure `html` is not `None` before further processing.
  - Raise a descriptive error if `html` is `None`.

* **crawl4ai/async_crawler_strategy.py**
  - Add a check in the `crawl` method of the `AsyncPlaywrightCrawlerStrategy` class to handle cases where `html` is `None`.
  - Raise a descriptive error if `html` is `None`.

* **tests/async/test_basic_crawling.py**
  - Add a test case to verify handling of `None` values for the `html` variable in the `test_invalid_url` function.

* **tests/async/test_error_handling.py**
  - Add a test case to verify handling of `None` values for the `html` variable in the `test_network_error` function.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/unclecode/crawl4ai/issues/105?shareId=XXXX-XXXX-XXXX-XXXX).
@DhrubojyotiDey
Copy link

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication.
The following is the below error:
LOG] 🌤️ Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 0
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 1
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 2
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 3
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds.
Number of related items extracted: 0
[]

I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

@mobyds
Copy link

mobyds commented Oct 17, 2024

I have the same error with the use of WebCrawler

The problem is in the file utils.py
File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized
src = img.get('src', '')
src = img.get('src', '')
^^^^^^^^^^^^^^^^^^
File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get
return self.attrs.get(key, default)
AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except:
try:
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
except Exception as e:
pass

@unclecode
Copy link
Owner

image

@RhonnieAl Sorry for my delayed response. The links that you are trying to crawl have very strong bot detection. This is why they won't navigate to the page. For the error message, we made some adjustments to make the error message a little bit better in the new version, 0.3.7. You can update to this new version, and then get a better message. I think I'm going to release the new version within a day or two. One thing that you can do is always try to set the headless to false, so that you can see what's happening, and in this way, you'll get an understanding of what's going on. Here's a screenshot of what's happening. Fyi you can apply some sort of scripts and techniques using the hooks that we have in our library before going to a page to fix some of such issues. However, if you use the new version, the error message contains some useful information for you to try on different websites. Anyway, hopefully, this can be helpful for you.

@unclecode
Copy link
Owner

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌤️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] ✅ Crawled https://www.nbcnews.com/business successfully! [LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []

I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

@unclecode
Copy link
Owner

I have the same error with the use of WebCrawler

The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass

@mobyds This is interesting, would you please share the url caused this issue? Thx

@mobyds
Copy link

mobyds commented Oct 17, 2024

https://chantepie.fr/

@unclecode
Copy link
Owner

@mobyds It works for me, perhaps you can share with me your code as well as you system specs.

image

async def main():
    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        url = "https://chantepie.fr/"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            screenshot = True
        )
        
        # Save screenshot to file
        with open(os.path.join(__data, "chantepie.png"), "wb") as f:
            f.write(base64.b64decode(result.screenshot))
        
        print(result.markdown)
[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://chantepie.fr/ using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://chantepie.fr/ successfully!
[LOG] 🚀 Crawling done for https://chantepie.fr/, success: True, time taken: 5.08 seconds
[LOG] 🚀 Content extracted for https://chantepie.fr/, success: True, time taken: 0.29 seconds
[LOG] 🔥 Extracting semantic blocks for https://chantepie.fr/, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://chantepie.fr/, time taken: 0.32 seconds.

@mobyds
Copy link

mobyds commented Oct 17, 2024

Iy was with WebCrawler, not with AsyncWebCrawler

@DhrubojyotiDey
Copy link

@unclecode Hi

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌤️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] ✅ Crawled https://www.nbcnews.com/business successfully! [LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []
I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

Below is the code snippet I used for extraction. I find the issue most common with hindustantimes and NDTV. The news block is not getting extracted completely.

url1 = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
url2 = "https://www.hindustantimes.com/world-news/israelhamas-war-live-updates-palestine-israel-latest-news-hamas-militant-group-attack-101696723677129.html"

related_content = []
os.environ['GEMINI_API_KEY'] = userdata.get('my_key')

async def process_urls():
async with AsyncWebCrawler(verbose=True) as crawler:
for url in urls:
# Perform extraction for each URL
result = await crawler.arun(
url=url1,
extraction_strategy=LLMExtractionStrategy(
provider="gemini/gemini-pro",
bypass_cache=True,
api_token=os.environ['GEMINI_API_KEY'],
instruction="Extract only content related to Israel and hamas war and extract URL if available"
),
)

        if result.extracted_content is not None:
            try:
                extracted_data = json.loads(result.extracted_content)
                related_content.extend(extracted_data)  # Append extracted data for each URL
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON for {url}: {e}")
                print(f"Raw extracted content: {result.extracted_content}")  # Debug raw content
        else:
            print(f"No content extracted by the LLM for {url}")

Execute the asynchronous function

asyncio.run(process_urls())

print(f"Number of related items extracted: {len(related_content)}")
combined_data = [item.get('content') for item in related_content]
print(combined_data)

@unclecode
Copy link
Owner

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

@unclecode
Copy link
Owner

@DhrubojyotiDey I followed the first link that you shared here. The page is actually very long. Let me explain how this LLM extraction strategy works. By default, there is a chunking stage. This means that when you pass the content, break it into smaller chunks and then send every chunk in parallel to the language model. This is designed to be more suitable for smaller languages. Those small language models may not have a long context window. Therefore, we can make the most of them this way. If you're using a language model that supports long window contexts, such as Gemini, in your code, the best way to handle it is to either turn off this feature or to use a very long chunk length. Here's an example of both approaches. In my case, they work perfectly. I hope this is helpful for you.

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            apply_chunking = False,
            # chunk_token_threshold = 2 ** 14 # 16k tokens
            instruction="""Extract only content related to Israel and hamas war and extract URL if available"""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            
            # magic=True
        )
        extracted_content = json.loads(result.extracted_content)
        print(extracted_content)

    print("Done")

@mobyds
Copy link

mobyds commented Oct 21, 2024

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

OK, and thanks a lot for this very useful lib

@unclecode
Copy link
Owner

You're welcome @mobyds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants