Crawling error #105

BlackChila · 2024-09-27T13:37:32Z

hey and thanks for this nice package!
am having the following issue: some websites are randomly not scraped, while others get scraped correctly. On each run of the code which websites are scraped or not varies randomly. For the not scraped websites I get the following error:
[ERROR] 🚫 Failed to crawl https://random-website.com, error: 'NoneType' object has no attribute 'get'.

I save the .html files after scraping and the websites which are affected by this bug are saved in an html file with just ['', None] contained in the file.

tried to update all packages and also setup a new conda environment, but it didnt fix the issue. I am using WebCrawler, not the AsyncWebCrawler

unclecode · 2024-09-28T00:19:17Z

@BlackChila Would you please share the link of the page you experienced this?

BlackChila · 2024-09-28T10:21:42Z

hey unclecode, thanks for your answer!
I experience this on multiple pages and i can access the pages manually, so my IP is not blocked. Also it varies randomly if crawl4ai can access the page or not, on some run it accesses it, sometimes not, so i do not think it is an issue regarding the page itself. It affects eg https://de.wikipedia.org/wiki/Aral or https://www.aldi-nord.de/, but other times not. in total it affects 20-30 % of my crawled websites

unclecode · 2024-09-28T15:53:17Z

@BlackChila Thx for sharing, Please do us a favor and try using the asynchronous method. Let's see if you get something similar with it or not. If you still face issues, then we'll start to do the stress test by crawling a set of links and websites to see when such things happen. However, let's just try it with asynchronous and let me know about it first. Thank you.

RhonnieAl · 2024-09-28T18:06:45Z

@unclecode Getting the same NoneType error. Here are the logs:

INFO: Started server process [14039]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[LOG] 🌤️ Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
Delaying for 10 seconds...
Resuming...
[LOG] 🕸️ Crawling https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ successfully!
[LOG] 🚀 Crawling done for https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, success: True, time taken: 0.74 seconds
[ERROR] 🚫 Failed to crawl https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str
url='https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str' session_id=None responser_headers=None status_code=None
INFO: 127.0.0.1:62431 - "GET / HTTP/1.1" 200 OK

from fastapi import FastAPI, HTTPException
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from dotenv import load_dotenv
import asyncio
import json


load_dotenv() 

app = FastAPI()

@app.get("/")

async def crawl(url: str = "https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/"):
    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            
            # Introduce Delay
            print("Delaying for 10 seconds...")
            await asyncio.sleep(10)
            print("Resuming...")

            # Extract data
            result = await crawler.arun(url=url, bypass_cache=True)
            
            # Return data
            print(result)
            return result.dict()
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

BlackChila · 2024-09-30T08:54:41Z

Thanks @RhonnieAl for posting the same issue with the asynchronous method here!

ojaros · 2024-09-30T19:07:29Z

Getting the same [ERROR] 🚫 Failed to crawl error: Failed to extract content from the website: error: can only concatenate str (not "NoneType") to str

When trying to crawl a Notion site

xansrnitu · 2024-10-02T14:11:46Z

+1. I am encountering this same issue.

Related to unclecode#105 Fix the 'NoneType' object has no attribute 'get' error in `AsyncWebCrawler`. * **crawl4ai/async_webcrawler.py** - Add a check in the `arun` method to ensure `html` is not `None` before further processing. - Raise a descriptive error if `html` is `None`. * **crawl4ai/async_crawler_strategy.py** - Add a check in the `crawl` method of the `AsyncPlaywrightCrawlerStrategy` class to handle cases where `html` is `None`. - Raise a descriptive error if `html` is `None`. * **tests/async/test_basic_crawling.py** - Add a test case to verify handling of `None` values for the `html` variable in the `test_invalid_url` function. * **tests/async/test_error_handling.py** - Add a test case to verify handling of `None` values for the `html` variable in the `test_network_error` function. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/unclecode/crawl4ai/issues/105?shareId=XXXX-XXXX-XXXX-XXXX).

DhrubojyotiDey · 2024-10-13T16:13:54Z

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication.
The following is the below error:
LOG] 🌤️ Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 0
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 1
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 2
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 3
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1
[LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds.
Number of related items extracted: 0
[]

I have a few questions in mind.

Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

mobyds · 2024-10-17T09:02:52Z

I have the same error with the use of WebCrawler

The problem is in the file utils.py
File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized
src = img.get('src', '')
src = img.get('src', '')
^^^^^^^^^^^^^^^^^^
File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get
return self.attrs.get(key, default)
AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except:
try:
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
except Exception as e:
pass

unclecode · 2024-10-17T13:15:37Z

@RhonnieAl Sorry for my delayed response. The links that you are trying to crawl have very strong bot detection. This is why they won't navigate to the page. For the error message, we made some adjustments to make the error message a little bit better in the new version, 0.3.7. You can update to this new version, and then get a better message. I think I'm going to release the new version within a day or two. One thing that you can do is always try to set the headless to false, so that you can see what's happening, and in this way, you'll get an understanding of what's going on. Here's a screenshot of what's happening. Fyi you can apply some sort of scripts and techniques using the hooks that we have in our library before going to a page to fix some of such issues. However, if you use the new version, the error message contains some useful information for you to try on different websites. Anyway, hopefully, this can be helpful for you.

unclecode · 2024-10-17T13:16:36Z

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌤️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] ✅ Crawled https://www.nbcnews.com/business successfully! [LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []

I have a few questions in mind.

Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)

which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

unclecode · 2024-10-17T13:19:13Z

I have the same error with the use of WebCrawler

The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass

@mobyds This is interesting, would you please share the url caused this issue? Thx

mobyds · 2024-10-17T13:20:29Z

https://chantepie.fr/

unclecode · 2024-10-17T13:42:26Z

@mobyds It works for me, perhaps you can share with me your code as well as you system specs.

async def main():
    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        url = "https://chantepie.fr/"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            screenshot = True
        )
        
        # Save screenshot to file
        with open(os.path.join(__data, "chantepie.png"), "wb") as f:
            f.write(base64.b64decode(result.screenshot))
        
        print(result.markdown)

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://chantepie.fr/ using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://chantepie.fr/ successfully!
[LOG] 🚀 Crawling done for https://chantepie.fr/, success: True, time taken: 5.08 seconds
[LOG] 🚀 Content extracted for https://chantepie.fr/, success: True, time taken: 0.29 seconds
[LOG] 🔥 Extracting semantic blocks for https://chantepie.fr/, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://chantepie.fr/, time taken: 0.32 seconds.

mobyds · 2024-10-17T15:03:30Z

Iy was with WebCrawler, not with AsyncWebCrawler

DhrubojyotiDey · 2024-10-18T03:08:46Z

@unclecode Hi

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌤️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] ✅ Crawled https://www.nbcnews.com/business successfully! [LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []
I have a few questions in mind.

Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)

which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

Below is the code snippet I used for extraction. I find the issue most common with hindustantimes and NDTV. The news block is not getting extracted completely.

url1 = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
url2 = "https://www.hindustantimes.com/world-news/israelhamas-war-live-updates-palestine-israel-latest-news-hamas-militant-group-attack-101696723677129.html"

related_content = []
os.environ['GEMINI_API_KEY'] = userdata.get('my_key')

async def process_urls():
async with AsyncWebCrawler(verbose=True) as crawler:
for url in urls:
# Perform extraction for each URL
result = await crawler.arun(
url=url1,
extraction_strategy=LLMExtractionStrategy(
provider="gemini/gemini-pro",
bypass_cache=True,
api_token=os.environ['GEMINI_API_KEY'],
instruction="Extract only content related to Israel and hamas war and extract URL if available"
),
)

        if result.extracted_content is not None:
            try:
                extracted_data = json.loads(result.extracted_content)
                related_content.extend(extracted_data)  # Append extracted data for each URL
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON for {url}: {e}")
                print(f"Raw extracted content: {result.extracted_content}")  # Debug raw content
        else:
            print(f"No content extracted by the LLM for {url}")

Execute the asynchronous function

asyncio.run(process_urls())

print(f"Number of related items extracted: {len(related_content)}")
combined_data = [item.get('content') for item in related_content]
print(combined_data)

unclecode · 2024-10-18T08:34:48Z

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

unclecode · 2024-10-18T10:04:33Z

@DhrubojyotiDey I followed the first link that you shared here. The page is actually very long. Let me explain how this LLM extraction strategy works. By default, there is a chunking stage. This means that when you pass the content, break it into smaller chunks and then send every chunk in parallel to the language model. This is designed to be more suitable for smaller languages. Those small language models may not have a long context window. Therefore, we can make the most of them this way. If you're using a language model that supports long window contexts, such as Gemini, in your code, the best way to handle it is to either turn off this feature or to use a very long chunk length. Here's an example of both approaches. In my case, they work perfectly. I hope this is helpful for you.

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            apply_chunking = False,
            # chunk_token_threshold = 2 ** 14 # 16k tokens
            instruction="""Extract only content related to Israel and hamas war and extract URL if available"""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            
            # magic=True
        )
        extracted_content = json.loads(result.extracted_content)
        print(extracted_content)

    print("Done")

mobyds · 2024-10-21T10:34:13Z

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

OK, and thanks a lot for this very useful lib

unclecode · 2024-10-24T12:03:12Z

You're welcome @mobyds

unclecode self-assigned this Sep 28, 2024

unclecode added the bug Something isn't working label Sep 28, 2024

theguy000 mentioned this issue Oct 3, 2024

Fix crawling error in AsyncWebCrawler #125

Open

unclecode closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling error #105

Crawling error #105

BlackChila commented Sep 27, 2024 •

edited

Loading

unclecode commented Sep 28, 2024

BlackChila commented Sep 28, 2024 •

edited

Loading

unclecode commented Sep 28, 2024

RhonnieAl commented Sep 28, 2024 •

edited

Loading

BlackChila commented Sep 30, 2024

ojaros commented Sep 30, 2024

xansrnitu commented Oct 2, 2024

DhrubojyotiDey commented Oct 13, 2024

mobyds commented Oct 17, 2024 •

edited

Loading

unclecode commented Oct 17, 2024

unclecode commented Oct 17, 2024

unclecode commented Oct 17, 2024

mobyds commented Oct 17, 2024

unclecode commented Oct 17, 2024

mobyds commented Oct 17, 2024 •

edited

Loading

DhrubojyotiDey commented Oct 18, 2024

unclecode commented Oct 18, 2024

unclecode commented Oct 18, 2024

mobyds commented Oct 21, 2024

unclecode commented Oct 24, 2024

Crawling error #105

Crawling error #105

Comments

BlackChila commented Sep 27, 2024 • edited Loading

unclecode commented Sep 28, 2024

BlackChila commented Sep 28, 2024 • edited Loading

unclecode commented Sep 28, 2024

RhonnieAl commented Sep 28, 2024 • edited Loading

BlackChila commented Sep 30, 2024

ojaros commented Sep 30, 2024

xansrnitu commented Oct 2, 2024

DhrubojyotiDey commented Oct 13, 2024

mobyds commented Oct 17, 2024 • edited Loading

unclecode commented Oct 17, 2024

unclecode commented Oct 17, 2024

unclecode commented Oct 17, 2024

mobyds commented Oct 17, 2024

unclecode commented Oct 17, 2024

mobyds commented Oct 17, 2024 • edited Loading

DhrubojyotiDey commented Oct 18, 2024

Execute the asynchronous function

unclecode commented Oct 18, 2024

unclecode commented Oct 18, 2024

mobyds commented Oct 21, 2024

unclecode commented Oct 24, 2024

BlackChila commented Sep 27, 2024 •

edited

Loading

BlackChila commented Sep 28, 2024 •

edited

Loading

RhonnieAl commented Sep 28, 2024 •

edited

Loading

mobyds commented Oct 17, 2024 •

edited

Loading

mobyds commented Oct 17, 2024 •

edited

Loading