Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting URL doesn't work if the page is non-html or invalid #4

Open
cuinjune opened this issue May 4, 2023 · 1 comment
Open

Comments

@cuinjune
Copy link

cuinjune commented May 4, 2023

@alex000kim If you provide a raw Github README link, it doesn't work since the page is a plain string (extract() doesn't work).
Also, if the URL is invalid, the Slackbot responds "I can't provide a response. Encountered an error: 'NoneType' object has no attribute 'lower'.

I was able to fix these issues by using the following updated code:

def is_html(content):
    content_start = content.lower().strip()[:15]
    return content_start.startswith("<!doctype html>") or content_start.startswith("<html>")

def augment_user_message(user_message, url_list):
    all_url_content = ''
    for url in url_list:
        downloaded = fetch_url(url)
        if downloaded is None:
            return user_message
        # Check if the content is HTML, then use extract() to clean and extract the main text content
        if is_html(downloaded):
            url_content = extract(downloaded, config=newconfig)
        else:
            url_content = downloaded
        user_message = user_message.replace(f'<{url}>', '')
        all_url_content = all_url_content + f' Contents of {url} : \n """ {url_content} """'
    user_message = user_message + "\n" + all_url_content
    return user_message

Please consider applying these changes to your code, or feel free to use a better solution if you know of any. I just wanted to share this with you.
Thank you so much for your great work by the way! :)

@alex000kim
Copy link
Owner

@cuinjune thanks, please create a PR with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants