Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Various Issues and Improve Scrapper Script #162

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hoz-efa
Copy link

@hoz-efa hoz-efa commented Jun 13, 2024

Pull Request Summary

New Updates (Commits on Jun 15, 2024):

Error Handling Improvements

  • Improved Error Handling: Added comprehensive error handling in the enrol function to ensure the script continues processing even if an error occurs during the execution. Each critical operation is now wrapped in a try-except block, allowing the script to skip problematic iterations and proceed with the next.

Known Issues

  • Language Exclusion: The script still subscribes to free courses in languages that are set to be excluded in the settings. For example, courses in Arabic are being subscribed to even though Arabic is set to false in the duce-cli-settings.json. This issue will be addressed in future updates.

Screenshot (17)

Previous Changes (Commits on Jun 15, 2024):

Improve Link Extraction and Error Handling in Scrapers

  • Link Extraction:

    • Improved logic to handle special cases for click.linksynergy.com URLs, ensuring all valid links are captured by checking both murl= and RD_PARM1 parameters.
  • Nonce Extraction and Processing:

    • Corrected the extraction of the JSON string containing the nonce from the script tag in the cv function, and updated the processing to properly isolate and parse the JSON data, ensuring successful AJAX requests for fetching course data.
  • Error Handling:

    • Enhanced error handling across all functions to ensure the script continues processing remaining items if an error occurs, with retries for network requests and handling cases where required elements might not be found.
  • Progress Tracking:

    • Refined progress tracking within each scraper function to provide accurate updates on the scraping process.
  • Threading:

    • Utilized threading to parallelize scraping tasks, ensuring efficient processing of multiple sites.
  • Data Aggregation:

    • Improved the aggregation of scraped data into a unified list, maintaining consistency in the format and structure of the results.

These changes collectively improve the overall reliability, efficiency, and functionality of the script.

Previous Changes (Commits on Jun 10, 2024):

  1. Fixed Cloudscraper Session Error:

    • Resolved the issue where creating a scraper session through s was causing an error with cloudscraper.create_scraper(sess=s).
  2. Repaired Scrapers for Multiple Sites:

    • Fixed the scrapers for Disudemy, Coursevania, and iDownloadCoupon. Previously, the script was unable to start due to issues with these scrapers.
  3. Corrected e-Next API Link:

    • Updated the script with the correct link for the e-next API, ensuring proper API interaction.

The script is now functioning correctly.

Screenshot (4) Screenshot (7)

However, there are a couple of minor issues that need attention:

  • The script sometimes keeps retrying. I have attached a screenshot of this behavior for reference.
    Screenshot (5)

  • The tqdm progress bar is slightly glitchy, repeating the name of the website. Despite this, the backend operations work perfectly, so this issue is only with the display of the progress bar.

Note: I have only tested these changes with the CLI version and have not verified them with the GUI version.

I have no idea about GUI. I didn't try to run it once cause I only prefer the CLI version more often, and for PySimpleGUI, you can register yourself as a "Hobbyist" and get the developer key and can use it for a year... check it out here

These are the versions of my libraries from the requirements:

bs4                       0.0.2
cloudscraper              1.2.71
colorama                  0.4.6
html5lib                  1.1
requests                  2.31.0
requests-file             2.0.0
requests-toolbelt         1.0.0
tqdm                      4.66.4

You can check and verify your versions with the above by running this command in PowerShell:
pip list | findstr /R "bs4 requests html5lib cloudscraper pyopenssl browser_cookie3 colorama tqdm"

Please review these changes and let me know if any further modifications are needed.

hoz-efa and others added 3 commits June 10, 2024 21:08
Bugs and Fixes
Improved Error Handling: Added comprehensive error handling in the "enrol" function to ensure the script continues processing even if an error occurs during the execution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant