Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebScraper refactor into scrapeURL #714

Merged
merged 79 commits into from
Nov 7, 2024
Merged

WebScraper refactor into scrapeURL #714

merged 79 commits into from
Nov 7, 2024

Conversation

mogery
Copy link
Member

@mogery mogery commented Sep 28, 2024

Directives:

  • reduce state
    • stateless, functional programming paradigms to reduce debugging complexity
      • state that is required (e.g. current logger) is passed in an immutable meta object
    • sovereign modules that do not see the whole state (see e.g. engines/fire-engine/scrape.ts, it only has logger, not the whole meta object)
  • make the signal flow clear to ease debugging
    • intense verbosity in logging
    • modularity, make it clear where to add things in the future, make it easy to add things in the future without breaking stuff
      • define generic modules that can be implemented and appended to later (e.g. transformers, engines)
  • better error handling
    • using a rust-like error model, exceptions are freely thrown instead of wrapping it into {success: false, error: ...} objects
    • errors that occur are always re-thrown, with the original metadata (e.g. stack) intact. if doing stuff like retries with a limit, previous errors are passed along using the cause property.
    • we may never swallow an error.
      • at points where errors are not directly re-thrown, rather put into an object/array (e.g. retry logic/EngineResultsTracker), non-expected errors should be explicitly logged and Sentry.captureException'd
    • errors are only transformed into a success object at the top level of scrapeURL, in order to avoid breaking other parts of the codebase. errors are passed in the error metadata
    • never determine what an error is by checking it's message -- if you need a specific error that is determinable by parts of the codebase, create a custom error class and use instanceof -- see error.ts for reference
  • standalone
    • scrapeURL should never (even attempt to!) interface with the database. It should be its own standalone thing that could even be lifted out of firecrawl as a whole. To keep it fast, reliable, and maintainable, we need to keep its footprint minimal -- DB code can be handled by the surrounding bits that are tangled up in that anyways (e.g. queue-worker.ts)

@nickscamara
Copy link
Member

nickscamara commented Oct 3, 2024

  • Add sb
  • Integrate w/ v1
  • Make crawl not crash if scrapeURL throws

@nickscamara
Copy link
Member

nickscamara commented Nov 6, 2024

  • Reminder for Nick and Rafa to test scrape events + dashboard logging

@nickscamara
Copy link
Member

nickscamara commented Nov 7, 2024

  • Redlock for emails, same issue in main right now. If someone is someone is doing a big crawl, it will send them multiple emails.

@mogery mogery merged commit 8d467c8 into main Nov 7, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants