`WebScraper` refactor into `scrapeURL` #714

mogery · 2024-09-28T22:31:47Z

Directives:

reduce state
- stateless, functional programming paradigms to reduce debugging complexity
  - state that is required (e.g. current logger) is passed in an immutable meta object
- sovereign modules that do not see the whole state (see e.g. engines/fire-engine/scrape.ts, it only has logger, not the whole meta object)
make the signal flow clear to ease debugging
- intense verbosity in logging
- modularity, make it clear where to add things in the future, make it easy to add things in the future without breaking stuff
  - define generic modules that can be implemented and appended to later (e.g. transformers, engines)
better error handling
- using a rust-like error model, exceptions are freely thrown instead of wrapping it into {success: false, error: ...} objects
- errors that occur are always re-thrown, with the original metadata (e.g. stack) intact. if doing stuff like retries with a limit, previous errors are passed along using the cause property.
- we may never swallow an error.
  - at points where errors are not directly re-thrown, rather put into an object/array (e.g. retry logic/EngineResultsTracker), non-expected errors should be explicitly logged and Sentry.captureException'd
- errors are only transformed into a success object at the top level of scrapeURL, in order to avoid breaking other parts of the codebase. errors are passed in the error metadata
- never determine what an error is by checking it's message -- if you need a specific error that is determinable by parts of the codebase, create a custom error class and use instanceof -- see error.ts for reference
standalone
- scrapeURL should never (even attempt to!) interface with the database. It should be its own standalone thing that could even be lifted out of firecrawl as a whole. To keep it fast, reliable, and maintainable, we need to keep its footprint minimal -- DB code can be handled by the surrounding bits that are tangled up in that anyways (e.g. queue-worker.ts)

… and DOCX support

nickscamara · 2024-10-03T16:59:49Z

Add sb
Integrate w/ v1
Make crawl not crash if scrapeURL throws

…m/mendableai/firecrawl into test/e2e-tests-for-all-parameters

apps/api/src/scraper/scrapeURL/lib/urlSpecificParams.ts

apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

nickscamara · 2024-11-06T23:14:48Z

Reminder for Nick and Rafa to test scrape events + dashboard logging

nickscamara · 2024-11-07T14:58:53Z

Redlock for emails, same issue in main right now. If someone is someone is doing a big crawl, it will send them multiple emails.

…refactor

…ai/firecrawl into mog/webscraper-refactor

mogery added 13 commits September 28, 2024 13:15

feat: use strictNullChecking

7732c9d

feat: switch logger to Winston

76c082a

feat(scrapeURL): first batch

260c538

fix(scrapeURL): error swallow

cda065f

fix(scrapeURL): add timeout to EngineResultsTracker

1c5a29c

fix(scrapeURL): report unexpected error to sentry

775d994

chore: remove unused modules

4204223

feat(transfomers/coerce): warn when a format's response is missing

73ea367

feat(scrapeURL): feature flag priorities, engine quality sorting, PDF…

d7755af

… and DOCX support

(add note)

1576177

feat(scrapeURL): wip readme

7253a50

feat(scrapeURL): LLM extract

5a115fe

feat(scrapeURL): better warnings

0e6c6b7

mogery added 16 commits October 4, 2024 21:46

fix(scrapeURL/engines/fire-engine;playwright): fix screenshot

8adf50b

feat(scrapeURL): add forceEngine internal option

236a4e7

feat(scrapeURL/engines): scrapingbee

fc490b9

feat(scrapeURL/transformars): uploadScreenshot

29ca8ce

feat(scrapeURL): more intense tests

3355574

bunch of stuff

33cde05

Merge branch 'main' into mog/webscraper-refactor

5913437

get rid of WebScraper (mostly)

4ee3dec

adapt batch scrape

94f0cf2

add staging deploy workflow

55236fb

fix yaml

e0519ae

fix logger issues

3ee8cc7

fix v1 test schema

64bb7ef

Merge branch 'main' into mog/webscraper-refactor

b866a18

feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions

4a68289

scrapeURL: v0 backwards compat

a74e404

Merge branch 'test/e2e-tests-for-all-parameters' of https://github.co…

13b4eea

…m/mendableai/firecrawl into test/e2e-tests-for-all-parameters

tomkosm reviewed Nov 6, 2024

View reviewed changes

apps/api/src/scraper/scrapeURL/lib/urlSpecificParams.ts Outdated Show resolved Hide resolved

apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts Show resolved Hide resolved

Nick: skipTls feature flag?

8616fe6

nickscamara added the in review label Nov 6, 2024

rafaelmmiller and others added 5 commits November 6, 2024 19:31

403

3bd2790

todo

1e098fe

todo

ec78240

fixes

66a6f91

yeet headers from url specific params

7a54291

mogery added 3 commits November 7, 2024 00:22

add warning when final engine has feature deficit

be40dcb

expose engine results tracker for ScrapeEvents implementation

461eda8

ingest scrape events

49801ac

nickscamara and others added 13 commits November 7, 2024 13:07

Merge branch 'test/e2e-tests-for-all-parameters' into mog/webscraper-…

cc89094

…refactor

fixed some tests

692f42c

comment

40d8882

Update index.test.ts

7949ac2

fixed rawHtml

054b73b

Merge branch 'mog/webscraper-refactor' of https://github.com/mendable…

3db2212

…ai/firecrawl into mog/webscraper-refactor

Update index.test.ts

e1806ac

update comments

8641829

move geolocation to global f-e option, fix removeBase64Images

7198a28

Nick:

a02c42a

Merge branch 'mog/webscraper-refactor' of https://github.com/mendable…

f9e775a

…ai/firecrawl into mog/webscraper-refactor

trim url-specific params

ec0542e

Update index.ts

b9e732b

nickscamara added ready to merge and removed in review labels Nov 7, 2024

mogery merged commit 8d467c8 into main Nov 7, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WebScraper` refactor into `scrapeURL` #714

`WebScraper` refactor into `scrapeURL` #714

mogery commented Sep 28, 2024 •

edited

Loading

nickscamara commented Oct 3, 2024 •

edited by mogery

Loading

nickscamara commented Nov 6, 2024 •

edited

Loading

nickscamara commented Nov 7, 2024 •

edited

Loading

WebScraper refactor into scrapeURL #714

WebScraper refactor into scrapeURL #714

Conversation

mogery commented Sep 28, 2024 • edited Loading

nickscamara commented Oct 3, 2024 • edited by mogery Loading

nickscamara commented Nov 6, 2024 • edited Loading

nickscamara commented Nov 7, 2024 • edited Loading

`WebScraper` refactor into `scrapeURL` #714

`WebScraper` refactor into `scrapeURL` #714

mogery commented Sep 28, 2024 •

edited

Loading

nickscamara commented Oct 3, 2024 •

edited by mogery

Loading

nickscamara commented Nov 6, 2024 •

edited

Loading

nickscamara commented Nov 7, 2024 •

edited

Loading