Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebScraper refactor into scrapeURL #714

Merged
merged 79 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
7732c9d
feat: use strictNullChecking
mogery Sep 28, 2024
76c082a
feat: switch logger to Winston
mogery Sep 28, 2024
260c538
feat(scrapeURL): first batch
mogery Sep 28, 2024
cda065f
fix(scrapeURL): error swallow
mogery Sep 28, 2024
1c5a29c
fix(scrapeURL): add timeout to EngineResultsTracker
mogery Sep 28, 2024
775d994
fix(scrapeURL): report unexpected error to sentry
mogery Sep 28, 2024
4204223
chore: remove unused modules
mogery Sep 28, 2024
73ea367
feat(transfomers/coerce): warn when a format's response is missing
mogery Sep 29, 2024
d7755af
feat(scrapeURL): feature flag priorities, engine quality sorting, PDF…
mogery Sep 29, 2024
1576177
(add note)
mogery Sep 29, 2024
7253a50
feat(scrapeURL): wip readme
mogery Sep 30, 2024
5a115fe
feat(scrapeURL): LLM extract
mogery Sep 30, 2024
0e6c6b7
feat(scrapeURL): better warnings
mogery Oct 1, 2024
8adf50b
fix(scrapeURL/engines/fire-engine;playwright): fix screenshot
mogery Oct 4, 2024
236a4e7
feat(scrapeURL): add forceEngine internal option
mogery Oct 4, 2024
fc490b9
feat(scrapeURL/engines): scrapingbee
mogery Oct 4, 2024
29ca8ce
feat(scrapeURL/transformars): uploadScreenshot
mogery Oct 4, 2024
3355574
feat(scrapeURL): more intense tests
mogery Oct 4, 2024
33cde05
bunch of stuff
mogery Oct 4, 2024
5913437
Merge branch 'main' into mog/webscraper-refactor
mogery Oct 28, 2024
4ee3dec
get rid of WebScraper (mostly)
mogery Oct 28, 2024
94f0cf2
adapt batch scrape
mogery Oct 28, 2024
55236fb
add staging deploy workflow
mogery Oct 28, 2024
e0519ae
fix yaml
mogery Oct 28, 2024
3ee8cc7
fix logger issues
mogery Oct 28, 2024
64bb7ef
fix v1 test schema
mogery Oct 28, 2024
b866a18
Merge branch 'main' into mog/webscraper-refactor
mogery Oct 29, 2024
4a68289
feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions
mogery Oct 29, 2024
a74e404
scrapeURL: v0 backwards compat
mogery Oct 29, 2024
cd8a895
logger fixes
mogery Oct 29, 2024
13b3030
Merge branch 'main' into mog/webscraper-refactor
mogery Nov 4, 2024
136a3b5
feat(scrapeurl): v0 returnOnlyUrls support
mogery Nov 4, 2024
0a1cd5d
fix(scrapeURL/v0): URL leniency
mogery Nov 4, 2024
43f1c1a
fix(batch-scrape): ts non-nullable
mogery Nov 4, 2024
d41b2d8
fix(scrapeURL/fire-engine/chromecdp): fix wait action
mogery Nov 5, 2024
262e733
fix(logger): remove error debug key
mogery Nov 5, 2024
bc64ae3
feat(requests.http): use dotenv expression
mogery Nov 5, 2024
2a96717
fix(scrapeURL/extractMetadata): extract custom metadata
mogery Nov 5, 2024
cd53432
fix crawl option conversion
mogery Nov 5, 2024
8b69ccb
feat(scrapeURL): Add retry logic to robustFetch
mogery Nov 5, 2024
9144dba
fix(scrapeURL): crawl stuff
mogery Nov 5, 2024
6ba51aa
fix(scrapeURL): LLM extract
mogery Nov 5, 2024
7a1cf43
fix(scrapeURL/v0): search fix
mogery Nov 5, 2024
3f623fc
fix(tests/v0): grant larger response size to v0 crawl status
mogery Nov 5, 2024
fdec4e8
feat(scrapeURL): basic fetch engine
mogery Nov 5, 2024
96beff8
feat(scrapeURL): playwright engine
mogery Nov 5, 2024
e5385e6
Merge branch 'main' into mog/webscraper-refactor
mogery Nov 5, 2024
5e2124c
feat(scrapeURL): add url-specific parameters
mogery Nov 5, 2024
ed5a0d3
Update readme and examples
ericciarla Nov 5, 2024
0f208fa
added e2e tests for most parameters. Still a few actions, location an…
rafaelmmiller Nov 6, 2024
a539ad7
Merge remote-tracking branch 'origin/mog/webscraper-refactor' into te…
nickscamara Nov 6, 2024
9b271d7
fixed type
rafaelmmiller Nov 6, 2024
621df86
Nick:
nickscamara Nov 6, 2024
8151107
Update scrape.ts
nickscamara Nov 6, 2024
48025e4
Update index.ts
nickscamara Nov 6, 2024
3a374b2
added actions and base64 check
rafaelmmiller Nov 6, 2024
13b4eea
Merge branch 'test/e2e-tests-for-all-parameters' of https://github.co…
rafaelmmiller Nov 6, 2024
8616fe6
Nick: skipTls feature flag?
nickscamara Nov 6, 2024
3bd2790
403
rafaelmmiller Nov 6, 2024
1e098fe
todo
rafaelmmiller Nov 6, 2024
ec78240
todo
rafaelmmiller Nov 6, 2024
66a6f91
fixes
mogery Nov 6, 2024
7a54291
yeet headers from url specific params
mogery Nov 6, 2024
be40dcb
add warning when final engine has feature deficit
mogery Nov 6, 2024
461eda8
expose engine results tracker for ScrapeEvents implementation
mogery Nov 6, 2024
49801ac
ingest scrape events
mogery Nov 7, 2024
cc89094
Merge branch 'test/e2e-tests-for-all-parameters' into mog/webscraper-…
nickscamara Nov 7, 2024
692f42c
fixed some tests
rafaelmmiller Nov 7, 2024
40d8882
comment
rafaelmmiller Nov 7, 2024
7949ac2
Update index.test.ts
nickscamara Nov 7, 2024
054b73b
fixed rawHtml
rafaelmmiller Nov 7, 2024
3db2212
Merge branch 'mog/webscraper-refactor' of https://github.com/mendable…
rafaelmmiller Nov 7, 2024
e1806ac
Update index.test.ts
nickscamara Nov 7, 2024
8641829
update comments
mogery Nov 7, 2024
7198a28
move geolocation to global f-e option, fix removeBase64Images
mogery Nov 7, 2024
a02c42a
Nick:
nickscamara Nov 7, 2024
f9e775a
Merge branch 'mog/webscraper-refactor' of https://github.com/mendable…
nickscamara Nov 7, 2024
ec0542e
trim url-specific params
mogery Nov 7, 2024
b9e732b
Update index.ts
nickscamara Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/archive/js-sdk.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
HOST: ${{ secrets.HOST }}
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
Expand All @@ -21,7 +20,6 @@ env:
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
HDX_NODE_BETA_MODE: 1

jobs:
Expand Down
2 changes: 0 additions & 2 deletions .github/archive/python-sdk.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
HOST: ${{ secrets.HOST }}
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
Expand All @@ -21,7 +20,6 @@ env:
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
HDX_NODE_BETA_MODE: 1

jobs:
Expand Down
2 changes: 0 additions & 2 deletions .github/archive/rust-sdk.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
HOST: ${{ secrets.HOST }}
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
Expand All @@ -21,7 +20,6 @@ env:
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
HDX_NODE_BETA_MODE: 1


Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
HOST: ${{ secrets.HOST }}
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
POSTHOG_API_KEY: ${{ secrets.POSTHOG_API_KEY }}
POSTHOG_HOST: ${{ secrets.POSTHOG_HOST }}
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
Expand All @@ -25,7 +24,6 @@ env:
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
HYPERDX_API_KEY: ${{ secrets.HYPERDX_API_KEY }}
HDX_NODE_BETA_MODE: 1
FIRE_ENGINE_BETA_URL: ${{ secrets.FIRE_ENGINE_BETA_URL }}
USE_DB_AUTHENTICATION: ${{ secrets.USE_DB_AUTHENTICATION }}
Expand Down
32 changes: 32 additions & 0 deletions .github/workflows/deploy-image-staging.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: STAGING Deploy Images to GHCR

env:
DOTNET_VERSION: '6.0.x'

on:
push:
branches:
- mog/webscraper-refactor
workflow_dispatch:

jobs:
push-app-image:
runs-on: ubuntu-latest
defaults:
run:
working-directory: './apps/api'
steps:
- name: 'Checkout GitHub Action'
uses: actions/checkout@main

- name: 'Login to GitHub Container Registry'
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{github.actor}}
password: ${{secrets.GITHUB_TOKEN}}

- name: 'Build Inventory Image'
run: |
docker build . --tag ghcr.io/mendableai/firecrawl-staging:latest
docker push ghcr.io/mendableai/firecrawl-staging:latest
1 change: 0 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ TEST_API_KEY= # use if you've set up authentication and want to test with a real
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ We provide an easy to use API with our hosted version. You can find the playgrou
Check out the following resources to get started:
- [x] **API**: [Documentation](https://docs.firecrawl.dev/api-reference/introduction)
- [x] **SDKs**: [Python](https://docs.firecrawl.dev/sdks/python), [Node](https://docs.firecrawl.dev/sdks/node), [Go](https://docs.firecrawl.dev/sdks/go), [Rust](https://docs.firecrawl.dev/sdks/rust)
- [x] **LLM Frameworks**: [Langchain (python)](https://python.langchain.com/docs/integrations/document_loaders/firecrawl/), [Langchain (js)](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/firecrawl), [Llama Index](https://docs.llamaindex.ai/en/latest/examples/data_connectors/WebPageDemo/#using-firecrawl-reader), [Crew.ai](https://docs.crewai.com/), [Composio](https://composio.dev/tools/firecrawl/all), [PraisonAI](https://docs.praison.ai/firecrawl/)
- [x] **LLM Frameworks**: [Langchain (python)](https://python.langchain.com/docs/integrations/document_loaders/firecrawl/), [Langchain (js)](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/firecrawl), [Llama Index](https://docs.llamaindex.ai/en/latest/examples/data_connectors/WebPageDemo/#using-firecrawl-reader), [Crew.ai](https://docs.crewai.com/), [Composio](https://composio.dev/tools/firecrawl/all), [PraisonAI](https://docs.praison.ai/firecrawl/), [Superinterface](https://superinterface.ai/docs/assistants/functions/firecrawl), [Vectorize](https://docs.vectorize.io/integrations/source-connectors/firecrawl)
- [x] **Low-code Frameworks**: [Dify](https://dify.ai/blog/dify-ai-blog-integrated-with-firecrawl), [Langflow](https://docs.langflow.org/), [Flowise AI](https://docs.flowiseai.com/integrations/langchain/document-loaders/firecrawl), [Cargo](https://docs.getcargo.io/integration/firecrawl), [Pipedream](https://pipedream.com/apps/firecrawl/)
- [x] **Others**: [Zapier](https://zapier.com/apps/firecrawl/integrations), [Pabbly Connect](https://www.pabbly.com/connect/integrations/firecrawl/)
- [ ] Want an SDK or Integration? Let us know by opening an issue.
Expand Down
1 change: 0 additions & 1 deletion SELF_HOST.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@ TEST_API_KEY= # use if you've set up authentication and want to test with a real
SCRAPING_BEE_API_KEY= # use if you'd like to use as a fallback scraper
OPENAI_API_KEY= # add for LLM-dependent features (e.g., image alt generation)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
Expand Down
5 changes: 0 additions & 5 deletions apps/api/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
Expand All @@ -54,9 +52,6 @@ STRIPE_PRICE_ID_STANDARD_NEW_YEARLY=
STRIPE_PRICE_ID_GROWTH=
STRIPE_PRICE_ID_GROWTH_YEARLY=

HYPERDX_API_KEY=
HDX_NODE_BETA_MODE=1

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

Expand Down
2 changes: 1 addition & 1 deletion apps/api/jest.setup.js
Original file line number Diff line number Diff line change
@@ -1 +1 @@
global.fetch = require('jest-fetch-mock');
// global.fetch = require('jest-fetch-mock');
10 changes: 7 additions & 3 deletions apps/api/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,11 @@
"@tsconfig/recommended": "^1.0.3",
"@types/body-parser": "^1.19.2",
"@types/cors": "^2.8.13",
"@types/escape-html": "^1.0.4",
"@types/express": "^4.17.17",
"@types/jest": "^29.5.12",
"@types/node": "^20.14.1",
"@types/pdf-parse": "^1.1.4",
"body-parser": "^1.20.1",
"express": "^4.18.2",
"jest": "^29.6.3",
Expand All @@ -53,9 +55,7 @@
"@bull-board/api": "^5.20.5",
"@bull-board/express": "^5.20.5",
"@devil7softwares/pos": "^1.0.2",
"@dqbd/tiktoken": "^1.0.13",
"@hyperdx/node-opentelemetry": "^0.8.1",
"@logtail/node": "^0.4.12",
"@dqbd/tiktoken": "^1.0.16",
"@nangohq/node": "^0.40.8",
"@sentry/cli": "^2.33.1",
"@sentry/node": "^8.26.0",
Expand All @@ -78,6 +78,7 @@
"date-fns": "^3.6.0",
"dotenv": "^16.3.1",
"dotenv-cli": "^7.4.2",
"escape-html": "^1.0.3",
"express-rate-limit": "^7.3.1",
"express-ws": "^5.0.2",
"form-data": "^4.0.0",
Expand All @@ -92,6 +93,7 @@
"languagedetect": "^2.0.0",
"logsnag": "^1.0.0",
"luxon": "^3.4.3",
"marked": "^14.1.2",
"md5": "^2.3.0",
"moment": "^2.29.4",
"mongoose": "^8.4.4",
Expand All @@ -114,6 +116,8 @@
"typesense": "^1.5.4",
"unstructured-client": "^0.11.3",
"uuid": "^10.0.0",
"winston": "^3.14.2",
"winston-transport": "^4.8.0",
"wordpos": "^2.1.0",
"ws": "^8.18.0",
"xml2js": "^0.6.2",
Expand Down
Loading
Loading