- How to Crawl a Website Using Web Crawler?
Web Crawler is a built-in feature of our Scraper APIs. It’s a tool used to discover target URLs, select the relevant content, and have it delivered in bulk. It crawls websites in real-time and at scale to quickly deliver all content or only the data you need based on your chosen criteria.
There are three main tasks Web Crawler can do:
-
Perform URL discovery;
-
Crawl all pages on a site;
-
Index all URLs on a domain.
Use it when you need to crawl through the site and receive parsed data in bulk, as well as to collect a list of URLs in a specific category or from an entire website.
There are three data output types you can receive when using Web Crawler: a list of URLs, parsed results, and HTML files. If needed, you can set Web Crawler to upload the results to your cloud storage.
You can easily control the crawling scope by adjusting its width and depth with filters. Web Crawler can also use various scraping parameters, such as geo-location and user agent, to increase the success rate of crawling jobs. Most of these scraping parameters depend on the Scraper API you use.
To control your crawling job, you need to use different endpoints. You can initiate, stop and resume your job, get job info, get the list of result chunks, and get the results. Below are the endpoints we’ll use in this crawling tutorial. For more information and output examples, visit our documentation.
-
Endpoint:
https://ect.oxylabs.io/v1/jobs
-
Method:
POST
-
Authentication:
Basic
-
Request headers:
Content-Type: application/json
This endpoint will deliver the list of URLs found while processing the job.
-
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/sitemap
-
Method:
GET
-
Authentication:
Basic
-
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/aggregate
-
Method:
GET
-
Authentication:
Basic
The aggregate results can consist of a lot of data, so we split them into multiple chunks based on the chunk size you specify. Use this endpoint to get a list of chunk files available.
-
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
-
Method:
GET
-
Authentication:
Basic
With this endpoint, you can download a particular chunk of the aggregate result. The contents of the response body depend on the output type you choose.
The result can be one of the following:
-
An index (a list of URLs)
-
An aggregate JSON file with all parsed results
-
An aggregate JSON file with all HTML results
For your convenience, we’ve put all the available parameters you can use in the table below. It can also be found in our documentation.
Parameter | Description | Default Value |
---|---|---|
url | The URL of the starting point | - |
filters |
These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result. See this section for more information. | - |
filters:crawl |
Specifies which URLs Web Crawler will include in the end result. See this section for more information. | - |
filters:process |
Specifies which URLs Web Crawler will scrape. See this section for more information. | - |
filters:max_depth |
Determines the max length of URL chains Web Crawler will follow. See this section for more information. | 1 |
scrape_params |
These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location. | - |
scrape_params:source |
See this section for more information. | - |
scrape_params:geo_location |
The geographical location that the result should be adapted for. See this section for more information. | - |
scrape_params:user_agent_type |
Device type and browser. See this section for more information. | desktop |
scrape_params:render |
Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. See this section for more information. | - |
output:type\_ |
The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTML results or parsed data. See this section for more information. | - |
upload |
These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information. | - |
upload:storage_type |
Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon. | - |
upload:storage_url |
The storage bucket URL. | - |
Using these parameters is straightforward, as you can pass them with the request payload. Below you can find code examples in Python.
For simplicity, you can use Postman to make crawling requests. Download this Postman collection to try out all the endpoints of Web Crawler. Here’s a step-by-step video tutorial you can follow:
How to Crawl a Website: Step-by-step Guide
To make HTTP requests in Python, we’ll use the Requests library. Install it by entering the following in your terminal:
pip install requests
To deal with HTML results, we’ll use the BeautifulSoup4 library to parse the results and make them more readable. This step is optional, but you can install this library with:
pip install BeautifulSoup4
In the following example, we use the sitemap
parameter to create a job
that crawls the Amazon homepage and gets a list of URLs found within the
starting page. With the crawl
and process
parameters being set to “.\*”
,
Web Crawler will follow and return any Amazon URL. These two parameters
use regular expressions (regex) to determine what URLs should be crawled
and processed. Be sure to visit our
documentation
for more details and useful resources.
We don’t need to include the source
parameter because we aren’t scraping
content from the URLs yet. Using the json
module, we write the data into
a .json file, and then, with the pprint
module, we print the
structured content. Let’s see the example:
import requests, json
from pprint import pprint
# Set the content type to JSON.
headers = {"Content-Type": "application/json"}
# Crawl all URLs inside the target URL.
payload = {
"url": "https://www.amazon.com/",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": 1
},
"scrape_params": {
"user_agent_type": "desktop",
},
"output": {
"type_": "sitemap"
}
}
# Create a job and store the JSON response.
response = requests.request(
'POST',
'https://ect.oxylabs.io/v1/jobs',
auth=('USERNAME', 'PASSWORD'), # Your credentials go here.
headers=headers,
json=payload,
)
# Write the decoded JSON response to a .json file.
with open('job_sitemap.json', 'w') as f:
json.dump(response.json(), f)
# Print the decoded JSON response.
pprint(response.json())
Depending on the request size, the process might take a bit of time. You
can make sure the job is finished by checking the job information.
When it’s done, send another request to the sitemap endpoint
https://ect.oxylabs.io/v1/jobs/{id}/sitemap
to return a list of URLs.
For example:
import requests, json
from pprint import pprint
# Store the JSON response containing URLs (sitemap).
sitemap = requests.request(
'GET',
'https://ect.oxylabs.io/v1/jobs/{id}/sitemap', # Replace {id] with the job ID.
auth=('USERNAME', 'PASSWORD'), # Your credentials go here.
)
# Write the decoded JSON response to a .json file.
with open('sitemap.json', 'w') as f:
json.dump(sitemap.json(), f)
# Print the decoded JSON response.
pprint(sitemap.json())
To get parsed content, use the parsed
parameter. Using the example
below, we can crawl all URLs found on this Amazon
page
and then parse the content of each URL. This time, we’re using the
amazon
source as we’re scraping content from the specified Amazon page.
So, let’s see all of this put together in Python:
import requests, json
from pprint import pprint
# Set the content type to JSON.
headers = {"Content-Type": "application/json"}
# Parse content from the URLs found in the target URL.
payload = {
"url": "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
"=1679564333&rnid=502394&ref=sr_pg_1",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": 1
},
"scrape_params": {
"source": "amazon",
"user_agent_type": "desktop"
},
"output": {
"type_": "parsed"
}
}
# Create a job and store the JSON response.
response = requests.request(
'POST',
'https://ect.oxylabs.io/v1/jobs',
auth=('USERNAME', 'PASSWORD'), # Your credentials go here.
headers=headers,
json=payload,
)
# Write the decoded JSON response to a .json file.
with open('job_parsed.json', 'w') as f:
json.dump(response.json(), f)
# Print the decoded JSON response.
pprint(response.json())
Note that if you want to use the geo_location
parameter when scraping
Amazon pages, you must set its value to the preferred location’s
zip/postal code. For more information, visit this
page
in our documentation.
Once the job is complete, you can check how many chunks your request has
generated and then download the content from each chunk with this
endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
. For
instance, with the following code snippet, we’re printing the first
chunk:
import requests, json
from pprint import pprint
# Store the JSON response containing parsed results.
parsed_results = requests.request(
'GET',
'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1', # Replace {id] with the job ID.
auth=('USERNAME', 'PASSWORD'), # Your credentials go here.
)
# Write the decoded JSON response to a .json file.
with open('parsed_results_1.json', 'w') as f:
json.dump(parsed_results.json(), f)
# Print the decoded JSON response.
pprint(parsed_results.json())
The code to get HTML results doesn’t differ much from the code in the
previous section. The only difference is that we’ve set the type_
parameter to html
. Let’s see the code sample:
import requests, json
from pprint import pprint
# Set the content type to JSON.
headers = {"Content-Type": "application/json"}
# Index HTML results of URLs found in the target URL.
payload = {
"url": "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
"=1679564333&rnid=502394&ref=sr_pg_1",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": 1
},
"scrape_params": {
"source": "universal",
"user_agent_type": "desktop"
},
"output": {
"type_": "html"
}
}
# Create a job and store the JSON response.
response = requests.request(
'POST',
'https://ect.oxylabs.io/v1/jobs',
auth=('USERNAME', 'PASSWORD'), # Your credentials go here
headers=headers,
json=payload,
)
# Write the decoded JSON response to a .json file.
with open('job_html.json', 'w') as f:
json.dump(response.json(), f)
# Print the decoded JSON response.
pprint(response.json())
Again, you’ll need to make a request to retrieve each chunk of the result. We’ll use the BeautifulSoup4 library to parse HTML, but this step is optional. We then write the parsed content to an .html file. The code example below downloads content from the first chunk:
import requests
from bs4 import BeautifulSoup
# Store the JSON response containing HTML results.
html_response = requests.request(
'GET',
'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1', # Replace {id] with the job ID.
auth=('USERNAME', 'PASSWORD'), # Your credentials go here.
)
# Parse the HTML content.
soup = BeautifulSoup(html_response.content, 'html.parser')
html_results = soup.prettify()
# Write the HTML results to an .html file.
with open('html_results.html', 'w') as f:
f.write(html_results)
# Print the HTML results.
print(html_results)
You can modify the code files as needed per your requirements.
This tutorial covered the fundamental aspects of using Web Crawler. We recommend looking at our documentation for more information on using the endpoints and query parameters. In case you have any questions, you can always contact us at [email protected] or via live chat on our website.