Skip to content
/ scraper Public
forked from Chmod351/scraper

web scraper obtain information from any website, just by using its URL and the target CSS class that you want to scrape. It doesn't have a predefined purpose, so you can use it to gather information from any site you like

License

Notifications You must be signed in to change notification settings

2div/scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Screenshot_from_2023-08-02_14-38-24-removebg-preview

Static Badge GitHub closed pull requests GitHub issues GitHub last commit (by committer) GitHub top language

Index 🔖

Objetive ⭐

The objective of this web scraper is to obtain information from any website, just by using its URL and the target CSS class that you want to scrape. It doesn't have a predefined purpose, so you can use it to gather information from any site you like.

Documentation 📖

Postman Documentation

Custom Usage ⚙️

The code makes a POST request to the /api/v1/scrappe endpoint at http://localhost:5000. The request body should contain the following parameters:

  • keyWord (string): The keyword to filter articles by (optional).
  • url (string): The URL of the web page to scrape (mandatory).
  • objectClass (string): The CSS class of the elements to scrape from the web page (mandatory).

The API endpoint responds with a JSON object containing the following properties:

  • state: A string indicating the state of the scraping process.
  • objects found: The number of objects found after filtering.
  • key-word: The keyword used for filtering.
  • scanned webpage: The URL of the webpage that was scraped.
  • found articles: An array of articles that match the filtering criteria.
    if the response is too big the api use compression middleware to reduce the size.
  • implementing findOrCreate method for mongoose is a powerful tool to ensure that the scraping of websites doesn't lead to duplicated results in the database.

Body Example

{
      "url":"https://www.url.com.ar",
      "objectClass":".css-class-selector",
      "keyWord":"keyword"
}

Response Example

{
    "state": "success",
    "objects found": 2,
    "key-word": {
        "doc": {
            "_id": "64d40fa677d90019c57302ed",
            "keyword": "keyword",
            "createdAt": "2023-08-09T22:13:58.108Z",
            "updatedAt": "2023-08-10T17:08:08.459Z",
            "__v": 0,
            "usedTimes": 28
        },
        "created": false
    },
    "scanned webpage": {
        "_id": "64d3e3459686e7f4087acfdb",
        "cssClass": ".css-class-selector",
        "url": "https://www.url.com.ar",
        "__v": 0,
        "createdAt": "2023-08-09T19:04:37.137Z",
        "scrapedTimes": 69,
        "updatedAt": "2023-08-10T17:08:08.328Z"
    },
    "found articles": [
        {
            "_id": "64d4fcf821aef9f1dd17bbb8",
            "websiteTarget": "64d3e3459686e7f4087acfdb",
            "keywords": [
                "64d40fa677d90019c57302ed"
            ],
            "title": "Some Title",
            "link": "/some/link/related/to/the/article",
            "createdAt": "2023-08-10T15:06:32.535Z",
            "updatedAt": "2023-08-10T17:08:08.643Z",
            "__v": 2
        },
     ]
}

Export data to xlsx

  • Make a Post request to /api/v1/export/to-excel
  • The request body should contain the following parameters:
  • scanned webpage (Object): Response for /api/v1/scrappe (mandatory)
  • found articles (Objects Array): Response for /api/v1/scrappe (mandatory).

body example:

{
    "scanned webpage": {
      "_id": "64d3e3459686e7f4087acfdb",
        "cssClass": ".css-class-selector",
        "url": "https://www.url.com.ar",
        "__v": 0,
        "createdAt": "2023-08-09T19:04:37.137Z",
        "scrapedTimes": 69,
        "updatedAt": "2023-08-10T17:08:08.328Z"
    },
    "found articles":[
        {
           "_id": "64d4fcf821aef9f1dd17bbb8",
            "websiteTarget": "64d3e3459686e7f4087acfdb",
            "keywords": [
                "64d40fa677d90019c57302ed"
            ],
            "title": "Some Title",
            "link": "/some/link/related/to/the/article",
            "createdAt": "2023-08-10T15:06:32.535Z",
            "updatedAt": "2023-08-10T17:08:08.643Z",
            "__v": 2
        }
     ]
}

Documentation

Usage Limitations

  • You can only send up to 100 requests per 10 minutes.
  • If the webpage has incorrect element nesting, the scraper will fail
  • before use this tool please read FAQ

Contributors ❤️

Especial thanks to:

Contributions 📈

  • Contributions are welcome! please read our guidelines

About

web scraper obtain information from any website, just by using its URL and the target CSS class that you want to scrape. It doesn't have a predefined purpose, so you can use it to gather information from any site you like

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 50.3%
  • HTML 28.3%
  • CSS 21.4%