Skip to content

Scrapes both companies & internships from a variety of sources

Notifications You must be signed in to change notification settings

intern-hub/scraper

Repository files navigation

scraper

This repository contains our scraping code. The scraper pulls companies from /r/cscareerquestions and crawls the web for positions relating to those companies. We are also actively looking for other reliable sources of company names.

Installation

Make sure the following dependencies have been installed on your system.

  • Docker

You will also need to place a valid hibernate.cfg.xml file in the src/main/resources folder. This file is responsible for providing SQL database connection details, enabling the scraper to read and write companies/positions. Please see src/main/resources/hibernate.cfg.xml.example for an example.

Usage

The following commands are assumed to be run from the root of the repository directory.

To fetch all companies and save them to the database, ignoring duplicates, use:

scripts/start_docker.sh -c

To fetch all positions for each company in the database and then save them to the database, ignoring duplicates, use:

scripts/start_docker.sh -p

About

Scrapes both companies & internships from a variety of sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published