This is a Scrapy project which crawls the University of Essex for all courses in every subject. It is an attempt to extract as much information about a course as possible.
The project retrieves all courses from all subjects, below is an example of an extracted item in the JSON format pretty printed.
{
"count": 16,
"subject": "Drama",
"link": "/subjects/drama",
"courses": {
"Undergraduate": [{
"name": "Drama",
"degree_type": "BA",
"link": "/courses/ug00097/1/ba-drama",
"study_mode": "Full-time",
"location": "Colchester Campus",
"options": ["Year Abroad", "Placement Year"]
}, {
"name": "Drama (Including Foundation Year)",
"degree_type": "BA",
"link": "/courses/ug00097/2/ba-drama",
"study_mode": "Full-time",
"location": "Colchester Campus",
"options": []
}, {
"name": "Drama and Literature",
"degree_type": "BA",
"link": "/courses/ug00098/1/ba-drama-and-literature",
"study_mode": "Full-time",
"location": "Colchester Campus",
"options": ["Year Abroad", "Placement Year"]
}]
}
},
All requirements are using the latest version available at the time.
Clone the project repo and run the crawler from within the subfolder.
# Clone project repo
$ git clone https://github.com/MrManning/course-crawler.git course-crawler
# Navigate inside project
$ cd course-crawler/src
# Run the crawler
$ scrapy crawl courses
The scraped courses will be located inside course-crawler/build/courses.json
As mentioned in #TODO the crawler is able to retrieve all the courses for Undergraduates and Masters but the results are split into multiple objects. It should be possible to fix this by updating the pipeline to check if each item already exists.
- Fix bug which stops crawler running infinitely
- Retrieve all courses under degree type including courses hidden behind API call
- Make second depth a recursive method for courses hidden behind the API call
- Move CustomFeedStorage method to separate file
- Add sample item to README.md
- Reach a depth of 2
- Reach a depth of 3 to retrieve more course information
- Extract courses from each degree type (Undergraduate, Postgraduate, Research)
- Get all courses
- Combine duplicate results into one item
- Add unit tests
- Implement caching to avoid potentially being blocked and increase performance time
- Update repository folder structure
- Update README.md to include more badges
- Add repository license
- Main loop to start the crawler (Running from command line might be fine alone)
- Ability to set output file (If running in command line output file can be set otherwise the default can be used)
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for more information.