Skip to content

Program which crawls the University of Essex website for all courses in every subject. It is an attempt to extract as much information about a course as possible.

License

Notifications You must be signed in to change notification settings

MrManning/course-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

course-crawler

Python 3.7 License: GPL v3

Overview

This is a Scrapy project which crawls the University of Essex for all courses in every subject. It is an attempt to extract as much information about a course as possible.

Example data

The project retrieves all courses from all subjects, below is an example of an extracted item in the JSON format pretty printed.

{
    "count": 16,
    "subject": "Drama",
    "link": "/subjects/drama",
    "courses": {
        "Undergraduate": [{
            "name": "Drama",
            "degree_type": "BA",
            "link": "/courses/ug00097/1/ba-drama",
            "study_mode": "Full-time",
            "location": "Colchester Campus",
            "options": ["Year Abroad", "Placement Year"]
        }, {
            "name": "Drama (Including Foundation Year)",
            "degree_type": "BA",
            "link": "/courses/ug00097/2/ba-drama",
            "study_mode": "Full-time",
            "location": "Colchester Campus",
            "options": []
        }, {
            "name": "Drama and Literature",
            "degree_type": "BA",
            "link": "/courses/ug00098/1/ba-drama-and-literature",
            "study_mode": "Full-time",
            "location": "Colchester Campus",
            "options": ["Year Abroad", "Placement Year"]
        }]
    }
},

Requirements

All requirements are using the latest version available at the time.

Installation

Clone the project repo and run the crawler from within the subfolder.

# Clone project repo
$ git clone https://github.com/MrManning/course-crawler.git course-crawler

# Navigate inside project
$ cd course-crawler/src

# Run the crawler
$ scrapy crawl courses

The scraped courses will be located inside course-crawler/build/courses.json

Problems

As mentioned in #TODO the crawler is able to retrieve all the courses for Undergraduates and Masters but the results are split into multiple objects. It should be possible to fix this by updating the pipeline to check if each item already exists.

TODO

  • Fix bug which stops crawler running infinitely
  • Retrieve all courses under degree type including courses hidden behind API call
  • Make second depth a recursive method for courses hidden behind the API call
  • Move CustomFeedStorage method to separate file
  • Add sample item to README.md
  • Reach a depth of 2
  • Reach a depth of 3 to retrieve more course information
  • Extract courses from each degree type (Undergraduate, Postgraduate, Research)
    • Get all courses
    • Combine duplicate results into one item
  • Add unit tests
  • Implement caching to avoid potentially being blocked and increase performance time

Potential TODO

  • Update repository folder structure
  • Update README.md to include more badges
  • Add repository license
  • Main loop to start the crawler (Running from command line might be fine alone)
  • Ability to set output file (If running in command line output file can be set otherwise the default can be used)

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for more information.

About

Program which crawls the University of Essex website for all courses in every subject. It is an attempt to extract as much information about a course as possible.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages