While the IPEDS survey data contains a wealth of information, the data provided on the IPEDS website is not very user-friendly. This project provides files that create long form panel Stata datasets of IPEDS surveys for use in academic research. The files are free to download, use, and edit as you wish. If they have been helpful in furthering your research, a citation would be appreciated.
You will need Python 3+ (the scripts will NOT run in Python 2) and Stata to run the code. Download either a zip or tarball of the code by clicking the download link at the top of the page then unzip/unpack this file which should contain one folder. I will refer to this folder as the IPEDS directory.
If the name of the IPEDS directory has spaces in any part of its absolute path (such as /users/naven/my documents/research/data/ipeds), then Stata will incorrectly name all of the log files to the word before the first space (in this case my.log) so I recommend naming the IPEDS directory without spaces and choosing a location in which the names of the folders containing the IPEDS directory also have no spaces (such as /users/naven/documents/research/data/ipeds). Within the IPEDS directory, create a folder named downloads (in all lowercase).
The current version as is only works for Mac, Windows, and Linux if Stata is installed in the default directory with the default name. If you installed Stata into a directory or under a name that is not the default, then you will need to paste the absolute path to the Stata executable file between the double quotes of the stata_executable_path variable in the create_dta.py file.
The raw data to create the datasets can be downloaded from the IPEDS Data Center Complete Data Files. My preferred method to download all of the files simultaneously is to use the Firefox extension DownThemAll! with the archives filter. Save all of the files to the IPEDS directory/downloads folder.
Open the run_all.py file in the data_scrape folder using an IDE such as Pycharm or IDLE (included with Python) running Python 3+, and then click run. This will execute all of the Python scripts necessary to create individual year Stata datasets for each survey. Next run the do files in the do_files folder to create panel Stata datasets for each survey.
These files use Python to facilitate the organization and creation of raw (uncleaned) Stata .dta files for each academic year from the IPEDS Data Center Complete Data Files. These individual year files can then be appended to create Stata .dta files that contain all years of data for each survey. As mentioned above, the scripts operate under the assumption that the IPEDS directory contains a subfolder named downloads to which all IPEDS files are originally downloaded.
The project scripts do the following:
-
Delete duplicate/SPSS/SAS files
-
Sort data dictionaries to the corresponding year within the codebooks folder
-
Sort .do files that create .dta files to the corresponding year within the do_files folder
-
Sort raw .csv files to the corresponding year within the raw_data folder
-
Change the directory in each do file to the clean_data folder
-
Replace the default insheet file path which corresponds to a path local to IPEDS servers to the raw_data folder
-
Add a capture command to each label command in the .do files (some files have incorrect labels or illegal characters)
-
Unzip all the files
-
Create .dta files from the raw .csv files
-
Delete all unnecessary unzipped files
A list of files for which a .dta file could not be successfully created from the raw .csv file is also created.
A README file describes the order in which each script should be run, although the run_all.py script will automatically run the scripts in the correct order. The IPEDS files are originally downloaded as zipped files, however the project also includes functions and scripts that will edit unzipped files in case the files have already been unzipped or if edits only need to be made to an individual file.
After creating individual year .dta files using the Python scripts, these files append the individual years for each survey together and then clean the resulting datasets.
If you notice any bugs or errors, please notify me at [email protected].