Skip to content

datadiversitylab/world-cities-analysis-12-19-23

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

world-cities-analysis

Built With

Python 3.9.12, GeoPandas 0.9, GDAL 3.02, scikit-learn 1.1.1

Datasets

We use the 6,018 city boundaries defined in a .shp file (Export the file in EPSG:4326). After downloading, move the extracted files into the directory "../../city-boundaries/" relative to the current code folder.

Continents for each city were determined using another shapefile (download the "label areas"). Move the extracted data to "../../continent-boundaries/" relative to the current code folder.

Cities were analyzed based on the following extracted datasets.

Note that all of the extracted CSV data for our analyses has been pushed to the "csv_data" folder in this repository. It is recommended to use those files instead of re-extracting the data as explained below:

> gdalwarp -t_srs EPSG:4326 '../../datasets/human_modification/gHM.tif' '../../datasets/human_modification/gHM_mod.tif'

The original "gHM.tif" file should be moved to another unused folder or deleted.

  • Land Usage: We used the cropland and grazing land data from 2017. Files are 5 arcminute .asc format.
  • Elevation: Originally exported as a .shp file, then it should be rasterized as a new .tif file using GDAL in a command line:
> gdal_rasterize -a MEAN_ELEV "../../datasets/elevation/GMTED2010_Spatial_Metadata.shp" -tr 0.008333333333333 0.008333333333333 "../../datasets/elevation/elevation.tif"

Unused but also available to extract

Run the following to see all available options to extract data using the city_data.py script:

> python city_data.py -h   
Usage: city_data.py [options]

Options:
  -h, --help          show this help message and exit
  -c C, --plotcity=C  City to be plotted (case sensitive; options are any city
                      in our database (Boston, Los Angeles, Tokyo, Beijing,
                      etc.)
  -w, --worldclim     Extract WorldClim data stored at
                      "../../datasets/worldclim"
  -p, --paleoclim     Extract PaleoClim data stored at
                      "../../datasets/paleoclim"
  -l, --landscan      Extract Landscan data stored at
                      "../../datasets/landscan"
  -b, --brightness    Extract Sky Brightness data stored at
                      "../../datasets/brightness"
  -r, --roads         Extract road density data stored at
                      "../../datasets/roads"
  -m, --human         Extract human modification data stored at
                      "../../datasets/human_modification"
  -u, --urban_heat    Extract urban heat data stored at "../../datasets/roads"
  -y, --land_use      Extract land use data stored at
                      "../../datasets/land_use"
  -e, --elevation     Extract elevation data stored at
                      "../../datasets/elevation"
  -g, --geodist       Output geographical distances between cities.

City Analysis

Use the cities.py script with the following options to perform analyses on the data. Example output plots are located in the "figures" folder.

> python cities.py -h 
Usage: cities.py [options]

Options:
  -h, --help            show this help message and exit
  -s, --stability       Clusters the cities based on features located at
                        "../csv_data/". Then calculates the cluster stability
                        by running several iterations of clustering.
  -c, --cluster         Clusters city data located at "../csv_data/", then
                        plots the clusters in a geographic representation.
  -t, --centers_to_csv  Saves calculated cluster centroids to CSV file (use
                        with python cities.py -c or -p).
  -p, --pca             Clusters then plots the cities with their first two
                        Principle Components
  -e, --elbow           Uses centroid CSV files for different k values, and
                        Euclidean calculates distances between cities and
                        clusters. Plots an Elbow plot and generates a csv for
                        cluster distances and city distances from each other.
  -m METHOD, --method=METHOD
                        Clustering method to perform. Type either "dbscan" or
                        "kmeans". (Default: kmeans)
  -k K, --num_clusters=K
                        Number of clusters to partition the data in k-means
                        clustering. Default: 6
  -i C, --cluster_iters=C
                        Number of clustering iterations to calculate baseline
                        clusters. Default: 10
  -b B, --baseline_stability_iters=B
                        Number of iterations to calculate the stability of the
                        baseline clusters. Default: 100
  -n N, --stability_iters=N
                        Number of iterations to calculate the cluster
                        stability for each city. Default: 100
  -f F, --drop_features=F
                        Percentage of features to drop at each iteration of
                        stability calculations. Default: 10
  -d D, --drop_rows=D   Percentage of rows to drop at each iteration of
                        stability calculations. Default: 10
  -r R, --random_seed=R
                        Random seed initialization. Default: 10

SLURM usage

It is recommended to use an HPC and schedule tasks using a SLURM script. It can take several hours for many of the datasets to finish extracting the data to CSV. Example SLURM script to extract the Landscan data:

#!/bin/bash

# --------------------------------------------------------------
### PART 1: Requests resources to run your job.
# --------------------------------------------------------------
### Optional. Set the job name
#SBATCH --job-name=get_landscan_data
### SLURM reads %x as the job name and %j as the job ID
#SBATCH --output=%x-%j.out
### REQUIRED. Specify the PI group for this job
#SBATCH --account=[enter your group here]
### Optional. Request email when job begins and ends
#SBATCH --mail-type=ALL
### Optional. Specify email address to use for notification
#SBATCH --mail-user=[your email address]
### REQUIRED. Set the partition for your job.
#SBATCH --partition=standard
### REQUIRED. Set the number of cores that will be used for this job.
#SBATCH --ntasks=4
### REQUIRED. Set the number of nodes
#SBATCH --nodes=1
### REQUIRED. Set the memory required for this job.
#SBATCH --mem-per-cpu=5gb
### REQUIRED. Specify the time required for this job, hhh:mm:ss
#SBATCH --time=24:00:00


# --------------------------------------------------------------
### PART 2: Executes bash commands to run your job
# --------------------------------------------------------------
### Install necessary modules
source ~/.bashrc && conda activate cities
### change to your script’s directory
cd ~/world-cities-analysis/code
### Run your work
echo "Extracting Landscan population data..."
python city_data.py -l
sleep 10

Example SLURM for calculating cluster stability:

#!/bin/bash

# --------------------------------------------------------------
### PART 1: Requests resources to run your job.
# --------------------------------------------------------------
### Optional. Set the job name
#SBATCH --job-name=cluster_stability
### SLURM reads %x as the job name and %j as the job ID
#SBATCH --output=%x-%j.out
### REQUIRED. Specify the PI group for this job
#SBATCH --account=[your group name]
### Optional. Request email when job begins and ends
#SBATCH --mail-type=ALL
### Optional. Specify email address to use for notification
#SBATCH --mail-user=[your email]
### REQUIRED. Set the partition for your job.
#SBATCH --partition=standard
### REQUIRED. Set the number of cores that will be used for this job.
#SBATCH --ntasks=4
### REQUIRED. Set the number of nodes
#SBATCH --nodes=1
### REQUIRED. Set the memory required for this job.
#SBATCH --mem-per-cpu=5gb
### REQUIRED. Specify the time required for this job, hhh:mm:ss
#SBATCH --time=24:00:00


# --------------------------------------------------------------
### PART 2: Executes bash commands to run your job
# --------------------------------------------------------------
### Install necessary modules
source ~/.bashrc && conda activate cities
### change to your script’s directory
cd ~/world-cities-analysis/code
### Run your work
echo "Analyzing k=2 cluster stabilities..."
python cities.py -s -k 2 -b 1000 -i 100 -n 10000

echo "Analyzing k=3 cluster stabilities..."
python cities.py -s -k 3 -b 1000 -i 100 -n 10000

echo "Analyzing k=4 cluster stabilities..."
python cities.py -s -k 4 -b 1000 -i 100 -n 10000

echo "Analyzing k=5 cluster stabilities..."
python cities.py -s -k 5 -b 1000 -i 100 -n 10000

echo "Analyzing k=6 cluster stabilities..."
python cities.py -s -k 6 -b 1000 -i 100 -n 10000

sleep 10

Authors

Kyle Arechiga - [email protected]

Cristian Roman Palacios - [email protected]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published