world-cities-analysis

Built With

Python 3.9.12, GeoPandas 0.9, GDAL 3.02, scikit-learn 1.1.1

Datasets

We use the 6,018 city boundaries defined in a .shp file (Export the file in EPSG:4326). After downloading, move the extracted files into the directory "../../city-boundaries/" relative to the current code folder.

Continents for each city were determined using another shapefile (download the "label areas"). Move the extracted data to "../../continent-boundaries/" relative to the current code folder.

Cities were analyzed based on the following extracted datasets.

Note that all of the extracted CSV data for our analyses has been pushed to the "csv_data" folder in this repository. It is recommended to use those files instead of re-extracting the data as explained below:

WorldClim bioclimatic data: Used the 19 bioclimatic .tif files at a resolution of 5 arcminutes.
Landscan population data: Used population data from the years 2000 and 2021 in 30 arcescond resolution .tif format.
Artificial Sky Brightness: Brightness data from 2015 extracted from a 30 arcsecond .tif file.
Road Density: Extracted from a 5 arcminute resolution .asc file.
Human Modification: This dataset requires a the .tif to be reprojected to EPSG:4326. This can be achieved using GDAL in the command line:

> gdalwarp -t_srs EPSG:4326 '../../datasets/human_modification/gHM.tif' '../../datasets/human_modification/gHM_mod.tif'

The original "gHM.tif" file should be moved to another unused folder or deleted.

Land Usage: We used the cropland and grazing land data from 2017. Files are 5 arcminute .asc format.
Elevation: Originally exported as a .shp file, then it should be rasterized as a new .tif file using GDAL in a command line:

> gdal_rasterize -a MEAN_ELEV "../../datasets/elevation/GMTED2010_Spatial_Metadata.shp" -tr 0.008333333333333 0.008333333333333 "../../datasets/elevation/elevation.tif"

Unused but also available to extract

Run the following to see all available options to extract data using the city_data.py script:

> python city_data.py -h

Usage: city_data.py [options]

Options:
  -h, --help          show this help message and exit
  -c C, --plotcity=C  City to be plotted (case sensitive; options are any city
                      in our database (Boston, Los Angeles, Tokyo, Beijing,
                      etc.)
  -w, --worldclim     Extract WorldClim data stored at
                      "../../datasets/worldclim"
  -p, --paleoclim     Extract PaleoClim data stored at
                      "../../datasets/paleoclim"
  -l, --landscan      Extract Landscan data stored at
                      "../../datasets/landscan"
  -b, --brightness    Extract Sky Brightness data stored at
                      "../../datasets/brightness"
  -r, --roads         Extract road density data stored at
                      "../../datasets/roads"
  -m, --human         Extract human modification data stored at
                      "../../datasets/human_modification"
  -u, --urban_heat    Extract urban heat data stored at "../../datasets/roads"
  -y, --land_use      Extract land use data stored at
                      "../../datasets/land_use"
  -e, --elevation     Extract elevation data stored at
                      "../../datasets/elevation"
  -g, --geodist       Output geographical distances between cities.

City Analysis

Use the cities.py script with the following options to perform analyses on the data. Example output plots are located in the "figures" folder.

> python cities.py -h

Usage: cities.py [options]

Options:
  -h, --help            show this help message and exit
  -s, --stability       Clusters the cities based on features located at
                        "../csv_data/". Then calculates the cluster stability
                        by running several iterations of clustering.
  -c, --cluster         Clusters city data located at "../csv_data/", then
                        plots the clusters in a geographic representation.
  -t, --centers_to_csv  Saves calculated cluster centroids to CSV file (use
                        with python cities.py -c or -p).
  -p, --pca             Clusters then plots the cities with their first two
                        Principle Components
  -e, --elbow           Uses centroid CSV files for different k values, and
                        Euclidean calculates distances between cities and
                        clusters. Plots an Elbow plot and generates a csv for
                        cluster distances and city distances from each other.
  -m METHOD, --method=METHOD
                        Clustering method to perform. Type either "dbscan" or
                        "kmeans". (Default: kmeans)
  -k K, --num_clusters=K
                        Number of clusters to partition the data in k-means
                        clustering. Default: 6
  -i C, --cluster_iters=C
                        Number of clustering iterations to calculate baseline
                        clusters. Default: 10
  -b B, --baseline_stability_iters=B
                        Number of iterations to calculate the stability of the
                        baseline clusters. Default: 100
  -n N, --stability_iters=N
                        Number of iterations to calculate the cluster
                        stability for each city. Default: 100
  -f F, --drop_features=F
                        Percentage of features to drop at each iteration of
                        stability calculations. Default: 10
  -d D, --drop_rows=D   Percentage of rows to drop at each iteration of
                        stability calculations. Default: 10
  -r R, --random_seed=R
                        Random seed initialization. Default: 10

SLURM usage

It is recommended to use an HPC and schedule tasks using a SLURM script. It can take several hours for many of the datasets to finish extracting the data to CSV. Example SLURM script to extract the Landscan data:

#!/bin/bash

# --------------------------------------------------------------
### PART 1: Requests resources to run your job.
# --------------------------------------------------------------
### Optional. Set the job name
#SBATCH --job-name=get_landscan_data
### SLURM reads %x as the job name and %j as the job ID
#SBATCH --output=%x-%j.out
### REQUIRED. Specify the PI group for this job
#SBATCH --account=[enter your group here]
### Optional. Request email when job begins and ends
#SBATCH --mail-type=ALL
### Optional. Specify email address to use for notification
#SBATCH --mail-user=[your email address]
### REQUIRED. Set the partition for your job.
#SBATCH --partition=standard
### REQUIRED. Set the number of cores that will be used for this job.
#SBATCH --ntasks=4
### REQUIRED. Set the number of nodes
#SBATCH --nodes=1
### REQUIRED. Set the memory required for this job.
#SBATCH --mem-per-cpu=5gb
### REQUIRED. Specify the time required for this job, hhh:mm:ss
#SBATCH --time=24:00:00


# --------------------------------------------------------------
### PART 2: Executes bash commands to run your job
# --------------------------------------------------------------
### Install necessary modules
source ~/.bashrc && conda activate cities
### change to your script’s directory
cd ~/world-cities-analysis/code
### Run your work
echo "Extracting Landscan population data..."
python city_data.py -l
sleep 10

Example SLURM for calculating cluster stability:

#!/bin/bash

# --------------------------------------------------------------
### PART 1: Requests resources to run your job.
# --------------------------------------------------------------
### Optional. Set the job name
#SBATCH --job-name=cluster_stability
### SLURM reads %x as the job name and %j as the job ID
#SBATCH --output=%x-%j.out
### REQUIRED. Specify the PI group for this job
#SBATCH --account=[your group name]
### Optional. Request email when job begins and ends
#SBATCH --mail-type=ALL
### Optional. Specify email address to use for notification
#SBATCH --mail-user=[your email]
### REQUIRED. Set the partition for your job.
#SBATCH --partition=standard
### REQUIRED. Set the number of cores that will be used for this job.
#SBATCH --ntasks=4
### REQUIRED. Set the number of nodes
#SBATCH --nodes=1
### REQUIRED. Set the memory required for this job.
#SBATCH --mem-per-cpu=5gb
### REQUIRED. Specify the time required for this job, hhh:mm:ss
#SBATCH --time=24:00:00


# --------------------------------------------------------------
### PART 2: Executes bash commands to run your job
# --------------------------------------------------------------
### Install necessary modules
source ~/.bashrc && conda activate cities
### change to your script’s directory
cd ~/world-cities-analysis/code
### Run your work
echo "Analyzing k=2 cluster stabilities..."
python cities.py -s -k 2 -b 1000 -i 100 -n 10000

echo "Analyzing k=3 cluster stabilities..."
python cities.py -s -k 3 -b 1000 -i 100 -n 10000

echo "Analyzing k=4 cluster stabilities..."
python cities.py -s -k 4 -b 1000 -i 100 -n 10000

echo "Analyzing k=5 cluster stabilities..."
python cities.py -s -k 5 -b 1000 -i 100 -n 10000

echo "Analyzing k=6 cluster stabilities..."
python cities.py -s -k 6 -b 1000 -i 100 -n 10000

sleep 10

Authors

Kyle Arechiga - [email protected]

Cristian Roman Palacios - [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.idea		.idea
Combined Data Comparison		Combined Data Comparison
clim_only_30s_cluster_data		clim_only_30s_cluster_data
clim_only_5m_cluster_data		clim_only_5m_cluster_data
cluster_data		cluster_data
code		code
combined_30s_cluster_data		combined_30s_cluster_data
combined_5m_cluster_data		combined_5m_cluster_data
csv_data		csv_data
csv_holding_folder		csv_holding_folder
figures		figures
no_outlier_combined_30s_figures		no_outlier_combined_30s_figures
no_outlier_human_only_figures		no_outlier_human_only_figures
result_figures		result_figures
README.md		README.md
centroids_6_alldata_30sclim.csv		centroids_6_alldata_30sclim.csv
cities_6_alldata_30sclim.csv		cities_6_alldata_30sclim.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

world-cities-analysis

Built With

Datasets

City Analysis

SLURM usage

Authors

About

Releases

Packages

Contributors 2

Languages

datadiversitylab/world-cities-analysis-12-19-23

Folders and files

Latest commit

History

Repository files navigation

world-cities-analysis

Built With

Datasets

City Analysis

SLURM usage

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages