Builds, tests, runs and exports a model of human perception data for street view images based on feature encodings from CLIP and the K-Nearest-Neighbour algorithm.
Please see the Percept project for the mobile web survey app to gather and generate the raw data.
- Percept Map Explorer: end-to-end demo, click on a map and get modelled perception scores for that point.
- Percept Image Demo: generates modelled perception scores for a given image.
- Amsterdam KNN Model: a model generated by this code and the data from the Amsterdam case study.
pip install -r requirements.txt
Invokes the clip-retrieval tool to perform efficient encoding of images into vectors. This must be installed separately, e.g. pip --install clip-retrieval
or as part of the requirements.txt mentioned above.
We use the same naming convention for CLIP models as clip-retrieval, i.e. from their docs:
--clip-model <CLIP model to load>
"(default ViT-B/32). Specify it as "open_clip:ViT-B-32" to use the Open CLIP or "hf_clip:patrickjohncyh/fashion-clip" to use the Huggingface clip model."
usage: clip_retrieval_knn [-h] [--images-dir DIRECTORY] [--embeddings-dir DIRECTORY] --clip-model MODELNAME [--other-clip-retrieval-args ARGS] --geojson FILENAME [--demographics FILENAME] [-k K] [--training-split FLOAT]
[--randomize] [--random-seed INT] [--stratified] [--environmental] [--environmental-method METHOD] [--environmental-text-dir DIR] [--prompt-style NUM]
[--results-log FILENAME] [--normalization-method METHOD] [--skip-cache] [--read-only] [--quiet] [--extra-assertions] [--gender GENDER,...] [--region REGION] [--age AGE_MIN,AGE_MAX]
[--education LEVEL,...] [--export FILENAME]
K-nearest neighbour on CLIP encoded vectors
options:
-h, --help show this help message and exit
--images-dir DIRECTORY, -i DIRECTORY
Directory with images to be processed with clip-retrieval tool
--embeddings-dir DIRECTORY, -e DIRECTORY
Directory for embeddings output of clip-retrieval tool
--clip-model MODELNAME, -M MODELNAME
CLIP model name (see clip-retrieval tool help)
--other-clip-retrieval-args ARGS
Other command line args to pass to clip-retrieval
--geojson FILENAME, -g FILENAME
File with GeoJSON data from survey
--demographics FILENAME, -d FILENAME
CSV File with demographic data per rating from survey
-k K Value of K (number of nearest neighbours to include in cluster) or comma-separated list of k-values to try.
--training-split FLOAT
Portion of data to use for 'training', value between 0 and 1 (default: 0.8)
--randomize Randomly shuffle the data before splitting into training and testing sets.
--random-seed INT Seed for random number generator.
--stratified Use stratified sampling (stratified by rating).
--environmental Add environmental features into the model
--environmental-method METHOD
One of: append, average, slerp
--environmental-text-dir DIR
Path to dir containing prompt files for environmental vars
--prompt-style NUM One of: 0, 1
--results-log FILENAME, -L FILENAME
Append the results to this file (CSV format)
--normalization-method METHOD
softmax10** (default), softmax or divbysum
--skip-cache Do not look for or read any cached data.
--read-only Do not write any data to disk (cache or otherwise).
--quiet, -q Reduce output to minimum.
--extra-assertions Run additional assertions for testing purposes.
--gender GENDER,... Comma-separated list of surveyed people's genders to include in analysis
--region REGION Include in analysis only those ratings from people who claim to be from this stated region (NL, non-NL)
--age AGE_MIN,AGE_MAX
Include in analysis only those ratings from people who claim to be from this stated age range
--education LEVEL,...
Comma-separated list of surveyed people's education level to include in analysis (Primary, Secondary, Tertiary, University, ostgraduate)
--export FILENAME Instead of running KNN, export numpy arrays with CLIP vectors and scores to the given file.
python3 clip_retrieval_knn.py -g data.geojson --images-dir images/ \
-k 20 --clip-model open_clip:ViT-B-32
Load image filename and score data from data.geojson
, load the image files themselves from the directory images/
, use K = 20 and the ViT-B-32 model from Open CLIP.
python3 clip_retrieval_knn.py -g data.geojson --images-dir images/ \
-k 40 --clip-model open_clip:ViT-H-14-378-quickgelu \
--environmental --environmental-method slerp --prompt-style 1
Load image filename and score data from data.geojson
, load the image files
themselves from the directory images/
, use K = 40 and the
ViT-H-14-378-quickgelu model from Open CLIP. Also include complementary
environment variables for each image location that should also be found in
data.geojson
. Build prompts using Prompt Style 1, encode the resulting text
into vectors and then combine it with the image vectors using Spherical Linear
Interpolation (slerp).
python3 clip_retrieval_knn.py -g data.geojson --images-dir images/ \
-k 10,20,30,40,50 --clip-model open_clip:ViT-H-14-378-quickgelu \
--training-split 0.7 --randomize --random-seed 1000 \
--results-log results.csv
Load image filename and score data from data.geojson
, load the image files
themselves from the directory images/
, run the tests multiple times with
different K values from 10 to 50 and use the ViT-H-14-378-quickgelu model from
Open CLIP. Randomly shuffle the order of the images with a random seed of 1000.
Put 70% of the (shuffled) data into the training set and the rest into the
testing set. Write the results of the tests into the file results.csv
(appending them to the end).
python3 clip_retrieval_knn.py -g data.geojson --images-dir images/ \
-k 10,20,30,40,50 --clip-model open_clip:ViT-H-14-378-quickgelu \
--demographics demo.csv --age 30,49
Load image filename and score data from data.geojson
, load the image files
themselves from the directory images/
, run the tests multiple times with
different K values from 10 to 50 and use the ViT-H-14-378-quickgelu model from
Open CLIP. Filter the responses according the the demographics information
found in demo.csv, keeping only those scores that were given by participants
between the ages of 30 to 49.
python3 clip_retrieval_knn.py -g data.geojson --images-dir images/ \
-k 40 --clip-model open_clip:ViT-H-14-378-quickgelu --export model.npz
Load image filename and score data from data.geojson
, load the image files
themselves from the directory images/
, use K = 40 and the
ViT-H-14-378-quickgelu model from Open CLIP. Export the resulting model to the file
model.npz, do not run the tests.
What do we mean by 'complementary environmental variables'?
These are environmental data (e.g. 'average street length') that are complementary to the imagery we already have. Each image is associated with a geographic location, and using that geographic location we can download from OpenStreetMap information about the street network and other surrounding points of interest or features. For example, for a given image location X, if we consider a buffer size of 300 metres, that means we take data such as 'the number of shops within 300 metres of location X' or 'the proportion of greenspace within a circle of radius 300 metres centring on location X'.
We can apply the complementary environmental variables that were generated for
each image location using the --environmental
option: in that case, we
produce text prompts for each image that describe the complementary
environmental variables and then run CLIP on the text prompts to create vectors
from the text. We then combine the text vectors with the image vectors in one
of three ways:
- append: put the two vectors end-to-end and create a new vector that is twice as long as the originals
- average: element-by-element average the vectors
- slerp: `Spherical Linear Interpolation' finds a vector that is halfway in between the text vector and the image vector. Effectively, it means rotating both vectors towards each other at the same rate until they meet. In 3-D space we would say that this finds a point on a sphere that sits halfway between the other two points on a sphere, on the same great circle between them all. However, we are working in higher dimensional space in the case of CLIP vectors, so it is generalized.
The prompts generated by style 0 have raw numbers in them and look like this:
greenspace count (within buffer of size 100m) is 9; shops count (within buffer of size 100m) is 28; public transport count (within buffer of size 100m) is 8; sustenance count (within buffer of size 100m) is 6; education count (within buffer of size 100m) is 0; [...]; street length avg (within buffer of size 300m) is 29.089045454545474; orientation entropy (within buffer of size 300m) is 2.5227772640841017; median speed (within buffer of size 300m) is 30.0
The prompts generated by style 1 have numbers rewritten as quintiles encoded as one of ('very low', 'low', 'medium', 'high', 'very high') and look like this:
greenspace count (within buffer of size 100m) is very low; shops count (within buffer of size 100m) is medium; public transport count (within buffer of size 100m) is low; sustenance count (within buffer of size 100m) is very low; education count (within buffer of size 100m) is very low; [...]; street length avg (within buffer of size 300m) is very low; orientation entropy (within buffer of size 300m) is low; median speed (within buffer of size 300m) is low