Skip to content
William Silversmith edited this page Sep 6, 2017 · 3 revisions

#Dataset Ingestion This page describes the general workflow on how to upload datasets accessible via the Precomputed API.

Naming Conventions

All dataset are stored in a bucket called neuroglance. Dataset names are all lowercase and ends with a version like "zfish_v0". To list all datasets currently available run gsutil ls gs://neuroglancer/ in a terminal or go to https://console.cloud.google.com/storage/browser/neuroglancer/?project=neuromancer-seung-import in your browser.

Inside a dataset "directory" you will find layers which can be directly displayed in neuroglancer. A layer containing electron microscopy images is named image, the labeled image produced by watershed before any agglomeration is called segmentation , after agglomeration is named segmentation_0.2 if it was merged up to a threshold of 0.2. The output of the convnets is stored in a layer named affinities.

Process

Copy to build directory

The data we want to ingest is copy to a build/ subdirectory inside our layer. For example, gs://neuroglancer/zfish_v0/image/build. This can be done in our command line by gsutil -m cp ./path/to/data/* gs://neuroglancer/zfish_v0/image/build/.
The data should follow the Chunked representation of volume data..

Creating ingestion tasks

Secondly, An "ingestion" task is created for each chunk and pushed to Google's pull queue.
An example of a task is:

{
   "chunk_path": "gs://neuroglancer/zfish_v0/image/build/0-1024_0-1024_0-128",
   "chunk_encoding": "npz",
   "layer_path": "gs://neuroglancer/zfish_v0/image"
}

It contains the path to chunk to process and the encoding of the chunk require to understand it, the info path contains all the information required to output the files that follows the Precomputed API. We want all tasks to be independent of each other and to only depend on a single chunk. This means that neuroglancer's chunk size( defined in the info size) should be a multiple of the chunk we are processing, otherwise we would only be able to write a partial chunk requiring some synchronization between tasks. Similarly downsampling is done up to what is possible so that only complete chunks are written.

Run digestion on a cluster

Thirdly, A cluster is created which runs containers that continuously process tasks until the queue is empty.

Creating Downsample tasks

Imagine the case in which the chunks stored in e build/ are of size 1024^3 and the chunk size define in info is 64^3. Then there can only be four downsample level which probably won't be enough if the dataset is relatively large. More downsampling can be done after all the "ingestion" tasks are processed. The scale to be downsample should be defined in the info file.
For example of a "Downsample" task define as:

{
   "chunk_path": "gs://neuroglancer/zfish_v0/image/40_40_45/0-64_0-64_0-64",
   "layer_path": "gs://neuroglancer/zfish_v0/image"
}

Will pull 8 chunks from "gs://neuroglancer/zfish_v0/image/20_20_45/" convering a volume of 128^3 and downsampling it and save it in the chunk_path.

Creating Mesh tasks

A meshing task can only be apply to a segmentation type layer. An example of a "mesh" task:

{
   "chunk_key": "gs://neuroglancer/zfish_v0/segmentation/5_5_45",
   "chunk_position": "0-1024_0-1024_0-128",
   "layer_path":  "gs://neuroglancer/zfish_v0/segmentation",
   "lod": 0,
   "simplification":5,
   "segments": []
}
  • "chunk_key is a path to an scale level, so that you can start meshing from a downsample segmentation for faster processing but where very thin objects might get lost.
  • "chunk_position" specifies the size and location of the chunks to process, the bigger it is the most memory it will be required.
  • "layer_path" path to the layer to find which other scales are available.
  • "lod": so far neuroglancer supports only one level of detail for meshes.
  • "simplification": 0 means no simplification, and 10 is the maximum where most detail is lost.
  • "segments": is a list of ids that are presents in the chunk being processed and that we want to produce meshes for, if empty all meshes are produced. When processing meshes it is required to have 2 pixel overlapping chunks for processing, if the chunks are not overlapping the marching cubes would produce inconsistent meshes.It is also desirable to pad with black the borders so that there are caps filling them.

Creating MeanGraph tasks

A Mean task is simultaneously applied to a segmentation and an affinities layer. An example of a "MeanGraph" task is:

{
   "chunk_position": "0-1024_1024-2048_0-128",
   "layer_path_affinities":  "gs://neuroglancer/zfish_v0/affinities",
   "layer_path_segmentation":  "gs://neuroglancer/zfish_v0/segmentation",
}

When creating mean affinities region graphs it is also required to have a two pixel overlap when processing the chunks.

Creating watershed tasks

A Mean task is simultaneously applied to a segmentation and an affinities layer. An example of a "Watershed" task is:

{
   "chunk_position": "0-1024_1024-2048_0-128",
   "crop_position": "128-896_128-896_16-112",
   "layer_path_affinities": "gs://neuroglancer/zfish_v0/affinities",
   "layer_path_segmentation": "gs://neuroglancer/zfish_v0/segmentation",
   "high_threshold": 0.99,
   "low_threshold": 0.1,
   "merge_threshold": 0.3,   
   "merge_size": 800,
   "dust_size": 100
}

A chunk at "chunk_position" will be fetch from affinity layer watershed will be applied to it, and save at the same position in the segmentation layer.

If "crop_position" is different that an empty string, the chunk will be cropped and save respecting the cropping offsets.
In the example above, the segmentation will be only saved in the x dimension from voxel 128 to 896. While for y it will start at 1152 and end at 1920.
Watershed is a global transform, which means that splitting the dataset into chunks for processing gives a different result. But running it in a single machine can be impractical, and writing a distributed one can led to a complex implementation.
A good compromise usually is to run it in overlapping chunks, despite the decrease in performance, the result is more similar to the global transform.

As every other task, the output position saved to the segmentation layer has to be an integral number of the underlying chunks and be grid aligned. If two tasks are specified to run to the same underlying chunk the output with depend on the order of execution of the tasks which is highly undesirable. The underlying chunks position and sizes are defined in "layer_path_segmentation".

Advance levels of wizardly are required to choose the otherthresholds and sizes:

  • "high_threshold": affinities larger than this value are considered infinity, which means the pair of voxels connected by this affinity edge is guaranteed to be in the same basin.
  • "low_threshold": edges with affinity less that this value are completely ignored.
  • "merge_threshold" and "merge_size": In the case of neighboring basins which the maximum affinity between them is larger than "merge_threshold" and the size of each of them is less than "merge_size" will be merged.
  • "dust_size": In the very end, we will inspect all the basins and the one smaller than this size will be remove by setting as background(value zero)"

Kubernets Cluster

First create a kubernets cluster

gcloud container --project "neuromancer-seung-import" clusters create "digest-cluster" --zone "us-east1-b" --machine-type "n1-standard-16" --image-type "GCI" --disk-size "100" --scopes "https://www.googleapis.com/auth/compute","https://www.googleapis.com/auth/devstorage.full_control","https://www.googleapis.com/auth/taskqueue","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/cloud-platform","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "20" --network "default" --enable-cloud-logging --no-enable-cloud-monitoring
# create a deployment call digest from that image container
gcloud config set container/cluster digest-cluster 
kubectl run digest --image=gcr.io/neuromancer-seung-import/digest --replicas=640
#It is reasonable to run two containers per available core when doing ingestion tasks

To look at the log of a given 'pods'

kubectl get pods
#assuming the id of a pod is digest-1115202986-dw9bc
kubectl logs digest-1115202986-dw9bc

Resize cluster to size 30

gcloud container clusters resize digest-cluster --size 30

Resize deployment to size 30

kubectl scale deployment digest --replicas=30

To manually create/update docker image

cd neuroglancer/python
#create container
docker build --tag gcr.io/neuromancer-seung-import/digest . 
#push container to google cloud, so that the kubernet cluster can access it
gcloud docker -- push gcr.io/neuromancer-seung-import/digest