-
Notifications
You must be signed in to change notification settings - Fork 14
Deep Neural Network Tracking
APT makes use of GPUs to train trackers using deep learning (DL). To make it possible to use deep learning, even if you don't have a powerful GPU on their local machine, there are three possible "backends". This is configurable under the Track -> GPU/Backend configuration menu.
-
Amazon Web Services (AWS) Cloud. APT is run within MATLAB on your local computer. This should be compatible with any operating system (currently, we have tested on Linux and Windows). To use AWS, you must have an Amazon Web Services (AWS) account that has been configured for one or more EC2 instances (see AWS EC2 Quickstart). The APT user interface will continue to run locally on your computer, but DL-based training and tracking will both run in the cloud on the AWS EC2.
- Your account is billed for EC2 usage/storage. This workflow is in development, so you should periodically check the AWS console/dashboard to monitor the state of your EC2 instances. At times the backend compute may become "disconnected" from the MATLAB client and require manual management.
-
Janelia Research Campus (JRC) Cluster (current default). This option is Janelia-specific. Your computer must run MATLAB on Linux and have access to JRC's login1/login2 machines via ssh. Commands are sent to the JRC cluster by sshing to login1 or login2 and calling bsub. For this, ssh to login1 and login2 must be set up so that it does not prompt you for any information.
-
login1.int.janelia.org and login2.int.janelia.org must be computers you have sshed to before. If you have never sshed into login1.int.janelia.org/login2.int.janelia.org, do this once manually so that login1.int.janelia.org/login2.int.janelia.org are locally recognized ssh hosts. Type
ssh login1.int.janelia.org
in your terminal. If you have never sshed to login1 before, it will say something like:
$ ssh login1.int.janelia.org The authenticity of host 'login1.int.janelia.org' can't be established. ECDSA key fingerprint is <something>. Are you sure you want to continue connecting (yes/no)? yes
Entering yes here will result in login1 being added to the file~/.ssh/known_hosts
, and when you ssh to login1.int.janelia.org again, it will not ask you the autheticate. Repeat the above for login2.int.janelia.org. -
You must have enabled password-less sshing on login1 and login2. For this to happen, your RSA public key must be included in the file ~/.ssh/authorized_keys on the Janelia file server. You can find information about this here. A command to do this is:
cat ~/.ssh/id_rsa.pub | ssh login1 'cat >> .ssh/authorized_keys'
- File paths on the Janelia file server should match paths to any relevant data on your local machine. E.g. if a movie you have labeled is at
/groups/branson/home/bransonk/mymovie.avi
, then/groups/branson/home/bransonk/mymovie.avi
should be accessible both on login1 and on your local machine. - Before trying this, please make sure that running
ssh login1.int.janelia.org
andssh login2.int.janelia.org
at the terminal do not ask you any questions, and don't require any additional input. If you get an warning like this:'Warning: the RSA host key for 'login1.int.janelia.org' differs from the key for the IP address 'some IP address'
Offending key for IP in your computer path/.ssh/known_hosts:31',
either remove that key from your .ssh/known_hosts file or contact janelia helpdesk for assistance doing so. - Hung/problematic jobs need to be manually managed via "bsub" (LSF) commands on login1/login2.
-
login1.int.janelia.org and login2.int.janelia.org must be computers you have sshed to before. If you have never sshed into login1.int.janelia.org/login2.int.janelia.org, do this once manually so that login1.int.janelia.org/login2.int.janelia.org are locally recognized ssh hosts. Type
- Local Docker on Linux. Your machine runs MATLAB on Linux and has a GPU (with appropriate dependencies installed). DL runs on your local machine using the Docker image we created for code dependencies. See Linux & Docker Setup Instructions.
- Local Conda on Windows. Your machine runs MATLAB on Windows and has a GPU. See Windows & Conda Setup.
Tracking with DL requires the MATLAB>=R2017a and utilizes the Parallel Computing Toolbox.
All new projects will contain one instance each of all known/available trackers. Legacy projects are similarly updated to have one tracker of each known type. Currently, the implemented types are
- Cascaded Pose Regression (CPR)
- Deep Convolutional Network - MDN: Mixture Density Network
- Deep Convolutional Network - UNet
- Deep Convolutional Network - DeepLabCut
Trackers can be live-switched under the Track> menu and retain state even when not selected. The Track and Train buttons, any visualized tracking results, and some menu subitems under the Track> menu apply to the currently selected tracker.
When saved, state for all trackers is saved and reloaded at project-load time. The saved project remembers which tracker was active at save-time and this tracker is restored.
To access the current tracker object, use the .tracker Labeler property as below. This can be useful from time to time for manual debugging or inspection of training and tracking.
tObj = lObj.tracker; % Currently selected tracker object
Setting tracking parameters is available under Track>Configure tracking parameters.
Listed directly under DeepTrack are parameters/settings common to all deep networks: the local cache directory, how often training models should be saved, and so on.
- The Cache Dir is a local filesystem path used by APT to store trained models, log files, and and other DL artifacts. This location must be set in order to track and train with deep networks.
Further parameters specific to the currently selected network architecture are located in a nested submenu. These parameters apply only to the current tracker/network architecture.
All parameter settings (for all trackers) are saved within an APT project.
This section assumes you have already created an AWS account, created Access Keys, installed and configured the AWS CLI, and performed these and all other steps in AWS EC2 Quickstart.
To start and communicate with EC2 instances, AWS EC2 Quickstart had you create an ssh Key Pair. You will need the name of this Key Pair, as well as the full path to the .pem file stored on your local workstation. The .pem file contains the private key for the key pair.
If you have not started any EC2 instances, you can do so now in MATLAB:
% keyName is the name of your key pair in a char/string
% pemFile is the full local path of your .pem file
aws = AWSec2(pemFile,'keyName',keyName);
% If this returns true, you are now spinning up an EC2 instance
tfsucc = aws.launchInstance()
Alternatively, you may have an existing EC2 instance, which you can inspect via the EC2 Console. If your instance is Stopped, select Instance State>Start to Start the instance and change its state to Running. The EC2 Console also lists the instanceID for your running instance.
Once an EC2 instance is started, in APT use the Track>GPU/Backend Configuration/Set EC2 Instance menu item to input your instanceID and pemfile (specify the full local filesystem path for the latter). APT should now be ready to train and track on Amazon EC2!
When APT Trains or Tracks with a DL tracker, a system call is made and DL runs in a separate process either locally or in the remote cloud. As the DL library refines a model or produces tracking output, a second background process, spawned in MATLAB using the Parallel Toolbox, monitors the DL process.
During Training, the Training Monitor displays the training loss and distance (error in px) over training iterations. The plots in the Monitor should update roughly every 30 seconds. Menu options enable monitoring cluster/job status and log files, as well as forcing an immediate update of the plots.
The "Stop training" button stops/kills a running Training session. Since models are saved at fixed intervals (configurable in the Tracking Parameters), the most recently saved model will be saved and available for Tracking. You may choose to Stop training either because the training loss/distance has dropped to a satisfactorily low level, or because training with the current parameters is taking too long.
When "Stop training" is pressed, a stop/kill message is sent to the DL process. It may take some time for the DL process to wind down. Please wait for the Training Monitor to acknowledge the stopped job via 'x' marks on the training plots before performing further Train or Track actions within APT.
APT is designed to detect many runtime errors that can occur with DeepLearning backends. If an error is detected, a diagnostic message will be displayed and the Training Monitor will be stopped. In some cases, an unexpected error may require manual management of the DL process. In particular, a running bsub job (JRC backend) or AWS process may need to be manually killed.
If it is necessary to manually kill a DL Training process, the following code should be run to reset APT's deep learning infrastructure:
tObj = lObj.tracker; % current tracker
tObj.bgTrnReset;
Tracking with DL typically occurs quite quickly. (With the JRC backend, sometimes the JRC cluster can be filled and a job will sit/wait pending execution). Since Tracking is typically fast, APT does not display any progress indication during DL Tracking. When tracking is complete, a message is displayed in the MATLAB Command Window and tracking results appear in the UI as appropriate.
As with Training, a system call is made and DL runs in a separate process; this DL process is itself monitored by a tracking monitor process that looks for tracking output in the filesystem. Again, APT attempts to detect runtime errors, and if an error is detected, in some cases a DL process must be manually killed. The following command checks on the state of the pending or most recent tracking job:
tObj = lObj.tracker; % current tracker object
tObj.trkPrintLogs;
If it is necessary to manually kill a Tracking DL process, run the following to reset APT:
tObj.bgTrkReset;
Since DL Training and Tracking are entirely separate DL processes with separate monitoring processes, once Training has begun and the first trained model is saved, Tracking can be done at any time. (Note: this is currently not true when using AWS as the backend, since it might require multiple EC2 instances. Currently, when using AWS, at most a single training or tracking session can be running for a given APT session at one time.) If Training is not complete, the most recent trained model is used for tracking.
- If your training monitor usually works, then seems to abruptly stop working: Sometimes after a MATLAB session has been open for a long time, the local parallel pool gets into a bad or hung state. The Training Monitor operates in a background thread and becomes unresponsive. Restarting MATLAB may address this problem.
- To see what system command would be used to spawn a training or tracking job without actually triggering the process, set the
tObj.dryRunOnly
flag to true. - Currently there is no way to track with a trained model from an iteration prior to the latest/final trained model.