New To SLEAP #875

Levans5 · 2022-07-29T20:17:09Z

Levans5
Jul 29, 2022

Hello!

I am very new to SLEAP and have several questions to help improve my model. I have began by following along the tutorial and seeking help here and reading relevant posts. My lab is hoping to use SLEAP for pose and identity tracking of 3 mice in a small box where they are free to interact. (Below)

I began labeling with random 100 frames across 5 videos. entered the loop of training and correcting the model on 20-100 random frames. I went through this process about 5-6 times (Top-Down) with ultimately 500-600 frames, and roughly 1500-1800 instances in my training set. (1) Do you recommend an certain number of frames to initially label, and is there an average or recommended number of ultimate frames and instances to achieve an accurate model? (2) should I run training on a new video to asses accuracy, and what is the best method of evaluating the accuracy of the trained model?

I stopped this process after I was "satisfied" with the performance of the predictions (I'd estimate 85% accuracy on the final 100 randomly predicted frames. (3)What level of accuracy should I look for in my model?

I next went though with running inference on one entire video (8999 frames). The settings I used are shown below as observed from a previous posts recommendation.

After running In total I had around 80 identified tracks which I had to go through to correct to the 3 actual identities present. (4) Are there better setting which I should use for identity tracking to limit this number. I did use the target instances per frame and cull to count settings (5)However, is there a way to set a target number of identities across the video and specify a 3 animal project?

I then went through to assess the accuracy of the skeleton placement across frames which was about 80% right. Here the only way I could seem to correct the 20% that were miss labeled (scrambled skeleton, backwards skeleton, piece of skeleton labeled on wrong body etc.) was to go through frame by frame making corrections. (6)Is there any way to make corrections across a clip or improve the process of fixing mislabeled skeletons? (I am hoping and imagine that improving the training process will limit the amount of mislabels and improve the accuracy of the model in general).

Lastly, I recently updated SLEAP and lost all the predictions on my videos, left with only the instances I labeled or corrected. This has happened before when leaving the program even after saving multiple times. (7) Is there anyway to save predictions when closing the program?

Thank you for all of your help! Your team has been amazing in answering questions so far.

Please let me know if you would like to know anything more about the process or specifications I have used.

Liam

talmo · 2022-07-30T01:52:45Z

talmo
Jul 30, 2022
Maintainer

Hi Liam,

Thanks for reaching out and describing your issues in detail! Let's take it one at a time:

(1) Do you recommend an certain number of frames to initially label, and is there an average or recommended number of ultimate frames and instances to achieve an accurate model?

Good question. There isn't a definitive answer to this, but what we can recommend from experience:

Starting with 5-10 frames seems to work well with relatively easy tracking projects, or 15-30 for projects with low SNR, hard landmarks or complex skeleton.
Results good enough for analysis seem to come out at the 100-500 frame range, again depending on the complexity of the poses and quality of the data.
Accuracy starts to plateau at 1500-2000 frames, but we've seen folks go as high as 5000-10000 because they felt like they were still getting improvements.

We've only rigorously quantified this in a study we did for the SLEAP paper (Fig. 2c):

As you can see, the accuracy is still increasing as we get past 1000 labels, but note the log-scale on the x-axis.

(2) should I run training on a new video to asses accuracy, and what is the best method of evaluating the accuracy of the trained model?

Good question. A common and rigorous way would be to generate three splits of your dataset:

Training (80%): This is used to actually train the models and the labels here have a direct impact on the final performance.
Validation (10%): This is used during training to estimate whether the model converged or not. This is crucial since accuracy continues to improve on the training set for a while after it stops improving on the validation set. If we let it keep training, the model would begin to overfit on the training data and get worse when generalizing to new data. Technically, the model never "sees" the validation set in the sense that it doesn't use those labels to update model weights, but because we used it during training to adjust hyperparameters and determine when to stop, there is a bias towards it being more accurate on the validation set.
Testing (10%): This is a held-out set of the data that is never used in any way during training. In principle, it can provide a more unbiased estimate of true accuracy as compared to the validation set.

SLEAP does a 90%/10% train/validation split automatically every time you train. This is because, in practice, generating three splits would mean that even less of your labeled data is used for training, and usually the validation and test sets are fairly close anyway.

The most rigorous way would account for the fact that the dataset is fairly small and perform cross-validation where you generate 80/10/10 splits many times so that you get a distribution of accuracy metrics evaluated using different samplings of the splits. In practice, this is usually not feasible unless you have access to a lot of GPUs since it would require training many neural networks.

Our recommendation is generally to rely on the validation set accuracy, which is what is reported by default within the GUI and saved to the model folder after each run. We also have a notebook on model evaluation that might be helpful for more detailed accuracy analyses.

Note that this is all regarding pose estimation performance, not identity tracking which is a whole 'nother can of worms...

(3)What level of accuracy should I look for in my model?

The mAP metric is a pretty good holistic summary of the accuracy across your whole dataset (see the paper for details on how it's computed). The values depend a lot on the type of skeleton you have, but it ranges between 0 and 1 (with 1 being the best). Our "gold standard" models usually reach around ~0.8 mAP.

The other thing to look at are distributions of distances (Fig 2 in the paper) which tell you about localization error, or the distributions of OKS values (see the metrics notebook) which are a more principled way of scoring pose predictions since it accounts for visibility and body size.

After running In total I had around 80 identified tracks which I had to go through to correct to the 3 actual identities present. (4) Are there better setting which I should use for identity tracking to limit this number. I did use the target instances per frame and cull to count settings (5)However, is there a way to set a target number of identities across the video and specify a 3 animal project?

Hmm, this is tricky. What's likely happening is that SLEAP loses track of more than 1 animal for more frames than the window size, so when one of them returns, there's no easy way to figure out which of the original tracks it should be assigned to. Despite having 80 tracks, assuming everything is working correctly, you should find that there are always at most 3 tracks in any frame.

We should definitely be able to do better though! Try enabling "Connect Single Track Breaks" and using the "flow" method which can improve the association after crossings like in your screenshot. You may also want to increase the frame window 10.

Our tracker is definitely not ideal for harder cases and we should add some more heuristic options that, while not ideal for everyone, might help with proofreading for some cases. A couple of relevant suggestions in #797 and #737. Another might be to have an option to force an assignment to the a maximum of 3 tracks for entire duration of the video, even if we're effectively just randomly guessing at times. If you have any other suggestions, please feel free to post in the Ideas!

If tracking is prohibitively bad on your data after trying out some different settings, let us know and perhaps share your data with us over email so we can have a closer look.

(6)Is there any way to make corrections across a clip or improve the process of fixing mislabeled skeletons?

We're definitely always open to suggestions! The best way is to improve the model predictions, but I appreciate how that's not always enough.

Generally, our approach has been to get the best predictions we can such that 99%+ of frames are good enough, and then we simply delete the egregiously bad ones during proofreading. This is faster than correcting them and we can usually get good enough results by interpolating across missing frames during analysis (provided they happen infrequently enough).

Again, we love to hear user feedback on how we can improve the labeling interface and workflow, so don't hesitate to ask for your dream features in Ideas!

Lastly, I recently updated SLEAP and lost all the predictions on my videos, left with only the instances I labeled or corrected. This has happened before when leaving the program even after saving multiple times. (7) Is there anyway to save predictions when closing the program?

This sounds pretty bad! Data integrity is our top priority, so we should definitely follow up on this, especially if you can reproduce it. Do you mind filling out a Bug Report so we can try to reproduce it on our end?

Thanks for taking the time to ask all these questions -- I'm sure they'll be super helpful to other users browsing the discussions! Let me know if you have any follow ups!

Cheers,

Talmo

1 reply

Levans5 Aug 31, 2022
Author

Hello Talmo!

Thanks so much for your help, your advice has been very helpful so far. I have one follow up on assessing the accuracy of our model.

You state "The mAP metric is a pretty good holistic summary of the accuracy across your whole dataset (see the paper for details on how it's computed). The values depend a lot on the type of skeleton you have, but it ranges between 0 and 1 (with 1 being the best). Our "gold standard" models usually reach around ~0.8 mAP."

For a each model I find OKS and PCK mAP scores. They are generally similar, but is there one I should pay more attention to. Additionally Since we have chosen to do topdown for this particular model there are metrics for both centroid and centered instance models. In terms of "gold standard" which of these should i hope to get around 0.8 mAP. I have attached the metrics for the latest two iterations below. Centroid is consistently around 0.8 mAP, while centered instance fluctuates around 0.10-0.16mAP between models.

Would you mind tacking a look at the metrics and letting me know what you think?

Thanks!
Liam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New To SLEAP #875

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

New To SLEAP #875

Levans5 Jul 29, 2022

Replies: 1 comment · 1 reply

talmo Jul 30, 2022 Maintainer

Levans5 Aug 31, 2022 Author

Levans5
Jul 29, 2022

Replies: 1 comment 1 reply

talmo
Jul 30, 2022
Maintainer

Levans5 Aug 31, 2022
Author