Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real syllables being labeled as "noise" (point label = -1) #32

Open
ArKornreich opened this issue Jan 13, 2024 · 4 comments
Open

Real syllables being labeled as "noise" (point label = -1) #32

ArKornreich opened this issue Jan 13, 2024 · 4 comments
Assignees
Labels
status: completed type: documentation Improvements or additions to documentation type: user help

Comments

@ArKornreich
Copy link

Hi!

First of all, thank you so much for your earlier help!

I have gotten everything to run smoothly up to this point, but as I try to open my dataset in the app post-segmentation, I am finding a large amount of syllables being labelled as noise. Given that I did a careful and pretty thorough job of noise-reduction before introducing my recordings into pykanto, there ought not to be much noise left to disregard. Is there anything I can do about this?

Thank you so much!

Ar K

@nilomr nilomr self-assigned this Jan 14, 2024
@nilomr
Copy link
Owner

nilomr commented Jan 14, 2024

Hi Ar,

Great to hear!

In this context, 'Noise' just refers to data that hasn't been assigned to a cluster. Pykanto utilizes UMAP and HDBSCAN for a preliminary classification, which you can then adjust interactively. The number of data points without cluster membership depends on:

i) The nature of the data,
ii) How you configure the dimensionality reduction algorithm, and
iii) The parameters used for clustering.

There's no universal set of parameters that work well for all situations. This is because some datasets may defy assumptions made by each algorithm, and in many cases, discrete population-wide categories might not exist.

See:

Also, see this bit from the app notes

Limitation 2: [...] the clustering process will work increasingly poorly with those [species] that have a large number of very variable elements. This is true of any clustering method: they will fail or produce spurious results if variation in the data is continuous.

If you attach a couple of screenshots of the interactive app I can also try to give you more targeted advice.

Hope that helps
— Nilo

@ArKornreich
Copy link
Author

Thank you so much for the speedy response!

This helped considerably, deepest thanks!

@ArKornreich
Copy link
Author

Shoot! One more question.

Once labeling occurs, is it possible to view/get data from songs as a sequence of these new lables? For instance, if I get syllable/unit clusters, A, B, C, D, E, F, and G, is there a way I could see one of the songs in the dataset as CABGDEF or something like this?

Thank you again!

Best,

Ar K

@nilomr
Copy link
Owner

nilomr commented Jan 15, 2024

Yes - here you go!
https://gist.github.com/nilomr/fd72373b7c2aaf0a717c151d7afa5244

There are no explicit ways to do this in pykanto, the example above is a simple one. The idea is that you end up with common python data structures (lists, pandas dataframes) so you can do whatever you need with the data while keeping the format standardised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: completed type: documentation Improvements or additions to documentation type: user help
Projects
None yet
Development

No branches or pull requests

2 participants