Real syllables being labeled as "noise" (point label = -1) #32

ArKornreich · 2024-01-13T16:03:16Z

Hi!

First of all, thank you so much for your earlier help!

I have gotten everything to run smoothly up to this point, but as I try to open my dataset in the app post-segmentation, I am finding a large amount of syllables being labelled as noise. Given that I did a careful and pretty thorough job of noise-reduction before introducing my recordings into pykanto, there ought not to be much noise left to disregard. Is there anything I can do about this?

Thank you so much!

Ar K

nilomr · 2024-01-14T15:43:00Z

Hi Ar,

Great to hear!

In this context, 'Noise' just refers to data that hasn't been assigned to a cluster. Pykanto utilizes UMAP and HDBSCAN for a preliminary classification, which you can then adjust interactively. The number of data points without cluster membership depends on:

i) The nature of the data,
ii) How you configure the dimensionality reduction algorithm, and
iii) The parameters used for clustering.

There's no universal set of parameters that work well for all situations. This is because some datasets may defy assumptions made by each algorithm, and in many cases, discrete population-wide categories might not exist.

See:

UMAP parameter selection
HDBSCAN parameter selection
pykanto/pykanto/dataset.py

Line 752 in 35d2218

def cluster_ids(

Also, see this bit from the app notes

Limitation 2: [...] the clustering process will work increasingly poorly with those [species] that have a large number of very variable elements. This is true of any clustering method: they will fail or produce spurious results if variation in the data is continuous.

If you attach a couple of screenshots of the interactive app I can also try to give you more targeted advice.

Hope that helps
— Nilo

ArKornreich · 2024-01-14T19:29:01Z

Thank you so much for the speedy response!

This helped considerably, deepest thanks!

ArKornreich · 2024-01-14T21:25:13Z

Shoot! One more question.

Once labeling occurs, is it possible to view/get data from songs as a sequence of these new lables? For instance, if I get syllable/unit clusters, A, B, C, D, E, F, and G, is there a way I could see one of the songs in the dataset as CABGDEF or something like this?

Thank you again!

Best,

Ar K

nilomr · 2024-01-15T11:40:50Z

Yes - here you go!
https://gist.github.com/nilomr/fd72373b7c2aaf0a717c151d7afa5244

There are no explicit ways to do this in pykanto, the example above is a simple one. The idea is that you end up with common python data structures (lists, pandas dataframes) so you can do whatever you need with the data while keeping the format standardised.

nilomr self-assigned this Jan 14, 2024

nilomr added the type: user help label Jan 14, 2024

NickleDave mentioned this issue Jan 15, 2024

ENH: Add Clusterer class + cluster module vocalpy/vocalpy#103

Open

2 tasks

nilomr added status: completed type: documentation Improvements or additions to documentation labels Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real syllables being labeled as "noise" (point label = -1) #32

Real syllables being labeled as "noise" (point label = -1) #32

ArKornreich commented Jan 13, 2024

nilomr commented Jan 14, 2024

ArKornreich commented Jan 14, 2024

ArKornreich commented Jan 14, 2024

nilomr commented Jan 15, 2024

Real syllables being labeled as "noise" (point label = -1) #32

Real syllables being labeled as "noise" (point label = -1) #32

Comments

ArKornreich commented Jan 13, 2024

nilomr commented Jan 14, 2024

ArKornreich commented Jan 14, 2024

ArKornreich commented Jan 14, 2024

nilomr commented Jan 15, 2024