Should the analyzer results be deterministic? #1390

denisw · 2024-05-23T14:36:12Z

denisw
May 23, 2024

I am using Presidio together with Azure AI Language and some custom analyzers (including context enhancement), and thought it would be a good idea to create a regression test that takes some known input texts and checks that the anonymization result is the same as in the past.

I noticed, though, that this kind of test is flaky: sometimes, the analysis and anonymization results differ slightly from the previous value for some runs, only to to match in some later run again.

Is it expected that Presidio's recognizers are not fully deterministic? Is there some source of randomness that can perhaps be controlled? Or should I simply not count on the same text resulting in the same analyzer results and anonymized text?

omri374 · 2024-05-26T06:09:10Z

omri374
May 26, 2024
Maintainer

Hi @denisw, Presidio itself is deterministic. Usual suspects are the NER model and Azure AI Language. Depending on the type of NER model you're using (spaCy, transformers, stanza), I would suggest to look into fixing the seed for those.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the analyzer results be deterministic? #1390

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Should the analyzer results be deterministic? #1390

denisw May 23, 2024

Replies: 1 comment

omri374 May 26, 2024 Maintainer

denisw
May 23, 2024

omri374
May 26, 2024
Maintainer