Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering expression - handling of missing values by != and !~ #2355

Open
aheinzel opened this issue Jan 18, 2025 · 0 comments
Open

filtering expression - handling of missing values by != and !~ #2355

aheinzel opened this issue Jan 18, 2025 · 0 comments

Comments

@aheinzel
Copy link

aheinzel commented Jan 18, 2025

Dear all,

I have a question regarding the handling of missing values when using != and !~ in filtering expression in bcftools v1.19+htslib-1.19. To support my question I include a minimal vcf file with one info tag named TAG.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

I can use the standard string comparison operator in view -i TAG[*]!="." to include only sites with at least one non missing value for TAG:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!="." input.vcf; Date=Sat Jan 18 17:33:45 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Doing the same with the regex operator view -i TAG[*]!~"\." does not filter out any variants:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!~"\." input.vcf; Date=Sat Jan 18 17:37:27 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Sorry just realized that this is maybe non conclusive, thus I add another example with view -i TAG[*]!~"[A-z]"

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!~"[A-z]" input.vcf; Date=Sun Jan 19 15:15:24 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Does the negated regex operator automatically evaluate to true for missing values?

@aheinzel aheinzel changed the title filtering expression - handing of missing values by != and != filtering expression - handling of missing values by != and !~ Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant