Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted gene start/end fields #5

Open
abs-yy opened this issue Jul 20, 2022 · 0 comments
Open

Corrupted gene start/end fields #5

abs-yy opened this issue Jul 20, 2022 · 0 comments

Comments

@abs-yy
Copy link

abs-yy commented Jul 20, 2022

Hi, first of all, big thanks for the great software!
I've been trying to get UTR predictions using --addUTR in braker1 and got corrupted gene and transcript rows; the start/end positions of the gene/transcripts are not correct. There were 82 cases like this in my ~20,000 gene set. The results of a test run using the test files from the GitHub repository also have this error, so I presumed it was a GUSHR error. I'm not fluent in python so I just filed an issue.

Here's some genes from the test run that have this error. You can see that the gene and transcript rows edited by GUSHR has mistaken start and stop positions.

% perl -lane 'print if @F[3]>@F[4] && @F[2] eq "gene"' ~/temp/utrs.gtf  | cut -f 9 | sort | uniq | grep -f - ~/temp/utrs.gtf
2R	GUSHR	gene	10078919 	8467054 	.	+ 	.	g693
2R	GUSHR	transcript	10078919 	8467054 	.	+ 	.	g693.t1
2R	AnnotationFinalizer	5'-UTR	8467039 	8467054 	. 	+ 	. 	transcript_id "g693.t1"; gene_id "g693";
2R	AnnotationFinalizer	5'-UTR	10078919 	10079011 	. 	+ 	. 	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	start_codon	10079012	10079014	.	+	0	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10079012	10079182	1	+	0	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10081507	10081632	1	+	0	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10082642	10082967	1	+	0	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10083030	10083134	1	+	1	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10083262	10083326	1	+	1	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10083404	10083493	1	+	2	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	CDS	10083574	10083938	1	+	2	transcript_id "g693.t1"; gene_id "g693";
2R	AUGUSTUS	stop_codon	10083936	10083938	.	+	0	transcript_id "g693.t1"; gene_id "g693";
2R	AnnotationFinalizer	3'-UTR	10083939 	10084210 	. 	+ 	. 	transcript_id "g693.t1"; gene_id "g693";
2R	GUSHR	gene	10197107 	9752095 	.	+ 	.	g769
2R	GUSHR	transcript	10197107 	9752095 	.	+ 	.	g769.t1
2R	AnnotationFinalizer	5'-UTR	9752086 	9752095 	. 	+ 	. 	transcript_id "g769.t1"; gene_id "g769";
2R	AnnotationFinalizer	5'-UTR	10197107 	10197116 	. 	+ 	. 	transcript_id "g769.t1"; gene_id "g769";
2R	AnnotationFinalizer	5'-UTR	10648880 	10648911 	. 	+ 	. 	transcript_id "g769.t1"; gene_id "g769";
2R	AUGUSTUS	start_codon	10648912	10648914	.	+	0	transcript_id "g769.t1"; gene_id "g769";
2R	AUGUSTUS	CDS	10648912	10649625	1	+	0	transcript_id "g769.t1"; gene_id "g769";
2R	AUGUSTUS	stop_codon	10649623	10649625	.	+	0	transcript_id "g769.t1"; gene_id "g769";
2R	AnnotationFinalizer	3'-UTR	10649626 	10649826 	. 	+ 	. 	transcript_id "g769.t1"; gene_id "g769";
2R	GUSHR	gene	10005157	9999915	.	- 	.	g674
2R	GUSHR	transcript	10005157	9999915	.	- 	.	g674.t1
2R	AnnotationFinalizer	3'-UTR	9995034 	9996490 	. 	- 	. 	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	stop_codon	9996491	9996493	.	-	0	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9996491	9996621	1	-	2	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9997704	9997889	1	-	2	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9998271	9998346	1	-	0	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9998415	9998562	1	-	1	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9998707	9998802	1	-	1	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9999425	9999525	1	-	0	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	9999734	9999915	1	-	2	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	CDS	10005157	10005370	0.91	-	0	transcript_id "g674.t1"; gene_id "g674";
2R	AUGUSTUS	start_codon	10005368	10005370	.	-	0	transcript_id "g674.t1"; gene_id "g674";
2R	AnnotationFinalizer	5'-UTR	10005371 	10005377 	. 	- 	. 	transcript_id "g674.t1"; gene_id "g674";
2R	AnnotationFinalizer	5'-UTR	10024051 	10024202 	. 	- 	. 	transcript_id "g674.t1"; gene_id "g674";
2R	AnnotationFinalizer	5'-UTR	10025850 	10026019 	. 	- 	. 	transcript_id "g674.t1"; gene_id "g674";

After meddling around with gushr.py, I noticed at line 680 and 684 (and also in 693, 697), the both two values being compared were str variables, not int. Adding int() to these rows and fixing line 551 (A space was added before the tabular) resulted in a perfect test run gushr.py -t example/augustus.gtf -b example/RNAseq.bam -g example/genome.fa -o utrs2 and braker --addUTR run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant