Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid character results in wrong error message ("All sequences must have the same length") #19

Open
niemasd opened this issue Mar 29, 2020 · 2 comments
Assignees

Comments

@niemasd
Copy link
Contributor

niemasd commented Mar 29, 2020

I have one sequence (hCoV_19_Norway_1539_2020_EPI_ISL_417487) that tn93 keeps thinking has one fewer characters than it actually has (or at least seems to have). I have attached a minimal working example below:

example.txt

I tried to run tn93 as follows:

cat example.aln | tn93 -l 1 -t 1

But I get the following error message:

All sequences must have the same length (29811), but sequence 'hCoV_19_Norway_1539_2020_EPI_ISL_417487' had length 29810

However, I tried checking it in Python (lines[3] is the problematic sequence):

lines = open('example.txt').readlines()

len(lines[1])  # prints 29812 (includes the newline at the end)
lines[1][:10]  # 'CTTCCCAGGT'
lines[1][-10:] # 'AATTTTAGT\n'
set(lines[1])  # {'\n', 'R', 'G', 'A', 'C', 'T', 'M'}

len(lines[3])  # prints 29812 (includes the newline at the end)
lines[3][:10]  # 'CTTCCCAGGT'
lines[3][-10:] # 'AATTTTAGT\n'
set(lines[3])  # {'V', 'S', '\n', 'R', 'G', 'I', 'A', 'C', 'Y', 'T'}

len(lines[5])  # prints 29812 (includes the newline at the end)
lines[5][:10]  # '----------'
lines[5][-10:] # 'AATTTTAGT\n'
set(lines[5])  # {'\n', 'G', 'A', '-', 'C', 'T'}

Excluding the newline character after every line (which is included in the lengths printed by the above code), each sequence has exactly 29811 characters.

The only weird character I see in the problematic sequence is I, which doesn't seem to be a standard IUPAC character. Thoughts?

@niemasd
Copy link
Contributor Author

niemasd commented Mar 29, 2020

Actually, yes, it seems as though the I was the culprit. Replacing I with N makes tn93 run properly.

example_replaced.txt

Read 3 sequences of length 29811
Will perform 3 pairwise distance calculations
Progress: ID1,ID2,Distance
Progress:       0% (       0 links found,         -nan evals/sec)hCoV_19_Norway_1539_2020_EPI_ISL_417487,hCoV_19_Pakistan_Gilgit1_2020_EPI_ISL_417444,0.000335814
Progress:    33.3% (       1 links found,          inf evals/sec)hCoV_19_Norway_1538_2020_EPI_ISL_417486,hCoV_19_Norway_1539_2020_EPI_ISL_417487,6.70927e-05
hCoV_19_Norway_1538_2020_EPI_ISL_417486,hCoV_19_Pakistan_Gilgit1_2020_EPI_ISL_417444,0.000268644
Progress:     100% (       3 links found,          inf evals/sec)
{
        "Actual comparisons performed" :3,
        "Comparisons accounting for copy numbers " :3,
        "Total comparisons possible" : 3,
        "Links found" : 3,
        "Maximum distance" : 0.000336,
        "Sequences" : 3,
        "Mean distance" : 0.000224,
        "Histogram" : [[0.005,3],[0.01,0],[0.015,0],[0.02,0],[0.025,0],[0.03,0],[0.035,0],[0.04,0],[0.045,0],[0.05,0],[0.055,0],[0.06,0],[0.065,0],[0.07,0],[0.075,0],[0.08,0],[0.085,0],[0.09,0],[0.095,0],[0.1,0],[0.105,0],[0.11,0],[0.115,0],[0.12,0],[0.125,0],[0.13,0],[0.135,0],[0.14,0],[0.145,0],[0.15,0],[0.155,0],[0.16,0],[0.165,0],[0.17,0],[0.175,0],[0.18,0],[0.185,0],[0.19,0],[0.195,0],[0.2,0],[0.205,0],[0.21,0],[0.215,0],[0.22,0],[0.225,0],[0.23,0],[0.235,0],[0.24,0],[0.245,0],[0.25,0],[0.255,0],[0.26,0],[0.265,0],[0.27,0],[0.275,0],[0.28,0],[0.285,0],[0.29,0],[0.295,0],[0.3,0],[0.305,0],[0.31,0],[0.315,0],[0.32,0],[0.325,0],[0.33,0],[0.335,0],[0.34,0],[0.345,0],[0.35,0],[0.355,0],[0.36,0],[0.365,0],[0.37,0],[0.375,0],[0.38,0],[0.385,0],[0.39,0],[0.395,0],[0.4,0],[0.405,0],[0.41,0],[0.415,0],[0.42,0],[0.425,0],[0.43,0],[0.435,0],[0.44,0],[0.445,0],[0.45,0],[0.455,0],[0.46,0],[0.465,0],[0.47,0],[0.475,0],[0.48,0],[0.485,0],[0.49,0],[0.495,0],[0.5,0],[0.505,0],[0.51,0],[0.515,0],[0.52,0],[0.525,0],[0.53,0],[0.535,0],[0.54,0],[0.545,0],[0.55,0],[0.555,0],[0.56,0],[0.565,0],[0.57,0],[0.575,0],[0.58,0],[0.585,0],[0.59,0],[0.595,0],[0.6,0],[0.605,0],[0.61,0],[0.615,0],[0.62,0],[0.625,0],[0.63,0],[0.635,0],[0.64,0],[0.645,0],[0.65,0],[0.655,0],[0.66,0],[0.665,0],[0.67,0],[0.675,0],[0.68,0],[0.685,0],[0.69,0],[0.695,0],[0.7,0],[0.705,0],[0.71,0],[0.715,0],[0.72,0],[0.725,0],[0.73,0],[0.735,0],[0.74,0],[0.745,0],[0.75,0],[0.755,0],[0.76,0],[0.765,0],[0.77,0],[0.775,0],[0.78,0],[0.785,0],[0.79,0],[0.795,0],[0.8,0],[0.805,0],[0.81,0],[0.815,0],[0.82,0],[0.825,0],[0.83,0],[0.835,0],[0.84,0],[0.845,0],[0.85,0],[0.855,0],[0.86,0],[0.865,0],[0.87,0],[0.875,0],[0.88,0],[0.885,0],[0.89,0],[0.895,0],[0.9,0],[0.905,0],[0.91,0],[0.915,0],[0.92,0],[0.925,0],[0.93,0],[0.935,0],[0.94,0],[0.945,0],[0.95,0],[0.955,0],[0.96,0],[0.965,0],[0.97,0],[0.975,0],[0.98,0],[0.985,0],[0.99,0],[0.995,0],[1,0]]
}

I would suggest perhaps having a more descriptive error message (e.g. "Invalid Character: I")

@niemasd niemasd changed the title Incorrectly saying one sequence has invalid length? Invalid character results in wrong error message ("All sequences must have the same length") Mar 29, 2020
@spond
Copy link
Member

spond commented Mar 29, 2020

Dear @niemasd,

The acceptable list of characters is shown at

const char ValidChars[] = "ACGTURYSWKMBDHVN?-",

And they are indeed IUPAC based. I agree that a more descriptive error message might be in order (to suggest that the user looks at non-IUPAC letters), but the current assumption is that most FASTA files are gonna have some non-sequence characters (e.g. new lines, spaces, etc).

Best,
Sergei

@spond spond self-assigned this Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants