forked from ICO2S/emboss
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChangeLog
4712 lines (3723 loc) · 206 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Version 6.6.0 15-Jul-2013
Restrict and related applications now ignore matches where the
enzyme site is wider than the sequence length.
The SRS server at EMBL-EBI no longer serves the EMBL database!
EBI's SRS server databases in server.srs have been updated to
reflect their reduced service.
Reading large sequences is more efficient. Reference counted
strings are used for output. Where gaps do not need to be
replaced, a single copy of the sequence string is used for input,
processing and output.
New sequence format iguspto supports a variant of the
intelligenetics format with tolerance for format variants on
input.
Calculation of isoelectric point has been updated to use the same
data values as Expasy and the Open Bio packages. New data file
Epkexpasy.dat holds the values used by Expasy.
The final position of the reverse strand is now correctly numbered
in the output of sixpack and showseq.
Eukaryote join features in union were not correctly copied after
subfeatures were implemented to hold exons. The union code now
correctly relocates subfeatures.
Complex (join) feature positions were not relocated when the
parent sequence was trimmed by start and end position. This was
introduced when subfeatures were implemented, and is now
corrected.
New option -methionine for transeq translates any start codon as
methionine when a specific range is given (including 1 to end) and
an alternative genetic code is specified.
Wildcard filenames were broken by the query language rewrite. The
previous functionality is restored. Any query can use a wildcard
filename with '*' or '?' characters. The order in which files are
processed is determined by the operating system.
Dbxreport and dbxstat now support databases with a dbalias
(alternative base name for the database files).
Restriction digest applications occasionally reported more than
one identical match where several enzymes recognize the same
target site. The testing of isoschizomers has been improved to
catch these cases. In practice most runs are with only a few named
enzymes with different sites.
Fragment lengths in restrict are now included as extra columns in
the output, giving the fragments to the 5' and 3' side of each cut
in the forward strand. Note that the output includes all possible
cut sites, though it may be impossible for a double digest to
physically cut at each of two closely spaced sites.
The -name option of restrict had no effect on report output and
has been removed.
Cachedbfetch corrects bad EDAM references to EDAM_syntax: instead
of EDAM_format: in the definitions returned by EBI's dbfetch
and wsdbfetch servers.
Sequence identifiers now remove characters that may confuse output
file generation, changing to underscore any forward or backslash
(interpreted as host system paths), commas, semicolons and colons.
Sequence input now warns for bad sequence characters when the
format is known. When auto-detecting the format the warnings are
turned off so that failed formats can silently be ignored, but
when reading further sequences from the same input file warnings
are enabled. They can be disabled for individual format parsers by
passing zero as the format code to seqAppendWarn.
New nibble (nib) format stores sequence data in half-byte binary
compressed format. The format is available for input and output,
but as a binary format can only be read from a file, not from a pipe.
New GDE format for sequence input and output - a simple format
with a #id prefix.
Support added for SwissProt OH (viral host) records.
New sequence input associated qualifier (available for all
sequence inputs) -squick reads only the id, accession, description
and sequence, saving unnecessary parsing of more complex input
formats such as swissprot, embl and genbank.
String parsing objects are now reused rather than deleted to save
memory reallocation in parsing input streams with a large number
of entries. Input source code now uses reusable token objects
cleared only when the program exits.
Acdpretty now correctly preserves in-line comments in ACD files.
Efficiency improvements in matching sets of characters in strings,
especially in functions used for each entry in a large set of
input sequences.
New applications xmlget and xmltext read XML data, for example
from dbfetch:embl which offers emblxml format. Output can be as
input or in reformatted versions.
QA tests of EMBASSY applications look in a test/data directory in
the EMBASSY package as an alternative place for data files
prefixed by TESTDATA:
Clustal omega data types added to knowntypes.standard file.
Ranges can use a syntax of start+len or start,+len to give the
length rather than the end position. The end is calculated from
the start and length and used internally. This syntax allows a
closer fit to the command line of primer3_core in eprimer32 where
ranges in the native application are always specified as start and
length.
List file inputs now report an error if any text follows the first
token on a line, unless it is a comment following a '#'
character. Previous versions treated any remaining text as a
comment and silently ignored it.
New sequence format iguspto supports a multi-line IG format used
by the US Patent Office. The multi-line descriptions are preserved
only if EMBOSS reads and writes in this format. We can add the
capability to any other multi-line input format where the original
description lines should be preserved. Other formats treat
descriptions as a single record to be wrapped where there is a
maximum record length (e.g. in EMBL format).
Programs dreg and preg now only report sequences where a pattern
match was found, which is the same behaviour as fuzznuc, fuzzpro
and fuzztran.
New code added to handle xml datatype. Supports multiple named XML
formats, using the DOM parsers to interpret data. Multiple XML
input formats are supported, but on output, in the absence of a
conversion method, the original XML is normally reported as plain
"xml" format.
In database definitions, "example" is now a list attribute which
can appear multiple times, allowing multiple example queries to be
defined as separate records, with possible documentation following
a '!' delimiter.
Showserver now scales the column headers better for long cache
file names.
Showdb now displays the taxons, examples, and aliases defined for
a database. Examples and aliases can be preceded by a count of the
number of each. All columns are displayed with -full, individual
elements are controlled by -numtaxons, -taxscope (-taxonomy is a
database type option) -examples -numexamples -aliases and
-numaliases.
Showdb now displays a count of the number of fields in addition to
the list of field names. New command line qualifier -numfields
controls the display of the field count.
Showdb now displays all types defined for a database, separated by
commas, but will only display a database once so that, for
example, a protein and protfeatures database will appear in the
protein database set first (if displayed). If only the features
databases are displayed then it will appear with them.
Showdb no longer shows the access levels (id, query and all) by
default for a database. New command line qualifier -access or the
existing -full qualifier will show these values.
Entrez access was specific to sequence data retrieval. Entrez
server retrievals can now automatically detect ID and accession
fields and read text entries with textget where a text format is
available.
Genbank-related protein formats Refseqp and Genpept are updated to
process all record types. Genpept feature handling is updated to
correct the handling of multiple locations by using subfeatures.
GenBank and Refseq formats now handle the full set of record types
including common species names, reference details and comments.
Dbtell -full reports any alias names for a database after the
definition.
Dbtell recognizes alias names for a database, reporting the master
database definition and a comment describing where the alias is
defined.
Dbtell -server reports the database definition for a server. All
attributes are reported in the database definition, whether
defined for the database or at the server level.
Servertell -full now reports the definitions of all databases for
the server, including all aliases defined in the server definition
file. Without -full an extra comment line in the output suggests
running with -full for more detailed information.
Restrict output now sorts by the position closest to the start for
matches on the reverse strand (for an asymmetric target
site). This sort change can produce additional matches in the
output of restover.
Embossversion is now set to fail with a message if the update
information URL is unreachable.
HTTP and FTP error messages were simplified and blank lines removed.
The valgrind.pl script has a new qualifier -debug which runs the
test with -debug on the command line.
Needle, needleall and water now fail with "die" message if there
is insufficient virtual memory to calculate the alignment between
two long sequences
Indexing with the dbx applications miscalculated the secondary
page capacity when the secondary page size is less than the
primary page size.
Ranges in a file can use a dash as a delimiter for the start and
end positions in addition to white space.
For all data types, format names can be replaced by EDAM format
term identifiers, for example 1927 for "embl". The format terms
are defined in the source code. We will need to define aliases or
use more complex queries if a format splits into a hierarchy
but this is unlikely in most cases.
On FreeBSD systems embossversion source code has quotes corrected
on the line that reports FreeBSDLF is defined.
Version 6.5.0 15-Jul-2012
On Windows (mEMBOSS) the user home directory is checked for the
.embossrc file and .embossdata directory, using emboss.default for
settings defined for all users.
Database definitions with multiple types and formats now check
that there is at least one valid format defined for each type of
data.
The qatest.pl script handles references to the user's home
directory on Windows. "~/" is replaced with the user's home
directory, with the full path or filename enclosed in quotes.
The qatest.pl script has a new qualifier -debug which runs the
test with -debug on the command line. For ACD utility tests the
application name is taken from the first command line parameter
and will not match the debug file so these will give an error for
an unknown .dbg file. For all other tests this is a simple way to
obtain debug output for a problematic test result.
EMBOSS supports soap protocol access using the Apache axis2c
library. We use version 1.6.0 for testing. Installation can be
tricky on some systems. We are happy to help with anyone who finds
problems. A copy of the library is included in the initial 6.5.0.0
mEMBOSS build.
Date parsing for EMBL, GenBank, SwissProt, Refseq and related
formats has been made more robust.
New application embossupdate checks for the availability of an
updated EMBOSS distribution or patches from the EMBOSS website and
FTP server. Embossupdate can be run at the end of a successful
installation or reinstallation. We hope this will help our
users to keep their versions up to date more easily.
Feature data can be read from PIR and GCG formatted databases.
EDAM is updated to release 1.1. EDAM is used to define EMBOSS and
EMBASSY applications, to describe EMBOSS defined databases and
entries in the DRCAT data resource catalogue. This is a prerelease
from the EDAM team to ensure EMBOSS has the most recent set of
terms.
Lists and tables now support very large numbers, requiring long
integers (datatype ajulong) to represent the return values from
ajListGetLength, ajTableGetLength and ajTableGetSize. Further
extensions are planned in future releases.
Directory inputs now interpret ~/ or ~user/ in the user response
in the same way as file inputs.
Application embossversion -full now reports the versions of all
libraries, and all configuration settings used to compile EMBOSS,
plus the sizes of standard data types.
Dbxfasta has a new format "idsv" which finds sequence version
values if the accession number has a .number suffix.
Dbxflat creates a sequence version for UniProt entries using the
accession number and the sequence version from the DT records.
Dbx indexing stores secondary reference file positions only if the
database has more than one data file per entry. The entries file
records the number of files in the database and can if needed
store more than one reference file. Identifiers indexes can store
more entries per page for databases with one file (embl, uniprot),
but support reference files for gcg, pir and taxonomy indexing.
Dbx indexing supports separate caches for primary and secondary
pages. Larger caches can reduce the number of physical reads and
writes at the cost of a small increase in CPU time. The organism
and description indexes for large databases can have terms that
appear in a very large number of entries (e.g. 'protein' in
UniProt or 'bacteria' in EMBL). Secondary cache sizes up to 100k
can be used to try to reduce the physical page rewrites needed as
these indexes grow.
Dbx indexing supports a smaller size for secondary index
pages. These hold the lists of entry ids for indexed strings, and
the file offsets for non-unique identifiers (e.g. secondary
accession numbers). The environment variable EMBOSS_SECPAGESIZE
defaults to 512, a quarter of the EMBOSS_PAGESIZE value of 2048.
Resource definitions can specify field-specific secondary page
sizes using, for example accsecpagesize: "256"
Dbx indexing applications (dbxflat, dbxfasta, dbxgcg, dbxedam,
dbxobo, dbxresource, dbxtax) secondary index files (e.g. keyword,
taxonomy and description indexes) are more compact. The entry ids
for each keyword are stored as a simple list unless more than one
index page is needed. As most indexed tokens are in only a few
entries this saves many pages while the index is being built. The
compressed index size is also smaller.
Dbxflat, dbxfasta and dbxgcg now report index terms that exceed the
maximum length (attributes idlen, acclen, deslen, orglen, keylen,
svlen, gilen). Each term beyond the current maximum is
reported. When the run is completed, the longest term length for
each index field is reported so that excessively large values can
be reduced.
Dbxflat dbxfasta and dbxgcg have improved memory efficiency on
large indexing runs. Many more internal data structures are reused
in the parsers.
Window length options are renamed to -window consistently across
all EMBOSS applications. The change applies to pepwindow and
pepwindowall
Multiple inputs to einverted gave inconsistent results as two
internal variables were not reset for each new sequence.
Resource definitions for uniprot (swissresource) and embl
(emblresource) are updated to allow the maximum size for database
index keys. If the database contains longer values in future they
will be truncated and the maximum size found by the parser will be
reported by dbxflat.
New resource definitions chebiresource and sworesource are
provided in emboss.standard to index ontologies with
exceptionally large index keys.
Ontologies CHEBI, ECO, GO, PW, RO, SO are updated.
Ontology SWO is added. This is the software ontology, in its OBO
format. Some identifiers are really URLs.
Sequence and other databases with an organism ('org') or taxonomy
('tax') index can restrict retrieval to one or more indexed
organism names or any other indexed level in the
taxonomy. Examples include EMBL or UniProt whether indexed locally
with dbxflat or accessed through the EBI's SRS server as srs:embl
or srs:uniprot. A new database attribute 'organisms' can be used
to define one or more organisms or taxonomy levels to restrict
data retrieval from the master index of the complete file. A value
using EMBOSS query syntax of "rattus|mus" will allow data from
both genera to be retrieved. Values can also be separated by tabs,
commas ',' or semicolons ';' As organisms can include spaces we
chose not to allow space as a delimiter. The organisms attribute
is implemented for method "emboss" and "srswww" to allow remote
retrieval. We can implement organisms for other access methods if
there is a demand from the user community,
Ontology databases can combine more than one branch of an ontology
in a single file. Examples include the Gene Ontology (GO) with
namespaces for cellular_location molecular_function and
biological_process and EDAM with data, format, identifier,
operation and topic. A new database attribute 'namespace' can be
used to define one or more namespaces to restrict data retrieval
from the master index of the complete file. This is tricky for
EDAM data which is in the data or identifier namespaces. A value
using EMBOSS query syntax of "data|identifier" or spaced with
"data identifier" will allow data from both namespaces to be
retrieved. The namespace attribute is implemented for method
"emboss" (how the ontologies are indexed in the distribution) and
"srswww" to allow remote retrieval. We can implement namespace for
other access methods if there is a demand from the user community,
EDAM release 1.0 is included. Major changes were needed to EMBOSS
internals as the identifiers are all changed (different term ID
number and different prefix). ACD files and the DRCAT data
resource catalogue are updated with the nearest equivalent terms
from EDAM 1.0.
Assembly data is now loaded a few records at a time using a new
"loader" object. This allows very large files to be processed in
chunks.
Variation data is now loaded a few records at a time using a new
"loader" object. This allows very large files to be processed in
chunks.
Support for BioPerl/Open-Bio OBDA flatfile indexes is included as
database access method 'obda'. The indexing in BioPerl 1.6 is
broken for EMBL as the semicolon is not removed from
identifiers. The secondary index files have duplicated
records. Both problems should be fixed in a future BioPerl
release. Note also that OBDA indexing parses only the primary
accession number so that other accessions are not retrievable from
OBDA index files.
EMBL entries with a single (source) feature could ignore the
feature.
Output files for fuzznuc, fuzzpro, fuzztran, dreg and preg
included the pattern name and the pattern string in the last
release. The output format is changed to remove the space between
the pattern name and string so that parsers see the expected number
of space-delimited fields in the output.
The query language parser has been rewritten to handle the new
-iquery and -ioffset qualifiers. Badly formed queries may now
produce different error messages.
Any input type that uses queries, with the exception or URL
inputs, can use two new associated qualifiers. -ioffset is the
initial non-zero offset when reading from a file or a
URL. -iquery if the query field which can be applied to an FTP or
HTTP URL or to any query in a list file. These names also apply to
sequence and feature input where other qualifiers begin with 's'
and 'f' respectively.
FTP and HTTP URLs can now be used directly as input queries for
all data types in place of file names. EMBOSS automatically
detects the ftp:// or http:// prefix and uses the appropriate
protocol. Any query or offset is ignored as there is no way to
distinguish these from a genuine part of the URL.
Patterns for fuzznuc, fuzzpro and fuzztran can include escaped
codes to skip the expansion of ambiguity codes and look for them
explicitly in the input. A backslash (shells may need two) before
the code specifies an exact match, for example \S will only match
S in the input.
Patterns for fuzznuc with ambiguity codes are now expanded to
include the ambiguity code (and any overlapped ambiguity
codes). For example, S matches [GCS] and B (not A) matches
[TGCBSYK]
A new AJAX source file ajtagval.c handles general tag-value pairs
of strings which have uses beyond feature internals.
Pepwheel can plot up to 5 sets of residues, with a total of
"steps" at each level. Leucine zipper plots with a step of 7 and 2
turns required more residues to be visible. The updated pepwheel
rescales the size of the inner wheel to allow more residues to be
displayed.
Sequence and assembly reading in BAM format always fails if no
match as found in the first pass - attempting to read again could
loop with the same result as the file is rewound. Rereading is
intended for text formats such as FASTA where the next entry may
match.
Header files in AJAX and NUCLEUS have been cleaned to remove
redundant references. A new include file ajlib.h includes the core
set of ajdefine, ajarch, ajmem, ajmess, ajfmt and ajstr which were
almost universally included. Applications are expected to use
emboss.h as their only include, but references to ajax.h and
emboss.h in the libraries are now all replaced with the minimally
required set of include files.
The server.entrez file has been updated using a script
serverentrez.pl which queries Eutils to obtain a list of database
names and fields. An internal array is used to define the
datatypes and formats for each database as these are defined only
in a series of HTML tables in other pages.
Reading from the NCBI Entrez server failed. The cause was trimming
newlines from a reference-counted string where the data returned
has CR-LF format but only one character was removed.
New xygraph output device support for datafile formats. "bedgraph"
outputs in BedGraph format. "wig" outputs in Wiggle format.
The "sequence" attribute is implemented for xygraph outputs. If
set true, the X-axis label defaults to the name of the first input
and the source name used in datafile outputs is also the name of
the first input.
Dottup and dotmatcher now have the first sequence on the X axis
and the second on the Y axis. This follows standards for datafile
output of graphical data which default to the X axis relating to
the first input sequence.
Dbx index files from earlier releases defaulted to "secondary"
indexes. The test for an index with no "Type" parameter defined
now picks up the standard Identifier indexed fields (id, acc, sv
and gi) correctly. The files were identified by field name, but
the test was using the file extension.
Fuzznuc, fuzzpro, fuzztran, dreg and preg when searching with a
regular expression found only the largest possible match at each
start position. A new function in recent releases of the PCRE
regular expression library supports searching for all matches
using function ajRegExecallC instead of ajRegExecC. These
applications can now find all overlapping matches to a pattern
using a regular expression.
The PCRE library is updated to include the pcre_dfa_exec
function. This is called by ajRegExecall and ajRegExecallC. The
regular expression can be compiled as usual. The new calls set an
internal value to the number of matches found, retrievable by
ajRegGetMatches. Offsets (ajRegOffsetI) and substrings (ajRegSubI)
return these matches, starting at zero which is the longest match
(the same as in ajRegExec). Any shorter matches with the same
start are stored in place of bracketed substrings.
Prettyplot options are changed to remove dependencies on other
options. Option -plurality (which depended on the sequence
alignment weight or the number of input sequences) is now -ratio
with a default of 0.5. This is exactly equivalent to the default
-plurality value or half the total weight. Option -resbreak is
replaced by -blocksperline with a default value of 1. This has the
same default output as the -resbreak option which defaulted to the
-residuesperline value.
All header files now have an @include comment block which includes
the LGPL licence and RCS tags. Header files are commented in
consistent sections. The C++ compile extern wrapper for C
declarations is now a macro to avoid indentation issues in emacs
and other editors.
All obsolete functions are moved to the end of source files and
wrapped in an #ifdef AJ__COMPILE_DEPRECATED block. The configure
option --enable-buildalldeprecated includes these functions in
compilation. Functions described in the 6.2.0 books are included
in a similar AJ__COMPILE_DEPRECATED_BOOK block and built with the
--enable-buildbookdeprecated configure option.
Diffseq produced incorrect results when reporting an insertion in
the second sequence. The error was introduced in release 6.0.0. It
is fixed by defining a "between" location for the insert site in
the first sequence, and by adding support for "between" features
to diffseq and other report formats. A new constructor
ajFeatNewBetween with one position makes creating such features
easier.
New function ajListDrop removes a node from a list by searching
for its address.
Test data includes a new EMBL data file syn.dat containing a
circular sequence.
GFF3 input combines features with the same ID under a generated
parent so that features can be linked as subfeatures and sorted
together. These features are identified by the Flags attribute and
excluded from GFF3 output.
GFF3 output is required to use different feature types for
parent and child. This is broken by the annotated parent feature
we need to represent EMBL/GenBank/DDBJ joins. For these, the
parent has a new type of biological_region with a new featflag
type=CDS (for example) so we can restore the correct internal
representation when reading the GFF3 file.
A new sequence associated qualifier -scircular defines a sequence
input as a circular molecule where this is not defined in the
input format, for example EMBL/Genbank and GFF3 have the
information but FASTA input does not. For feature input there is a
new -fcircular qualifier. Any circular definition in a sequence
format overrides this qualifier. Sequences with features are set
circular if the feature table input is defined as circular.
GFF3 format has been corrected using the online GFF3
validator. Protein feature type names are corrected to use the
current SO term name. Tags are converted to lower case on output
and back to standard case on input, for example /EC_number in EMBL
format, as GFF tags must start in lower case.
In GFF3 protein features now always use '.' for the
strand. Previous releases could also write '+'. Both are
acceptable as input.
GFF3 and GFF2 scores now use a general floating point format to
write 4 significant figures (rather than 3 decimal places) to cope
with very large and very small score values. Trailing zeroes
after the decimal point are omitted in this format. A score of
zero is written as a dot (missing value).
Sequence queries can use two alternative syntaxes for sequence
ranges. Appending :start:end allows a syntax similar to DAS
queries. Appending :start..end allows a syntax similar to
EMBL/GenBank locations in other entries. Both can be followed by
:r to reverse the sequence region.
Sequences and reference sequences can be read from EMBL CON
division entries by using the same database with an ACC (accession
number) index to read the sequence fragments defined in the CO
record(s).
New code added to handle reference sequences in ajrefseq* source
files. The AjPRefseq object will hold large reference sequence data
in managed memory buffers.
Database definitions can use a new attribute "special" to give a
name=value definition for any attribute specific to one access
method. The first instances are SpeciesIdentifier for
ensemblgenomes databases, and tags for processing assembled entries
in CON (constructed) entries in EMBL. ConDatabase is the database
name used, ConField is the index field. By default CON entries use
the ACC field of the same database.
Standardized all licensing references in the libraries to GNU Lesser
GPL version 2.1. Added CVS keywords to record the CVS file
version, and the date and user of the latest commit.
Microbial genomes in ensemblgenomes have an enumerated species
code which must be included in an data retrieval request. The
codes are temporarily added to the comment attribute of the
databases in the server cache file. This will be replaced by a
more complete solution in the next release.
The DRCAT.dat file has a new set of lines to handle Nucleic Acids
Research classifications. A new NARCat line code is now separately
parsed by dbxresource into the NAR category name and the URI.
Long tag values in GFF3 format could exceed limits in the regular
expression. This is fixed by first testing for and replacing
escaped quotes and then using a simpler expression to extract
quoted string values.
When reading ranges from a file the strings were overwritten by
the parser.
Application tcode results disagreed with the original
publication. The calculation parameters have been corrected.
EDAM.obo is updated. 28 terms were added. Descriptions were
updated and names changed.
Short descriptions of EMBOSS and EMBASSY applications have been
updated to use consistent terminology and grammar rules.
Dbxflat failed to parse the organism ('org') field of a GenBank
entry when another secondary field (keyword or description) was
also parsed in the same run.
Dbxflat and dbiflat now use a separate parser for SwissProt format
data files. Previous releases used the EMBL parser which failed to
identify the first word in the specially formatted SwissProt
description records. The change only affects the 'des" index
field.
Reading ABI format failed to read the sample name field and
machine name. The sample name is now correctly parsed. The sample
name is used by EMBOSS as the sequence identifier.
Formats specified on the command line were ignored by database
queries. This behaviour was correct in previous releases where
only one format was permitted, but is required from 6.4.0 where a
database may have multiple possible formats. Any format defined
elsewhere on the command line is now used if there is no format in
the query string.
ACD files are stricter in checking ambiguous qualifiers. Options
that are also a short form of another qualifier now generate
warnings. These can be turned off with the application attribute
wrapper: "Y" where a third party command line is wrapped.
Showfeat had an option -type which was ambiguous. Changed the
options so those with a match option (-typematch) have a show
equivalent -typeshow to display the column.
Emma had options -dend and -slow which were short forms of other
qualifiers. They are renamed -dendreuse and -slowalign. The old
qualifier name will now give an "ambiguous qualifier" error
message and report the new name.
Eprimer3 and eprimer32 had options -otm and -osize which were
short forms of other qualifiers, and could cause confusion
between optimum and oligo values. They are renamed -opttm and
-optsize.
Helixturnhelix had an advanced option -sd which was a short form
of sequence qualifier -sdbname. It is renamed to -sdvalue.
Prettyplot had an option -box which was a short form of other
qualifiers. It has been renamed -doboxes to match the related
qualifier -docolour.
Showserver had an option -server which was a short form of
-serverversion (itself named to avoid a clash with -version). This
option is now renamed -servername.
Supermatcher and wordfinder had an option -errorfile which was a
longer form of the standard qualifier -error which can suppress
the reporting of error messages. The -errorfile qualifiers are
renamed -errfile.
Revseq added 'Reversed:' to the sequence description. For use
cases where the original sequence description is preferred
(e.g. FASTQ format formatted descriptions) a new -notag option
retains the original description.
Cirdna prints text inside solid blocks invisibly. When printed
outside the text scaling was too small. The text scale is now
adjusted for the radius and sequence length so that labels should
be readable outside the box.
Fuzznuc, fuzzpro and fuzztran using a pattern file ignored the
command line -mismatch qualifier for the first pattern. The
default mismatch is now set to this value at the start of the
pattern matching loop in the library.
qatest.pl which runs the QA tests now checks for a qatest.dat file
in the EMBOSS source directory and additional qatest.dat files in
the test subdirectory for all EMBASSY packages found under the
source embassy/ directory. By providing individual qatest.dat
files for each package we can simplify testing for a core
distribution. Some of the older EMBASSY packages derived from
domainatrix have cross-dependencies where one test uses the output
of an application from another package. New AX and AY lines define
foreign tests which are executed even where a single EMBASSY
package has been specified with the -embassy=package qualifier on
the command line.
Version 6.4.0 15-Jul-2011
DBXFLAT can index FASTQ format short read sequence files, allowing
individual sequences to be rapidly retrieved by name.
Genpept format has changed since we last tested it. The LOCUS line
is simpler. EMBOSS now supports GenPept as documented and
distributed by NCBI.
Sequence in SAM format ignores the reference sequence
name. Previous releases saved it as the accession number, but this
is inappropriate as it is then reported as the identifier in EMBL
format.
The -help output (and documentation) for align and report output
types now includes the default format if defined in the ACD file.
New code added to handle variation data in ajvar* source
files. The AjPVar object will hold genetic variation data from the
Ensembl API and from VCF input files.
New access methods for URLs have been added as ajurlread.c and for
URL output methods as ajurlwrite.c - supporting collecting and
reporting of URLs as output. URLs are saved as an array of strings,
intended to be reported as a set of links to the underlying data.
Sequence format "raw" now only reads binary files, which means it
cannot be used for piped data. The change was needed to avoid
accepting binary data where a file has a NULL and then no newline,
for example ABI data files where the initial 'ABIF' could be read
as a valid sequence.
Application tcode failed to plot results for more than one
sequence. It also reported a plplot error when reading random
non-coding input. It also failed to report the threshold lines
when they were outside the range of observed scores.
Four new functions combine tables where the keys and values are of
the same types. In each case the tables are resized to the larger
of the hash array sizes, and then at each hash array position all
keys in both tables are compared. The functions differ only in the
actions taken when a match is or is not found. ajTableMergeAnd
keeps all keys that are in both tables. ajTableMergeEor is the
inverse keeping only keys that are in only one table.
ajTableMergeNot removes keys that are also in the second
table. ajTableMergeOr adds keys from the second table that do not
match. All remaining keys and values are deleted using the tables
built-in destructor functions.
Some data resource catalogue applications failed when run with the
-debug option. Their debug calls have been updated.
New application dbtell reports the attributes for a database.
All messages written to the user are also logged to the debug file
to help locate where they are generated when debugging.
Applications showfeat, extractfeat and coderet are updated to
follow the new features /subfeatures data structures.
When using a simple numeric database identifier, the SV field is
only searched if it is defined.
Access to local SRS databases created an invalid command line for
getz with a stray '+' character needed only in the web version.
Nexus format input can now handle a missing taxlabels block by
using the matrix block to read sequence names.
GFF3 tag names are automatically converted to lower case unless
they match a known GFF3 "special" tag name.
GFF3 format has been rewritten to comply strictly with the GFF3
standard on the sequence ontology website. Characters are now
escaped in tag values. The 'featflag' tag has been changed to
convert the hex value into a readable list of flags, with some
flags now inferred from the content of the GFF line. The GFF3
special tags (all starting with an upper-case letter) are now
stored separately. The ID and Parent tags are used in
post-processing to build subfeatures which are stored under the
feature with an ID matching their first Parent tag.
GFF3 input requires the optional EMBOSS type comment to identify a
protein GFF3 file as there is currently no safe way to distinguish
protein from nucleotide features using only the standard GFF3
format.
GFF3 format sequence format failed to read files with additional
## comment records after the header block. These comments are now
ignored.
Feature objects have been extended. A feature may now include a
list of subfeatures. This is intended to allow exons to be stored
under the feature to which they belong. With this new structure,
sorting feature tables becomes easy as there is no need to match
group tags and sort by ID. Features simply sort by their main
(parent) feature, with the other subfeatures (exons) unseen by the
sort algorithm.
Application restrict crashed when the enzyme list was empty. If
reported invalid enzyme names, but not 'no enzyme name given'.
Reference-counted lists are enabled with the constructor
ajListNewRef creating a reference-counted copy. Lists are only
deleted when the reference count falls to zero.
Reference-counted tables are enabled with the constructor
ajTableNewRef creating a reference-counted copy. Tables are only
deleted when the reference count falls to zero.
Table code has been rewritten to automatically delete keys unless
the table is created with a Const version of the constructor. All
table constructors are renamed, with the older names retains as
"deprecated" functions which do not delete keys or values. All
EMBOSS code has been changed to use the new function names.
New functions ajTableMatch, ajTableMatchC and ajTableMatchS test a
key is present in a table. They can be used where the ajTableFetch
is inadequate because the value may be NULL. Some code used
ajTableFetchKey but this is intended only for case-insensitive keys.
Tables (AjPTable) have defined functions to hash and compare
keys. Two new functions can be defined to delete keys and
values. By default these are NULL and no keys or values are
deleted. The functions can be ajMemFree to simply free memory, or
more complex object destructors. As these require a void** argument
(all keys and values are void* internally) wrappers are needed
around object destructors. We recommend appending 'Void' to the
standard destructor name and casting the void** argument to pass
to the object-specific destructor.
Tables (AjPTable) can be resized using the ajTableResizeLen
function. When adding to a table with ajTablePut the table is
automatically resized when the number of entries exceeds an
average of 8 per bucket.
Function ajMemFree now accepts a void** argument and sets the
pointer to zero after free the memory. All EMBOSS code calls this
through the AJFREE macro which is now safer to use as the pointer
appears only once in the generated code.
Application digest conflicted with the name of a utility on some
systems. It has been renamed to pepdigest.
In the emboss.standard and emboss.default files certain attributes
can appear more than once if defined as type "ATTR_LIST" in the
ajnam.c source file. These include a new attribute 'field:' defined
once for each database query field, superseding the 'fields:'
list of field names. The 'field:' attribute has a list of field
names, with the first being the name preferred by EMBOSS and
others acceptable on the command line. A '!' delimiter marks the
end of the field names and the start of a free text description.
This style of description is also allowed for other attributes,
including 'taxon:' and the 'edam*:' attributes. The syntax is
taken from the metadata in OBO format.
Data retrieval using the HTTP protocol now checks for redirects in
the header and replaces the file buffer with the results from the
new URL. This allows EMBOSS to read outdated URLs for database
access.
New trace functions ajTableFetchTrace and ajTablePutTrace help to
debug adding new keys to a table.
New parsing function ajStrTokenNextParseDelimiters returns the
delimiter string in addition to the token parsed from a string
token handler.
Application einverted could report a bad alignment if the matched
region reached the end of the search window. Matches which go
beyond the search window are now ignored. This bug was reported
with a very low threshold score and was unlikely to be noticed
with the default settings.
Sequence format treecon failed if the only line of input started
with a number. Failure to find a second record now simply returns
false.
Tables can now use integer keys and values of four types - integer
and long, signed and unsigned. The unsigned longs are used
internally for emblcd index reading and for b+tree index creation.
Report output in from pattern patching applications (fuzznuc,
fuzzpro, fuzztran, dreg, preg) now includes the pattern as well
as the pattern name in the '*pat' or 'Pattern_name' feature tag
value.
New applications search the EDAM ontology by each of its query
fields, with common options to restrict the results to one of the
7 EDAM namespaces. Also new applications to look for EDAM term with
each of the 5 common relationships for EDAM data terms:
has_input, has_output, is_identifier_of, is_format_of and
is_source_of. The sixth relationship has_attribute is only used by
the obsolete 'entity' namespace terms.
New application dbxresource indexes the data resource catalogue
DRCAT.dat which is distributed with EMBOSS. Most fields in DRCAT
are indexed. The EDAM and Taxon fields are used by other
applications to search the EDAM and TAXON databases for terms which
are in turn used to select DRCAT entries by taxon, data type,
format, identifier and resource.
Any menu (list and selection ACD types) which allows all options
to be selected now accepts "*" to select everything. This can be
the default (e.g. for database index fields) or can be specified
by the user with quotes to protect it from interpretation by the
Unix shell.
Tokens indexed with the dbx* programs now have white space indexed
as underscores. Any index files with spaces in the tokens need to
be re-indexed. This applies to keyword and organism indexes.
New code added to handle short read assemblies in ajassem* source
files. The AjPAssem object will hold large numbers of short reads
in managed memory buffers.
New template for adding data types with specific formats for input
and output and data access methods. These templates are stored in
ajwxyz* source files with a script newdatatypes.pl to
automatically create new, properly named, stub functions in the
AJAX core and ajaxdb libraries.
Program nthseq now simply reports an error (not a fatal error) if
too few sequences were read.
Feature input and output was in one large file. This has now been
refactored with ajfeatdata.h for the data structures, ajfeatread.c
for input formats, ajfeatwrite.c for output formats and remaining
feature object handling code in ajfeat.c.
New access methods for text have been added as ajtextread.c and
for text output methods as ajtextwrite.c - supporting text and
(preserved) HTML and XML output. Text is saved as an array of
strings, intended to be used as one per input record although
storing the entire text in the first string is also possible.
Data queries have been made general. A new AjPQuery object handles
queries for any datatype, storing a list of field names and
queries, plus an operator (OR, AND, NOT, EOR, ELSE) for combining
fields. Previous releases had a hard-coded search for "id or
accession" which now uses the new query structure. Extensions to
the query language will allow more complex combinations, and will
allow any field to be defined for an external data resource
(e.g. fields for an SRSWWW server).
All data reading access methods have been restructured. Methods
that essentially return an open file with the pointer set to the
start of an entry (which covers most of the original access
methods) are moved to a new source file ajtextdb.c and use a new
AjPTextin input object which is included within AjPSeqin for
sequence input and AjPOboin for OBO term input. These functions
are generalized for any input data in some text-based file
format. Sequence access will first check for a text-based access
method, and then for a sequence-specific method (e.g. ensembl).
Other input datatypes can do the same. The code for OBO ontology
terms will use the new text access methods. Code for access to
other input data types (feature, alignment) will now be relatively
easy to add. Text retrieval of data from a new list of data
resources can also use these access methods.
Program einverted required at least one base between the halves of
an inverted repeat. Blunt joins are now reported where previous
versions reported a 2 base gap.
Error messages from database indexing now include the filename of
the index file. This is useful when identifying the indexing
operation where the problem occurred.
EMBOSS database index files are extended to mark numeric and
string index pages. In previous releases all were marked as
strings. Older index files remain valid for sequence retrieval,
but not for the new dbxreport index analysis application.