You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to cluster amino acid sequences (up to ~100000, but the problem is occurring with only about 10000 sequences) with a low sequence identity threshold (ideally 0.25-0.3). However, when I try this, mmseqs easy-clust is failing with a segmentation fault at the pre-filter step for any sequence identity less than 0.5. This problem occurs even when I allocate 600 GB of memory on our HPC.
Here are all of the arguments (I left everything but min-seq-id as default.
MMseqs Version: 16.747c6
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 4
k-mer length 0
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max sequence length 2000
Max results per query 20
Split database 0
Split mode 2
Split memory limit 0
Coverage threshold 0.8
Coverage mode 0
Compositional bias 1
Compositional bias 1
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Include identical seq. id. false
Spaced k-mers 1
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads 128
Compressed 0
Verbosity 3
Add backtrace false
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0.3
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max reject 2147483647
Max accept 2147483647
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Cluster mode 0
Max connected component depth 1000
Similarity type 2
Weight file name
Cluster Weight threshold 0.9
Single step clustering false
Cascaded clustering steps 3
Cluster reassign false
Remove temporary files true
Force restart with latest tmp false
MPI runner
k-mers per sequence 21
Scale k-mers per sequence aa:0.000,nucl:0.200
Adjust k-mer length false
Shift hash 67
Include only extendable false
Skip repeating k-mers false
Database type 0
Shuffle input database true
Createdb mode 1
Write lookup file 0
Offset of numeric ids 0
MMseqs Output (for bugs)
createdb ../data/truncated_selenoprotein.fa ../data/tmp/6450526811545454666/input --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
Multiline fasta can not be combined with --createdb-mode 0
We recompute with --createdb-mode 1
Time for merging to input_h: 0h 0m 0s 1ms
Time for merging to input: 0h 0m 0s 1ms
[11010] 0s 25ms
Time for merging to input_h: 0h 0m 0s 0ms
Time for merging to input: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 29ms
Create directory ../data/tmp/6450526811545454666/clu_tmp
cluster ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu ../data/tmp/6450526811545454666/clu_tmp --max-seq-len 2000 --max-seqs 20 -c 0.8 --spaced-kmer-mode 1 --alignment-mode 3 -e 0.001 --min-seq-id 0.3 --remove-tmp-files 1
Set cluster sensitivity to -s 5.000000
Set cluster mode SET COVER
Set cluster iterations to 3
linclust ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/clu_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 128 --compressed 0 -v 3 --cluster-weight-threshold 0.9 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.3 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 2000 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 1 --force-reuse 0
kmermatcher ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.3 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 2000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 128 --compressed 0 -v 3 --cluster-weight-threshold 0.9
kmermatcher ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.3 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 2000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 128 --compressed 0 -v 3 --cluster-weight-threshold 0.9
Database size: 11025 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)
Generate k-mers list for 1 split
[=================================================================] 100.00% 11.02K 0s 63ms
Sort kmer 0h 0m 0s 89ms
Sort by rep. sequence 0h 0m 0s 86ms
Time for fill: 0h 0m 0s 4ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 327ms
rescorediagonal ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 128 --compressed 0 -v 3
[=================================================================] 100.00% 11.02K 0s 73ms
Time for merging to pref_rescore1: 0h 0m 0s 45ms=================>] 99.83% 11.01K eta 0s
Time for processing: 0h 0m 0s 242ms
clust ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore1 ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 128 --compressed 0 -v 3 --cluster-weight-threshold 0.9
Clustering mode: Set Cover
[=================================================================] 100.00% 11.02K 0s 23ms
Sort entries
Find missing connections
Found 25579 new connections.
Reconstruct initial order
[=================================================================] 100.00% 11.02K 0s 17ms
Add missing connections
[=================================================================] 100.00% 11.02K 0s 1ms
Time for read in: 0h 0m 0s 149ms
Total time: 0h 0m 0s 203ms
Size of the sequence database: 11025
Size of the alignment database: 11025
Number of clusters: 4709
Writing results 0h 0m 0s 1ms
Time for merging to pre_clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 225ms
createsubdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/order_redundancy ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy -v 3 --subdb-mode 1
Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 4ms
createsubdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/order_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter1 -v 3 --subdb-mode 1
Time for merging to pref_filter1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 4ms
filterdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter1 ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter2 --filter-file ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/order_redundancy --threads 128 --compressed 0 -v 3
Filtering using file(s)
[=================================================================] 100.00% 4.71K 0s 49ms
Time for merging to pref_filter2: 0h 0m 0s 32ms=================> ] 98.39% 4.63K eta 0s
Time for processing: 0h 0m 0s 208ms
rescorediagonal ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter2 ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.3 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 128 --compressed 0 -v 3
[=================================================================] 100.00% 4.71K 0s 46ms
Time for merging to pref_rescore2: 0h 0m 0s 32ms=============> ] 93.56% 4.41K eta 0s
Time for processing: 0h 0m 0s 244ms
align ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore2 ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.3 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 2000 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 128 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 4709 type: Aminoacid
Target database size: 4709 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 4.71K 0s 106ms
Time for merging to aln: 0h 0m 0s 33ms
6719 alignments calculated
6523 sequence pairs passed the thresholds (0.970829 of overall calculated)
1.385220 hits per query sequence
Time for processing: 0h 0m 0s 276ms
clust ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/aln ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 128 --compressed 0 -v 3 --cluster-weight-threshold 0.9
Clustering mode: Set Cover
[=================================================================] 100.00% 4.71K 0s 15ms
Sort entries
Find missing connections
Found 1814 new connections.
Reconstruct initial order
[=================================================================] 100.00% 4.71K 0s 8ms
Add missing connections
[=================================================================] 100.00% 4.71K 0s 0ms
Time for read in: 0h 0m 0s 128ms
Total time: 0h 0m 0s 176ms
Size of the sequence database: 4709
Size of the alignment database: 4709
Number of clusters: 3897
Writing results 0h 0m 0s 1ms
Time for merging to clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 194ms
mergeclusters ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/clu_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pre_clust ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/clust --threads 128 --compressed 0 -v 3
Clustering step 1
[=================================================================] 100.00% 4.71K 0s 14ms
Clustering step 2
[=================================================================] 100.00% 3.90K 0s 50ms
Write merged clustering
[=================================================================] 100.00% 11.02K 0s 127ms
Time for merging to clu_redundancy: 0h 0m 0s 34ms
Time for processing: 0h 0m 0s 202ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter1 -v 3
Time for processing: 0h 0m 0s 1ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref -v 3
Time for processing: 0h 0m 0s 0ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore1 -v 3
Time for processing: 0h 0m 0s 24ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pre_clust -v 3
Time for processing: 0h 0m 0s 0ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy -v 3
Time for processing: 0h 0m 0s 1ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/input_step_redundancy_h -v 3
Time for processing: 0h 0m 0s 1ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_filter2 -v 3
Time for processing: 0h 0m 0s 19ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/pref_rescore2 -v 3
Time for processing: 0h 0m 0s 21ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/aln -v 3
Time for processing: 0h 0m 0s 21ms
rmdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/linclust/2847535345120249051/clust -v 3
Time for processing: 0h 0m 0s 1ms
createsubdb ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/clu_redundancy ../data/tmp/6450526811545454666/input ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/input_step_redundancy -v 3 --subdb-mode 1
Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 8ms
prefilter ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/input_step_redundancy ../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/pref_step0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 1 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 2000 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3
Query database size: 3897 type: Aminoacid
Estimated memory consumption: 1003M
Target database size: 3897 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 3.90K 0s 23ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 3.90K 0s 18ms
Index statistics
Entries: 91288
DB size: 488 MB
Avg k-mer size: 0.001426
Top 10 k-mers
GTNDAR 84
CFPDFV 63
SACNNY 59
VLFPFT 48
TNVHRL 40
TCYGCQ 39
GGMQKT 34
HPNGCP 30
CTENFQ 30
GSNDNR 28
Time for index table init: 0h 0m 0s 407ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 154
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 3897
Target db start 1 to 3897
../data/tmp/6450526811545454666/clu_tmp/3490336684826145900/cascaded_clustering.sh: line 102: 3296002 Segmentation fault (core dumped) $RUNNER "$MMSEQS" prefilter "$INPUT" "$INPUT" "${TMP_PATH}/pref_step$STEP" ${TMP}
Error: Prefilter step 0 died
Error: Search died
Context
I am trying to cluster with low sequence identity in order to produce training, testing, and validation datasets for a machine learning model to avoid data leakage due to sequence homology.
Your Environment
Here are some details about the machine...
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
Stepping: 6
BogoMIPS: 4400.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dc
a sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba
ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap
avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_de
tect wbnoinvd dtherm ida arat pln pts hwp_epp vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rd
pid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 3 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 80 MiB (64 instances)
L3: 96 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Vulnerabilities:
Gather data sampling: Vulnerable: No microcode
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
The text was updated successfully, but these errors were encountered:
Problem
I am trying to cluster amino acid sequences (up to ~100000, but the problem is occurring with only about 10000 sequences) with a low sequence identity threshold (ideally 0.25-0.3). However, when I try this, mmseqs easy-clust is failing with a segmentation fault at the pre-filter step for any sequence identity less than 0.5. This problem occurs even when I allocate 600 GB of memory on our HPC.
Steps to Reproduce
mmseqs easy-cluster ../data/truncated_selenoprotein.fa ../data/cluster_truncated_selenoprotein ../data/tmp --min-seq-id 0.3 --max-seq-len 2000
Here are all of the arguments (I left everything but
min-seq-id
as default.MMseqs Output (for bugs)
Context
I am trying to cluster with low sequence identity in order to produce training, testing, and validation datasets for a machine learning model to avoid data leakage due to sequence homology.
Your Environment
Here are some details about the machine...
The text was updated successfully, but these errors were encountered: