Skip to content

Commit

Permalink
minor corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Sep 30, 2023
1 parent 4e97e8e commit 571c865
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 14 deletions.
2 changes: 1 addition & 1 deletion doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ matching genome.
## Can I use sourmash to determine the best reference genome for mapping my reads?

Yes! (And see the FAQ above,
[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping).)
[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping))

If you're interested in picking a single best reference genome (from a
large database) for read mapping, you can do the following:
Expand Down
26 changes: 13 additions & 13 deletions doc/sourmash-internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -647,24 +647,24 @@ n+1 problem

Since `sourmash gather` will pick only one "best match" if there
are several (and will ignore the others), the order of searching
can matter for multiple collections. How does this work?
can matter for large collections. How does this work?

In brief, sourmash doesn't guarantee a particular load order for
sketches in a single collection, but it _does_ guarantee that
collections are loaded and searched in their entirety in the order
that you provide them. So, for example, if you have a large zipfile
database of sketches that contain some duplicates, you can't pick
which of the duplicates will be chosen as a match; but you _can_
provide your own collection of prioritized matches as a separate
database. A practical application of this might be to provide the
GTDB "representatives" database first on the command line, with the
full GTDB database second, in order to prioritize choosing
representative genomes as matches over the rest.

This also plays a role in the order of reporting for `prefetch` -
`prefetch` will report matching sketches in the order it encounters
them, which will match the order in which collections are given
to `sourmash prefetch` on the command line.
database of sketches that contains duplicates, you can't predict which
of the duplicates will be chosen as a match; but you _can_ build your
own collection of prioritized matches as a separate database, and put
it first on the command line. A practical application of this might
be to list the GTDB "representatives" database first on the command
line, with the full GTDB database second, in order to prioritize
choosing representative genomes as matches over the rest.

This also plays a role in the order of reporting for `prefetch`
output - `prefetch` will report matching sketches in the order it
encounters them, which will match the order in which collections are
given to `sourmash prefetch` on the command line.

## Formats natively understood by sourmash

Expand Down

0 comments on commit 571c865

Please sign in to comment.