-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corpus CLI #488
Comments
This sounds like a job for
Playing Devil's Advocate here, why do we need to involve a |
If instead of making the positional argument a root directory only, we allow passing an arbitrary amount of directories and files, we can easily get this behavior: $ cat files.txt
s3://my-bucket/foo/bar/spam.pdf
s3://my-bucket/foo/bar/ham.pdf
s3://my-bucket/foo/bar/eggs.pdf
$ alias ragna-ingest="python -c 'import sys; print(sys.argv[1:])'"
$ ragna-ingest $(cat files.txt)
['s3://my-bucket/foo/bar/spam.pdf', 's3://my-bucket/foo/bar/ham.pdf', 's3://my-bucket/foo/bar/eggs.pdf'] This is standard CLI behavior and enables users to also glob files with the shell, e.g.
Maybe there is a misunderstanding what the For an example, have a look at To answer your question about why we need it here:
Yeah, that is a good idea. However, I prefer using I wonder if we can merge |
(After a short offline chat with @pmeier) I had a vague concern around latency and storage for large corpuses (eg instantiating 100K
I opened #251 at one point in case it's of interest here. |
To be able to release the corpus API, we need a way for users to CRUD the corpora on a a given source storage. To make our lives a little easier, we are not targeting a UI for this yet, but should start with a CLI instead. Since the likely consumers of this feature are power users or admins, this is ok.
We can add a
ragna corpus
subcommand to the CLI. This in turn could have more subcommands:ragna corpus list
: List all available corporaragna corpus ingest
: Ingest some documents into a given corpus (more on this later)ragna corpus delete
: Delete a given corpusragna corpus metadata
: List all available metadata in a given corpusEach command needs the source storage the action should be applied to. We have a few options here that we potentially can implement all:
--source-storage
flag that accepts an import string similar to what we do in our config file, e.g.--source-storage ragna.source_storages.Chroma
--config
and only accept a--source-storage
listed there. Also allow passing the source storage by its display name similar to the API, since we know the options.--config
parameter and--source-storage
is not passed, offer the user an interactive list of available source storages to select fromragna corpus ingest
is the trickiest of them. IMO a reasonable default behavior would beLocalDocument.from_path
on each path for which we have an availableDocumentHandler
From there on it is just calling
SourceStorage.store
and injecting them into Ragna's database.The tricky part comes when opening this up to other behavior than just the default:
ragna corpus ingest
you want to ingest files from local disk?Document
class has afrom_path
classmethod in order for us to create an arbitraryDocument
subclass if we have nothing more than a path?I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.
The text was updated successfully, but these errors were encountered: