-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knowledge doc ingestion #148
base: main
Are you sure you want to change the base?
Knowledge doc ingestion #148
Conversation
Signed-off-by: Aakanksha Duggal <[email protected]>
|
||
### 3.3 Introducing the Document Chunking Command | ||
|
||
- **Command Overview**: We propose a new command, `ilab document format`, which will: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for using the word format
? The first time I read the command without reading the whole document, it gave me the impression the command was for transforming a document from one format to another format (e.g. pdf to md, or pdf to json). Should we consider a word or verb that more closely resembles what is actually happening in this step? For example:
ilab docs import --input path/to/document.pdf --output path/to/schema
ilab docs ingest --input path/to/document.pdf --output path/to/schema
ilab docs process --input path/to/document.pdf --output path/to/schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose ilab data
being used instead of ilab docs
or ilab document
- we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If think the document-related commands are going to expand considerably, then this might be the time to create a new command group. If the document processing will remain a small subset, then integrating them under the data
group would simplify the CLI structure.
ingest
Provide the document path to ingest and process into the desired scheme.
process
Provide the document path to process into the scheme.
import
Specify the document path to import and format according to the scheme requirements.
chunk
Enter the document path to split and structure for the scheme.
‘ilab data [verb] [path]’
- Pro: Keeps all data-related commands in one place, which potentially makes a unified experience for the user and maybe more easily discoverable for tasks. This simplifies the CLI structure.
- Con: The broad scope may lead to potentially cluttering the group with varied tasks. Users who are focused on document-specific action might find it harder to locate
‘Ilab docs/document’
- Pro: Creates a clear and dedicated space for document specific command, which potentially makes it easier for users working with document related functions. This leaves a lot of room for scalability
- Con: Adds another command group, fragmenting the CLI, especially if the document task/commands are minimal. Users might need to switch between command groups if they are working with documents and other data types.
## 5. InstructLab Schema Overview | ||
|
||
### Key Components: | ||
- **Docling JSON Output**: The output from Docling will be the instructlab schema, which serves as the backbone for both SDG and RAG workflows. For specific details around the leaf node path or timestamp, we will include that as a part of the file nomenclature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the ingestion command, should we consider a flag where the pipeline could augment the metadata of the final output like --metadata ./path-to-metadata.json
to add information such as attribution, timestamps, ilab version, schema version, etc.?
cc @juliadenham |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @aakankshaduggal thank you for writing it up!
I would like to see this shared in #sdg
in upstream Slack and maybe also sent out to [email protected]
so community members can weigh-in as well
|
||
## 3. Proposed Approach | ||
|
||
### 3.1 Custom InstructLab Schema Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth pointing out we have an existing schema
package maintained by the @instructlab/schema-maintainers: https://github.com/instructlab/schema
Right now this is only for Taxonomy schema, but we could extend this with Classes designed specifically for this usecase
|
||
### 3.2 PDF and Document Conversion via Docling | ||
|
||
- **Docling Integration**: We will leverage **Docling** to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include a link here to somewhere where folks can read more about Docling, and perhaps a bit as to why Docling is the chosen solution here?
|
||
### 3.3 Introducing the Document Chunking Command | ||
|
||
- **Command Overview**: We propose a new command, `ilab document format`, which will: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose ilab data
being used instead of ilab docs
or ilab document
- we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.
- Take a document path (defined in `qna.yaml`). | ||
- Format and chunk the document into the desired schema. | ||
|
||
- **Implementation Details**: Initially, this functionality will be integrated into the existing SDG repository. Over time, it can evolve into a standalone utility, allowing external integrations and wider usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the motivation for moving this out of the SDG repository? "allowing external integrations and wider usage" doesn't really tell me much
|
||
- **Current Challenge**: Knowledge documents are stored in Git-based repositories, which may be unfamiliar to many users. | ||
- **Proposed Solution**: | ||
- Allow users to input a local directory and provide an automated script that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a "script," why not just have this be part of the code? We can detect if a given directory is git-tracked (e.g. by checking for a .git
subdirectory) and do the manipulation described if not
|
||
Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking: | ||
|
||
![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice diagram!
Is this related? #120 cc @makelinux |
One more #64 |
@aakankshaduggal (trying to envision overall flow in my head): I almost view this as proposing two independent yet related enhancements. One is the ability to define references to "documents" in a variety of ways versus just through git references. The other is actually talking about new document formats and how they would be injested. so do you envision a user will still declaratively define "pointers" in their taxonomy to the backing doc storage similar to what is done today in knowledge like the following example:
However this "declarative definition" is now more flexible in the sense that it no longer has to necessarily just be
Which would then in ilab data generate when I am processing the leaf node lead to the sdg process looking in a local path relative to the "taxonomy base" path for the documents to use in sdg? (Scoping this comment to comment one which is really a document independent topic). Is there more specifics on the general number of formats that we want to introduce? Do we have specifics on how that document section enhancement would look like?) I ask about the other formats to see if we are bringing in formats that bring in the need for implicit dependencies (like for example a S3 bucket where somehow in the schema we then need to build a flexible way for the user to define how they want to interact with the COS bucket: which could be different in different environments.) |
Then 2: the document type enhancement First question: would it also be accurate to say that as we add in new document types (independent of the ways we reference them): we are still going to keep the declarative nature of the taxonomy where a user will explicitly reference the document in the taxonomy section. SDG then will handle when looking at the document determining it's type and then if it needs to be processed by docling and chunked. It will then produce the chunks (in the example of a 3 MB PDF file about 250 md chunks are produced): and handle ensuring those are processed as the "set" of documents for sdg? This would continue if multiple pdf files were defined? I am curious if you are envisioning things remaining in that flow versus what I would call a "pre processing" flow where users have to expilictly use the tooling to get the pdf docs converted as a pre req step to setting up a taxonomy, then create a knowledge repo (that would always only contain markdown documents), and then create a leaf node that points to the markdown documents only. Does that make sense the difference at a high level on what I am talking on? So basically in option 1 which I think is what we are after:
Then know in processing: ok this doc is type PDF: first I need to go through and convert the pdf document to markdown chunks. Let me automatically do that. Then ok: now I know all these chunks are the full set of "documents" I am running for the leaf node. Ok let me then take that and run that for the leaf node and now we are off to the races same flow we have currently. Same idea for docx files or any other file type we add. |
Related to #149 instructlab/sdg#324