Knowledge doc ingestion #148

aakankshaduggal · 2024-10-25T02:42:29Z

Signed-off-by: Aakanksha Duggal <[email protected]>

williamcaban · 2024-10-25T21:38:40Z

docs/sdg/knowledge-doc-ingestion.md

+
+### 3.3 Introducing the Document Chunking Command
+
+- **Command Overview**: We propose a new command, `ilab document format`, which will:


Is there a particular reason for using the word format? The first time I read the command without reading the whole document, it gave me the impression the command was for transforming a document from one format to another format (e.g. pdf to md, or pdf to json). Should we consider a word or verb that more closely resembles what is actually happening in this step? For example:

ilab docs import --input path/to/document.pdf --output path/to/schema
ilab docs ingest --input path/to/document.pdf --output path/to/schema
ilab docs process --input path/to/document.pdf --output path/to/schema

I propose ilab data being used instead of ilab docs or ilab document - we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.

If think the document-related commands are going to expand considerably, then this might be the time to create a new command group. If the document processing will remain a small subset, then integrating them under the data group would simplify the CLI structure.

ingest Provide the document path to ingest and process into the desired scheme.
process Provide the document path to process into the scheme.
import Specify the document path to import and format according to the scheme requirements.
chunk Enter the document path to split and structure for the scheme.

‘ilab data [verb] [path]’

Pro: Keeps all data-related commands in one place, which potentially makes a unified experience for the user and maybe more easily discoverable for tasks. This simplifies the CLI structure.

Con: The broad scope may lead to potentially cluttering the group with varied tasks. Users who are focused on document-specific action might find it harder to locate

‘Ilab docs/document’

Pro: Creates a clear and dedicated space for document specific command, which potentially makes it easier for users working with document related functions. This leaves a lot of room for scalability

Con: Adds another command group, fragmenting the CLI, especially if the document task/commands are minimal. Users might need to switch between command groups if they are working with documents and other data types.

williamcaban · 2024-10-25T21:44:44Z

docs/sdg/knowledge-doc-ingestion.md

+## 5. InstructLab Schema Overview
+
+### Key Components:
+- **Docling JSON Output**: The output from Docling will be the instructlab schema, which serves as the backbone for both SDG and RAG workflows. For specific details around the leaf node path or timestamp, we will include that as a part of the file nomenclature.


As part of the ingestion command, should we consider a flag where the pipeline could augment the metadata of the final output like --metadata ./path-to-metadata.json to add information such as attribution, timestamps, ilab version, schema version, etc.?

nathan-weinberg · 2024-10-29T17:21:43Z

cc @juliadenham

nathan-weinberg

This is great @aakankshaduggal thank you for writing it up!

I would like to see this shared in #sdg in upstream Slack and maybe also sent out to [email protected] so community members can weigh-in as well

nathan-weinberg · 2024-10-29T19:43:19Z

docs/sdg/knowledge-doc-ingestion.md

+
+## 3. Proposed Approach
+
+### 3.1 Custom InstructLab Schema Design


It is worth pointing out we have an existing schema package maintained by the @instructlab/schema-maintainers: https://github.com/instructlab/schema

Right now this is only for Taxonomy schema, but we could extend this with Classes designed specifically for this usecase

nathan-weinberg · 2024-10-29T19:44:02Z

docs/sdg/knowledge-doc-ingestion.md

+
+### 3.2 PDF and Document Conversion via Docling
+
+- **Docling Integration**: We will leverage **Docling** to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows.


Could we include a link here to somewhere where folks can read more about Docling, and perhaps a bit as to why Docling is the chosen solution here?

nathan-weinberg · 2024-10-29T19:47:09Z

docs/sdg/knowledge-doc-ingestion.md

+
+### 3.3 Introducing the Document Chunking Command
+
+- **Command Overview**: We propose a new command, `ilab document format`, which will:


I propose ilab data being used instead of ilab docs or ilab document - we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.

nathan-weinberg · 2024-10-29T19:47:50Z

docs/sdg/knowledge-doc-ingestion.md

+  - Take a document path (defined in `qna.yaml`).
+  - Format and chunk the document into the desired schema.
+
+- **Implementation Details**: Initially, this functionality will be integrated into the existing SDG repository. Over time, it can evolve into a standalone utility, allowing external integrations and wider usage.


What would be the motivation for moving this out of the SDG repository? "allowing external integrations and wider usage" doesn't really tell me much

nathan-weinberg · 2024-10-29T19:49:19Z

docs/sdg/knowledge-doc-ingestion.md

+
+- **Current Challenge**: Knowledge documents are stored in Git-based repositories, which may be unfamiliar to many users.
+- **Proposed Solution**:
+  - Allow users to input a local directory and provide an automated script that:


Rather than a "script," why not just have this be part of the code? We can detect if a given directory is git-tracked (e.g. by checking for a .git subdirectory) and do the manipulation described if not

nathan-weinberg · 2024-10-29T19:49:59Z

docs/sdg/knowledge-doc-ingestion.md

+
+Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking:
+
+![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5)


Very nice diagram!

nathan-weinberg · 2024-10-29T20:03:23Z

Is this related? #120 cc @makelinux

nathan-weinberg · 2024-10-29T20:05:35Z

This one from @jjasghar also seems related? #106

nathan-weinberg · 2024-10-29T20:09:12Z

One more #64

relyt0925 · 2024-10-31T01:25:37Z

@aakankshaduggal (trying to envision overall flow in my head): I almost view this as proposing two independent yet related enhancements. One is the ability to define references to "documents" in a variety of ways versus just through git references. The other is actually talking about new document formats and how they would be injested.

so do you envision a user will still declaratively define "pointers" in their taxonomy to the backing doc storage similar to what is done today in knowledge like the following example:

document:
  repo: https://github.com/relyt0925/rbc-knowledge
  commit: 99dae176de4927940aee4faaeb0f645b3ee4582b
  patterns:
    - pdf_chunk*.md

However this "declarative definition" is now more flexible in the sense that it no longer has to necessarily just be
repo, commit, pattern It could be something like filepath within the base of a taxonomy which could look something like this

document:
  local_directory: documents/docchunks/ 
  patterns:
    - pdf_chunk0.md

Which would then in ilab data generate when I am processing the leaf node lead to the sdg process looking in a local path relative to the "taxonomy base" path for the documents to use in sdg?

(Scoping this comment to comment one which is really a document independent topic). Is there more specifics on the general number of formats that we want to introduce? Do we have specifics on how that document section enhancement would look like?) I ask about the other formats to see if we are bringing in formats that bring in the need for implicit dependencies (like for example a S3 bucket where somehow in the schema we then need to build a flexible way for the user to define how they want to interact with the COS bucket: which could be different in different environments.)

relyt0925 · 2024-10-31T01:30:38Z

Then 2: the document type enhancement

First question: would it also be accurate to say that as we add in new document types (independent of the ways we reference them): we are still going to keep the declarative nature of the taxonomy where a user will explicitly reference the document in the taxonomy section. SDG then will handle when looking at the document determining it's type and then if it needs to be processed by docling and chunked. It will then produce the chunks (in the example of a 3 MB PDF file about 250 md chunks are produced): and handle ensuring those are processed as the "set" of documents for sdg? This would continue if multiple pdf files were defined?

I am curious if you are envisioning things remaining in that flow versus what I would call a "pre processing" flow where users have to expilictly use the tooling to get the pdf docs converted as a pre req step to setting up a taxonomy, then create a knowledge repo (that would always only contain markdown documents), and then create a leaf node that points to the markdown documents only. Does that make sense the difference at a high level on what I am talking on?

So basically in option 1 which I think is what we are after:
SDG would see as it's parsing the leaf node something like

document:
  local_directory: documents/pdfs/ 
  patterns:
    - pdf1.pdf

Then know in processing: ok this doc is type PDF: first I need to go through and convert the pdf document to markdown chunks. Let me automatically do that. Then ok: now I know all these chunks are the full set of "documents" I am running for the leaf node. Ok let me then take that and run that for the leaf node and now we are off to the races same flow we have currently. Same idea for docx files or any other file type we add.

aakankshaduggal added 2 commits October 24, 2024 22:41

📝 Add Knowledge Document Ingestion Pipeline design proposal

5df756d

Signed-off-by: Aakanksha Duggal <[email protected]>

Merge branch 'instructlab:main' into knowledge-doc-ingestion

72cecb3

williamcaban reviewed Oct 25, 2024

View reviewed changes

aakankshaduggal marked this pull request as ready for review October 28, 2024 20:17

nathan-weinberg requested review from nathan-weinberg and cdoern October 29, 2024 17:14

nathan-weinberg reviewed Oct 29, 2024

View reviewed changes

nathan-weinberg linked an issue Oct 29, 2024 that may be closed by this pull request

Add Knowledge Document Ingestion Pipeline Design Proposal #149

Open

makelinux mentioned this pull request Oct 30, 2024

Create a repository for knowledge doc ingestion tool #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge doc ingestion #148

Knowledge doc ingestion #148

aakankshaduggal commented Oct 25, 2024 •

edited

Loading

williamcaban Oct 25, 2024

nathan-weinberg Oct 29, 2024

JustinXHale Oct 30, 2024 •

edited

Loading

williamcaban Oct 25, 2024

nathan-weinberg commented Oct 29, 2024

nathan-weinberg left a comment

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg commented Oct 29, 2024

nathan-weinberg commented Oct 29, 2024

nathan-weinberg commented Oct 29, 2024

relyt0925 commented Oct 31, 2024 •

edited

Loading

relyt0925 commented Oct 31, 2024 •

edited

Loading


		### 3.3 Introducing the Document Chunking Command

		- Command Overview: We propose a new command, `ilab document format`, which will:


		## 3. Proposed Approach

		### 3.1 Custom InstructLab Schema Design


		### 3.2 PDF and Document Conversion via Docling

		- Docling Integration: We will leverage Docling to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows.


		Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking:

		![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5)

Knowledge doc ingestion #148

Are you sure you want to change the base?

Knowledge doc ingestion #148

Conversation

aakankshaduggal commented Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinXHale Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathan-weinberg commented Oct 29, 2024

nathan-weinberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathan-weinberg commented Oct 29, 2024

nathan-weinberg commented Oct 29, 2024

nathan-weinberg commented Oct 29, 2024

relyt0925 commented Oct 31, 2024 • edited Loading

relyt0925 commented Oct 31, 2024 • edited Loading

aakankshaduggal commented Oct 25, 2024 •

edited

Loading

JustinXHale Oct 30, 2024 •

edited

Loading

relyt0925 commented Oct 31, 2024 •

edited

Loading

relyt0925 commented Oct 31, 2024 •

edited

Loading