Admittedly, creating a custom content parser is a bit cumbersome. However, once it's configured, it shouldn't change frequently.
Right-click on any part of the page and select "Inspect" to open the "Elements" tab of the Chrome Developer Tools.
An effective approach is to find the innermost (or outermost depending on your use case) containing <div>
of the selected element that has a unique id
or class
that can be used to distinguish the container.
Note: The container tag doesn't necessarily have to be a <div>
tag. It can be any tag.
Determine the selector (and selectorAll) queries and add them to the existing content parser configuration in the Lumos Options page. See documentation for querySelector()
and querySelectorAll()
to confirm all querying capabilities and see more examples.
Example queries:
- Select element by tag name:
tagName
- Select element by id (leading
#
):#elementId
- Select element by class name (leading
.
):.className
querySelector()
supports complex selectors and negation.
Example config for a single domain:
{
"domain.com": {
"chunkSize": 500,
"chunkOverlap": 0,
"selectors": [
"tagName",
"#elementId"
],
"selectorsAll": [
".className"
]
}
}