-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add text as html to orig elements chunks #3779
Add text as html to orig elements chunks #3779
Conversation
…ml_to_orig_elements_chunks
unstructured/chunking/base.py
Outdated
@@ -774,6 +774,8 @@ def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]: | |||
# -- Python 3.7+ maintains dict insertion order -- | |||
ordered_unique_keys = {key: None for val_list in values for key in val_list} | |||
yield field_name, list(ordered_unique_keys.keys()) | |||
elif strategy is CS.STRING_CONCATENATE: | |||
yield field_name, "".join(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be better " ".join(val.strip for val in values)
?
<Document> | ||
<Page> | ||
<Section> | ||
<p>First </p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the spaces here (before </p>
) intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are, a bit :P
I did that intuitively; when you look at some metadata, it json it looks this way
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/documents/unstructured_json_output/example.json#L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same for all 1-line htmls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it isn't case for HTML, so mabe I clean that from docstirng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This simplest solution doesn't drop HTML from metadata when merging Elements from HTML input. We still need to address how to handle nested elements, and if we want to have
LayoutElements
in the metadata of Composite Elements, a unit test showing the current behavior.Note: metadata still contains
orig_elements
which has all the metadata.