Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text as html to orig elements chunks #3779

Merged
merged 9 commits into from
Nov 20, 2024

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Nov 11, 2024

This simplest solution doesn't drop HTML from metadata when merging Elements from HTML input. We still need to address how to handle nested elements, and if we want to have LayoutElements in the metadata of Composite Elements, a unit test showing the current behavior.
Note: metadata still contains orig_elements which has all the metadata.

@plutasnyy plutasnyy self-assigned this Nov 11, 2024
@plutasnyy plutasnyy marked this pull request as ready for review November 18, 2024 12:28
@@ -774,6 +774,8 @@ def iter_kwarg_pairs() -> Iterator[tuple[str, Any]]:
# -- Python 3.7+ maintains dict insertion order --
ordered_unique_keys = {key: None for val_list in values for key in val_list}
yield field_name, list(ordered_unique_keys.keys())
elif strategy is CS.STRING_CONCATENATE:
yield field_name, "".join(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better " ".join(val.strip for val in values)?

<Document>
<Page>
<Section>
<p>First </p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the spaces here (before </p>) intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, a bit :P
I did that intuitively; when you look at some metadata, it json it looks this way
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/documents/unstructured_json_output/example.json#L45

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same for all 1-line htmls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it isn't case for HTML, so mabe I clean that from docstirng

Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@plutasnyy plutasnyy added this pull request to the merge queue Nov 20, 2024
Merged via the queue into main with commit 85ecdab Nov 20, 2024
41 checks passed
@plutasnyy plutasnyy deleted the ML-468/add_text_as_html_to_orig_elements_chunks branch November 20, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants