Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content iterator #82

Merged
merged 33 commits into from
May 27, 2024
Merged

Content iterator #82

merged 33 commits into from
May 27, 2024

Conversation

chocolatkey
Copy link
Member

@chocolatkey chocolatkey commented Sep 7, 2023

Need to port over the tests too before this is ready

Notable changes:

  • Bump minimum Go version from 1.18 to 1.21
  • Adds content iterator based on the mobile toolkits, and also implemented in the ts-toolkit
  • Improvements to minimal reads mode for ZIPs
  • Change manifest.Properties to avoid mutations, also adding a Properties() getter to the Resource interface. See (Im)mutability in the toolkit APIs #88 for more info
  • Add deduplication for Contributors, a common issue in EPUB metadata
  • Fix concurrent map access fault in marshalling of manifest
  • Improve EPUB font deobfuscator
  • Fix bug in XML query of META-INF/display-options.xml

Future improvements:

  • Add the memory-efficient content iterator improvements as with the ts-toolkit
  • Full test suite for content iterator

@chocolatkey chocolatkey marked this pull request as ready for review May 18, 2024 04:53
return ""
}

// TODO partialCfi and domRange getters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK they are only used in Thorium, so we might not even need them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorium doesn't really use / need EPUB CFI. We generate them in our "locators" next to the data structure we actually rely on to make things work internally (DOM Range for precise locations, percentage progression otherwise ... oh, and by the way we still don't support positions list yet). We emit CFI "just because we can", and it's only the path to the element anyway, no character-level reference. To achieve feature-parity with our internal DOM Range (de)serialisation mechanism, we would need a properly-tested and conformant CFI implementation, which apparently doesn't exist :)

Also see futurepress/epub.js#1358

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is based on the kotlin-toolkit, which is presumably based on the documentation here: https://github.com/readium/architecture/blob/master/models/locators/extensions/html.md . partialCfi is just a string in kotlin, so it could theoretically omitted since you can access it if it exists via OtherLocations, but since DomRange is implemented in the kotlin-toolkit I would want to implement it in the go-toolkit as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension helpers are implemented but never used. It might not be worth implementing it in the Go toolkit, unless you do want to generate partial CFI or DOM ranges at some point.

pkg/mediatype/sniffer.go Show resolved Hide resolved
pkg/parser/epub/parser.go Show resolved Hide resolved
pkg/pub/publication.go Show resolved Hide resolved
pkg/pub/service_content.go Show resolved Hide resolved
pkg/util/css.go Outdated Show resolved Hide resolved
Comment on lines +11 to +13
Text(separator *string) (string, error) // Extracts the full raw text, or returns null if no text content can be found.
Iterator() iterator.Iterator // Creates a new iterator for this content.
Elements() ([]element.Element, error) // Returns all the elements as a list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Kotlin toolkit, text() and elements() are in the interface with a default implementation, but they should probably be extensions instead. The only thing that should be required to implement is Iterator.

Maybe we can remove them from the interface and keep only ContentText() and ContentElements() on the side?

Alternatively if we want to keep the API easy to use, Content could be a struct and have a IteratorFactory interface instead.

pkg/content/element/element.go Outdated Show resolved Hide resolved
pkg/content/iterator/html.go Show resolved Hide resolved
pkg/content/iterator/iterator.go Show resolved Hide resolved
Copy link
Member

@mickael-menu mickael-menu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @chocolatkey

@chocolatkey chocolatkey merged commit 7dabb92 into main May 27, 2024
1 check passed
@chocolatkey chocolatkey deleted the content-iterator branch May 27, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants