-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break lifetime entanglement of TextExtract, TFIDF and Jieba #100
Conversation
This a more concrete example of what I was thinking in #99. I don't really love it...but wanted to throw it out there for consideration. Having gotten this far, I wonder maybe one could try adding a
moving all of the impl over to two The final result would be a 2 APIs, one where |
This creates UnboundTextExtract and UnboundTFIDF struct that implement a new JiebaKeywordExtract trait. Unlike KeywordExtract, the JiebaKeywordExtract takes a Jieba struct in the keyword_extract() call. This enables instantiation of UnboundTFIDF and UnboundTextExtract without a Jieba instance which lets them have separate lifetimes. For loading custom stop words or IDF dictionaries, this can avoid unnecessary object initialization costs. The original TextExtract<'a> and TFIDF<'a> stucts become convenience facades over the Unbound variants leaving the public API stable. The Unbound vairants also implement the Default trait allow their new() methods to be more verbose. This in turn allows construction of empty variants of the objects without picking up the cost of cloning the default state just to overwrite it later in a load_dict() or set_stop_words() call.
@messense I think this is ready for review now if you have a sec. The CI failures seem to be due to incorrect access tokens in github... |
Thanks! I'll take a look next week. |
Exposes all the configuration assumptions of TFIDF and TextRank so they can be inspected and modified by the user. Adds doc tests showing basic usage.
Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs. |
It's fine to introduce breaking changes, a semver bump isn't a big issue. |
Oh! In that case, hold off on review for a bit. Let me just replace the Old APIs instead of introducing awkward names like JiebaKeywordExtract, etc. And if we're going to do a semver bump, before publishing, let me also try to make the hmm replaceable (either by completing the other open PR or writing a new one). Give me a week. Will ping again. |
New APIs do not require binding a Jieba on construction allowing independent lifetime management and preservation of state.
Also clean up some of the unsafe call syntax.
Okay, I think I have everything stable now. The capi has NOT been updated to reflect the newer more flexible API. It just preserves the old semantics. If we are okay continuing down this path, I think there are some follow-up PRs that would
This would be a cohesive api restructuring in one semver bump. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run cargo fmt
to format code.
Done though it didn't marke the I also realized the PR title + summary were completely out of date so I rewrote them. |
/// Creates a KeywordExtractConfig state that contains filter criteria as | ||
/// well as segmentation configuration for use by keyword extraction | ||
/// implementations. | ||
pub fn new(stop_words: BTreeSet<String>, min_keyword_length: usize, use_hmm: bool) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove getters (not really useful) and use builder pattern for KeywordExtractConfig
, like
let config = KeywordExtractConfig::builder()
.add_stop_word("word")
.use_hmm(true)
// and other options
.build();
or without a separate builder type:
let config = KeywordExtractConfig::default()
.add_stop_word("word")
.use_hmm(true)
// and other options
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change this later, does not need to block merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #101 to track.
Thank you for your contribution, I've sent you an invitation to collaborate on this project. |
Thank you!! For the getters, filing an issue. I kinda wanna mess with completing that custom HMM extraction in the other open PR first, then attempt to bind these all into elixir to see what happens. I worry that w/o the getters, it might be harder to query the configuration into other languages without having to duplicate (and synchronize) the state. |
Modify KeywordExtract to take a Jieba instance during the keyword extraction call instead of during construction. Remove the Jieba reference and the lifetimes from struct TextRank as well as struct TFIDF.
This is desirable in situations where the extractor instance might outlive a single function call.
One specific case where this comes up is in binding from other languages where the Jieba instance may be refcounted and shared between API calls in a way that depends on the other languages's calling semantics. In these situations, it is hard to have TextRank and TFIDF be bound to the Jieba instance via Rust's understanding of scoped-based lifetimes.
Luckily, the KeywordExtract interface only uses the Jieba instance on execution of the extraction and arguably, the API should just take the Jieba instance in (or even possibly the resulting segments) to reduce coupling of the structs. In this PR, the Jieba instance was moved from construction down to the function where it was used.
Since this is an API breaking change, the PR also cleans up the API style per This brings the API more in line with Rust conventions per
https://rust-lang.github.io/api-guidelines/interoperability.html#types-eagerly-implement-common-traits-c-common-traits
Specifically:
CAVEAT: In the new code, TextExtract defaults to NOT use hmm in the Jieba cutting. In the old code TextExtract had hmm on and TFIDF had hmm had it off. This inconsistency looks like an oversight.
fixes #99