Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

Merged
merged 12 commits into from
Apr 11, 2024

Conversation

awong-dev
Copy link
Collaborator

@awong-dev awong-dev commented Apr 4, 2024

Modify KeywordExtract to take a Jieba instance during the keyword extraction call instead of during construction. Remove the Jieba reference and the lifetimes from struct TextRank as well as struct TFIDF.

This is desirable in situations where the extractor instance might outlive a single function call.

One specific case where this comes up is in binding from other languages where the Jieba instance may be refcounted and shared between API calls in a way that depends on the other languages's calling semantics. In these situations, it is hard to have TextRank and TFIDF be bound to the Jieba instance via Rust's understanding of scoped-based lifetimes.

Luckily, the KeywordExtract interface only uses the Jieba instance on execution of the extraction and arguably, the API should just take the Jieba instance in (or even possibly the resulting segments) to reduce coupling of the structs. In this PR, the Jieba instance was moved from construction down to the function where it was used.

Since this is an API breaking change, the PR also cleans up the API style per This brings the API more in line with Rust conventions per

https://rust-lang.github.io/api-guidelines/interoperability.html#types-eagerly-implement-common-traits-c-common-traits

Specifically:

  • TFIDF was renamed TfIdf to follow Rust's capitalization conventions for acronyms.
  • KeywordExtract::extract_tags() was renamed to KeywordExtract::extract_keywords()
  • A KeywordExtractConfig object was created holding common code from both TextRank and TfIdf.
  • {TextExtract,TfIdf}::new() no longer encodes defaults and just the algorithm parameters as well as a KeywordExtractConfig object
  • TextExtract and TfIdf both implement the Default trait and expose a default() method with the default configuration.
  • API tests moved into doc tests further collocating code with test.

CAVEAT: In the new code, TextExtract defaults to NOT use hmm in the Jieba cutting. In the old code TextExtract had hmm on and TFIDF had hmm had it off. This inconsistency looks like an oversight.

fixes #99

@awong-dev
Copy link
Collaborator Author

This a more concrete example of what I was thinking in #99. I don't really love it...but wanted to throw it out there for consideration.

Having gotten this far, I wonder maybe one could try adding a

pub trait JiebaKeywordExtract {
    fn extract_tags(&self, &jieba: Jieba, sentence: &str, top_k: usize, allowed_pos: Vec<String>) -> Vec<Keyword>;
}

moving all of the impl over to two TFIDFNoJieba and TextExtractNoJieba structs w/o the Jieba instance. And then maintaining the existing API by turning TFIDF<'a> and TextExtract<'a> into thin facades that just contain the Jieba reference and an instance of one of the above.

The final result would be a 2 APIs, one where &Jieba is bound on new(), and one where it is bound on run...

This creates UnboundTextExtract and UnboundTFIDF struct that
implement a new JiebaKeywordExtract trait.

Unlike KeywordExtract, the JiebaKeywordExtract takes a Jieba struct in the
keyword_extract() call. This enables instantiation of UnboundTFIDF and
UnboundTextExtract without a Jieba instance which lets them have separate
lifetimes. For loading custom stop words or IDF dictionaries, this can
avoid unnecessary object initialization costs.

The original TextExtract<'a> and TFIDF<'a> stucts become convenience
facades over the Unbound variants leaving the public API stable.

The Unbound vairants also implement the Default trait allow their
new() methods to be more verbose. This in turn allows construction
of empty variants of the objects without picking up the cost of
cloning the default state just to overwrite it later in a load_dict()
or set_stop_words() call.
@awong-dev awong-dev changed the title WIPR: Add TFIDFState + APIs to avoid repeat loading of TFIDF data WIPR: Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Apr 7, 2024
@awong-dev awong-dev changed the title WIPR: Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Apr 7, 2024
@awong-dev
Copy link
Collaborator Author

awong-dev commented Apr 7, 2024

@messense I think this is ready for review now if you have a sec. The CI failures seem to be due to incorrect access tokens in github...

@messense
Copy link
Owner

messense commented Apr 7, 2024

Thanks! I'll take a look next week.

@messense messense self-requested a review April 7, 2024 12:41
Exposes all the configuration assumptions of TFIDF and TextRank
so they can be inspected and modified by the user.

Adds doc tests showing basic usage.
@awong-dev
Copy link
Collaborator Author

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

@messense
Copy link
Owner

messense commented Apr 8, 2024

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

It's fine to introduce breaking changes, a semver bump isn't a big issue.

@awong-dev
Copy link
Collaborator Author

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

It's fine to introduce breaking changes, a semver bump isn't a big issue.

Oh! In that case, hold off on review for a bit. Let me just replace the Old APIs instead of introducing awkward names like JiebaKeywordExtract, etc.

And if we're going to do a semver bump, before publishing, let me also try to make the hmm replaceable (either by completing the other open PR or writing a new one).

Give me a week. Will ping again.

New APIs do not require binding a Jieba on construction allowing
independent lifetime management and preservation of state.
Also clean up some of the unsafe call syntax.
@awong-dev
Copy link
Collaborator Author

awong-dev commented Apr 9, 2024

Okay, I think I have everything stable now.

The capi has NOT been updated to reflect the newer more flexible API. It just preserves the old semantics.

If we are okay continuing down this path, I think there are some follow-up PRs that would

  1. Add a Default trait for Jieba and make Jieba::new() not assume any defaults.
  2. Extract and Hmm struct out of Jieba and make cut take an Option<&Hmm> parameter instead of true/false
  3. Update capi to reflect the new strcuture.

This would be a cohesive api restructuring in one semver bump.

Copy link
Owner

@messense messense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run cargo fmt to format code.

src/keywords/mod.rs Outdated Show resolved Hide resolved
@awong-dev awong-dev changed the title Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Break lifetime entanglement of TextExtract, TFIDF and Jieba Apr 11, 2024
@awong-dev
Copy link
Collaborator Author

awong-dev commented Apr 11, 2024

Please run cargo fmt to format code.

Done though it didn't marke the pub(cargo) thing...just the change in jieba_benchmark.rs

I also realized the PR title + summary were completely out of date so I rewrote them.

/// Creates a KeywordExtractConfig state that contains filter criteria as
/// well as segmentation configuration for use by keyword extraction
/// implementations.
pub fn new(stop_words: BTreeSet<String>, min_keyword_length: usize, use_hmm: bool) -> Self {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove getters (not really useful) and use builder pattern for KeywordExtractConfig, like

let config = KeywordExtractConfig::builder()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    .build();

or without a separate builder type:

let config = KeywordExtractConfig::default()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    ;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change this later, does not need to block merging this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #101 to track.

src/keywords/mod.rs Outdated Show resolved Hide resolved
@messense messense merged commit eec9f7a into messense:main Apr 11, 2024
5 of 7 checks passed
@messense
Copy link
Owner

Thank you for your contribution, I've sent you an invitation to collaborate on this project.

@awong-dev
Copy link
Collaborator Author

Thank you!!

For the getters, filing an issue. I kinda wanna mess with completing that custom HMM extraction in the other open PR first, then attempt to bind these all into elixir to see what happens.

I worry that w/o the getters, it might be harder to query the configuration into other languages without having to duplicate (and synchronize) the state.

@awong-dev awong-dev deleted the KeywordState branch April 11, 2024 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make APIs for TFIDF and TextRank that do NOT take a reference to Jieba?
2 participants