Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

awong-dev · 2024-04-04T21:52:41Z

Modify KeywordExtract to take a Jieba instance during the keyword extraction call instead of during construction. Remove the Jieba reference and the lifetimes from struct TextRank as well as struct TFIDF.

This is desirable in situations where the extractor instance might outlive a single function call.

One specific case where this comes up is in binding from other languages where the Jieba instance may be refcounted and shared between API calls in a way that depends on the other languages's calling semantics. In these situations, it is hard to have TextRank and TFIDF be bound to the Jieba instance via Rust's understanding of scoped-based lifetimes.

Luckily, the KeywordExtract interface only uses the Jieba instance on execution of the extraction and arguably, the API should just take the Jieba instance in (or even possibly the resulting segments) to reduce coupling of the structs. In this PR, the Jieba instance was moved from construction down to the function where it was used.

Since this is an API breaking change, the PR also cleans up the API style per This brings the API more in line with Rust conventions per

https://rust-lang.github.io/api-guidelines/interoperability.html#types-eagerly-implement-common-traits-c-common-traits

Specifically:

TFIDF was renamed TfIdf to follow Rust's capitalization conventions for acronyms.
KeywordExtract::extract_tags() was renamed to KeywordExtract::extract_keywords()
A KeywordExtractConfig object was created holding common code from both TextRank and TfIdf.
{TextExtract,TfIdf}::new() no longer encodes defaults and just the algorithm parameters as well as a KeywordExtractConfig object
TextExtract and TfIdf both implement the Default trait and expose a default() method with the default configuration.
API tests moved into doc tests further collocating code with test.

CAVEAT: In the new code, TextExtract defaults to NOT use hmm in the Jieba cutting. In the old code TextExtract had hmm on and TFIDF had hmm had it off. This inconsistency looks like an oversight.

fixes #99

awong-dev · 2024-04-04T21:57:08Z

This a more concrete example of what I was thinking in #99. I don't really love it...but wanted to throw it out there for consideration.

Having gotten this far, I wonder maybe one could try adding a

pub trait JiebaKeywordExtract {
    fn extract_tags(&self, &jieba: Jieba, sentence: &str, top_k: usize, allowed_pos: Vec<String>) -> Vec<Keyword>;
}

moving all of the impl over to two TFIDFNoJieba and TextExtractNoJieba structs w/o the Jieba instance. And then maintaining the existing API by turning TFIDF<'a> and TextExtract<'a> into thin facades that just contain the Jieba reference and an instance of one of the above.

The final result would be a 2 APIs, one where &Jieba is bound on new(), and one where it is bound on run...

This creates UnboundTextExtract and UnboundTFIDF struct that implement a new JiebaKeywordExtract trait. Unlike KeywordExtract, the JiebaKeywordExtract takes a Jieba struct in the keyword_extract() call. This enables instantiation of UnboundTFIDF and UnboundTextExtract without a Jieba instance which lets them have separate lifetimes. For loading custom stop words or IDF dictionaries, this can avoid unnecessary object initialization costs. The original TextExtract<'a> and TFIDF<'a> stucts become convenience facades over the Unbound variants leaving the public API stable. The Unbound vairants also implement the Default trait allow their new() methods to be more verbose. This in turn allows construction of empty variants of the objects without picking up the cost of cloning the default state just to overwrite it later in a load_dict() or set_stop_words() call.

awong-dev · 2024-04-07T06:39:15Z

@messense I think this is ready for review now if you have a sec. The CI failures seem to be due to incorrect access tokens in github...

messense · 2024-04-07T12:41:11Z

Thanks! I'll take a look next week.

Exposes all the configuration assumptions of TFIDF and TextRank so they can be inspected and modified by the user. Adds doc tests showing basic usage.

awong-dev · 2024-04-08T08:28:19Z

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

messense · 2024-04-08T08:33:31Z

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

It's fine to introduce breaking changes, a semver bump isn't a big issue.

awong-dev · 2024-04-08T08:37:52Z

Note there is a behavior change with TextExtract. I can undo it, but honestly the more I look at this, the more I wonder if it'd be worth it to break API compat and remove the older APIs.

It's fine to introduce breaking changes, a semver bump isn't a big issue.

Oh! In that case, hold off on review for a bit. Let me just replace the Old APIs instead of introducing awkward names like JiebaKeywordExtract, etc.

And if we're going to do a semver bump, before publishing, let me also try to make the hmm replaceable (either by completing the other open PR or writing a new one).

Give me a week. Will ping again.

New APIs do not require binding a Jieba on construction allowing independent lifetime management and preservation of state.

Also clean up some of the unsafe call syntax.

awong-dev · 2024-04-09T07:47:40Z

Okay, I think I have everything stable now.

The capi has NOT been updated to reflect the newer more flexible API. It just preserves the old semantics.

If we are okay continuing down this path, I think there are some follow-up PRs that would

Add a Default trait for Jieba and make Jieba::new() not assume any defaults.
Extract and Hmm struct out of Jieba and make cut take an Option<&Hmm> parameter instead of true/false
Update capi to reflect the new strcuture.

This would be a cohesive api restructuring in one semver bump.

messense

Please run cargo fmt to format code.

src/keywords/mod.rs

awong-dev · 2024-04-11T04:00:35Z

Please run cargo fmt to format code.

Done though it didn't marke the pub(cargo) thing...just the change in jieba_benchmark.rs

I also realized the PR title + summary were completely out of date so I rewrote them.

messense · 2024-04-11T04:53:11Z

src/keywords/mod.rs

+    /// Creates a KeywordExtractConfig state that contains filter criteria as
+    /// well as segmentation configuration for use by keyword extraction
+    /// implementations.
+    pub fn new(stop_words: BTreeSet<String>, min_keyword_length: usize, use_hmm: bool) -> Self {


I'd remove getters (not really useful) and use builder pattern for KeywordExtractConfig, like

let config = KeywordExtractConfig::builder() .add_stop_word("word") .use_hmm(true) // and other options .build();

or without a separate builder type:

let config = KeywordExtractConfig::default() .add_stop_word("word") .use_hmm(true) // and other options ;

We can change this later, does not need to block merging this PR.

Added #101 to track.

src/keywords/mod.rs

messense · 2024-04-11T05:00:13Z

Thank you for your contribution, I've sent you an invitation to collaborate on this project.

awong-dev · 2024-04-11T05:34:22Z

Thank you!!

For the getters, filing an issue. I kinda wanna mess with completing that custom HMM extraction in the other open PR first, then attempt to bind these all into elixir to see what happens.

I worry that w/o the getters, it might be harder to query the configuration into other languages without having to duplicate (and synchronize) the state.

Add TFIDFState + APIs to avoid repeat loading of TFIDF data

538b99c

awong-dev force-pushed the KeywordState branch from 3fac582 to b40afc8 Compare April 7, 2024 06:23

clippy fix: remove unnecessary borrow.

8de5f29

awong-dev changed the title ~~WIPR: Add TFIDFState + APIs to avoid repeat loading of TFIDF data~~ WIPR: Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Apr 7, 2024

awong-dev changed the title ~~WIPR: Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba~~ Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba Apr 7, 2024

messense self-requested a review April 7, 2024 12:41

awong-dev added 2 commits April 8, 2024 00:39

Add docs + tests and clean up API.

d6f8fb7

Exposes all the configuration assumptions of TFIDF and TextRank so they can be inspected and modified by the user. Adds doc tests showing basic usage.

DRY-up API/impl by extracting a KeywordExtractConfig struct

350fa11

awong-dev force-pushed the KeywordState branch from c24cddc to 350fa11 Compare April 8, 2024 08:15

awong-dev added 2 commits April 8, 2024 01:18

Update TextRank expecation to NOT use hmm

7074953

use jieba_rs::{} in doctests

dad8903

awong-dev added 2 commits April 8, 2024 19:49

Replace older {TFIDF,TextRank}<'a> APIs.

20169b0

New APIs do not require binding a Jieba on construction allowing independent lifetime management and preservation of state.

Fix capi to work with new KeywordExtract API.

afe8124

Also clean up some of the unsafe call syntax.

awong-dev force-pushed the KeywordState branch from 8b7b995 to afe8124 Compare April 9, 2024 07:33

Fix benchmarks

2788022

messense reviewed Apr 11, 2024

View reviewed changes

src/keywords/mod.rs Outdated Show resolved Hide resolved

awong-dev changed the title ~~Add JiebaKeywordExtract trait and Unbound{TextExtract,TFIDF} structs to break lifetime entanglement with Jieba~~ Break lifetime entanglement of TextExtract, TFIDF and Jieba Apr 11, 2024

Scope KeywordExtractConfig::filter() to create. Run cargo fmt

427c533

messense reviewed Apr 11, 2024

View reviewed changes

src/keywords/mod.rs Outdated Show resolved Hide resolved

Update src/keywords/mod.rs

6e53cf2

messense approved these changes Apr 11, 2024

View reviewed changes

messense merged commit eec9f7a into messense:main Apr 11, 2024
5 of 7 checks passed

awong-dev deleted the KeywordState branch April 11, 2024 05:34

awong-dev mentioned this pull request Apr 13, 2024

Add doctest support into coverage #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

awong-dev commented Apr 4, 2024 •

edited

Loading

awong-dev commented Apr 4, 2024

awong-dev commented Apr 7, 2024 •

edited

Loading

messense commented Apr 7, 2024

awong-dev commented Apr 8, 2024

messense commented Apr 8, 2024

awong-dev commented Apr 8, 2024

awong-dev commented Apr 9, 2024 •

edited

Loading

messense left a comment

awong-dev commented Apr 11, 2024 •

edited

Loading

messense Apr 11, 2024

messense Apr 11, 2024

awong-dev Apr 11, 2024

messense commented Apr 11, 2024

awong-dev commented Apr 11, 2024

Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

Break lifetime entanglement of TextExtract, TFIDF and Jieba #100

Conversation

awong-dev commented Apr 4, 2024 • edited Loading

awong-dev commented Apr 4, 2024

awong-dev commented Apr 7, 2024 • edited Loading

messense commented Apr 7, 2024

awong-dev commented Apr 8, 2024

messense commented Apr 8, 2024

awong-dev commented Apr 8, 2024

awong-dev commented Apr 9, 2024 • edited Loading

messense left a comment

Choose a reason for hiding this comment

awong-dev commented Apr 11, 2024 • edited Loading

messense Apr 11, 2024

Choose a reason for hiding this comment

messense Apr 11, 2024

Choose a reason for hiding this comment

awong-dev Apr 11, 2024

Choose a reason for hiding this comment

messense commented Apr 11, 2024

awong-dev commented Apr 11, 2024

awong-dev commented Apr 4, 2024 •

edited

Loading

awong-dev commented Apr 7, 2024 •

edited

Loading

awong-dev commented Apr 9, 2024 •

edited

Loading

awong-dev commented Apr 11, 2024 •

edited

Loading