-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align Logger Output with Documented Pipeline #2192
Comments
Agreed. If things are unclear because I unnecessarily tried to simplify things, then we need to be explicit instead and more inline with the documentation.
Sounds good! I suggest we keep all logging before the representation steps and improve the logging at the representation level. I like separating the c-TF-IDF from additional representations (although c-TF-IDF can also be an additional representation/aspect).
I generally try to refer to a "topic" as the output of any representation, whether that is c-TF-IDF. Implicitly, that does contain the output of the clustering but is not sufficient to me. So technically, I would say at the end of the c-TF-IDF where I attempt to extract topics from clusters.
But yeah, this definitely makes sense here since the definition of a "topic" has also shifted a bit with the introduction of additional representations, multimodal topic modeling, etc. All in all, sounds good to me! Thank you for considering this and taking the time 😄 |
So I made a draft now. The logging looks like
And it can get very verbose. For example, when reducing topics:
This may be an issue for you but for me is very handy, otherwise I would just set verbose to false. The logger is also aligned more closely to where the process is actually happening in the code. I did not modify the earlier stages though (embeddings, dim. red, and clustering) |
@pipa666 Thanks! There are a couple of things that I would prefer to be removed from the logging in order to prevent too much logging. I want to make sure there is a balance so that users also do not get information overload. Let's start with this one:
I would prefer to combine the first two Bag of Words loggers. Simply stating that the Moreover, I think I would prefer removing the "Setting multiplier for c-TF-IDF" since that is an almost instant procedure. As a result, it would look like this:
I'm also wondering whether we should add something like:
but that, in turn, makes the logging quite a lot... What do you think? |
I agree, indeed I had some struggles on the vectorizer part, as its very simple and I dont think many people customize that part (I really don't know). Your suggestion makes more sense, since it aligns with the docs. Regarding the Regarding the On the other side, if we want to reduce the logging, then I would suggest removing these statements from everywhere. I assume that if any process is not completed, then a python error will raise, so the statements are a bit unnecessary. I like their aesthetic though. |
Let's remove that part. I feel like this is something most users would not want.
Let's skip this for now. I might remove the completed statements at some point since you can already understand how long something took by the additional logging that will now be added.
I had the same idea. I like the aesthetic and it helps break up individual pieces. I would say, let's leave them as is and not add more. |
Feature request
Following some reflection based on a closed PR, I want to suggest a change in the logger.info messages to align with the documentation and modular pipeline of the library.
Context: In the current implementation, the code's logging verbosity reflects a simplified version of the pipeline:
Embeddings -> Dim. Red. -> Clustering -> Representation
However, in the documentation and associated research papers, the pipeline is presented with more detailed later stages:
Embeddings -> Dim. Red. -> Clustering -> Bag of Words (vectorizer + c-TF-IDF) -> Fine Tuning Representations
This mismatch between the actual implementation steps and the documented process can create confusion, especially for non-CS researchers (like myself) or contributors who rely on logging to understand the modularity and steps of the implementation.
Problem Description:
There’s ambiguity in the logging related to the term “Representation,” which might be interpreted differently depending on the pipeline step.
The pipeline section involving the c-TF-IDF is currently not logged at all, despite being a core part of the pipeline.
The logging output skips from:
Proposed Solution: To better reflect the detailed process, I propose to align it with the current documentation. A more detailed logging flow might look like:
Conceptual Clarification: It also comes to play that some concepts are still ambiguous an can be approached differently. For example, the definition of “topic” can be seen in two ways:
As such, the statement of "extracting topics from" is not a precise term, so I would avoid it in this new scheme, unless there is a clear cut point where topics are "defined".
Given that the c-TF-IDF step plays a foundational role in refining topic representations, I suggest treating it as a separate and essential component in the pipeline and reflecting this in the logger output.
Motivation
I want to improve this library, since I will probably be using during my research (already used it for a paper).
Your contribution
I can implement the solution, if agreed. I will probably start doing so for my own purposes, so a commit id will be available here for observation/comments when done.
The text was updated successfully, but these errors were encountered: