-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naive Bayes #138
Naive Bayes #138
Conversation
Caution Review failedThe pull request is closed. WalkthroughThis pull request introduces a comprehensive update to the Naive Bayes classification algorithms within the project. It includes the addition of documentation for the Naive Bayes algorithm, updates to the project configuration to include new dependencies, and the introduction of three new classifier implementations: Categorical, Gaussian, and Multinomial Naive Bayes. Additionally, new test files have been created to validate the functionality of these classifiers against established benchmarks. Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #138 +/- ##
==========================================
+ Coverage 97.86% 98.31% +0.45%
==========================================
Files 11 15 +4
Lines 795 1008 +213
==========================================
+ Hits 778 991 +213
Misses 17 17 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
toyml/classification/naive_bayes.py (5)
8-19
: Consider adding type hints for the class attributes.To improve code readability and maintainability, consider adding type hints for the class attributes. This will make it easier for other developers to understand the expected types of these attributes.
@dataclass class GaussianNaiveBayes: - labels_: list[int] = field(default_factory=list) + labels_: list[int] = field(default_factory=list, init=False) """The labels in training dataset""" - class_count_: int = 0 + class_count_: int = field(default=0, init=False) """The number of classes in training dataset""" - class_prior_: dict[int, float] = field(default_factory=dict) + class_prior_: dict[int, float] = field(default_factory=dict, init=False) """The prior probability of each class in training dataset""" - means_: dict[int, list[float]] = field(default_factory=dict) + means_: dict[int, list[float]] = field(default_factory=dict, init=False) """The means of each class in training dataset""" - vars_: dict[int, list[float]] = field(default_factory=dict) + vars_: dict[int, list[float]] = field(default_factory=dict, init=False) """The variance of each class in training dataset"""
21-28
: Verify thefit
method's return type.The
fit
method is declared to returnGaussianNaiveBayes
, but it actually returnsself
. While this is technically correct sinceself
is an instance ofGaussianNaiveBayes
, it's more idiomatic to declare the return type asNone
for methods that modify the instance in-place and don't return a new instance.- def fit(self, dataset: list[list[float]], labels: list[int]) -> GaussianNaiveBayes: + def fit(self, dataset: list[list[float]], labels: list[int]) -> None: """Fit the naive bayes model""" self.labels_ = sorted(set(labels)) self.class_count_ = len(set(labels)) self.class_prior_ = {label: 1 / self.class_count_ for label in self.labels_} self.means_ = self._get_classes_means(dataset, labels) self.vars_ = self._get_classes_vars(dataset, labels) - return self🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 23-28: toyml/classification/naive_bayes.py#L23-L28
Added lines #L23 - L28 were not covered by tests
30-38
: Refactor thepredict
method to improve readability.The
predict
method can be refactored to improve readability and reduce the number of lines of code. Here's a suggested refactoring:def predict(self, sample: list[float]) -> int: label_likelihoods = self._likelihood(sample) - raw_label_posteriors: dict[int, float] = {} - for label, likelihood in label_likelihoods.items(): - raw_label_posteriors[label] = likelihood * self.class_prior_[label] + raw_label_posteriors = {label: likelihood * self.class_prior_[label] for label, likelihood in label_likelihoods.items()} evidence = sum(raw_label_posteriors.values()) - label_posteriors = {label: raw_posterior / evidence for label, raw_posterior in raw_label_posteriors.items()} - label = max(label_posteriors, key=lambda k: label_posteriors[k]) - return label + label_posteriors = {label: raw_posterior / evidence for label, raw_posterior in raw_label_posteriors.items()} + return max(label_posteriors, key=label_posteriors.get)🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 31-38: toyml/classification/naive_bayes.py#L31-L38
Added lines #L31 - L38 were not covered by tests
40-55
: Address the TODO comment and consider using log-likelihood.The TODO comment suggests calculating the log-likelihood instead of the likelihood. This is a good suggestion because it can help avoid underflow issues when dealing with very small probabilities. Here's how you can modify the
_likelihood
method to calculate the log-likelihood:def _likelihood(self, sample: list[float]) -> dict[int, float]: """ Calculate the likelihood of each sample in each class """ label_likelihoods: dict[int, float] = {} for label in self.labels_: label_means = self.means_[label] label_vars = self.vars_[label] - likelihood = 1.0 + log_likelihood = 0.0 for i, xi in enumerate(sample): - # TODO: try to calculate the log-likelihood - likelihood *= (1 / math.sqrt(2 * math.pi * label_vars[i])) * math.exp( - -((xi - label_means[i]) ** 2) / (2 * label_vars[i]) - ) + log_likelihood += math.log(1 / math.sqrt(2 * math.pi * label_vars[i])) - ((xi - label_means[i]) ** 2) / (2 * label_vars[i]) - label_likelihoods[label] = likelihood + label_likelihoods[label] = log_likelihood return label_likelihoods🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 44-49: toyml/classification/naive_bayes.py#L44-L49
Added lines #L44 - L49 were not covered by tests
[warning] 51-51: toyml/classification/naive_bayes.py#L51
Added line #L51 was not covered by tests
[warning] 54-55: toyml/classification/naive_bayes.py#L54-L55
Added lines #L54 - L55 were not covered by tests
72-89
: Address the TODO comment and handle the simple sample variance case.The TODO comment suggests handling the simple sample variance case. This is important because the current implementation of the
_get_classes_vars
method uses the biased sample variance formula, which divides byn-1
instead ofn
. This can lead to issues when there is only one sample for a class. Here's how you can modify the method to handle this case:variances = { label: [ - dimension_sum_of_square / (label_count[label] - 1) + dimension_sum_of_square / label_count[label] if label_count[label] == 1 else dimension_sum_of_square / (label_count[label] - 1) for dimension_sum_of_square in dimension_sum_of_squares ] for label, dimension_sum_of_squares in label_dimension_sum_of_squares.items() }🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 73-73: toyml/classification/naive_bayes.py#L73
Added line #L73 was not covered by tests
[warning] 75-80: toyml/classification/naive_bayes.py#L75-L80
Added lines #L75 - L80 were not covered by tests
[warning] 82-82: toyml/classification/naive_bayes.py#L82
Added line #L82 was not covered by tests
[warning] 89-89: toyml/classification/naive_bayes.py#L89
Added line #L89 was not covered by tests
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
toyml/classification/naive_bayes.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
toyml/classification/naive_bayes.py
[warning] 23-28: toyml/classification/naive_bayes.py#L23-L28
Added lines #L23 - L28 were not covered by tests
[warning] 31-38: toyml/classification/naive_bayes.py#L31-L38
Added lines #L31 - L38 were not covered by tests
[warning] 44-49: toyml/classification/naive_bayes.py#L44-L49
Added lines #L44 - L49 were not covered by tests
[warning] 51-51: toyml/classification/naive_bayes.py#L51
Added line #L51 was not covered by tests
[warning] 54-55: toyml/classification/naive_bayes.py#L54-L55
Added lines #L54 - L55 were not covered by tests
[warning] 58-58: toyml/classification/naive_bayes.py#L58
Added line #L58 was not covered by tests
[warning] 60-66: toyml/classification/naive_bayes.py#L60-L66
Added lines #L60 - L66 were not covered by tests
[warning] 70-70: toyml/classification/naive_bayes.py#L70
Added line #L70 was not covered by tests
[warning] 73-73: toyml/classification/naive_bayes.py#L73
Added line #L73 was not covered by tests
[warning] 75-80: toyml/classification/naive_bayes.py#L75-L80
Added lines #L75 - L80 were not covered by tests
[warning] 82-82: toyml/classification/naive_bayes.py#L82
Added line #L82 was not covered by tests
[warning] 89-89: toyml/classification/naive_bayes.py#L89
Added line #L89 was not covered by tests
🔇 Additional comments (2)
toyml/classification/naive_bayes.py (2)
1-6
: LGTM!
The imports are relevant and necessary for the implementation.
57-70
: LGTM!
The _get_classes_means
method is implemented correctly and follows the expected logic for calculating the means of each class.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 58-58: toyml/classification/naive_bayes.py#L58
Added line #L58 was not covered by tests
[warning] 60-66: toyml/classification/naive_bayes.py#L60-L66
Added lines #L60 - L66 were not covered by tests
[warning] 70-70: toyml/classification/naive_bayes.py#L70
Added line #L70 was not covered by tests
Close #121
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation