Skip to content

Commit

Permalink
Merge pull request #559 from mindsdb/staging
Browse files Browse the repository at this point in the history
Release 1.3.0
  • Loading branch information
paxcema authored Oct 7, 2021
2 parents 824862c + 31232c3 commit 96b5849
Show file tree
Hide file tree
Showing 70 changed files with 2,002 additions and 1,219 deletions.
4 changes: 3 additions & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: Bug report
about: Create a report to help us improve
labels:
labels: Bug
---

## Your Environment
Expand All @@ -13,3 +13,5 @@ labels:


## How can we replicate it?
* What dataset did you use (link to it please)
* What was the code you ran
5 changes: 5 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
name: Question
about: Ask a question
labels: question
---
8 changes: 8 additions & 0 deletions .github/ISSUE_TEMPLATE/suggestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
name: Suggestion
about: Suggest a feature, improvement, doc change, etc.
labels: enhancement
---



43 changes: 34 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,51 @@ We love to receive contributions from the community and hear your opinions! We w
* Submit a bug fix
* Propose new features
* Test Lightwood
* Solve an issue

# Code contributions
In general, we follow the "fork-and-pull" Git workflow.

1. Fork the Lightwood repository
2. Clone the repository
3. Make changes and commit them
4. Push your local branch to your fork
5. Submit a Pull request so that we can review your changes
6. Write a commit message
7. Make sure that the CI tests are GREEN
2. Checkout the `staging` branch, this is the development version that gets released weekly
4. Make changes and commit them
5. Make sure that the CI tests pass
6. Submit a Pull request from your repo to the `staging` branch of mindsdb/lightwood so that we can review your changes

>NOTE: Be sure to merge the latest from "upstream" before making a pull request!
> You will need to sign a CLI agreement for the code since lightwood is under a GPL license
> Be sure to merge the latest from `staging` before making a pull request!
> You can run the test suite locally by running `flake8 .` to check style and `python -m unittest discover tests` to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
# Feature and Bug reports
We use GitHub issues to track bugs and features. Report them by opening a [new issue](https://github.com/mindsdb/lightwood/issues/new/choose) and fill out all of the required inputs.

# Code review process
The Pull Request reviews are done on a regular basis.
Please, make sure you respond to our feedback/questions.

If your change has a chance to affecting performance we will run our private benchmark suite to validate it.

Please, make sure you respond to our feedback and questions.

# Community
If you have additional questions or you want to chat with MindsDB core team, you can join our community [![Discourse posts](https://img.shields.io/discourse/posts?server=https%3A%2F%2Fcommunity.mindsdb.com%2F)](https://community.mindsdb.com/). To get updates on MindsDB’s latest announcements, releases, and events, [sign up for our newsletter](https://mindsdb.us20.list-manage.com/subscribe/post?u=5174706490c4f461e54869879&id=242786942a).
If you have additional questions or you want to chat with MindsDB core team, you can join our community slack.

# Setting up a dev environment

- Clone lightwood
- `cd lightwood && pip install requirements.txt`
- Add it to your python path (e.g. by adding `export PYTHONPATH='/where/you/cloned/lightwood:$PYTHONPATH` as a newline at the end of your `~/.bashrc` file)
- Check that the unittest are passing by going into the directory where you cloned lightwood and running: `python -m unittest discover tests`

> If `python` default to python2.x on your environment use `python3` and `pip3` instead
## Setting up a vscode environment

Currently, the prefred environment for working with lightwood is vscode, it's a very popular python IDE. Any IDE should however work, while we don't have guides for those please use the following as a template.

* Install and enable setting sync using github account (if you use multiple machines)
* Install pylance (for types) and make sure to disable pyright
* Go to `Python > Lint: Enabled` and disable everything *but* flake8
* Set `python.linting.flake8Path` to the full path to flake8 (which flake8)
* Set `Python › Formatting: Provider` to autopep8
* Add `--global-config=<path_to>/lightwood/.flake8` and `--experimental` to `Python › Formatting: Autopep8 Args`
* Install live share and live share whiteboard
11 changes: 0 additions & 11 deletions dev/README.md

This file was deleted.

2 changes: 0 additions & 2 deletions dev/requirements.txt

This file was deleted.

2 changes: 1 addition & 1 deletion lightwood/__about__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
__title__ = 'lightwood'
__package_name__ = 'lightwood'
__version__ = '1.2.0'
__version__ = '1.3.0'
__description__ = "Lightwood is a toolkit for automatic machine learning model building"
__email__ = "[email protected]"
__author__ = 'MindsDB Inc'
Expand Down
12 changes: 10 additions & 2 deletions lightwood/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
from lightwood.analysis.model_analyzer import model_analyzer
# Base
from lightwood.analysis.analyze import model_analyzer
from lightwood.analysis.explain import explain

__all__ = ['model_analyzer', 'explain']
# Blocks
from lightwood.analysis.base import BaseAnalysisBlock
from lightwood.analysis.nc.calibrate import ICP
from lightwood.analysis.helpers.acc_stats import AccStats
from lightwood.analysis.helpers.feature_importance import GlobalFeatureImportance


__all__ = ['model_analyzer', 'explain', 'ICP', 'AccStats', 'GlobalFeatureImportance', 'BaseAnalysisBlock']
96 changes: 96 additions & 0 deletions lightwood/analysis/analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
from typing import Dict, List, Tuple, Optional

from lightwood.api import dtype
from lightwood.ensemble import BaseEnsemble
from lightwood.analysis.base import BaseAnalysisBlock
from lightwood.data.encoded_ds import EncodedDs
from lightwood.encoder.text.pretrained import PretrainedLangEncoder
from lightwood.api.types import ModelAnalysis, StatisticalAnalysis, TimeseriesSettings


def model_analyzer(
predictor: BaseEnsemble,
data: EncodedDs,
train_data: EncodedDs,
stats_info: StatisticalAnalysis,
target: str,
ts_cfg: TimeseriesSettings,
dtype_dict: Dict[str, str],
accuracy_functions,
analysis_blocks: Optional[List[BaseAnalysisBlock]] = []
) -> Tuple[ModelAnalysis, Dict[str, object]]:
"""
Analyses model on a validation subset to evaluate accuracy, estimate feature importance and generate a
calibration model to estimating confidence in future predictions.
Additionally, any user-specified analysis blocks (see class `BaseAnalysisBlock`) are also called here.
:return:
runtime_analyzer: This dictionary object gets populated in a sequential fashion with data generated from
any `.analyze()` block call. This dictionary object is stored in the predictor itself, and used when
calling the `.explain()` method of all analysis blocks when generating predictions.
model_analysis: `ModelAnalysis` object that contains core analysis metrics, not necessarily needed when predicting.
"""

runtime_analyzer = {}
data_type = dtype_dict[target]

# retrieve encoded data representations
encoded_train_data = train_data
encoded_val_data = data
data = encoded_val_data.data_frame
input_cols = list([col for col in data.columns if col != target])

# predictive task
is_numerical = data_type in (dtype.integer, dtype.float, dtype.array, dtype.tsarray, dtype.quantity)
is_classification = data_type in (dtype.categorical, dtype.binary)
is_multi_ts = ts_cfg.is_timeseries and ts_cfg.nr_predictions > 1
has_pretrained_text_enc = any([isinstance(enc, PretrainedLangEncoder)
for enc in encoded_train_data.encoders.values()])

# raw predictions for validation dataset
normal_predictions = predictor(encoded_val_data) if not is_classification else predictor(encoded_val_data,
predict_proba=True)
normal_predictions = normal_predictions.set_index(data.index)

# ------------------------- #
# Run analysis blocks, both core and user-defined
# ------------------------- #
kwargs = {
'predictor': predictor,
'target': target,
'input_cols': input_cols,
'dtype_dict': dtype_dict,
'normal_predictions': normal_predictions,
'data': data,
'train_data': train_data,
'encoded_val_data': encoded_val_data,
'is_classification': is_classification,
'is_numerical': is_numerical,
'is_multi_ts': is_multi_ts,
'stats_info': stats_info,
'ts_cfg': ts_cfg,
'accuracy_functions': accuracy_functions,
'has_pretrained_text_enc': has_pretrained_text_enc
}

for block in analysis_blocks:
runtime_analyzer = block.analyze(runtime_analyzer, **kwargs)

# ------------------------- #
# Populate ModelAnalysis object
# ------------------------- #
model_analysis = ModelAnalysis(
accuracies=runtime_analyzer['score_dict'],
accuracy_histogram=runtime_analyzer['acc_histogram'],
accuracy_samples=runtime_analyzer['acc_samples'],
train_sample_size=len(encoded_train_data),
test_sample_size=len(encoded_val_data),
confusion_matrix=runtime_analyzer['cm'],
column_importances=runtime_analyzer['column_importances'],
histograms=stats_info.histograms,
dtypes=dtype_dict
)

return model_analysis, runtime_analyzer
46 changes: 46 additions & 0 deletions lightwood/analysis/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from typing import Tuple, Dict, Optional

import pandas as pd
from lightwood.helpers.log import log


class BaseAnalysisBlock:
"""Class to be inherited by any analysis/explainer block."""
def __init__(self,
deps: Optional[Tuple] = ()
):

self.dependencies = deps # can be parallelized when there are no dependencies @TODO enforce

def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
"""
This method should be called once during the analysis phase, or not called at all.
It computes any information that the block may either output to the model analysis object,
or use at inference time when `.explain()` is called (in this case, make sure all needed
objects are added to the runtime analyzer so that `.explain()` can access them).
:param info: Dictionary where any new information or objects are added. The next analysis block will use
the output of the previous block as a starting point.
:param kwargs: Dictionary with named variables from either the core analysis or the rest of the prediction
pipeline.
"""
log.info(f"{self.__class__.__name__}.analyze() has not been implemented, no modifications will be done to the model analysis.") # noqa
return info

def explain(self,
row_insights: pd.DataFrame,
global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:
"""
This method should be called once during the explaining phase at inference time, or not called at all.
Additional explanations can be at an instance level (row-wise) or global.
For the former, return a data frame with any new insights. For the latter, a dictionary is required.
:param row_insights: dataframe with previously computed row-level explanations.
:param global_insights: dict() with any explanations that concern all predicted instances or the model itself.
:returns:
- row_insights: modified input dataframe with any new row insights added here.
- global_insights: dict() with any explanations that concern all predicted instances or the model itself.
"""
log.info(f"{self.__class__.__name__}.explain() has not been implemented, no modifications will be done to the data insights.") # noqa
return row_insights, global_insights
Loading

0 comments on commit 96b5849

Please sign in to comment.