Merge pull request #559 from mindsdb/staging

Release 1.3.0
mindsdb · Oct 7, 2021 · 96b5849 · 96b5849
2 parents 824862c + 31232c3
commit 96b5849
Show file tree

Hide file tree

Showing 70 changed files with 2,002 additions and 1,219 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -1,7 +1,7 @@
 ---
 name: Bug report
 about: Create a report to help us improve
-labels:
+labels: Bug
 ---
 
 ## Your Environment
@@ -13,3 +13,5 @@ labels:
 
 
 ## How can we replicate it?
+* What dataset did you use (link to it please)
+* What was the code you ran
diff --git a/.github/ISSUE_TEMPLATE/question.md b/.github/ISSUE_TEMPLATE/question.md
@@ -0,0 +1,5 @@
+---
+name: Question
+about: Ask a question
+labels: question
+---
diff --git a/.github/ISSUE_TEMPLATE/suggestion.md b/.github/ISSUE_TEMPLATE/suggestion.md
@@ -0,0 +1,8 @@
+---
+name: Suggestion
+about: Suggest a feature, improvement, doc change, etc.
+labels: enhancement
+---
+
+
+
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -10,26 +10,51 @@ We love to receive contributions from the community and hear your opinions! We w
 * Submit a bug fix
 * Propose new features
 * Test Lightwood
+* Solve an issue
 
 # Code contributions
 In general, we follow the "fork-and-pull" Git workflow.
 
 1. Fork the Lightwood repository
-2. Clone the repository
-3. Make changes and commit them
-4. Push your local branch to your fork
-5. Submit a Pull request so that we can review your changes
-6. Write a commit message
-7. Make sure that the CI tests are GREEN
+2. Checkout the `staging` branch, this is the development version that gets released weekly
+4. Make changes and commit them
+5. Make sure that the CI tests pass
+6. Submit a Pull request from your repo to the `staging` branch of mindsdb/lightwood so that we can review your changes
 
->NOTE: Be sure to merge the latest from "upstream" before making a pull request!
+> You will need to sign a CLI agreement for the code since lightwood is under a GPL license
+> Be sure to merge the latest from `staging` before making a pull request!
+> You can run the test suite locally by running `flake8 .` to check style and `python -m unittest discover tests` to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
 
 # Feature and Bug reports
 We use GitHub issues to track bugs and features. Report them by opening a [new issue](https://github.com/mindsdb/lightwood/issues/new/choose) and fill out all of the required inputs.
 
 # Code review process
 The Pull Request reviews are done on a regular basis. 
-Please, make sure you respond to our feedback/questions.
+
+If your change has a chance to affecting performance we will run our private benchmark suite to validate it.
+
+Please, make sure you respond to our feedback and questions.
 
 # Community
-If you have additional questions or you want to chat with MindsDB core team, you can join our community [![Discourse posts](https://img.shields.io/discourse/posts?server=https%3A%2F%2Fcommunity.mindsdb.com%2F)](https://community.mindsdb.com/). To get updates on MindsDB’s latest announcements, releases, and events, [sign up for our newsletter](https://mindsdb.us20.list-manage.com/subscribe/post?u=5174706490c4f461e54869879&amp;id=242786942a).
+If you have additional questions or you want to chat with MindsDB core team, you can join our community slack.
+
+# Setting up a dev environment
+
+- Clone lightwood
+- `cd lightwood && pip install requirements.txt`
+- Add it to your python path (e.g. by adding `export PYTHONPATH='/where/you/cloned/lightwood:$PYTHONPATH` as a newline at the end of your `~/.bashrc` file)
+- Check that the unittest are passing by going into the directory where you cloned lightwood and running: `python -m unittest discover tests` 
+
+> If `python` default to python2.x on your environment use `python3` and `pip3` instead
+
+## Setting up a vscode environment
+
+Currently, the prefred environment for working with lightwood is vscode, it's a very popular python IDE. Any IDE should however work, while we don't have guides for those please use the following as a template.
+
+* Install and enable setting sync using github account (if you use multiple machines)
+* Install pylance (for types) and make sure to disable pyright
+* Go to `Python > Lint: Enabled` and disable everything *but* flake8
+* Set `python.linting.flake8Path` to the full path to flake8 (which flake8)
+* Set `Python › Formatting: Provider` to autopep8
+* Add `--global-config=<path_to>/lightwood/.flake8` and `--experimental` to `Python › Formatting: Autopep8 Args`
+* Install live share and live share whiteboard
diff --git a/dev/README.md b/dev/README.md
diff --git a/dev/requirements.txt b/dev/requirements.txt
diff --git a/lightwood/__about__.py b/lightwood/__about__.py
@@ -1,6 +1,6 @@
 __title__ = 'lightwood'
 __package_name__ = 'lightwood'
-__version__ = '1.2.0'
+__version__ = '1.3.0'
 __description__ = "Lightwood is a toolkit for automatic machine learning model building"
 __email__ = "[email protected]"
 __author__ = 'MindsDB Inc'

diff --git a/lightwood/analysis/__init__.py b/lightwood/analysis/__init__.py
@@ -1,4 +1,12 @@
-from lightwood.analysis.model_analyzer import model_analyzer
+# Base
+from lightwood.analysis.analyze import model_analyzer
 from lightwood.analysis.explain import explain
 
-__all__ = ['model_analyzer', 'explain']
+# Blocks
+from lightwood.analysis.base import BaseAnalysisBlock
+from lightwood.analysis.nc.calibrate import ICP
+from lightwood.analysis.helpers.acc_stats import AccStats
+from lightwood.analysis.helpers.feature_importance import GlobalFeatureImportance
+
+
+__all__ = ['model_analyzer', 'explain', 'ICP', 'AccStats', 'GlobalFeatureImportance', 'BaseAnalysisBlock']
diff --git a/lightwood/analysis/analyze.py b/lightwood/analysis/analyze.py
@@ -0,0 +1,96 @@
+from typing import Dict, List, Tuple, Optional
+
+from lightwood.api import dtype
+from lightwood.ensemble import BaseEnsemble
+from lightwood.analysis.base import BaseAnalysisBlock
+from lightwood.data.encoded_ds import EncodedDs
+from lightwood.encoder.text.pretrained import PretrainedLangEncoder
+from lightwood.api.types import ModelAnalysis, StatisticalAnalysis, TimeseriesSettings
+
+
+def model_analyzer(
+    predictor: BaseEnsemble,
+    data: EncodedDs,
+    train_data: EncodedDs,
+    stats_info: StatisticalAnalysis,
+    target: str,
+    ts_cfg: TimeseriesSettings,
+    dtype_dict: Dict[str, str],
+    accuracy_functions,
+    analysis_blocks: Optional[List[BaseAnalysisBlock]] = []
+) -> Tuple[ModelAnalysis, Dict[str, object]]:
+    """
+    Analyses model on a validation subset to evaluate accuracy, estimate feature importance and generate a
+    calibration model to estimating confidence in future predictions.
+
+    Additionally, any user-specified analysis blocks (see class `BaseAnalysisBlock`) are also called here.
+
+    :return:
+    runtime_analyzer: This dictionary object gets populated in a sequential fashion with data generated from
+    any `.analyze()` block call. This dictionary object is stored in the predictor itself, and used when
+    calling the `.explain()` method of all analysis blocks when generating predictions.
+
+    model_analysis: `ModelAnalysis` object that contains core analysis metrics, not necessarily needed when predicting.
+    """
+
+    runtime_analyzer = {}
+    data_type = dtype_dict[target]
+
+    # retrieve encoded data representations
+    encoded_train_data = train_data
+    encoded_val_data = data
+    data = encoded_val_data.data_frame
+    input_cols = list([col for col in data.columns if col != target])
+
+    # predictive task
+    is_numerical = data_type in (dtype.integer, dtype.float, dtype.array, dtype.tsarray, dtype.quantity)
+    is_classification = data_type in (dtype.categorical, dtype.binary)
+    is_multi_ts = ts_cfg.is_timeseries and ts_cfg.nr_predictions > 1
+    has_pretrained_text_enc = any([isinstance(enc, PretrainedLangEncoder)
+                                   for enc in encoded_train_data.encoders.values()])
+
+    # raw predictions for validation dataset
+    normal_predictions = predictor(encoded_val_data) if not is_classification else predictor(encoded_val_data,
+                                                                                             predict_proba=True)
+    normal_predictions = normal_predictions.set_index(data.index)
+
+    # ------------------------- #
+    # Run analysis blocks, both core and user-defined
+    # ------------------------- #
+    kwargs = {
+        'predictor': predictor,
+        'target': target,
+        'input_cols': input_cols,
+        'dtype_dict': dtype_dict,
+        'normal_predictions': normal_predictions,
+        'data': data,
+        'train_data': train_data,
+        'encoded_val_data': encoded_val_data,
+        'is_classification': is_classification,
+        'is_numerical': is_numerical,
+        'is_multi_ts': is_multi_ts,
+        'stats_info': stats_info,
+        'ts_cfg': ts_cfg,
+        'accuracy_functions': accuracy_functions,
+        'has_pretrained_text_enc': has_pretrained_text_enc
+    }
+
+    for block in analysis_blocks:
+        runtime_analyzer = block.analyze(runtime_analyzer, **kwargs)
+
+    # ------------------------- #
+    # Populate ModelAnalysis object
+    # ------------------------- #
+    model_analysis = ModelAnalysis(
+        accuracies=runtime_analyzer['score_dict'],
+        accuracy_histogram=runtime_analyzer['acc_histogram'],
+        accuracy_samples=runtime_analyzer['acc_samples'],
+        train_sample_size=len(encoded_train_data),
+        test_sample_size=len(encoded_val_data),
+        confusion_matrix=runtime_analyzer['cm'],
+        column_importances=runtime_analyzer['column_importances'],
+        histograms=stats_info.histograms,
+        dtypes=dtype_dict
+    )
+
+    return model_analysis, runtime_analyzer
diff --git a/lightwood/analysis/base.py b/lightwood/analysis/base.py
@@ -0,0 +1,46 @@
+from typing import Tuple, Dict, Optional
+
+import pandas as pd
+from lightwood.helpers.log import log
+
+
+class BaseAnalysisBlock:
+    """Class to be inherited by any analysis/explainer block."""
+    def __init__(self,
+                 deps: Optional[Tuple] = ()
+                 ):
+
+        self.dependencies = deps  # can be parallelized when there are no dependencies @TODO enforce
+
+    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
+        """
+        This method should be called once during the analysis phase, or not called at all.
+        It computes any information that the block may either output to the model analysis object,
+        or use at inference time when `.explain()` is called (in this case, make sure all needed
+        objects are added to the runtime analyzer so that `.explain()` can access them).
+
+        :param info: Dictionary where any new information or objects are added. The next analysis block will use
+        the output of the previous block as a starting point.
+        :param kwargs: Dictionary with named variables from either the core analysis or the rest of the prediction
+        pipeline.
+        """
+        log.info(f"{self.__class__.__name__}.analyze() has not been implemented, no modifications will be done to the model analysis.")  # noqa
+        return info
+
+    def explain(self,
+                row_insights: pd.DataFrame,
+                global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:
+        """
+        This method should be called once during the explaining phase at inference time, or not called at all.
+        Additional explanations can be at an instance level (row-wise) or global.
+        For the former, return a data frame with any new insights. For the latter, a dictionary is required.
+
+        :param row_insights: dataframe with previously computed row-level explanations.
+        :param global_insights: dict() with any explanations that concern all predicted instances or the model itself.
+
+        :returns:
+            - row_insights: modified input dataframe with any new row insights added here.
+            - global_insights: dict() with any explanations that concern all predicted instances or the model itself.
+        """
+        log.info(f"{self.__class__.__name__}.explain() has not been implemented, no modifications will be done to the data insights.")  # noqa
+        return row_insights, global_insights