Skip to content

Commit

Permalink
Update numpy to latest (#1799)
Browse files Browse the repository at this point in the history
* initial commit

* bump pandas min 1.5.0

* testing

* testing

* update numpy

* update numpy

* update numpy

* update numpy

* loosen numpy

* loosen numpy

* remove python 3.8

* remove python 3.8

* revert read file

* lint fix

* Updated release notes and pinned numpy under 2.0.0

* incorrect pr num

* update minimum requirements

* update minimum dask

* min spark version

* set min scikit-learn for min spark

* first pass fix doc build

* second pass build docs

* Add line ending

Added line ending to file

* Add line to end of release notes

* Missing blank line

In release notes

* Revert "Merge remote-tracking branch 'origin/integrate_string_arrow' into update_numpy"

This reverts commit dc4ba5b, reversing
changes made to f59074b.

* Doc updates, tests pass

All tests pass with upgraded libs

* update min pyarrow in test reqs

Update the minimum pyarrow package in the test requirements

* moto.mock_s3 -> moto.mock_aws

name change in library

* Updated min req for moto

* boto3 min updated

moto upgrade requires an upgrade to boto3

* parquet - try forcing INT96 timestamp

Workaround for the Minimum Dependencies (Spark) test

* Remove temp parquet file

Used for local manual test and slipped through

* Updates per PR review

* Modified min requirements

Based on running action against branch

* Missing = in requirements

dumb

* Incorrect scikit-learn version

s/b 0.22 not 0.2.2

* Min scikit-learn 1.1.0

* spark requires python-dateutil 2.8.2

* pyspark min 3.5.0 to pass tests

* "revert" cast in _get_histogram_values.py

Not clear why this was necessary.  Tests pass without it

* _get_histogram_values cast re-added

With an updated filter

---------

Co-authored-by: Parthiv Naresh <[email protected]>
Co-authored-by: Christopher Park <[email protected]>
  • Loading branch information
3 people authored Feb 2, 2024
1 parent 8c4a3d6 commit 98e4735
Show file tree
Hide file tree
Showing 25 changed files with 133 additions and 75 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/build_docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ env:
ALTERYX_OPEN_SRC_UPDATE_CHECKER: False
jobs:
build_docs:
name: 3.8 build docs
name: 3.9 build docs
runs-on: ubuntu-latest
strategy:
matrix:
python_version: ["3.8"]
python_version: ["3.9"]
steps:
- name: Checkout repository
uses: actions/checkout@v3
Expand All @@ -26,12 +26,12 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python_version }}
cache: 'pip'
cache: 'pip'
cache-dependency-path: 'pyproject.toml'
- uses: actions/cache@v3
id: cache
with:
path: ${{ env.pythonLocation }}
path: ${{ env.pythonLocation }}
key: ${{ matrix.python_version }}-lint-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01
- name: Install apt requirements
run: |
Expand All @@ -42,7 +42,7 @@ jobs:
- name: Install woodwork with doc dependencies (not using cache)
if: steps.cache.outputs.cache-hit != 'true'
run: |
python -m pip install .[dev]
python -m pip install ".[docs]"
- name: Install woodwork with no doc dependencies (using cache)
if: steps.cache.outputs.cache-hit == 'true'
run: |
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/install_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest]
python_version: ["3.8", "3.9", "3.10", "3.11"]
python_version: ["3.9", "3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: Checkout repository
Expand All @@ -26,12 +26,12 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python_version }}
cache: 'pip'
cache: 'pip'
cache-dependency-path: 'pyproject.toml'
- uses: actions/cache@v3
id: cache
with:
path: ${{ env.pythonLocation }}
path: ${{ env.pythonLocation }}
key: ${{ matrix.os- }}-${{ matrix.python_version }}-install-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01
- name: Build woodwork package
run: |
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/latest_dependency_checker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v4
with:
python-version: '3.8.x'
python-version: '3.9.x'
- name: Install pip and virtualenv
run: |
python -m pip install --upgrade pip
Expand Down
20 changes: 10 additions & 10 deletions .github/workflows/tests_with_latest_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
strategy:
fail-fast: true
matrix:
python_version: ["3.8", "3.9", "3.10", "3.11"]
python_version: ["3.9", "3.10", "3.11"]
directories: ["All Other Tests", "Testing Table Accessor", "Testing to Disk with LatLong", "All other Serialization"]
steps:
- name: Set up python ${{ matrix.python_version }}
Expand Down Expand Up @@ -49,47 +49,47 @@ jobs:
python -m pip install unpacked_sdist/[dask]
cd unpacked_sdist
coverage erase
- if: ${{ matrix.python_version != 3.8 && matrix.directories == 'Testing to Disk with LatLong' }}
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Testing to Disk with LatLong' }}
name: Run testing to Disk with LatLong Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0
- if: ${{ matrix.python_version != 3.8 && matrix.directories == 'All other Serialization' }}
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'All other Serialization' }}
name: Run all other Serialization Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0
- if: ${{ matrix.python_version != 3.8 && matrix.directories == 'Testing Table Accessor' }}
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Testing Table Accessor' }}
name: Run Table Accessor Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_table_accessor.py -n 2 --durations 0
- if: ${{ matrix.python_version != 3.8 && matrix.directories == 'All Other Tests' }}
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'All Other Tests' }}
name: Run all other Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/ -n 2 --ignore=woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_table_accessor.py --durations 0
- if: ${{ matrix.python_version == 3.8 && matrix.directories == 'Testing to Disk with LatLong' }}
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Testing to Disk with LatLong' }}
name: Run Testing to Disk with LatLong Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.8 && matrix.directories == 'All other Serialization' }}
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'All other Serialization' }}
name: Run all other Serialization Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.8 && matrix.directories == 'Testing Table Accessor' }}
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Testing Table Accessor' }}
name: Run Table Accessor Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_table_accessor.py -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.8 && matrix.directories == 'All Other Tests' }}
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'All Other Tests' }}
name: Run all other Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/ -n 2 --ignore=woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_table_accessor.py --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.8 }}
- if: ${{ matrix.python_version == 3.9 }}
name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/tests_with_minimum_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
- main
jobs:
py38_unit_tests_minimum_dependencies:
name: Tests - 3.8 Minimum Dependencies
name: Tests - 3.9 Minimum Dependencies
runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -18,10 +18,10 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Set up python 3.8
- name: Set up python 3.9
uses: actions/setup-python@v4
with:
python-version: 3.8
python-version: 3.9
- name: Install woodwork - minimum tests requirements
run: |
python -m pip install -e . --no-dependencies
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
echo "PREVIOUS_HASH=$(git rev-parse --short HEAD~1)" >> $GITHUB_ENV
echo "Previous commit hash: ${{ env.PREVIOUS_HASH }}"
- name: Run airflow tests and generate report
run: |
run: |
curl --location --request POST '${{ secrets.AIRFLOW_BASE_URL }}dags/woodwork_run_tests_generate_report/dagRuns' \
-u '${{ secrets.AIRFLOW_WW_USER }}:${{ secrets.AIRFLOW_WW_PASS }}' \
--header 'Content-Type: application/json' \
Expand All @@ -36,7 +36,7 @@ jobs:
"description": null,
"n_trials": 1,
"pytest_args": {},
"python_version": "3.8",
"python_version": "3.9",
"scenarios_yaml": "woodwork_scenarios.yaml",
"woodwork_branch_previous": "${{ env.PREVIOUS_HASH }}",
"woodwork_branch_new": "${{ env.CURRENT_HASH }}",
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ repos:
- id: add-trailing-comma
name: Add trailing comma
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: 'v0.1.13'
rev: 'v0.1.14'
hooks:
- id: ruff
types_or: [ python, pyi, jupyter ]
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ formats: []
build:
os: "ubuntu-22.04"
tools:
python: "3.8"
python: "3.9"
apt_packages:
- openjdk-11-jre-headless
jobs:
Expand Down
5 changes: 4 additions & 1 deletion docs/source/guides/using_woodwork_with_dask_and_spark.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,10 @@
"Woodwork allows column names of any format that is supported by the DataFrame. However, Dask DataFrames do not currently support integer column names.\n",
"\n",
"### Setting DataFrame Index\n",
"When specifying a Woodwork index with a pandas DataFrame, the underlying index of the DataFrame will be updated to match the column specified as the Woodwork index. When specifying a Woodwork index on a Dask or Spark DataFrame, however, the underlying index will remain unchanged.\n"
"When specifying a Woodwork index with a pandas DataFrame, the underlying index of the DataFrame will be updated to match the column specified as the Woodwork index. When specifying a Woodwork index on a Dask or Spark DataFrame, however, the underlying index will remain unchanged.\n",
"\n",
"### Dask `string[pyarrow]`\n",
"Woodwork may have issues with the new string storage model used by Dask. To workaround this, add `dask.config.set({'dataframe.convert-string': False})`, prior to running dask operations.\n"
]
}
],
Expand Down
12 changes: 6 additions & 6 deletions docs/source/install.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Install

Woodwork is available for Python 3.8 - 3.11. It can be installed from PyPI, conda-forge, or from source.
Woodwork is available for Python 3.9 - 3.11. It can be installed from PyPI, conda-forge, or from source.

To install Woodwork, run the following command:

Expand Down Expand Up @@ -123,7 +123,7 @@ You can do so by installing it as a package inside a container (following the no
creating a new image with Woodwork pre-installed, using the following commands in your `Dockerfile`:

```dockerfile
FROM --platform=linux/x86_64 python:3.8-slim-buster
FROM --platform=linux/x86_64 python:3.9-slim-buster
RUN apt update && apt -y update
RUN apt install -y build-essential
RUN pip3 install --upgrade --quiet pip
Expand All @@ -135,11 +135,11 @@ Woodwork has several other Python dependencies that are used only for specific m

| Dependency | Min Version | Notes |
|-------------------|-------------|----------------------------------------|
| boto3 | 1.10.45 | Required to read/write to URLs and S3 |
| boto3 | 1.34.32 | Required to read/write to URLs and S3 |
| smart_open | 5.0.0 | Required to read/write to URLs and S3 |
| pyarrow | 4.0.1 | Required to serialize to parquet |
| dask[distributed] | 2021.10.0 | Required to use with Dask DataFrames |
| pyspark | 3.2.0 | Required to use with Spark DataFrames |
| pyarrow | 15.0.0 | Required to serialize to parquet |
| dask[distributed] | 2024.1.0 | Required to use with Dask DataFrames |
| pyspark | 3.5.0 | Required to use with Spark DataFrames |


# Development
Expand Down
14 changes: 14 additions & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,20 @@ Release Notes
.. Thanks to the following people for contributing to this release:
v0.28.0
====================
* Enhancements
* Fixes
* Changes
* Upgraded numpy to < 2.0.0 :pr:`1799`
* Documentation Changes
* Added dask string storage note to "Other Limitations" in Dask documentation :pr:`1799`
* Testing Changes
* Upgraded moto and boto3 :pr:`1799`

Thanks to the following people for contributing to this release:
:user:`cp2boston`, :user:`gsheni`

v0.27.0 Dec 12, 2023
====================
* Fixes
Expand Down
24 changes: 11 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ classifiers = [
"Topic :: Scientific/Engineering",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
Expand All @@ -28,14 +27,14 @@ maintainers = [
]
keywords = ["data science", "machine learning", "typing"]
license = {file = "LICENSE"}
requires-python = ">=3.8,<4"
requires-python = ">=3.9,<4"
dependencies = [
"pandas >= 1.4.3",
"scikit-learn >= 0.22",
"scikit-learn >= 1.1.0",
"python-dateutil >= 2.8.1",
"scipy >= 1.10.0",
"importlib-resources >= 5.10.0",
"numpy >= 1.22.0, <1.25.0",
"numpy >= 1.25.0, <2.0.0",
]

[project.urls]
Expand All @@ -51,19 +50,19 @@ test = [
"pytest >= 7.0.1",
"pytest-cov >= 2.10.1",
"pytest-xdist >= 2.1.0",
"boto3 >= 1.10.45",
"moto[all] >= 3.0.7",
"boto3 >= 1.34.32",
"moto[all] >= 5.0.0",
"smart-open >= 5.0.0",
"pyarrow >= 4.0.1, <13.0.0",
"pyarrow >= 14.0.1"
]
dask = [
"dask[dataframe] >= 2022.11.1",
]
spark = [
"pyspark >= 3.2.2",
"pandas >= 1.4.3, <2.0.0",
"numpy < 1.24.0",
"pyarrow >= 4.0.1, <13.0.0",
"pyspark >= 3.5.0",
"pandas >= 2.0.0",
"numpy >= 1.25.0",
"pyarrow >= 14.0.1",
]
updater = [
"alteryx-open-src-update-checker >= 3.1.0"
Expand All @@ -83,8 +82,7 @@ docs = [
dev = [
"ruff >= 0.1.6",
"pre-commit >= 2.20.0",
"click >= 7.1.2, <8.1.0",
"woodwork[docs, dask, spark, test]",
"click >= 8.1.7"
]
complete = [
"woodwork[dask, spark, updater]",
Expand Down
2 changes: 1 addition & 1 deletion woodwork/serializers/parquet_serializer.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def _save_parquet_table_to_disk(self):
**table_metadata,
}
table = table.replace_schema_metadata(combined_meta)
pq.write_table(table, update_file)
pq.write_table(table, update_file, use_deprecated_int96_timestamps=True)

# Remove checksum files which prevent deserialization if present due to updated parquet header
crc_files = [f for f in files if Path(f).suffix == ".crc"]
Expand Down
11 changes: 10 additions & 1 deletion woodwork/statistics_utils/_get_histogram_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,16 @@ def _get_histogram_values(series, bins=10):
histogram (list(dict)): a list of dictionary with keys `bins` and
`frequency`
"""
values = pd.cut(series, bins=bins, duplicates="drop").value_counts().sort_index()

if pd.api.types.is_numeric_dtype(series.dtype) or pd.api.types.is_bool_dtype(
series.dtype,
):
series = series.astype(float)
values = (
pd.cut(x=series.to_numpy(), bins=bins, duplicates="drop")
.value_counts()
.sort_index()
)
df = values.reset_index()
df.columns = ["bins", "frequency"]
results = []
Expand Down
4 changes: 3 additions & 1 deletion woodwork/tests/accessor/test_column_accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def test_accessor_init_with_schema_errors(sample_series):
new_dtype = "<U0"
else:
ltype_dtype = "category"
new_dtype = "object"
new_dtype = "string"

error = re.escape(
f"dtype mismatch between Series dtype {new_dtype}, and Categorical dtype, {ltype_dtype}",
Expand Down Expand Up @@ -163,6 +163,8 @@ def test_accessor_init_with_logical_type(sample_series):
def test_accessor_init_with_invalid_logical_type(sample_series):
if _is_spark_series(sample_series):
series_dtype = "<U0"
elif _is_dask_series(sample_series):
series_dtype = "int64"
else:
series_dtype = "object"
series = sample_series.astype(series_dtype)
Expand Down
Loading

0 comments on commit 98e4735

Please sign in to comment.