Skip to content

Commit

Permalink
Merge branch 'main' into i267
Browse files Browse the repository at this point in the history
  • Loading branch information
AryanBakliwal committed Apr 18, 2024
2 parents e127fff + 27c240d commit a60cec7
Show file tree
Hide file tree
Showing 9 changed files with 615 additions and 13 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,10 @@ jobs:
pip install . --upgrade pip
- name: Test notebooks
run: >
python -m ipykernel install --user --name bitinfo
python -m ipykernel install --user --name bitinfo-docs
jupyter nbconvert --to html --execute docs/*.ipynb
--ExecutePreprocessor.kernel_name=bitinfo
--ExecutePreprocessor.kernel_name=bitinfo-docs
install:
name: 'install xbitinfo, ${{ matrix.os }}'
runs-on: '${{ matrix.os }}'
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ CHANGELOG
X.X.X (unreleased)
------------------

* Add basic filter to remove artificial information from bitinformation (:pr:`280`, :issue:`209`) `Ishaan Jain`_.
* Add support for additional datatypes in :py:func:`xbitinfo.xbitinfo.plot_bitinformation` (:pr:`218`, :issue:`168`) `Hauke Schulz`_.
* Drop python 3.8 support and add python 3.11 (:pr:`175`) `Hauke Schulz`_.
* Implement basic retrieval of bitinformation in python as alternative to julia implementation (:pr:`156`, :issue:`155`, :pr:`126`, :issue:`125`) `Hauke Schulz`_ with helpful comments from `Milan Klöwer`_.
Expand Down
311 changes: 311 additions & 0 deletions docs/ArtificialInformation_Filter.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "895e1d5a",
"metadata": {},
"source": [
"# Artificial information filtering\n",
"\n",
"In simple terms the bitinformation is retrieved by checking how variable a bit pattern is. However, this approach cannot distinguish between actual information content and artifical information content. By studying the distribution of the information content the user can often identify clear cut-offs of real information content and artificial information content.\n",
"\n",
"The following example shows how such a separation of real information and artificial information can look like. To do so, artificial information is artificially added to an example dataset by applying linear quantization. Linear quantization is often applied to climate datasets (e.g. ERA5) and needs to be accounted for in order to retrieve meaningful bitinformation content. An algorithm that aims at detecting this artificial information itself is introduced."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c37dd36",
"metadata": {},
"outputs": [],
"source": [
"import xarray as xr\n",
"import xbitinfo as xb\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "e8e1424f",
"metadata": {},
"source": [
"## Loading example dataset\n",
"We use here the openly accessible CONUS dataset. The dataset is available at full precision."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b18b9e24",
"metadata": {},
"outputs": [],
"source": [
"ds = xr.open_zarr(\n",
" \"s3://hytest/conus404/conus404_monthly.zarr\",\n",
" storage_options={\n",
" \"anon\": True,\n",
" \"requester_pays\": False,\n",
" \"client_kwargs\": {\"endpoint_url\": \"https://usgs.osn.mghpcc.org\"},\n",
" },\n",
")\n",
"# selecting water vapor mixing ratio at 2 meters\n",
"data = ds[\"ACSWDNT\"]\n",
"# select subset of data for demonstration purposes\n",
"chunk = data.isel(time=slice(0, 2), y=slice(0, 1015), x=slice(0, 1050))\n",
"chunk"
]
},
{
"cell_type": "markdown",
"id": "535ce421",
"metadata": {},
"source": [
"## Creating dataset copy with artificial information\n",
"### Functions to encode and decode"
]
},
{
"cell_type": "markdown",
"id": "1842f792",
"metadata": {},
"source": [
"# Artificial information filtering\n",
"\n",
"In simple terms the bitinformation is retrieved by checking how variable a bit pattern is. However, this approach cannot distinguish between actual information content and artifical information content. By studying the distribution of the information content the user can often identify clear cut-offs of real information content and artificial information content.\n",
"\n",
"The following example shows how such a separation of real information and artificial information can look like. To do so, artificial information is artificially added to an example dataset by applying linear quantization. Linear quantization is often applied to climate datasets (e.g. ERA5) and needs to be accounted for in order to retrieve meaningful bitinformation content. An algorithm that aims at detecting this artificial information itself is introduced."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb998fbb",
"metadata": {},
"outputs": [],
"source": [
"import xarray as xr\n",
"import xbitinfo as xb\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "32ac97e0",
"metadata": {},
"source": [
"## Loading example dataset\n",
"We use here the openly accessible CONUS dataset. The dataset is available at full precision."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9639a618",
"metadata": {},
"outputs": [],
"source": [
"ds = xr.open_zarr(\n",
" \"s3://hytest/conus404/conus404_monthly.zarr\",\n",
" storage_options={\n",
" \"anon\": True,\n",
" \"requester_pays\": False,\n",
" \"client_kwargs\": {\"endpoint_url\": \"https://usgs.osn.mghpcc.org\"},\n",
" },\n",
")\n",
"# selecting water vapor mixing ratio at 2 meters\n",
"data = ds[\"ACSWDNT\"]\n",
"# select subset of data for demonstration purposes\n",
"chunk = data.isel(time=slice(0, 2), y=slice(0, 1015), x=slice(0, 1050))\n",
"chunk"
]
},
{
"cell_type": "markdown",
"id": "3d735e4b",
"metadata": {},
"source": [
"## Creating dataset copy with artificial information\n",
"### Functions to encode and decode"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b3a7c7ae",
"metadata": {},
"outputs": [],
"source": [
"# Encoding function to compress data\n",
"def encode(chunk, scale, offset, dtype, astype):\n",
" enc = (chunk - offset) * scale\n",
" enc = np.around(enc)\n",
" enc = enc.astype(astype, copy=False)\n",
" return enc\n",
"\n",
"\n",
"# Decoding function to decompress data\n",
"def decode(enc, scale, offset, dtype, astype):\n",
" dec = (enc / scale) + offset\n",
" dec = dec.astype(dtype, copy=False)\n",
" return dec"
]
},
{
"cell_type": "markdown",
"id": "fa6f26c7",
"metadata": {},
"source": [
"### Transform dataset to introduce artificial information"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c09e3cf3",
"metadata": {},
"outputs": [],
"source": [
"xmin = np.min(chunk)\n",
"xmax = np.max(chunk)\n",
"scale = (2**16 - 1) / (xmax - xmin)\n",
"offset = xmin\n",
"enc = encode(chunk, scale, offset, \"f4\", \"u2\")\n",
"dec = decode(enc, scale, offset, \"f4\", \"u2\")"
]
},
{
"cell_type": "markdown",
"id": "7126810d",
"metadata": {},
"source": [
"## Comparison of bitinformation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05ef8a94",
"metadata": {},
"outputs": [],
"source": [
"# original dataset without artificial information\n",
"orig_info = xb.get_bitinformation(\n",
" xr.Dataset({\"w/o artif. info\": chunk}),\n",
" dim=\"x\",\n",
" implementation=\"python\",\n",
")\n",
"\n",
"# dataset with artificial information\n",
"arti_info = xb.get_bitinformation(\n",
" xr.Dataset({\"w artif. info\": dec}),\n",
" dim=\"x\",\n",
" implementation=\"python\",\n",
")\n",
"\n",
"# plotting distribution of bitwise information content\n",
"info = xr.merge([orig_info, arti_info])\n",
"plot = xb.plot_bitinformation(info)"
]
},
{
"cell_type": "markdown",
"id": "de1ecb7e",
"metadata": {},
"source": [
"The figure reveals that artificial information is introduced by applying linear quantization. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8600d4b8",
"metadata": {},
"outputs": [],
"source": [
"keepbits = xb.get_keepbits(info, inflevel=[0.99])\n",
"print(\n",
" f\"The number of keepbits increased from {keepbits['w/o artif. info'].item(0)} bits in the original dataset to {keepbits['w artif. info'].item(0)} bits in the dataset with artificial information.\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "fa80f988",
"metadata": {},
"source": [
"In the following, a gradient based filter is introduced to remove this artificial information again so that even in case artificial information is present in a dataset the number of keepbits remains similar."
]
},
{
"cell_type": "markdown",
"id": "3f7a7c2e",
"metadata": {},
"source": [
"## Artificial information filter\n",
"The filter `gradient` works as follows:\n",
"\n",
"1. It determines the Cumulative Distribution Function(CDF) of the bitwise information content\n",
"2. It computes the gradient of the CDF to identify points where the gradient becomes close to a given tolerance indicating a drop in information.\n",
"3. Simultaneously, it keeps track of the minimum cumulative sum of information content which is threshold here, which signifies at least this much fraction of total information needs to be passed.\n",
"4. So the bit where the intersection of the gradient reaching the tolerance and the cumulative sum exceeding the threshold is our TrueKeepbits. All bits beyond this index are assumed to contain artificial information and are set to zero in order to cut them off.\n",
"5. You can see the above concept implemented in the function get_cdf_without_artificial_information in xbitinfo.py\n",
"\n",
"Please note that this filter relies on a clear separation between real and artificial information content and might not work in all cases."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0ab6633",
"metadata": {},
"outputs": [],
"source": [
"xb.get_keepbits(\n",
" arti_info,\n",
" inflevel=[0.99],\n",
" information_filter=\"Gradient\",\n",
" **{\"threshold\": 0.7, \"tolerance\": 0.001}\n",
")"
]
},
{
"cell_type": "markdown",
"id": "21c6369d",
"metadata": {},
"source": [
"With the application of the filter the keepbits are closer/identical to their original value in the dataset without artificial information. The plot of the bitinformation visualizes this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e9183b2",
"metadata": {},
"outputs": [],
"source": [
"plot = xb.plot_bitinformation(arti_info, information_filter=\"Gradient\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
5 changes: 3 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
"sphinx.ext.autosummary",
"sphinx.ext.extlinks",
"sphinx.ext.intersphinx",
"sphinxcontrib.napoleon",
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx_copybutton",
"IPython.sphinxext.ipython_directive",
Expand Down Expand Up @@ -92,7 +92,7 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = "en"

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down Expand Up @@ -130,6 +130,7 @@
"home_page_in_toc": False,
"extra_navbar": "",
"navbar_footer_text": "",
"navigation_with_keys": False,
}

html_context = {
Expand Down
5 changes: 3 additions & 2 deletions docs/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: bitinfo-docs
channels:
- conda-forge
dependencies:
- python=3.9
- python
- matplotlib-base
- numpy
- pooch
Expand All @@ -18,11 +18,12 @@ dependencies:
- tqdm
- pytest
- sphinx!=4.4.0
- sphinxcontrib-napoleon
- sphinx-copybutton
- sphinx-book-theme>=0.1.7
- myst-nb
- numcodecs>=0.10.0
- intake-xarray
- s3fs
- pip
- pip:
- -e ../.
Loading

0 comments on commit a60cec7

Please sign in to comment.