Skip to content

Commit

Permalink
Improve notebook execution performance (#6)
Browse files Browse the repository at this point in the history
* Added papermill custom engine to reuse it for notebook execution

* update file path to absolute instead of relative

* add ability for getting customization file from result

* Implemented logic to use a separate papermill notebook client for each notebook

* Implemented async_execute for papermill engine

* Added EngineBusyError

* Configured compose.yml to build j-sp image instead of pulling from repository

* updated readme with changes in rpt-viewer

* Optimised Dockerfile

---------

Co-authored-by: molotgor <[email protected]>
  • Loading branch information
Nikita-Smirnov-Exactpro and molotgor authored Sep 23, 2024
1 parent 1676bbb commit 9ab041c
Show file tree
Hide file tree
Showing 11 changed files with 721 additions and 109 deletions.
9 changes: 7 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Copy requirements.txt into the container at /app
COPY requirements.txt /app/

# groupadd --system - create a system account
# useradd --system - create a system account
Expand Down Expand Up @@ -41,5 +41,10 @@ ENV PIP_CONFIG_FILE="${HOME}/.pip/pip.conf"
RUN mkdir -p "${PYTHON_SHARED_LIB_PATH}"
RUN echo 'umask 0007' >> "${HOME}/.bashrc"

# Copy the json_stream_provider module into the container at /app
COPY json_stream_provider /app/json_stream_provider
# Copy the destributive files into the container at /app
COPY LICENSE NOTICE README.md package_info.json server.py /app/

ENTRYPOINT ["python", "/app/server.py"]
CMD ["/var/th2/config/custom.json"]
1 change: 1 addition & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This project includes code from https://github.com/nteract/papermill/blob/2.6.0 which is licensed under the BSD License.
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This python server is made to launch Jupyter notebooks (*.ipynb) and get results
* `notebooks` (Default value: /home/jupyter-notebook/) - path to the directory with notebooks. `j-sp` search files with `ipynb` extension recursively in the specified folder.
* `results` (Default value: /home/jupyter-notebook/results) - path to the directory for run results. `j-sp` resolves result file with `jsonl` extension against specified folder.
* `logs` (Default value: /home/jupyter-notebook/logs) - path to the directory for run logs. `j-sp` puts run logs to specified folder.
* `out-of-use-engine-time` (Default value: 3600) - out-of-use time interval in seconds. `j-sp` unregisters engine related to a notebook when user doesn't run the notebook more than this time

### mounting:

Expand Down Expand Up @@ -37,6 +38,7 @@ spec:
notebooks: /home/jupyter-notebook/
results: /home/jupyter-notebook/j-sp/results/
logs: /home/jupyter-notebook/j-sp/logs/
out-of-use-engine-time: 3600
mounting:
- path: /home/jupyter-notebook/
pvcName: jupyter-notebook
Expand Down Expand Up @@ -100,7 +102,7 @@ chmod -R g=u user_data/
#### start command
```shell
cd local-run/with-jupyter-notebook
docker compose up
docker compose up --build
```
#### clean command
```shell
Expand All @@ -119,9 +121,25 @@ docker compose build

## Release notes:

### 0.0.7

* Custom engine holds separate papermill notebook client for each file.

### 0.0.6

* Added papermill custom engine to reuse it for notebook execution.
A separate engine is registered for each notebook and unregistered after 1 hour out-of-use time by default.
* update local run with jupyter-notebook:
* updated th2-rpt-viewer:
* `JSON Reader` page pulls execution status each 50 ms instead of 1 sec
* `JSON Reader` page now uses virtuoso for rendering lists
* `JSON Reader` page now has search, it's values could be loaded from `json` file containing array of objects containing `pattern` and `color` fields for searching content. Execution of notebook could create such file and it will be loaded into UI if it would be created in path of `customization_path` parameter.
* Added ability to create multiple `JSON Reader` pages.
* `JSON Reader` page now has compare mode.

### 0.0.5

* added `umask 0007` to `~/.bashrc` file to provide rw file access for `users` group
* added `umask 0007` to `~/.bashrc` file to provide rw file access for `users` group
* added `/file` request for loading content of single jsonl file
* removed ability to get any file from machine via `/file` REST APIs
* added sorting on requests `/files/notebooks` and `/files/results`
Expand Down
Empty file.
259 changes: 259 additions & 0 deletions json_stream_provider/custom_engines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# Copyright 2024 Exactpro (Exactpro Systems Limited)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging.config
import time
from datetime import datetime

from papermill.clientwrap import PapermillNotebookClient
from papermill.engines import NBClientEngine, NotebookExecutionManager, PapermillEngines
from papermill.utils import remove_args, merge_kwargs, logger


class EngineKey:
def __init__(self, client_id, notebook_file):
self.client_id = client_id
self.notebook_file = notebook_file

def __hash__(self):
# Combine attributes for a unique hash
return hash((self.client_id, self.notebook_file))

def __eq__(self, other):
if isinstance(other, EngineKey):
return self.client_id == other.client_id and self.notebook_file == other.notebook_file
return False

def __iter__(self):
return iter((self.client_id, self.notebook_file))

def __str__(self):
return f"{self.client_id}:{self.notebook_file}"


class EngineHolder:
_key: EngineKey
_client: PapermillNotebookClient
_last_used_time: float
_busy: bool = False

def __init__(self, key: EngineKey, client: PapermillNotebookClient):
self._key = key
self._client = client
self._last_used_time = time.time()

def __str__(self):
return f"Engine(key={self._key}, last_used_time={self._last_used_time}, is_busy={self._busy})"

async def async_execute(self, nb_man):
if self._busy:
raise EngineBusyError(
f"Notebook client related to '{self._key}' has been busy since {self._get_last_used_date_time()}")

try:
self._busy = True
# accept new notebook into (possibly) existing client
self._client.nb_man = nb_man
self._client.nb = nb_man.nb
# reuse client connection to existing kernel
output = await self._client.async_execute(cleanup_kc=False)
# renumber executions
for i, cell in enumerate(nb_man.nb.cells):
if 'execution_count' in cell:
cell['execution_count'] = i + 1

return output
finally:
self._busy = False

def get_last_used_time(self) -> float:
return self._last_used_time

def close(self):
self._client = None

def _get_last_used_date_time(self):
return datetime.fromtimestamp(self._last_used_time)


class EngineBusyError(RuntimeError):
pass


class CustomEngine(NBClientEngine):
out_of_use_engine_time: int = 60 * 60
metadata_dict: dict = {}
logger: logging.Logger

# The code of this method is derived from https://github.com/nteract/papermill/blob/2.6.0 under the BSD License.
# Original license follows:
#
# BSD 3-Clause License
#
# Copyright (c) 2017, nteract
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# * Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Modified by Exactpro for https://github.com/th2-net/th2-json-stream-provider-py
@classmethod
async def async_execute_notebook(
cls,
nb,
kernel_name,
output_path=None,
progress_bar=True,
log_output=False,
autosave_cell_every=30,
**kwargs,
):
"""
A wrapper to handle notebook execution tasks.
Wraps the notebook object in a `NotebookExecutionManager` in order to track
execution state in a uniform manner. This is meant to help simplify
engine implementations. This allows a developer to just focus on
iterating and executing the cell contents.
"""
nb_man = NotebookExecutionManager(
nb,
output_path=output_path,
progress_bar=progress_bar,
log_output=log_output,
autosave_cell_every=autosave_cell_every,
)

nb_man.notebook_start()
try:
await cls.async_execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
finally:
nb_man.cleanup_pbar()
nb_man.notebook_complete()

return nb_man.nb

# this method has been copied from the issue comment
# https://github.com/nteract/papermill/issues/583#issuecomment-791988091
@classmethod
async def async_execute_managed_notebook(
cls,
nb_man,
kernel_name,
log_output=False,
stdout_file=None,
stderr_file=None,
start_timeout=60,
execution_timeout=None,
**kwargs
):
"""
Performs the actual execution of the parameterized notebook locally.
Args:
nb_man (NotebookExecutionManager): Wrapper for execution state of a notebook.
kernel_name (str): Name of kernel to execute the notebook against.
log_output (bool): Flag for whether or not to write notebook output to the
configured logger.
start_timeout (int): Duration to wait for kernel start-up.
execution_timeout (int): Duration to wait before failing execution (default: never).
"""

def create_client(): # TODO: should be static
# Exclude parameters that named differently downstream
safe_kwargs = remove_args(['timeout', 'startup_timeout'], **kwargs)

# Nicely handle preprocessor arguments prioritizing values set by engine
final_kwargs = merge_kwargs(
safe_kwargs,
timeout=execution_timeout if execution_timeout else kwargs.get('timeout'),
startup_timeout=start_timeout,
kernel_name=kernel_name,
log=logger,
log_output=log_output,
stdout_file=stdout_file,
stderr_file=stderr_file,
)
cls.logger.info(f"Created papermill notebook client for {key}")
return PapermillNotebookClient(nb_man, **final_kwargs)

# TODO: pass client_id
key = EngineKey("", nb_man.nb['metadata']['papermill']['input_path'])
engine_holder: EngineHolder = cls.get_or_create_engine_metadata(key, create_client)
return await engine_holder.async_execute(nb_man)

@classmethod
def create_logger(cls):
cls.logger = logging.getLogger('engine')

@classmethod
def set_out_of_use_engine_time(cls, value: int):
cls.out_of_use_engine_time = value

@classmethod
def get_or_create_engine_metadata(cls, key: EngineKey, func):
cls.remove_out_of_date_engines(key)

engine_holder: EngineHolder = cls.metadata_dict.get(key)
if engine_holder is None:
engine_holder = EngineHolder(key, func())
cls.metadata_dict[key] = engine_holder

return engine_holder

@classmethod
def remove_out_of_date_engines(cls, exclude_key: EngineKey):
now = time.time()
dead_line = now - cls.out_of_use_engine_time
out_of_use_engines = [key for key, metadata in cls.metadata_dict.items() if
key != exclude_key and metadata.get_last_used_time() < dead_line]
for key in out_of_use_engines:
engine_holder: EngineHolder = cls.metadata_dict.pop(key)
engine_holder.close()
cls.logger.info(
f"unregistered '{key}' papermill engine, last used time {now - engine_holder.get_last_used_time()} sec ago")


class CustomEngines(PapermillEngines):
async def async_execute_notebook_with_engine(self, engine_name, nb, kernel_name, **kwargs):
"""Fetch a named engine and execute the nb object against it."""
return await self.get_engine(engine_name).async_execute_notebook(nb, kernel_name, **kwargs)


# Instantiate a ExactproPapermillEngines instance, register Handlers and entrypoints
exactpro_papermill_engines = CustomEngines()
exactpro_papermill_engines.register(None, CustomEngine)
exactpro_papermill_engines.register_entry_points()
20 changes: 17 additions & 3 deletions log_configuratior.py → json_stream_provider/log_configuratior.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,24 @@
# Copyright 2024 Exactpro (Exactpro Systems Limited)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging.config
import os

log4py_file = '/var/th2/config/log4py.conf'
def configureLogging():


def configure_logging():
if os.path.exists(log4py_file):
logging.config.fileConfig(log4py_file, disable_existing_loggers=False)
logging.getLogger(__name__).info(f'Logger is configured by {log4py_file} file')
Expand Down Expand Up @@ -30,5 +46,3 @@ def configureLogging():
}
logging.config.dictConfig(default_logging_config)
logging.getLogger(__name__).info('Logger is configured by default')


Loading

0 comments on commit 9ab041c

Please sign in to comment.