Improve logging and error handling when ingesting an entire folder #1132

NetroScript · 2023-10-29T02:53:12Z

Currently the ingestion progress of an entire folder using the command line script can be quite annoying should errors happen.
Additionally, there is little feedback of the current overall progress if there are many files to be ingested.

For this reason this commit adds two things:

Wrapping individual documents into a try except block
Improve logging
- Count total documents and show the current document count
- Optionally log to a file

Using the following command as an example:

poetry run python ./scripts/ingest_folder.py ./tests --log-file ./test.log

The new console output can be seen here:

With the following log file being created:

pabloogc · 2023-10-29T11:55:17Z

Please run the formatter and lint make check

lopagela · 2023-10-29T16:12:12Z

Hey - I think loguru should be dropped from this project, as it's not giving the ability to see the libraries' logs.

See #1133 for usage of builtin python logs

NetroScript · 2023-10-29T17:00:02Z

The black tool correctly runs for the file, however with the current configuration of the makefile both ruff and mypy do not get run for this specific file (at least not on the laptop I am on right now as I am not on my home PC).

For that reason I manually ran

poetry run ruff check .\scripts\ingest_folder.py tests --fix
poetry run mypy .\scripts\ingest_folder.py

Originally I also used the logger library for the code I adjusted for myself, but when I decided to open a PR, I checked the other files and found loguru being used, so I adjusted logger to it.

I can also change it back, but I would say the decision lies with the repository maintainers.

pabloogc · 2023-10-29T18:13:36Z

@NetroScript check this PR #1133, we dropped loguru all together and moved to the default logger.

Sorry for the ping pong changes but if you can adapt your PR to use the default logger I'll give the green.

This adds a total document count and also optionally logs processing start, completion and error to a file.

See zylon-ai#1133

NetroScript · 2023-10-29T21:47:40Z

I rebased onto main, logging looks like this now:

Console:

File:

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt\__init__.py), but it at least looked the same so I did not add anything to scripts\__init__.py or similar.

lopagela · 2023-10-29T22:39:56Z

scripts/ingest_folder.py

+    except Exception as e:
+        logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")


Using exception instead of error is adding the stack trace after the log - I don't know if you want it, or if you just want the string representation of the error, so putting it as a suggestion:

Suggested change

except Exception as e:

logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")

except Exception as e:

logger.exception(f"Failed to ingest document: {changed_path}. Error: {e}")

As I don't know what the common practice is, I don't have a personal opinion on it.

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.
After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I am happy to do the change if it is deemed as the better option.

After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I agree

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.

If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)

lopagela · 2023-10-29T23:20:34Z

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt_init_.py), but it at least looked the same so I did not add anything to scripts_init_.py or similar.

@NetroScript this is what is happening under the hood when you are calling python scripts/ingest_folder.py:

Python loads the file scripts/ingest_folder.py and starts to interprets it

Process import argparse
etc ...
Process from private_gpt.di import root_injector
- Start by running private_gpt/__init__.py, then, initialize the root logger
- Then run private_gpt/di.py
- Finally import the symbol root_injector into the current scope
etc ...
ingest_service = root_injector.get(IngestService) get executed
- It basically create a python object will all its dependencies injected into it's constructor, so all the required objects get initialized
- At this point, all the model are loaded
Then parses the args, etc.

At this point, there is a logger that has been configured with logging.basicConfig (in private_gpt/__init__.py)

One can use logging.basicConfig before importing private_gpt, and, as explained in the logging documentation:

logging.basicConfig(**kwargs)
This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

In a nutshell, the first call tologging.basicConfig takes the priority with default arguments. If this is initialized with your own arguments in scripts/ingest_folder.py before private_gpt is being imported, the log configuration will be as you want ✌️

I hope I did not confuse you 😅

TLDR proposing:

import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

logging.basicConfig(filename=args.log_file or None, format = "...")

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

NetroScript · 2023-10-29T23:39:06Z

@lopagela

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt__init__.py), but it at least looked the same so I did not add anything to scripts__init__.py or similar.

@NetroScript this is what is happening under the hood when you are calling python scripts/ingest_folder.py:

Python loads the file scripts/ingest_folder.py and starts to interprets it

Process import argparse

etc ...

Process from private_gpt.di import root_injector

Start by running private_gpt/__init__.py, then, initialize the root logger

Then run private_gpt/di.py

Finally import the symbol root_injector into the current scope

etc ...

ingest_service = root_injector.get(IngestService) get executed

It basically create a python object will all its dependencies injected into it's constructor, so all the required objects get initialized

At this point, all the model are loaded

Then parses the args, etc.

At this point, there is a logger that has been configured with logging.basicConfig (in private_gpt/__init__.py)

One can use logging.basicConfig before importing private_gpt, and, as explained in the logging documentation:

logging.basicConfig(**kwargs)
This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

In a nutshell, the first call tologging.basicConfig takes the priority with default arguments. If this is initialized with your own arguments in scripts/ingest_folder.py before private_gpt is being imported, the log configuration will be as you want ✌️

I hope I did not confuse you 😅

TLDR proposing:
import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

logging.basicConfig(filename=args.log_file or None, format = "...")

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

Thanks for the detailed explanation 👌. I was mainly unsure if __init__.py was run, as the entry point to the script is different (and my knowledge of python considering modules is lacking). Your summary clears that up.

But wouldn't the shorthand

logging.basicConfig(filename=args.log_file or None, format = "...")

stop you from using two different formats for the output to the terminal and the file?

And at the same time, a basicConfig is already supplied before this code is run, so as the documentation says:

This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

And when I tested force=True locally, the previous handler gets replaced entirely instead of appended to.

lopagela · 2023-10-30T20:39:49Z

But wouldn't the shorthand

logging.basicConfig(filename=args.log_file or None, format = "...")

stop you from using two different formats for the output to the terminal and the file?

@NetroScript You are completely correct; my bad, I misunderstood that you wanted to have the logs from stderr and the log_file 😅

Then, your solution is still valid -- what behavior do you want to have:

Only put in the log file the logs emitted from scripts/ingest_folder.py (and the rest goes to stderr)
Put all the logs emitted in the scope of this script in the log file + stderr

For the first case, I'd suggest doing:

import logging
import argparse
from pathlib import Path

logger = logging.getLogger(__name__)

parser = # ...
args = parser.parse_args()

# Set up logging to a file if a path is provided
if args.log_file:
    file_handler = logging.FileHandler(args.log_file, mode="a")
    file_handler.setFormatter(
        logging.Formatter(
            "[%(asctime)s.%(msecs)03d] [%(levelname)s] %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",
        )
    )
    logger.addHandler(file_handler)

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

For the second case, I'd suggest doing:

import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

file_handler = # ...
logging.basicConfig(level="...", format = "...", handlers=[logging.StreamHandler(sys.stderr), file_handler])

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

lopagela

The current solution is valid in the sense that the log file will contains the logs of this file only

Code looks good, still awaiting some precision on what behavior is exactly awaited ✌️

lopagela · 2023-10-30T20:42:50Z

scripts/ingest_folder.py

+    except Exception as e:
+        logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")


After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I agree

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.

If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)

pabloogc previously approved these changes Oct 29, 2023

View reviewed changes

NetroScript dismissed pabloogc’s stale review via 48f6618 October 29, 2023 16:54

NetroScript added 3 commits October 29, 2023 22:09

Improve logging when ingesting an entire folder

3d5a78f

This adds a total document count and also optionally logs processing start, completion and error to a file.

Lint and reformat code

4969889

Remove loguru in favor of simple logging module

776fd7a

See zylon-ai#1133

NetroScript force-pushed the improve-ingestion-logging branch from 48f6618 to 776fd7a Compare October 29, 2023 21:43

lopagela reviewed Oct 29, 2023

View reviewed changes

lopagela approved these changes Oct 30, 2023

View reviewed changes

imartinez approved these changes Oct 30, 2023

View reviewed changes

imartinez merged commit b0e2582 into zylon-ai:main Oct 30, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging and error handling when ingesting an entire folder #1132

Improve logging and error handling when ingesting an entire folder #1132

NetroScript commented Oct 29, 2023

pabloogc commented Oct 29, 2023

lopagela commented Oct 29, 2023

NetroScript commented Oct 29, 2023

pabloogc commented Oct 29, 2023

NetroScript commented Oct 29, 2023 •

edited

Loading

lopagela Oct 29, 2023

NetroScript Oct 29, 2023

lopagela Oct 30, 2023

lopagela commented Oct 29, 2023

NetroScript commented Oct 29, 2023

lopagela commented Oct 30, 2023

lopagela left a comment

lopagela Oct 30, 2023

		except Exception as e:
		logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")

Improve logging and error handling when ingesting an entire folder #1132

Improve logging and error handling when ingesting an entire folder #1132

Conversation

NetroScript commented Oct 29, 2023

pabloogc commented Oct 29, 2023

lopagela commented Oct 29, 2023

NetroScript commented Oct 29, 2023

pabloogc commented Oct 29, 2023

NetroScript commented Oct 29, 2023 • edited Loading

lopagela Oct 29, 2023

Choose a reason for hiding this comment

NetroScript Oct 29, 2023

Choose a reason for hiding this comment

lopagela Oct 30, 2023

Choose a reason for hiding this comment

lopagela commented Oct 29, 2023

NetroScript commented Oct 29, 2023

lopagela commented Oct 30, 2023

lopagela left a comment

Choose a reason for hiding this comment

lopagela Oct 30, 2023

Choose a reason for hiding this comment

NetroScript commented Oct 29, 2023 •

edited

Loading