Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging and error handling when ingesting an entire folder #1132

Merged
merged 3 commits into from
Oct 30, 2023

Conversation

NetroScript
Copy link
Contributor

Currently the ingestion progress of an entire folder using the command line script can be quite annoying should errors happen.
Additionally, there is little feedback of the current overall progress if there are many files to be ingested.

For this reason this commit adds two things:

  • Wrapping individual documents into a try except block
  • Improve logging
    • Count total documents and show the current document count
    • Optionally log to a file

Using the following command as an example:

poetry run python ./scripts/ingest_folder.py ./tests --log-file ./test.log

The new console output can be seen here:

image

With the following log file being created:

image

pabloogc
pabloogc previously approved these changes Oct 29, 2023
@pabloogc
Copy link
Collaborator

Please run the formatter and lint make check

@lopagela
Copy link
Contributor

Hey - I think loguru should be dropped from this project, as it's not giving the ability to see the libraries' logs.

See #1133 for usage of builtin python logs

@NetroScript
Copy link
Contributor Author

The black tool correctly runs for the file, however with the current configuration of the makefile both ruff and mypy do not get run for this specific file (at least not on the laptop I am on right now as I am not on my home PC).

For that reason I manually ran

poetry run ruff check .\scripts\ingest_folder.py tests --fix
poetry run mypy .\scripts\ingest_folder.py

Originally I also used the logger library for the code I adjusted for myself, but when I decided to open a PR, I checked the other files and found loguru being used, so I adjusted logger to it.

I can also change it back, but I would say the decision lies with the repository maintainers.

@pabloogc
Copy link
Collaborator

@NetroScript check this PR #1133, we dropped loguru all together and moved to the default logger.

Sorry for the ping pong changes but if you can adapt your PR to use the default logger I'll give the green.

This adds a total document count and also optionally logs processing start, completion and error to a file.
@NetroScript
Copy link
Contributor Author

NetroScript commented Oct 29, 2023

I rebased onto main, logging looks like this now:

Console:
image

File:
image

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt\__init__.py), but it at least looked the same so I did not add anything to scripts\__init__.py or similar.

Comment on lines +72 to +73
except Exception as e:
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using exception instead of error is adding the stack trace after the log - I don't know if you want it, or if you just want the string representation of the error, so putting it as a suggestion:

Suggested change
except Exception as e:
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")
except Exception as e:
logger.exception(f"Failed to ingest document: {changed_path}. Error: {e}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I don't know what the common practice is, I don't have a personal opinion on it.

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.
After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I am happy to do the change if it is deemed as the better option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I agree

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.

If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)

@lopagela
Copy link
Contributor

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt_init_.py), but it at least looked the same so I did not add anything to scripts_init_.py or similar.

@NetroScript this is what is happening under the hood when you are calling python scripts/ingest_folder.py:

Python loads the file scripts/ingest_folder.py and starts to interprets it

  • Process import argparse
  • etc ...
  • Process from private_gpt.di import root_injector
    • Start by running private_gpt/__init__.py, then, initialize the root logger
    • Then run private_gpt/di.py
    • Finally import the symbol root_injector into the current scope
  • etc ...
  • ingest_service = root_injector.get(IngestService) get executed
    • It basically create a python object will all its dependencies injected into it's constructor, so all the required objects get initialized
    • At this point, all the model are loaded
  • Then parses the args, etc.

At this point, there is a logger that has been configured with logging.basicConfig (in private_gpt/__init__.py)

One can use logging.basicConfig before importing private_gpt, and, as explained in the logging documentation:

logging.basicConfig(**kwargs)
This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

In a nutshell, the first call tologging.basicConfig takes the priority with default arguments. If this is initialized with your own arguments in scripts/ingest_folder.py before private_gpt is being imported, the log configuration will be as you want ✌️


I hope I did not confuse you 😅

TLDR proposing:

import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

logging.basicConfig(filename=args.log_file or None, format = "...")

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

@NetroScript
Copy link
Contributor Author

@lopagela

Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name).

I am not entirely sure what the behavior of the logger library is (if it loads the config of private_gpt__init__.py), but it at least looked the same so I did not add anything to scripts__init__.py or similar.

@NetroScript this is what is happening under the hood when you are calling python scripts/ingest_folder.py:

Python loads the file scripts/ingest_folder.py and starts to interprets it

  • Process import argparse

  • etc ...

  • Process from private_gpt.di import root_injector

    • Start by running private_gpt/__init__.py, then, initialize the root logger
    • Then run private_gpt/di.py
    • Finally import the symbol root_injector into the current scope
  • etc ...

  • ingest_service = root_injector.get(IngestService) get executed

    • It basically create a python object will all its dependencies injected into it's constructor, so all the required objects get initialized
    • At this point, all the model are loaded
  • Then parses the args, etc.

At this point, there is a logger that has been configured with logging.basicConfig (in private_gpt/__init__.py)

One can use logging.basicConfig before importing private_gpt, and, as explained in the logging documentation:

logging.basicConfig(**kwargs)
This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

In a nutshell, the first call tologging.basicConfig takes the priority with default arguments. If this is initialized with your own arguments in scripts/ingest_folder.py before private_gpt is being imported, the log configuration will be as you want ✌️

I hope I did not confuse you 😅

TLDR proposing:

import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

logging.basicConfig(filename=args.log_file or None, format = "...")

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

Thanks for the detailed explanation 👌. I was mainly unsure if __init__.py was run, as the entry point to the script is different (and my knowledge of python considering modules is lacking). Your summary clears that up.


But wouldn't the shorthand

logging.basicConfig(filename=args.log_file or None, format = "...")

stop you from using two different formats for the output to the terminal and the file?

And at the same time, a basicConfig is already supplied before this code is run, so as the documentation says:

This function does nothing if the root logger already has handlers configured, unless the keyword argument force is set to True.

And when I tested force=True locally, the previous handler gets replaced entirely instead of appended to.

@lopagela
Copy link
Contributor

But wouldn't the shorthand

logging.basicConfig(filename=args.log_file or None, format = "...")

stop you from using two different formats for the output to the terminal and the file?

@NetroScript You are completely correct; my bad, I misunderstood that you wanted to have the logs from stderr and the log_file 😅

Then, your solution is still valid -- what behavior do you want to have:

  • Only put in the log file the logs emitted from scripts/ingest_folder.py (and the rest goes to stderr)
  • Put all the logs emitted in the scope of this script in the log file + stderr

For the first case, I'd suggest doing:

import logging
import argparse
from pathlib import Path

logger = logging.getLogger(__name__)

parser = # ...
args = parser.parse_args()

# Set up logging to a file if a path is provided
if args.log_file:
    file_handler = logging.FileHandler(args.log_file, mode="a")
    file_handler.setFormatter(
        logging.Formatter(
            "[%(asctime)s.%(msecs)03d] [%(levelname)s] %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",
        )
    )
    logger.addHandler(file_handler)

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

For the second case, I'd suggest doing:

import logging
import argparse
from pathlib import Path

parser = # ...
args = parser.parse_args()

file_handler = # ...
logging.basicConfig(level="...", format = "...", handlers=[logging.StreamHandler(sys.stderr), file_handler])

# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ...

Copy link
Contributor

@lopagela lopagela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution is valid in the sense that the log file will contains the logs of this file only


Code looks good, still awaiting some precision on what behavior is exactly awaited ✌️

Comment on lines +72 to +73
except Exception as e:
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.

I agree

The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.

If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)

@imartinez imartinez merged commit b0e2582 into zylon-ai:main Oct 30, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants