-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging and error handling when ingesting an entire folder #1132
Improve logging and error handling when ingesting an entire folder #1132
Conversation
Please run the formatter and lint |
Hey - I think See #1133 for usage of builtin python logs |
The For that reason I manually ran poetry run ruff check .\scripts\ingest_folder.py tests --fix
poetry run mypy .\scripts\ingest_folder.py Originally I also used the I can also change it back, but I would say the decision lies with the repository maintainers. |
@NetroScript check this PR #1133, we dropped loguru all together and moved to the default logger. Sorry for the ping pong changes but if you can adapt your PR to use the default logger I'll give the green. |
This adds a total document count and also optionally logs processing start, completion and error to a file.
48f6618
to
776fd7a
Compare
I rebased onto main, logging looks like this now: Without a custom formatter for the file handler, just the messages are written to the log, so I decided on using a more uniformly formatted output with less information (no module name). I am not entirely sure what the behavior of the logger library is (if it loads the config of |
except Exception as e: | ||
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using exception
instead of error
is adding the stack trace after the log - I don't know if you want it, or if you just want the string representation of the error, so putting it as a suggestion:
except Exception as e: | |
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}") | |
except Exception as e: | |
logger.exception(f"Failed to ingest document: {changed_path}. Error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I don't know what the common practice is, I don't have a personal opinion on it.
The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.
After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.
I am happy to do the change if it is deemed as the better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.
I agree
The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.
If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception
, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)
@NetroScript this is what is happening under the hood when you are calling Python loads the file
At this point, there is a logger that has been configured with One can use
In a nutshell, the first call to I hope I did not confuse you 😅 TLDR proposing: import logging
import argparse
from pathlib import Path
parser = # ...
args = parser.parse_args()
logging.basicConfig(filename=args.log_file or None, format = "...")
# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ... |
Thanks for the detailed explanation 👌. I was mainly unsure if But wouldn't the shorthand logging.basicConfig(filename=args.log_file or None, format = "...") stop you from using two different formats for the output to the terminal and the file? And at the same time, a
And when I tested |
logging.basicConfig(filename=args.log_file or None, format = "...")
@NetroScript You are completely correct; my bad, I misunderstood that you wanted to have the logs from Then, your solution is still valid -- what behavior do you want to have:
For the first case, I'd suggest doing: import logging
import argparse
from pathlib import Path
logger = logging.getLogger(__name__)
parser = # ...
args = parser.parse_args()
# Set up logging to a file if a path is provided
if args.log_file:
file_handler = logging.FileHandler(args.log_file, mode="a")
file_handler.setFormatter(
logging.Formatter(
"[%(asctime)s.%(msecs)03d] [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
)
logger.addHandler(file_handler)
# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ... For the second case, I'd suggest doing: import logging
import argparse
from pathlib import Path
parser = # ...
args = parser.parse_args()
file_handler = # ...
logging.basicConfig(level="...", format = "...", handlers=[logging.StreamHandler(sys.stderr), file_handler])
# Args are parsed, and logs configured, the job can start
from private_gpt.di import root_injector
# ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution is valid in the sense that the log file will contains the logs of this file only
Code looks good, still awaiting some precision on what behavior is exactly awaited ✌️
except Exception as e: | ||
logger.error(f"Failed to ingest document: {changed_path}. Error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all, commonly JSON objects are logged for this reason, as they are still one entry, one line, but allow to include a stack trace for example.
I agree
The only point coming to my mind which potentially speaks against it, is that the structure of the log file will be interrupted, which might make parsing it more difficult.
If programmatic parsing is wanted, then indeed, it might not be the best solution, but if this file has vocation to be read by humans, I'd suggest using exception
, as it keeps (and display) the full error instead of the error message only (making troubleshooting way easier)
Currently the ingestion progress of an entire folder using the command line script can be quite annoying should errors happen.
Additionally, there is little feedback of the current overall progress if there are many files to be ingested.
For this reason this commit adds two things:
try except
blockUsing the following command as an example:
The new console output can be seen here:
With the following log file being created: