Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make wfdb.rdann, wfdb.rdrecord, wfdb.rdsamp also accept pathlib.Path objects as the input record_name #346

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wenh06
Copy link
Contributor

@wenh06 wenh06 commented Mar 18, 2022

Currently, the functions wfdb.rdann, wfdb.rdrecord, wfdb.rdsamp only accept str for the input record_name, as they have str specific operations. wfdb.rdheader currently accepts pathlib.Path objects as input record_name since the os.path module is able to handle pathlib.Path as well as str type input.

For example, on a linux machine

import pathlib, wfdb
sp = "/home/wenhao/Jupyter/wenhao/data/CPSC2021/training_I/data_0_1"
pp = pathlib.Path(sp)
wfdb.rdrecord(pp)

would raise error

AttributeError                            Traceback (most recent call last)
<ipython-input-6-754a0c10c458> in <module>
----> 1 wfdb.rdrecord(pp)

~/.local/lib/python3.6/site-packages/wfdb/io/record.py in rdrecord(record_name, sampfrom, sampto, channels, physical, pn_dir, m2s, smooth_frames, ignore_skew, return_res, force_channels, channel_names, warn_empty)
   3426         pn_dir = posixpath.join(dir_list[0], get_version(dir_list[0]), *dir_list[1:])
   3427 
-> 3428     if record_name.endswith('.edf'):
   3429         record = edf2mit(record_name, pn_dir=pn_dir, record_only=True)
   3430     elif record_name.endswith('.wav'):

AttributeError: 'PosixPath' object has no attribute 'endswith'

Sometimes one would like to work locally with pathlib, which was introduced in Python 3.4 (PEP 428), to better handle the file paths.

@bemoody
Copy link
Collaborator

bemoody commented Mar 21, 2022

Thank you! This looks like a great improvement.

I'm not entirely familiar with pathlib so I'm a little unsure of some of the details.

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

Here, for example, it looks like you're passing a pathlib.Path object to download._stream_annotation. Is that a good idea or not? I guess that posixpath.join will automatically convert the argument back into a string, but will it correctly use slashes if running on a Windows platform? And does that mean that rdann would now accept backslashes in URLs, only on Windows?

If we do want the download functions to accept a pathlib.Path as an alternative to str then that should be documented.

I'm not necessarily opposed to either saying "pathlib.Path may be used only for local files" or saying "pathlib.Path may be used either for local files or for URLs", but I do think it's important to handle URL arguments in the same way on all hosts.

@wenh06
Copy link
Contributor Author

wenh06 commented Mar 23, 2022

Thank you! This looks like a great improvement.

I'm not entirely familiar with pathlib so I'm a little unsure of some of the details.

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

Here, for example, it looks like you're passing a pathlib.Path object to download._stream_annotation. Is that a good idea or not? I guess that posixpath.join will automatically convert the argument back into a string, but will it correctly use slashes if running on a Windows platform? And does that mean that rdann would now accept backslashes in URLs, only on Windows?

If we do want the download functions to accept a pathlib.Path as an alternative to str then that should be documented.

I'm not necessarily opposed to either saying "pathlib.Path may be used only for local files" or saying "pathlib.Path may be used either for local files or for URLs", but I do think it's important to handle URL arguments in the same way on all hosts.

You're correct. On a Windows system, if you pass a Path object (actually a WindowsPath object) for the file_name, then error might occur when this file_name includes a subdirectory for

url = posixpath.join(config.db_index_url, pn_dir, file_name)

in the function download._stream_annotation. Such error can be avoided by forcing converting file_name to a PosixPath using pathlib.PurePosixPath as

url = posixpath.join(config.db_index_url, pn_dir, pathlib.PurePosixPath(file_name))

A drawback of pathlib is that it cannot handle the URLs correctly (mainly the double slashes), for example

>>> pathlib.PurePosixPath('https://physionet.org/files/')
PurePosixPath('https:/physionet.org/files')
>>> db_index_url = 'https://physionet.org/files/'
>>> sp = "./temp/13918_AV_psd.mat"
>>> pp = pathlib.Path(sp)
>>> db_index_url / pathlib.PurePosixPath("mitdb" / pp)
PurePosixPath('https:/physionet.org/files/mitdb/temp/13918_AV_psd.mat')

The following image is an example tested on a Windows machine
image

@wenh06
Copy link
Contributor Author

wenh06 commented Mar 23, 2022

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

I also think this is a bug:

>>> wfdb.rdrecord("x_mitdb/x_111", pn_dir="mitdb", sampto=2000)
NetFileNotFoundError: 404 Error: Not Found for url: https://physionet.org/files/mitdb/1.0.0/x_111.hea
>>> wfdb.rdrecord("x_111", pn_dir="mitdb/1.0.0/x_mitdb/", sampto=2000)
<wfdb.io.record.Record at 0x7f6cd4ee7c50>

@wenh06
Copy link
Contributor Author

wenh06 commented Mar 23, 2022

I think there's a lot of work to do to better deal with the paths. For example, I found currently wfdb detects whether a pn_dir contains the version number using

'/' not in db_dir

or

'.' not in pn_dir

I think it's better to specify a pattern for the database version and use re.search to detect it

DB_VERSION_PATTERN = re.compile("\d+.\d+\.\d+")

@cx1111
Copy link
Member

cx1111 commented Apr 25, 2022

Thanks very much for identifying the bugs. We're going to think a bit more about how we want the API to be in the next major version, such that the behavior is more explicit. Especially when adding cloud data sources.

Copy link
Contributor

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see this merged!

# local file
if pn_dir is None:
with open(record_name + '.' + extension, 'rb') as f:
with open(file_name, 'rb') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paths offer a .open function, too, which might be more readable than using the global one.

@@ -4373,7 +4377,7 @@ def wrsamp(record_name, fs, units, sig_name, p_signal=None, d_signal=None,

"""
# Check for valid record name
if '.' in record_name:
if '.' in str(record_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As directories might contain dots, maybe this wants to check for record_name.name instead.

@wenh06
Copy link
Contributor Author

wenh06 commented May 29, 2022

Thanks very much for identifying the bugs. We're going to think a bit more about how we want the API to be in the next major version, such that the behavior is more explicit. Especially when adding cloud data sources.

I agree with you. I tried continuing to correct (enhance) path operations in other places for several hours but finally gave up. I hope that path operations would be designed in a more uniform and modern way in the next major version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants