-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#13 code for verification #34
Conversation
@akorosov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things to change are indicated in the inline comments.
More generally:
- use library objects in the intended way. Take the time to go through the documentation before starting to code.
- even in a short stand-alone script, split your code into functions. It makes the code easier to read and test.
geospaas_harvesting/verify.py
Outdated
id_range = range(DatasetURI.objects.earliest('id').id, | ||
DatasetURI.objects.latest('id').id, 1000)# <=number for the length of retrieved | ||
for i in range(len(id_range)): | ||
try: | ||
retrieved_dataset_uris = DatasetURI.objects.filter( | ||
id__gte=id_range[i], id__lt=id_range[i+1]) | ||
except IndexError: | ||
retrieved_dataset_uris = DatasetURI.objects.filter( | ||
id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easily readable. Please take a look at the QuerySet
documentation and use use an existing mechanism to iterate over the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
geospaas_harvesting/verify.py
Outdated
if 'html' in content_type.lower() or 'text' in content_type.lower(): | ||
corrupted_url_set.add(dsuri.uri) | ||
|
||
with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please use a clear descriptive name: "ones" could be anything. Keep this in mind for naming in general.
- It's a very bad idea to use
|
in a file name since it's the pipe operator in Unix shells. - Either of these solutions would be more user-friendly:
- the name of the file could be passed as an argument to the script (with a default value if it is not provided).
- the script could write its output to stdout, and the user can then redirect it as they wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
geospaas_harvesting/verify.py
Outdated
if DatasetURI.objects.get(uri=url).dataset.dataseturi_set.count() == 1: | ||
assert DatasetURI.objects.get(uri=url).dataset.delete()[0] == 2 | ||
else: | ||
assert DatasetURI.objects.get(uri=url).delete()[0] == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, we don't want this script to delete everything right away.
The remote repository could just be momentarily unavailable.
Writing to a file or stdout it is enough. It can then be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
geospaas_harvesting/verify.py
Outdated
with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f: | ||
for url in corrupted_url_set: | ||
# Write down the urls on unverified_ones_at_blablabla.txt | ||
f.write(url + '\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
tests/test_verify.py
Outdated
from geospaas.catalog.models import Dataset, DatasetURI | ||
|
||
import geospaas_harvesting.verify as verify | ||
from tests.test_ingesters import IngesterTestCase as itc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to use an alias since there is no other class named "IngesterTestCase" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
geospaas_harvesting/verify.py
Outdated
retrieved_dataset_uris = DatasetURI.objects.filter( | ||
id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id) | ||
for dsuri in retrieved_dataset_uris: | ||
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to get the Content-Type header (which might not be set) instead of relying on the status code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aperrin66
How we are going to identify a download link from its status code? What would that status be for a download link? is it distinguishable from a text
or html
response that might describe the failure of the download link instead of download link itself?
The reason for using content type is that it clearly shows that this link is ready for starting a download action. It shows it by setting the content type value as, for example,application/x-netcdf;charset=ISO-8859-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the status code is different from 2**, there is a problem.
geospaas_harvesting/verify.py
Outdated
for dsuri in retrieved_dataset_uris: | ||
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type') | ||
if 'html' in content_type.lower() or 'text' in content_type.lower(): | ||
corrupted_url_set.add(dsuri.uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrupted_url_set
could grow quite big. Why not write the url right away instead of storing all of them in memory, then iterating over them again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it will not be problematic to store it in the memory. I thought that the ratio of corrupted ones to the healthy ones is extremely low!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit.
@aperrin66 I am 100 percent agree with separation with functions to write a better code. However, this is a tiny code that I think it is not a proper idea for this one(exclusive this code) to make functions out of it. |
geospaas_harvesting/verify.py
Outdated
main(filename=sys.argv[1] if len(sys.argv) == 2 else \ | ||
f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to get the file name get be put in the main()
function since it's read from the command line.
It will be easier to write it in a clear and clean way there, rather than in this long line.
geospaas_harvesting/verify.py
Outdated
while init_index < DatasetURI.objects.count(): | ||
try: | ||
retrieved_dataset_uris = DatasetURI.objects.all()[init_index:init_index+interval] | ||
except IndexError: | ||
retrieved_dataset_uris = DatasetURI.objects.all()[init_index:] | ||
init_index += interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better but still more complicated that necessary.
https://docs.djangoproject.com/en/3.1/ref/models/querysets/#iterator
@aperrin66 |
geospaas_harvesting/verify.py
Outdated
if requests.head(dsuri.uri, allow_redirects=True).status_code==200: | ||
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type') | ||
if 'html' in content_type.lower() or 'text' in content_type.lower(): | ||
f.write(dsuri.uri + os.linesep) | ||
else: | ||
f.write(dsuri.uri + os.linesep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my previous comment was not clear enough.
Unless we run into a case that proves this wrong, let's assume that a status code in the 200-299 range is enough to validate a URL. There is no need to look at the headers for now. We can always add it later if the status code is not enough.
@aperrin66 |
b3467b2
to
8021be4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more things are indicated inline.
Please remember to check your code for PEP-8 conformity.
geospaas_harvesting/verify.py
Outdated
|
||
|
||
if __name__ == '__main__': | ||
main(filename=sys.argv[1] if len(sys.argv) == 2 else '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you leave that here? You don't need to pass an argument to main()
.
You can just put something like that at the beginning of main()
:
try:
filename = sys.argv[1]
except IndexError:
filename = "..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit
geospaas_harvesting/verify.py
Outdated
filename=f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}" | ||
with open(filename+".txt", 'w') as f: | ||
for dsuri in DatasetURI.objects.iterator(): | ||
if not str(requests.head(dsuri.uri, allow_redirects=True).status_code).startswith('2'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bother with type conversions?
Something like that is both clearer and more efficient:
response = requests.head(dsuri.uri, allow_redirects=True)
if response.status_code < 200 or response.status_code > 299:
f.write(dsuri.uri + os.linesep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit
tests/test_verify.py
Outdated
@mock.patch('requests.head') | ||
def test_download_link_responded_with_incorrect_status_code(self, mock_request, mock_open): | ||
"""Shall write dataset to file from database because of unhealthy download link""" | ||
mock_request.return_value = self.FakeResponseIncorrectStatusCode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works too, and you don't have to declare an extra class:
mock_request.return_value.status_code = 504
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit
tests/test_verify.py
Outdated
self.assertEqual(len(mock_open.mock_calls), 5) | ||
self.assertTrue(mock_open.mock_calls[2][1][0].startswith('http://test.uri/dataset')) | ||
self.assertTrue(mock_open.mock_calls[3][1][0].startswith('http://anotherhost/dataset')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your tests might be easier to write if you just use a temporary directory to write the output files instead of mocking open()
.
See https://docs.python.org/3/library/tempfile.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the next commit
@aperrin66 |
Co-authored-by: Adrien Perrin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, two more things:
- please check spaces around operators.
- use a more explicit name for the script, like
verify_urls.py
c67b1ed
to
600f424
Compare
@aperrin66 |
closes nansencenter/django-geo-spaas#24 |
I removed nansencenter/django-geo-spaas#24 from the linked PRs because there is not code here to fix the wrong URLs, only to detect them. |
closes #13
@aperrin66
@akorosov
The code is ready for the issue. Only the tests are remaining. This code is in its final version and can be reviewed regardless of its tests.