Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#13 code for verification #34

Merged
merged 16 commits into from
Sep 28, 2020
Merged

#13 code for verification #34

merged 16 commits into from
Sep 28, 2020

Conversation

opsdep
Copy link
Contributor

@opsdep opsdep commented Sep 23, 2020

closes #13
@aperrin66
@akorosov
The code is ready for the issue. Only the tests are remaining. This code is in its final version and can be reviewed regardless of its tests.

@opsdep
Copy link
Contributor Author

opsdep commented Sep 23, 2020

@akorosov
@aperrin66
The tests are also ready in 09d0917. There is no more change for this pr unless you have requested by your review.

Copy link
Member

@aperrin66 aperrin66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to change are indicated in the inline comments.
More generally:

  • use library objects in the intended way. Take the time to go through the documentation before starting to code.
  • even in a short stand-alone script, split your code into functions. It makes the code easier to read and test.

Comment on lines 22 to 30
id_range = range(DatasetURI.objects.earliest('id').id,
DatasetURI.objects.latest('id').id, 1000)# <=number for the length of retrieved
for i in range(len(id_range)):
try:
retrieved_dataset_uris = DatasetURI.objects.filter(
id__gte=id_range[i], id__lt=id_range[i+1])
except IndexError:
retrieved_dataset_uris = DatasetURI.objects.filter(
id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easily readable. Please take a look at the QuerySet documentation and use use an existing mechanism to iterate over the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

if 'html' in content_type.lower() or 'text' in content_type.lower():
corrupted_url_set.add(dsuri.uri)

with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please use a clear descriptive name: "ones" could be anything. Keep this in mind for naming in general.
  • It's a very bad idea to use | in a file name since it's the pipe operator in Unix shells.
  • Either of these solutions would be more user-friendly:
    • the name of the file could be passed as an argument to the script (with a default value if it is not provided).
    • the script could write its output to stdout, and the user can then redirect it as they wish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

Comment on lines 44 to 47
if DatasetURI.objects.get(uri=url).dataset.dataseturi_set.count() == 1:
assert DatasetURI.objects.get(uri=url).dataset.delete()[0] == 2
else:
assert DatasetURI.objects.get(uri=url).delete()[0] == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, we don't want this script to delete everything right away.
The remote repository could just be momentarily unavailable.
Writing to a file or stdout it is enough. It can then be checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f:
for url in corrupted_url_set:
# Write down the urls on unverified_ones_at_blablabla.txt
f.write(url + '\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

from geospaas.catalog.models import Dataset, DatasetURI

import geospaas_harvesting.verify as verify
from tests.test_ingesters import IngesterTestCase as itc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use an alias since there is no other class named "IngesterTestCase" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

retrieved_dataset_uris = DatasetURI.objects.filter(
id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id)
for dsuri in retrieved_dataset_uris:
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to get the Content-Type header (which might not be set) instead of relying on the status code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aperrin66
How we are going to identify a download link from its status code? What would that status be for a download link? is it distinguishable from a text or html response that might describe the failure of the download link instead of download link itself?

The reason for using content type is that it clearly shows that this link is ready for starting a download action. It shows it by setting the content type value as, for example,application/x-netcdf;charset=ISO-8859-1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the status code is different from 2**, there is a problem.

for dsuri in retrieved_dataset_uris:
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')
if 'html' in content_type.lower() or 'text' in content_type.lower():
corrupted_url_set.add(dsuri.uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted_url_set could grow quite big. Why not write the url right away instead of storing all of them in memory, then iterating over them again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it will not be problematic to store it in the memory. I thought that the ratio of corrupted ones to the healthy ones is extremely low!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit.

@opsdep
Copy link
Contributor Author

opsdep commented Sep 24, 2020

@aperrin66
All of your comment are addressed in d8106d1. If you are ok with it, then notify me to write tests based on it.

I am 100 percent agree with separation with functions to write a better code. However, this is a tiny code that I think it is not a proper idea for this one(exclusive this code) to make functions out of it.

Comment on lines 39 to 40
main(filename=sys.argv[1] if len(sys.argv) == 2 else \
f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to get the file name get be put in the main() function since it's read from the command line.
It will be easier to write it in a clear and clean way there, rather than in this long line.

Comment on lines 25 to 30
while init_index < DatasetURI.objects.count():
try:
retrieved_dataset_uris = DatasetURI.objects.all()[init_index:init_index+interval]
except IndexError:
retrieved_dataset_uris = DatasetURI.objects.all()[init_index:]
init_index += interval
Copy link
Member

@aperrin66 aperrin66 Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better but still more complicated that necessary.
https://docs.djangoproject.com/en/3.1/ref/models/querysets/#iterator

@opsdep
Copy link
Contributor Author

opsdep commented Sep 24, 2020

@aperrin66
54883ad is ready and I am writing test for it. agreed?

Comment on lines 26 to 31
if requests.head(dsuri.uri, allow_redirects=True).status_code==200:
content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')
if 'html' in content_type.lower() or 'text' in content_type.lower():
f.write(dsuri.uri + os.linesep)
else:
f.write(dsuri.uri + os.linesep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my previous comment was not clear enough.

Unless we run into a case that proves this wrong, let's assume that a status code in the 200-299 range is enough to validate a URL. There is no need to look at the headers for now. We can always add it later if the status code is not enough.

@opsdep
Copy link
Contributor Author

opsdep commented Sep 24, 2020

@aperrin66
All addressed in 573c056.

Copy link
Member

@aperrin66 aperrin66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things are indicated inline.
Please remember to check your code for PEP-8 conformity.



if __name__ == '__main__':
main(filename=sys.argv[1] if len(sys.argv) == 2 else '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you leave that here? You don't need to pass an argument to main().
You can just put something like that at the beginning of main():

try:
    filename = sys.argv[1]
except IndexError:
    filename = "..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit

filename=f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}"
with open(filename+".txt", 'w') as f:
for dsuri in DatasetURI.objects.iterator():
if not str(requests.head(dsuri.uri, allow_redirects=True).status_code).startswith('2'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother with type conversions?
Something like that is both clearer and more efficient:

response = requests.head(dsuri.uri, allow_redirects=True)
if response.status_code < 200 or response.status_code > 299:
    f.write(dsuri.uri + os.linesep)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit

@mock.patch('requests.head')
def test_download_link_responded_with_incorrect_status_code(self, mock_request, mock_open):
"""Shall write dataset to file from database because of unhealthy download link"""
mock_request.return_value = self.FakeResponseIncorrectStatusCode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works too, and you don't have to declare an extra class:
mock_request.return_value.status_code = 504

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit

Comment on lines 32 to 34
self.assertEqual(len(mock_open.mock_calls), 5)
self.assertTrue(mock_open.mock_calls[2][1][0].startswith('http://test.uri/dataset'))
self.assertTrue(mock_open.mock_calls[3][1][0].startswith('http://anotherhost/dataset'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests might be easier to write if you just use a temporary directory to write the output files instead of mocking open().
See https://docs.python.org/3/library/tempfile.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the next commit

@opsdep
Copy link
Contributor Author

opsdep commented Sep 25, 2020

@aperrin66
all addressed in c67b1ed.

Copy link
Member

@aperrin66 aperrin66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, two more things:

  • please check spaces around operators.
  • use a more explicit name for the script, like verify_urls.py

@opsdep
Copy link
Contributor Author

opsdep commented Sep 25, 2020

@aperrin66
664b116 is ready.

@opsdep
Copy link
Contributor Author

opsdep commented Sep 25, 2020

closes nansencenter/django-geo-spaas#24
@akorosov
@aperrin66
This pr also closes nansencenter/django-geo-spaas#24 at the same time. Because we have developed some code that checks the previously ingested datasets.

@aperrin66 aperrin66 merged commit e2fbd0b into master Sep 28, 2020
@aperrin66 aperrin66 deleted the issue13-verify-code branch September 28, 2020 06:54
@aperrin66
Copy link
Member

aperrin66 commented Sep 28, 2020

I removed nansencenter/django-geo-spaas#24 from the linked PRs because there is not code here to fix the wrong URLs, only to detect them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add checking mechanism for ingested data
2 participants