#13 code for verification #34

opsdep · 2020-09-23T09:38:14Z

closes #13
@aperrin66
@akorosov
The code is ready for the issue. Only the tests are remaining. This code is in its final version and can be reviewed regardless of its tests.

opsdep · 2020-09-23T12:52:24Z

@akorosov
@aperrin66
The tests are also ready in 09d0917. There is no more change for this pr unless you have requested by your review.

aperrin66

Some things to change are indicated in the inline comments.
More generally:

use library objects in the intended way. Take the time to go through the documentation before starting to code.
even in a short stand-alone script, split your code into functions. It makes the code easier to read and test.

aperrin66 · 2020-09-23T13:56:50Z

geospaas_harvesting/verify.py

+    id_range = range(DatasetURI.objects.earliest('id').id,
+                     DatasetURI.objects.latest('id').id, 1000)# <=number for the length of retrieved
+    for i in range(len(id_range)):
+        try:
+            retrieved_dataset_uris = DatasetURI.objects.filter(
+                id__gte=id_range[i], id__lt=id_range[i+1])
+        except IndexError:
+            retrieved_dataset_uris = DatasetURI.objects.filter(
+                id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id)


This is not easily readable. Please take a look at the QuerySet documentation and use use an existing mechanism to iterate over the table.

Done in the next commit.

aperrin66 · 2020-09-23T14:06:35Z

geospaas_harvesting/verify.py

+            if 'html' in content_type.lower() or 'text' in content_type.lower():
+                corrupted_url_set.add(dsuri.uri)
+
+    with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f:


Please use a clear descriptive name: "ones" could be anything. Keep this in mind for naming in general.

It's a very bad idea to use | in a file name since it's the pipe operator in Unix shells.

Either of these solutions would be more user-friendly:

the name of the file could be passed as an argument to the script (with a default value if it is not provided).

the script could write its output to stdout, and the user can then redirect it as they wish.

Done in the next commit.

aperrin66 · 2020-09-23T14:11:11Z

geospaas_harvesting/verify.py

+            if DatasetURI.objects.get(uri=url).dataset.dataseturi_set.count() == 1:
+                assert DatasetURI.objects.get(uri=url).dataset.delete()[0] == 2
+            else:
+                assert DatasetURI.objects.get(uri=url).delete()[0] == 1


As we discussed, we don't want this script to delete everything right away.
The remote repository could just be momentarily unavailable.
Writing to a file or stdout it is enough. It can then be checked.

Done in the next commit.

aperrin66 · 2020-09-23T14:12:43Z

geospaas_harvesting/verify.py

+    with open(f"unverified_ones_at_{datetime.now().strftime('%Y-%m-%d|%H_%M_%S')}.txt", 'w') as f:
+        for url in corrupted_url_set:
+            # Write down the urls on unverified_ones_at_blablabla.txt
+            f.write(url + '\n')


Please use https://docs.python.org/3/library/os.html#os.linesep

Done in the next commit.

aperrin66 · 2020-09-23T14:14:27Z

tests/test_verify.py

+from geospaas.catalog.models import Dataset, DatasetURI
+
+import geospaas_harvesting.verify as verify
+from tests.test_ingesters import IngesterTestCase as itc


You don't need to use an alias since there is no other class named "IngesterTestCase" here.

Done in the next commit.

aperrin66 · 2020-09-23T14:22:11Z

geospaas_harvesting/verify.py

+            retrieved_dataset_uris = DatasetURI.objects.filter(
+                id__gte=id_range[i], id__lte=DatasetURI.objects.latest('id').id)
+        for dsuri in retrieved_dataset_uris:
+            content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')


Any particular reason to get the Content-Type header (which might not be set) instead of relying on the status code?

@aperrin66
How we are going to identify a download link from its status code? What would that status be for a download link? is it distinguishable from a text or html response that might describe the failure of the download link instead of download link itself?

The reason for using content type is that it clearly shows that this link is ready for starting a download action. It shows it by setting the content type value as, for example,application/x-netcdf;charset=ISO-8859-1

If the status code is different from 2**, there is a problem.

aperrin66 · 2020-09-23T14:23:14Z

geospaas_harvesting/verify.py

+        for dsuri in retrieved_dataset_uris:
+            content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')
+            if 'html' in content_type.lower() or 'text' in content_type.lower():
+                corrupted_url_set.add(dsuri.uri)


corrupted_url_set could grow quite big. Why not write the url right away instead of storing all of them in memory, then iterating over them again?

I thought that it will not be problematic to store it in the memory. I thought that the ratio of corrupted ones to the healthy ones is extremely low!

Done in the next commit.

opsdep · 2020-09-24T07:47:28Z

@aperrin66
All of your comment are addressed in d8106d1. If you are ok with it, then notify me to write tests based on it.

I am 100 percent agree with separation with functions to write a better code. However, this is a tiny code that I think it is not a proper idea for this one(exclusive this code) to make functions out of it.

aperrin66 · 2020-09-24T08:02:49Z

geospaas_harvesting/verify.py

+    main(filename=sys.argv[1] if len(sys.argv) == 2 else \
+            f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}")


The code to get the file name get be put in the main() function since it's read from the command line.
It will be easier to write it in a clear and clean way there, rather than in this long line.

aperrin66 · 2020-09-24T08:04:28Z

geospaas_harvesting/verify.py

+        while init_index < DatasetURI.objects.count():
+            try:
+                retrieved_dataset_uris = DatasetURI.objects.all()[init_index:init_index+interval]
+            except IndexError:
+                retrieved_dataset_uris = DatasetURI.objects.all()[init_index:]
+            init_index += interval


This is better but still more complicated that necessary.
https://docs.djangoproject.com/en/3.1/ref/models/querysets/#iterator

opsdep · 2020-09-24T09:28:18Z

@aperrin66
54883ad is ready and I am writing test for it. agreed?

geospaas_harvesting/verify.py

aperrin66 · 2020-09-24T10:09:01Z

geospaas_harvesting/verify.py

+            if requests.head(dsuri.uri, allow_redirects=True).status_code==200:
+                content_type = requests.head(dsuri.uri, allow_redirects=True).headers.get('content-type')
+                if 'html' in content_type.lower() or 'text' in content_type.lower():
+                    f.write(dsuri.uri + os.linesep)
+            else:
+                f.write(dsuri.uri + os.linesep)


Maybe my previous comment was not clear enough.

Unless we run into a case that proves this wrong, let's assume that a status code in the 200-299 range is enough to validate a URL. There is no need to look at the headers for now. We can always add it later if the status code is not enough.

opsdep · 2020-09-24T11:20:59Z

@aperrin66
All addressed in 573c056.

aperrin66

A few more things are indicated inline.
Please remember to check your code for PEP-8 conformity.

aperrin66 · 2020-09-24T11:48:04Z

geospaas_harvesting/verify.py

+
+
+if __name__ == '__main__':
+    main(filename=sys.argv[1] if len(sys.argv) == 2 else '')


Why do you leave that here? You don't need to pass an argument to main().
You can just put something like that at the beginning of main():

try: filename = sys.argv[1] except IndexError: filename = "..."

Done in the next commit

aperrin66 · 2020-09-24T11:56:56Z

geospaas_harvesting/verify.py

+        filename=f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}"
+    with open(filename+".txt", 'w') as f:
+        for dsuri in DatasetURI.objects.iterator():
+            if not str(requests.head(dsuri.uri, allow_redirects=True).status_code).startswith('2'):


Why bother with type conversions?
Something like that is both clearer and more efficient:

response = requests.head(dsuri.uri, allow_redirects=True) if response.status_code < 200 or response.status_code > 299: f.write(dsuri.uri + os.linesep)

Done in the next commit

aperrin66 · 2020-09-24T12:11:28Z

tests/test_verify.py

+    @mock.patch('requests.head')
+    def test_download_link_responded_with_incorrect_status_code(self, mock_request, mock_open):
+        """Shall write dataset to file from database because of unhealthy download link"""
+        mock_request.return_value = self.FakeResponseIncorrectStatusCode()


that works too, and you don't have to declare an extra class:
mock_request.return_value.status_code = 504

Done in the next commit

aperrin66 · 2020-09-24T12:15:48Z

tests/test_verify.py

+        self.assertEqual(len(mock_open.mock_calls), 5)
+        self.assertTrue(mock_open.mock_calls[2][1][0].startswith('http://test.uri/dataset'))
+        self.assertTrue(mock_open.mock_calls[3][1][0].startswith('http://anotherhost/dataset'))


Your tests might be easier to write if you just use a temporary directory to write the output files instead of mocking open().
See https://docs.python.org/3/library/tempfile.html

Done in the next commit

opsdep · 2020-09-25T07:15:25Z

@aperrin66
all addressed in c67b1ed.

Co-authored-by: Adrien Perrin <[email protected]>

aperrin66

Almost there, two more things:

please check spaces around operators.
use a more explicit name for the script, like verify_urls.py

opsdep · 2020-09-25T10:00:21Z

@aperrin66
664b116 is ready.

opsdep · 2020-09-25T11:28:27Z

closes nansencenter/django-geo-spaas#24
@akorosov
@aperrin66
This pr also closes nansencenter/django-geo-spaas#24 at the same time. Because we have developed some code that checks the previously ingested datasets.

aperrin66 · 2020-09-28T06:55:18Z

I removed nansencenter/django-geo-spaas#24 from the linked PRs because there is not code here to fix the wrong URLs, only to detect them.

opsdep requested review from akorosov and aperrin66 September 23, 2020 12:52

aperrin66 requested changes Sep 23, 2020

View reviewed changes

aperrin66 reviewed Sep 24, 2020

View reviewed changes

opsdep requested a review from aperrin66 September 24, 2020 09:39

aperrin66 reviewed Sep 24, 2020

View reviewed changes

geospaas_harvesting/verify.py Outdated Show resolved Hide resolved

aperrin66 reviewed Sep 24, 2020

View reviewed changes

opsdep force-pushed the issue13-verify-code branch from b3467b2 to 8021be4 Compare September 24, 2020 11:31

opsdep requested a review from aperrin66 September 24, 2020 11:33

aperrin66 requested changes Sep 24, 2020

View reviewed changes

opsdep requested a review from aperrin66 September 25, 2020 07:15

aanersc and others added 13 commits September 25, 2020 11:49

#13 code for verification

f655449

#13 tests for verify code

2dd0ce1

#13 autopep8 of previous commit

7cb80d6

#13 sort imports of previous commit

a78a3e9

#13 without deleting

88de529

#13 simple input///status_code///iterator

ddaab82

#13 tests are updated

2b0a1c9

Update geospaas_harvesting/verify.py

559204b

Co-authored-by: Adrien Perrin <[email protected]>

#13 docstring is updated

e5ea178

#13 minors

8631bc0

#13 docstring correction

c0b2ffd

#13 minors

ca0d884

#13 tests with temfile

d656f66

aperrin66 requested changes Sep 25, 2020

View reviewed changes

#13 minors

600f424

opsdep force-pushed the issue13-verify-code branch from c67b1ed to 600f424 Compare September 25, 2020 09:58

#13 minors

664b116

opsdep requested a review from aperrin66 September 25, 2020 10:44

#13 minors

172814c

aperrin66 approved these changes Sep 25, 2020

View reviewed changes

aperrin66 merged commit e2fbd0b into master Sep 28, 2020

aperrin66 deleted the issue13-verify-code branch September 28, 2020 06:54

		main(filename=sys.argv[1] if len(sys.argv) == 2 else \
		f"unverified_dataset_at_{datetime.now().strftime('%Y-%m-%d___%H_%M_%S')}")



		if __name__ == '__main__':
		main(filename=sys.argv[1] if len(sys.argv) == 2 else '')

#13 code for verification #34

#13 code for verification #34

Conversation

opsdep commented Sep 23, 2020 • edited by aperrin66 Loading

opsdep commented Sep 23, 2020

aperrin66 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opsdep commented Sep 24, 2020

Choose a reason for hiding this comment

aperrin66 Sep 24, 2020 • edited Loading

Choose a reason for hiding this comment

opsdep commented Sep 24, 2020

Choose a reason for hiding this comment

opsdep commented Sep 24, 2020

aperrin66 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opsdep commented Sep 25, 2020

aperrin66 left a comment

Choose a reason for hiding this comment

opsdep commented Sep 25, 2020 • edited Loading

opsdep commented Sep 25, 2020

aperrin66 commented Sep 28, 2020 • edited Loading

opsdep commented Sep 23, 2020 •

edited by aperrin66

Loading

aperrin66 Sep 24, 2020 •

edited

Loading

opsdep commented Sep 25, 2020 •

edited

Loading

aperrin66 commented Sep 28, 2020 •

edited

Loading