Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading data from Zooniverse; classification_export.status_code == 403 error #38

Open
beckynevin opened this issue Apr 27, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@beckynevin
Copy link
Collaborator

Describe the bug
The last cell of the citizen science notebook (the one that grabs the classifications from Zooniverse using panoptes client) fails every 10th time it runs.

To Reproduce
Steps to reproduce the behavior, written in imperative mood:

  1. Restart the kernel
  2. Scroll down to the last cell in citizen science notebook
  3. Run the cell
  4. Follow the directions to log in with your Zooniverse credentials.
  5. Sometimes it works, continue restarting the kernel and rerunning until you see the error.

Expected behavior
That there be no error with downloading the classifications. In other words, classification_export.status_code == 200 and classification_export.ok == True.

Actual behavior
Sometimes (again only ~10th time this is run), classification_export.status_code == 403.

Screenshots

EDC Output

INPUT
# This cell is set up to run independently from all of the above cells
import panoptes_client, utils
panoptes_client.Panoptes.connect(login="interactive")
# This project_id is found on Zooniverse by selecting 'build a project' and then selecting the project
# You don't need to be the project owner.
project_id = 19539
classification_export = panoptes_client.Project(project_id).get_export('classifications')
list_rows = []
counter = 0
# If the following line throws an error, restart the kernel and rerun the cell.
for row in classification_export.csv_reader():
    if counter == 0:
        header = row
    else:
        list_rows.append(row)
    counter += 1
df = utils.pandas.DataFrame(list_rows, columns = header)
df

SAMPLE OUTPUT
Enter your Zooniverse credentials...
Username:  rebecca.nevin
 ········
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
Input In [1], in <cell line: 14>()
     10 counter = 0
     11 # I get a weird error if I run the rest of this notebook first and don't rerun the import and call
     12 # to panoptes_client above: 
     13 # Error: iterator should return strings, not bytes (the file should be opened in text mode)
---> 14 for row in classification_export.csv_reader():
     16     if counter == 0:
     17         header = row

Error: iterator should return strings, not bytes (the file should be opened in text mode)

Additional context
Here is the code we wrote that bypasses this issue. We are not including this in the alpha version of the code release, but we'd like to include it down the road. Currently, we just have one comment that recommends re-running the cell if it fails.

# I currently have this cell set up to run independently from all of the above cells
#from panoptes_client import Panoptes, Project
import panoptes_client, utils
panoptes_client.Panoptes.connect(login="interactive")
# This project_id is found on Zooniverse by selecting 'build a project' and then selecting the project
# I also don't think you need to be the project owner, but I'm not sure
project_id = 19539
classification_export = panoptes_client.Project(project_id).get_export('classifications')
list_rows = []
counter = 0
# I get a weird error if I run the rest of this notebook first and don't rerun the import and call
# to panoptes_client above: 
# Error: iterator should return strings, not bytes (the file should be opened in text mode)
if classification_export.status_code == 200 and classification_export.ok == True:
    for row in classification_export.csv_reader():

        if counter == 0:
            header = row
        else:
            #print(row)
            list_rows.append(row)
        counter += 1

    df = utils.pandas.DataFrame(list_rows, columns = header)
    print(df)
elif classification_export.status_code == 403:
    print("There was an issue with the request, please try again in a minute.")
else:
    print(classification_export.status_code)
    print(classification_export.text)
@beckynevin beckynevin added the bug Something isn't working label Apr 27, 2023
@clareh
Copy link
Contributor

clareh commented May 16, 2023

had a discussion with someone who is keen to use our pipeline down the road and they raised the concern about the delay for getting results. They think the ~24 hour wait to get classifications will impact their ability to do science... worth discussing in this context perhaps?

@bnord
Copy link
Collaborator

bnord commented May 27, 2023

@clareh What specific concerns did they have about the delay? Why does 24-hour delay affect their science capacity?

@beckynevin
Copy link
Collaborator Author

Maybe the above two comments should be attached to a separate discussion? They seem not related to this issue/bug but seem related to the general discussion topic of how to fetch data.

@bnord
Copy link
Collaborator

bnord commented May 30, 2023

@clareh Could you start an issue or a new discussion on this?

@eatyourgreens eatyourgreens self-assigned this Jun 6, 2023
@eatyourgreens
Copy link
Collaborator

Hi! I've added myself to this as the Zooniverse contact.

My first thought is that perhaps the failed requests are using expired Authorization headers but I will investigate.

@ericdrosas87
Copy link
Contributor

Thank you @eatyourgreens !

@eatyourgreens
Copy link
Collaborator

eatyourgreens commented Jun 8, 2023

Hi again,

Do you know if the classification export is being requested after its signed URL has expired? Here's an example of an expired link:
https://panoptesuploads.blob.core.windows.net/private/project_classifications_export/2659a7c3-043d-45c7-8cef-c0fbae185cc5.csv?sp=r&sv=2018-11-09&se=2023-06-07T22%3A08%3A14Z&sr=b&sig=rnOa82WJhSROjG61If1qZ0QLIGcHT3KADJptlQB%2BoAE%3D

The URLs expire 3 minutes after they're generated, so maybe that's the cause of the problem?

If the signed URL has expired, I think that you need to retry and generate a new URL.

@eatyourgreens
Copy link
Collaborator

eatyourgreens commented Jun 8, 2023

zooniverse/panoptes#4209 might fix this, once it’s deployed to Panoptes production.

Credit to @yuenmichelle1 for figuring out the caching problem: those classification links are good for 3 minutes but Panoptes caches for 5 minutes, so there's a 2 minute overlap where Panoptes can give you an expired link.

@ericdrosas87
Copy link
Contributor

Thank you for the update @eatyourgreens, we'll retest soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants