Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qn about caching #21

Open
lmoresi opened this issue May 8, 2020 · 7 comments
Open

Qn about caching #21

lmoresi opened this issue May 8, 2020 · 7 comments

Comments

@lmoresi
Copy link

lmoresi commented May 8, 2020

Nice work Thomas (and others) ...

Here is something I noticed about how you cache the downloaded files.

In STEP3 of SeismoSD.ipynb You currently don't use cached data from "today" but I think this runs the risk that tomorrow you would consider this file to be just fine even if it is the incomplete data from running the notebook in the middle of the day. One simple check might be to check the creation date of the cached file though that is not really a check that the file is not corrupted.

The same argument applies to the downloaded data from step 3 and presumably the npz files in Step 4 too. There is no check to see if the npz file is out of date compared to the mseed file which would help, I think.

@ThomasLecocq
Copy link
Owner

Yep, just had the same issue... and manually deleted the files to be sure they would be reprocessed... The whole process was originally meant to be run once.

I'm ooooooopen for a solution (os.path.getmtime or else is ok for me)

@lmoresi
Copy link
Author

lmoresi commented May 8, 2020 via email

@ThomasLecocq
Copy link
Owner

the download/backfill logic is interesting, then for systematic, cron way of doing, I'd use MSNoise. In preparation for MSNoise 2.0 I already merged the PSD calculations. As soon as you "scan" an archive (whatever the way you fill this archive), MSNoise detects new jobs to do and only process those.

@FMassin
Copy link
Collaborator

FMassin commented May 8, 2020

I think using the notebook for this kind of thing is a complicated strategy. I would rather advice to wrap the fdsn.client interface into the module as new alternative mode to the --pqlx. We can also do an SDS interface, for the lucky ones which have direct access to data archive storage...

@ThomasLecocq
Copy link
Owner

sure thing... the idea was to provide a simple plotter for people.

@ThomasLecocq
Copy link
Owner

I mean, the elaborated way of handling massive datasets etc, without duplication from SDS archives etc, ... is implemented in MSNoise already. So the notebook complexity shouldn't be too much more expanded, it's not its goal.

@lmoresi
Copy link
Author

lmoresi commented May 20, 2020

Yes to all of the discussion - this is a dirty old hack !

I brought this up because of a small in-class project to automatically and on a daily schedule build these plots for a single site using github actions and push them back to the repository so that they are in the readme ( example: https://github.com/ANU-RSES-Education/SeismicNoise_AuSIS_UHS ). This is for the Australian Seismometers in Schools to give the students a chance to see what you guys are up to without needing to run the codes.

I don't have a bulletproof way to do this but it is similar to that requested in issue #23:

  1. Make a change in step 3
safety_window = pd.Timedelta('2 days')
today = pd.to_datetime(UTCDateTime.now().date)

# ... existing code 

for day in pbar:
    datestr = day.strftime("%Y-%m-%d")
    fn  = "{}_{}_{}.mseed".format(dataset, datestr, nslc)
    fnz = "{}_{}_{}.npz".format(dataset, datestr, nslc)
    
    if (today-day > safety_window) and (os.path.isfile(fn) or (os.path.isfile(fnz) and not force_reprocess)):
        pbar.set_description("Using cache - %s" % fn)
        continue
    else:
        pbar.set_description("Fetching    - %s" % fn)
        try: 
           # etc 
  1. A corresponding change in step 4
    for mseedid in list(set([tr.id for tr in stall])):
        fn_out = os.path.join("..","data","{}_{}_{}.npz".format(dataset, datestr, mseedid))
        if (today-day > safety_window) and (os.path.isfile(fn_out) and not force_reprocess):
            continue
        st = read(fn_in, sourcename=mseedid)

I can submit a PR if you would like me to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants