Tool "forgets" login status and rescans whole blog #404

thomas694 · 2022-12-04T15:27:18Z

Discussed in #396

^{Originally posted by stuwm November 21, 2022}
When extracting files from hidden tumblrs, the tool is "forgetting" that I have already logged into Tumblr. When clicking on "Authenticate", the tool is not asking for credentials, but switches to "logged in" status immediately. This typically happens after 5-10 mins.
Along with this behaviour the tool is recognizing "latest post" days from years back, although newer posts exist. Then the tool is starting a rescan for that tumblr.

…ogged out - Tumblr started to replace cookies far before their expiration date which lead to logouts, because the local cookie store wasn't constantly updated. - Now every cookie update request in a response is persisted.

- The normal crawler didn't recognize pinned posts.

thomas694 · 2022-12-04T15:41:16Z

"forgets login status"
Tumblr started to replace cookies far before their expiration date which lead to logouts, because the local cookie store wasn't constantly updated.
Now every cookie update request in a response is persisted.
"rescan whole blog"
The crawler for normal blogs seems to rescan whole blog on update.
The normal crawler didn't recognize pinned posts.

jjdnl · 2022-12-10T05:00:48Z

I have the same 'forgot login status' issue for the last 3 days now. Is there an update somewhere? Update check says that i'm up to date. version 2.9.2.368

MSWallack · 2022-12-22T15:33:18Z

I installed v2.9.3 but the app still forgot the authentication status. I closed the app and deleted cookies.json. Same result (except that the authentication window stayed white; I tried again and this time the window showed my tumblr blog). I closed the app and rebooted. Same result. Thank you for your efforts, but it must be something else going on.

AshleyPinner · 2023-01-04T10:51:24Z

Version: 2.9.3.376

Some plain observations of this behaviour, as it's happening to me. I can replicate it with the blogs I have in my queue (any number will suffice, as it's all based on time rather than quantity). Clearing the EBWebView folder + cookies.json, and starting afresh, I'll be logged out again in under 30 mins.

Things to note: cookies.json doesn't get updated when I get logged out, but more weirdly it also doesn't get updated when I log in again after being logged out. (it obviously gets created if I delete and log in afresh, but after that no changes are made to it)

The logs do also complain about an an uncaught exception in

TumblThree/src/TumblThree/TumblThree.Applications/Services/WebRequestFactory.cs

Line 129 in 068c5f1

    
           using (var response = await request.GetResponseAsync().TimeoutAfter(shellService.Settings.TimeOut) as HttpWebResponse)

:

20230104 10:05:44.549	Err 	TumblrHiddenCrawler.AddUrlsToDownloadListAsync: System.Net.WebException: The remote server returned an error: (404) Not Found.
   at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at TumblThree.Applications.Extensions.TaskTimeoutExtension.<TimeoutAfter>d__0`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at TumblThree.Applications.Services.WebRequestFactory.<ReadRequestToEndAsync>d__12.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Services\WebRequestFactory.cs:line 129
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at TumblThree.Applications.Parser.GfycatParser.<RequestGfycatCajax>d__6.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Parser\GfycatParser.cs:line 43
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at TumblThree.Applications.Parser.GfycatParser.<SearchForGfycatUrlAsync>d__8.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Parser\GfycatParser.cs:line 96
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at TumblThree.Applications.Crawler.AbstractTumblrCrawler.<AddGfycatUrlAsync>d__31.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\AbstractTumblrCrawler.cs:line 160
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at TumblThree.Applications.Crawler.TumblrHiddenCrawler.<AddExternalPhotoUrlToDownloadListAsync>d__47.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TumblrHiddenCrawler.cs:line 716
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at TumblThree.Applications.Crawler.TumblrHiddenCrawler.<AddUrlsToDownloadListAsync>d__23.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TumblrHiddenCrawler.cs:line 393

However, since these are all hidden blogs, I'm unsure how much of that is cause and how much of that is effect. The only clue in the logs that you've been logged out is this combo:

20230104 09:58:16.101	Err 	TumblrHiddenCrawler:IsBlogOnlineAsync: User not logged in
20230104 09:58:20.601	Err 	You need to login to tumblr.com for downloading <blogname>. Go to the Settings->Connection->Authenticate., The remote server returned an error: (401) Unauthorized.

Which is reflected in the errors at the top of the UI, and isn't a precursor to being logged out.

I'm happy to run a version with any kind of extended debug logging enabled since I'm in a replicable case (or is there any options for that already? Verbose logging doesn't tell me much more than TumblrHiddenCrawler.Crawl:Start), as right now my workflow is adding two blogs at a time, and re-queuing them after doing the login shuffle should I get "logged out" during their processing, which isn't exactly scaleable for ~300 blogs!

thomas694 · 2023-01-04T20:45:34Z

Thanks for your observations.

Are you leaving the settings dialog by using the save button or without? The cookies.json should be updated, because after a new login the cookies should be different.

In the error message I see GfycatParser. Maybe the linked image is no longer valid or something like that.

Do you download only hidden blogs or mixed together with normal blogs? Can you confirm that it works without being logged out if you download only newly added hidden blogs as written in the linked discussion #396?
Currently there's no option for more log output. Thanks for offering your help.

AshleyPinner · 2023-01-26T10:41:05Z

Not sure what you changed between 2.9.3 and 2.10 towards this bug, but last night it completed a full ~250 blogs download without being logged out after a period of time.

thomas694 · 2023-01-29T00:41:03Z

You can always compare the tagged versions.
The following things changed related to Tumblr crawlers, but none of them should be affecting this problem:

high-res image retrieval in normal crawler
"Date to" is including that day
user agent "Chrome 109" instead of "Chrome 107"

I wish it would be solved, but I just don't believe it yet.

MSWallack · 2023-01-30T17:50:17Z

If I recall (I'm not presently at the computer on which I use the app), the user agent in my config is Firefox. Should I change it to Chrome (and if so, what should that field say)?

thomas694 · 2023-01-31T23:08:13Z

It might help, maybe not. Currently it's "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" (no quotes).

MSWallack · 2023-02-03T16:55:09Z

I checked and my user agent is exactly what you listed for the current user agent setting. So I guess that means we can rule that out as the culprit.

AshleyPinner · 2023-02-06T23:52:14Z

So.. it's returned. Updating to 2.10.0.384 gave a respite of about 2 weeks, and now it can no longer retain logged in status properly.

MSWallack · 2023-02-09T20:31:11Z

I haven't done any programming since the dark ages (late '80s and early '90s) but the more I think about it, the more convinced I am that one of the keys to figuring out what's going on is the fact that even after getting a 501 error, when I click the Authenticate button, I do NOT need to authenticate again. The Tumblr window that pops up goes straight to my blog as if I've already logged in. I wonder whether it would be possible for the app to simply mimic the steps of re-authenticating when it encounters a 501 error.

AshleyPinner · 2023-02-09T23:35:02Z

Something to note, when I get Error 9 ("You need to login to tumblr.com for downloading <blog>. Go to the Settings -> Connection -> Authenticate."), the LastId inside the <blog>.tumblr file gets reset to 0. I suspect these are related somehow, as now when you requeue that blog, it'll scan the whole thing (ForceRecan is left to false, mind).

I ponder if the relation here is that:

you make too many requests inside a short period of time
tumblr expires your cookie because you've made too many requests
the next request is 501'ed, and you get Error 9 for that blog
this sets the LastId to 0 for that blog
you then make too many requests when you requeue the blog, starting back at the first point

Thus this is why it doesn't affect new blogs, if they don't have many posts, they get a proper LastId and are able to be fetched without requesting the whole blog again.

Things that need to happen when a blog is 501'ed:

If no files have been downloaded during the attempt, DO NOT reset the LastId, and DO NOT update the Last Complete Crawl (both of these are happening)
If files have been downloaded, but it errors partway through, update the LastId, and set the Last Complete Crawl to the date/time of the last post (so you have two ways to ensure you continue from the right point)
Possibly let the whole queue be paused if you get a 501 during a crawl, so it doesn't attempt to continue without authentication, as that's just not a good idea anyway.

thomas694 · 2023-02-13T22:57:20Z

@AshleyPinner @MSWallack

Please use this test version, it doesn't reset the LastId and logs the cookies.
For running the test, delete your TumblThree.log before and make sure that logging is set to 'Verbose' and you only put hidden blogs to the queue.
Run it and after you get this error you can stop it. Please provide any information that may be useful here
and send me the last 2-3 pages of the log file by email.

MSWallack · 2023-02-14T15:45:00Z

I installed and used the test version last night. The bad news is that I wasn't able to devote much time to it. The good news is that it never generated the 501 error. I downloaded a bunch of hidden blogs (about 20) and stayed logged in the whole time. I closed the program and restarted. Before doing anything else, I checked and I was still logged in (this hasn't been the case since the problem started). I then downloaded a bunch of non-hidden blogs and everything appeared fine. I'll try to test more over the next few days.

MSWallack · 2023-02-15T14:47:13Z

The good news: I let the app download a large batch of blogs (both hidden and not) and it seems to working just fine. No 501 errors.

The bad news: The app does appear to be rescanning each blog in full (or at least processing each blog more slowly than I remember.

thomas694 · 2023-02-16T22:58:43Z

That's good news.
As mentioned earlier the LastId may have been reset so it takes a full scan until the last position is remembered again.

MSWallack · 2023-02-17T15:10:00Z

So what was the problem and what did you change to fix the problem?

AshleyPinner · 2023-02-17T19:44:09Z

Ok, so sadly the logout still happens with the new version, but it now is more clear in the logs.

20230217 19:23:02.134	Vrb 	 blog
20230217 19:23:02.134	Vrb 	  tmgioct: {An ID}
20230217 19:23:02.134	Vrb 	  logged_in: 1
20230217 19:23:02.134	Vrb 	  cl_pref: show
20230217 19:23:02.134	Vrb 	  euconsent-v2: {some data}
20230217 19:23:02.134	Vrb 	  euconsent-v2-noniab: AAYA
20230217 19:23:02.134	Vrb 	  euconsent-v2-analytics: 0

and then (the error is async, but logs after the cookie change):

20230217 19:23:02.249	Err 	You need to login to tumblr.com for downloading {blog}. Go to the Settings->Connection->Authenticate., The remote server returned an error: (401) Unauthorized.

and thus:

20230217 19:23:02.243	Vrb 	 blog
20230217 19:23:02.244	Vrb 	  tmgioct: {Same ID as before, unchanged}
20230217 19:23:02.244	Vrb 	  logged_in: 0
20230217 19:23:02.244	Vrb 	  cl_pref: block
20230217 19:23:02.244	Vrb 	  euconsent-v2: {same some data as before}
20230217 19:23:02.244	Vrb 	  euconsent-v2-noniab: AAYA
20230217 19:23:02.244	Vrb 	  euconsent-v2-analytics: 0

It should be noted that 'main' still thinks it's logged in, but 'blog' isn't. To test this I threw in 109 blogs, and it errored after getting the queue down to 73 (which is roughly better than before?)

The LastId was thankfully unchanged, so there's a bit of success there?

thomas694 · 2023-02-19T00:18:01Z

So what was the problem and what did you change to fix the problem?

The problem, that some are being logged out, is known, the exact cause remains unclear. Actually I didn't fix too much yet. I changed that in case of errors the last position is no longer reset.

but it now is more clear in the logs.

Is there also a section "Cookies response" in your log? The especially interesting time is when the exception occurs.

AshleyPinner · 2023-02-19T03:23:38Z

Is there also a section "Cookies response" in your log? The especially interesting time is when the exception occurs.

Here's a whole section from the logs directly one after the other:

20230219 03:13:32.337	Vrb 	TumblrHiddenCrawler.Crawl:Start
20230219 03:13:34.315	Vrb 	Cookies request
20230219 03:13:34.315	Vrb 	 main
20230219 03:13:34.315	Vrb 	  logged_in: 1
20230219 03:13:34.315	Vrb 	  language: %252Cen_US
20230219 03:13:34.315	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.315	Vrb 	  euconsent-v2: {consent data}
20230219 03:13:34.315	Vrb 	  pfu: 12{some numbers removed}69
20230219 03:13:34.315	Vrb 	  sid: aVB{data before the dot}.{data after the dot}bML
20230219 03:13:34.315	Vrb 	  cl_pref: show
20230219 03:13:34.315	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.315	Vrb 	  tz: Europe%252FLondon
20230219 03:13:34.315	Vrb 	  search-displayMode: 2
20230219 03:13:34.315	Vrb 	  sid: deleted
20230219 03:13:34.315	Vrb 	  pfu: deleted
20230219 03:13:34.315	Vrb 	  tmgioct: {Same tmgioct as before}
20230219 03:13:34.315	Vrb 	  logged_in: 0
20230219 03:13:34.315	Vrb 	  cl_pref: block
20230219 03:13:34.315	Vrb 	  euconsent-v2: {consent data, same as before}
20230219 03:13:34.315	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.315	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.315	Vrb 	 blog
20230219 03:13:34.315	Vrb 	  tmgioct: {Same tmgioct as before}
20230219 03:13:34.315	Vrb 	  logged_in: 0
20230219 03:13:34.315	Vrb 	  cl_pref: block
20230219 03:13:34.315	Vrb 	  euconsent-v2: {consent data, same as before}
20230219 03:13:34.315	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.315	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.419	Vrb 	Cookies error
20230219 03:13:34.420	Vrb 	 main
20230219 03:13:34.420	Vrb 	  logged_in: 1
20230219 03:13:34.420	Vrb 	  language: %252Cen_US
20230219 03:13:34.420	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.420	Vrb 	  euconsent-v2: {same content as before}
20230219 03:13:34.420	Vrb 	  pfu: 12{some numbers removed}69
20230219 03:13:34.420	Vrb 	  sid: aVB{data before the dot}.{data after the dot}bML
20230219 03:13:34.420	Vrb 	  cl_pref: show
20230219 03:13:34.421	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.421	Vrb 	  tz: Europe%252FLondon
20230219 03:13:34.421	Vrb 	  search-displayMode: 2
20230219 03:13:34.421	Vrb 	  tmgioct: {Same tmgioct as before}
20230219 03:13:34.421	Vrb 	  logged_in: 0
20230219 03:13:34.421	Vrb 	  cl_pref: block
20230219 03:13:34.421	Vrb 	  euconsent-v2: {consent data, same as before}
20230219 03:13:34.421	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.422	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.422	Vrb 	 blog
20230219 03:13:34.422	Vrb 	  tmgioct: {Same tmgioct as before}
20230219 03:13:34.422	Vrb 	  logged_in: 0
20230219 03:13:34.422	Vrb 	  cl_pref: block
20230219 03:13:34.422	Vrb 	  euconsent-v2: {Consent Data, same as before}
20230219 03:13:34.423	Vrb 	  euconsent-v2-noniab: AAYA
20230219 03:13:34.423	Vrb 	  euconsent-v2-analytics: 0
20230219 03:13:34.423	Err 	Auth error: The remote server returned an error: (401) Unauthorized.

Lightly edited to ensure that privacy is maintained. Though the sid: deleted and pfu: deleted lines are untouched from the logs themselves.

thomas694 · 2023-02-21T21:09:11Z

I guess you started the blog download with already "broken" cookies. I think we need to look at the time when it changes from successful to failing requests, so the last working and the first failing request.

AshleyPinner · 2023-02-21T22:12:57Z

If you can make a build that stops the queue instantly on a 401 then I can get you that log with ease; right now it's trying to sift through a lot of cooke reports that look identical (and trying to post them in such a way that doesn't give away any kind of privacy).

Currently when you hit a 401, pausing the queue manually takes a while (and isn't instant), which then spits a lot of errors into the logs.

thomas694 · 2023-02-26T20:04:09Z

It should be possible with a text editor to search from the end for "(401)".
Else you can place a breakpoint in the HandleUnauthorizedWebException method of the test_version branch.

MSWallack · 2023-03-30T14:19:42Z

I had to postpone downloading any additional files via the app for a few weeks while I attended to some other matters. Anyway, when I came back to the app it was once again logging me out regularly. I updated to the new test version mentioned in the most recent Discussion post. Same problem. However, I did note two things that may be meaningful: First, the program now shows a 401 error where it previously showed a 501 error. Second, in the dialog with the Authenticate button, even after clicking Authenticate and then closing the popup Tumblr window, a logout button does not appear and the "not logged in" message continues to be displayed.

…tinue - Occasionally something happens to the cookies and Tumblr returns an auth error on the next request. - In case of an auth error the cookies are fetched from the WebView instance again and the request is repeated. - It's not totally clear yet why this happens.

AshleyPinner · 2023-08-30T16:52:21Z

The latest update (2.11.2.421) doesn't seem to behave correctly; when it hits an auth error, it doesn't halt the worker threads, so while it's trying to reauth, the rest of the queue gets marked up as an error.

Worst still, this seems to introduce a strange behaviour where the Concurrent Blogs limit is ignored, it then starts up multiple blogs past this limit (I have it set to 4, it'll hit 10 during this bug), and then download absolutely nothing because it's past the limit.

Finally, everytime I start the app in the latest version, it thinks I'm not logged into Tumblr.

As an aside, can we persist the 'Newly downloaded items' number until the blog gets a new crawl? Part of what makes this most frustrating is that this number is ephemeral, and is lost if you need to relaunch the app for any reason (eg, it's gotten stuck and you have to kill it in task manager). This means that while it's gotten the files, and while I know which blogs were crawled, I can't be sure which ones actually got an update (for context, there's ~540 blogs in the index, and roughly 40 have new content when I crawl once per day)

thomas694 · 2023-09-02T21:18:19Z

I cannot reproduce your first three points. But as I said before, I don't get this error often anyway. If someone can support debugging here, help is welcome.

But it should be possible to persist this number.

thomas694 · 2023-09-18T23:55:32Z

By default the new behavior is disabled now. It can be enabled by an option in the advanced settings.

thomas694 added a commit that referenced this issue Dec 4, 2022

Fix issue #404 Blog crawler seems to rescan whole blog

ed3e684

- The normal crawler didn't recognize pinned posts.

thomas694 added the help wanted Extra attention is needed label Sep 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool "forgets" login status and rescans whole blog #404

Tool "forgets" login status and rescans whole blog #404

thomas694 commented Dec 4, 2022

thomas694 commented Dec 4, 2022

jjdnl commented Dec 10, 2022

MSWallack commented Dec 22, 2022

AshleyPinner commented Jan 4, 2023 •

edited

Loading

thomas694 commented Jan 4, 2023

AshleyPinner commented Jan 26, 2023

thomas694 commented Jan 29, 2023

MSWallack commented Jan 30, 2023

thomas694 commented Jan 31, 2023

MSWallack commented Feb 3, 2023

AshleyPinner commented Feb 6, 2023

MSWallack commented Feb 9, 2023

AshleyPinner commented Feb 9, 2023 •

edited

Loading

thomas694 commented Feb 13, 2023

MSWallack commented Feb 14, 2023

MSWallack commented Feb 15, 2023

thomas694 commented Feb 16, 2023 •

edited

Loading

MSWallack commented Feb 17, 2023

AshleyPinner commented Feb 17, 2023

thomas694 commented Feb 19, 2023

AshleyPinner commented Feb 19, 2023 •

edited

Loading

thomas694 commented Feb 21, 2023

AshleyPinner commented Feb 21, 2023 •

edited

Loading

thomas694 commented Feb 26, 2023

MSWallack commented Mar 30, 2023

AshleyPinner commented Aug 30, 2023

thomas694 commented Sep 2, 2023

thomas694 commented Sep 18, 2023

Tool "forgets" login status and rescans whole blog #404

Tool "forgets" login status and rescans whole blog #404

Comments

thomas694 commented Dec 4, 2022

Discussed in #396

thomas694 commented Dec 4, 2022

jjdnl commented Dec 10, 2022

MSWallack commented Dec 22, 2022

AshleyPinner commented Jan 4, 2023 • edited Loading

thomas694 commented Jan 4, 2023

AshleyPinner commented Jan 26, 2023

thomas694 commented Jan 29, 2023

MSWallack commented Jan 30, 2023

thomas694 commented Jan 31, 2023

MSWallack commented Feb 3, 2023

AshleyPinner commented Feb 6, 2023

MSWallack commented Feb 9, 2023

AshleyPinner commented Feb 9, 2023 • edited Loading

thomas694 commented Feb 13, 2023

MSWallack commented Feb 14, 2023

MSWallack commented Feb 15, 2023

thomas694 commented Feb 16, 2023 • edited Loading

MSWallack commented Feb 17, 2023

AshleyPinner commented Feb 17, 2023

thomas694 commented Feb 19, 2023

AshleyPinner commented Feb 19, 2023 • edited Loading

thomas694 commented Feb 21, 2023

AshleyPinner commented Feb 21, 2023 • edited Loading

thomas694 commented Feb 26, 2023

MSWallack commented Mar 30, 2023

AshleyPinner commented Aug 30, 2023

thomas694 commented Sep 2, 2023

thomas694 commented Sep 18, 2023

AshleyPinner commented Jan 4, 2023 •

edited

Loading

AshleyPinner commented Feb 9, 2023 •

edited

Loading

thomas694 commented Feb 16, 2023 •

edited

Loading

AshleyPinner commented Feb 19, 2023 •

edited

Loading

AshleyPinner commented Feb 21, 2023 •

edited

Loading