Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

301 redirect error on paradoxValidateLinks should not be hardcoded as an error #404

Open
skvithalani opened this issue Jan 10, 2020 · 4 comments

Comments

@skvithalani
Copy link

skvithalani commented Jan 10, 2020

We are using paradox 0.6.8 version and while running paradoxValidateLinks we get a whole bunch of 301 redirect errors, mostly saying that

Received a 301 Moved Permanently on external link, location redirected to is [https://github.com/tmtsoftware/myproject/blob/master/..] at somepage.md

and we have set github.base_url to https://github.com/tmtsoftware/myproject/tree/master/...

By looking at the code of validateExternalLink method

private def validateExternalLink(capturedLink: CapturedLink, errorContext: ErrorContext, logger: ParadoxLogger) = {
    logger.info(s"Validating external link: ${capturedLink.link}")

    def reportError = reportErrorOnSources(errorContext, capturedLink.allSources)(_)
    val url = capturedLink.link.toString

    try {
      val response = Jsoup.connect(url)
        .userAgent("Paradox Link Validator <https://github.com/lightbend/paradox>")
        .followRedirects(false)
        .ignoreHttpErrors(true)
        .ignoreContentType(true)
        .execute()

      // jsoup doesn't offer any simple way to clean up, the only way to close is to get the body stream and close it,
      // but if you've already read the response body, that will throw an exception, and there's no way to check if
      // you've already tried to read the response body, so we can't do that in a finally block, we have to do it
      // explicitly every time we don't want to consume the stream.
      def close() = response.bodyStream().close()

      if (response.statusCode() / 100 == 3) {
        close()
        reportError(s"Received a ${response.statusCode()} ${response.statusMessage()} on external link, location redirected to is [${response.header("Location")}]")
      } else if (response.statusCode() != 200) {
        close()
        reportError(s"Error validating external link, status was ${response.statusCode()} ${response.statusMessage()}")
      } else {
        if (capturedLink.hasFragments) {
          validateFragments(url, response.parse(), capturedLink.fragments, errorContext)
        } else {
          close()
        }
      }
    } catch {
      case NonFatal(e) =>
        reportError(s"Exception occurred when validating external link: $e")
        logger.debug(e)
    }
  }

Here, followRedirects is hardcoded false in validateExternalLink method. Hence, can we take a customised boolean setting from users like followRedirects to let them choose whether paradoxValidateLinks errors out on 301 or not ?

We can also think to take custom error code list to fail only on limited errors mentioned (white list of error codes or black list of error codes or both available as settings and if both are provided then default to white list).

@skvithalani skvithalani changed the title 301 redirect error on paradoxValidateLinks 301 redirect error on paradoxValidateLinks should not be hardcoded as an error Jan 10, 2020
@skvithalani skvithalani changed the title 301 redirect error on paradoxValidateLinks should not be hardcoded as an error 301 redirect error on paradoxValidateLinks should not be hardcoded as an error Jan 10, 2020
@ennru
Copy link
Member

ennru commented Jan 11, 2020

There is a lot of potential to improve the link checking. Adding a followRedirects would help your specific case, for my projects I changed most links manually to the redirect target instead.
Creating a link check report as I suggested in #367 would make that configuration less important.

@skvithalani
Copy link
Author

skvithalani commented Jan 11, 2020

@ennru Thank you for replying to the issue.

We are planning to add paradoxValidateLinks in one of our nightly builds(not in dev/prod build), even though it is not recommended to do so, as it can fail randomly in case any external site is unreachable. But, we would still like to go ahead and add it in our build to get regular feedback. Also, we are fine with seldom failure in our nightly builds because of unreachable link.

With this understanding, 301 is likely to come when we will add new external links. Hence, #367 is a very good approach to fix linking errors by looking at a file (CSV) but in our case particularly we would want deterministic failures and possibly skip 301 to add paradoxValidateLinks in our nightly build.

Does this seem to be a reasonable and feasible scenario in your opinion?

Also, there are couple of links in our documentation for which we require https://github.com/tmtsoftware/myproject/tree/master/.. as our base url (where we point to github directory) and all others require https://github.com/tmtsoftware/myproject/blob/master/.. as our base url (where we point to concrete files and not directory), so what is the recommended approach to address both base urls without getting 301 error?

@ennru
Copy link
Member

ennru commented Jan 13, 2020

Don't you get FORBIDDEN from Github when you try this? They are very quick on blocking the requests from this link checker when I try it.
I think creating a report in a file and diffing that against an expected outcome seems more attractive than trying to configure something more advanced. I believe the link checking should maybe not even be a feature of Paradox itself. My take on it before this was introduced here was to build an external link checker working on the generated files: https://github.com/ennru/site-link-validator (but didn't get to a usable state...)

@skvithalani
Copy link
Author

skvithalani commented Jan 13, 2020

Surprisingly, I am not getting FORBIDDEN so far. In fact, I had to add github.dir.base_url in addition to already existing github.base_url value in our build.

This is how we use it in md files:
[csw-location-client]($github.dir.base_url$/csw-location/csw-location-client)

After having github.dir.base_url and github.base_url in our build I was able to get rid of github related 301 errors. But this does not seem like a clean solution. I would still expect some smartness to skip such 301 redirect errors from validation tool. Otherwise, in a big project it is a maintenance overhead to look for tree and blob specifically.

Also, when it comes to getting a regular feedback from our nightly build, I think it is important to have such validation tasks run regularly so that we don't create a big pile of maintenance at the time of release to fix all invalid links. And for that matter, I agree and see your point, that if we use the file approach to validate against some expected output, it will solve the problem where I expect 301 errors for github links. One disadvantage of this is that the file would turn out to have all the guthub related 301 errors, which are more than 100 in our project.

So again, my natural question is, can't we still simplify the file such that I don't need over 100 entries bypassing 301 errors ? Wouldn't it be better to customise what errors I want to fail on ?

Also, I checked out your https://github.com/ennru/site-link-validator, it seems pretty cool. If it would be in working condition with customisable errors to fail on and having first class support of paradox directives, I think, it would be a good fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants