Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken URL fragments in the repo #25421

Closed
OnkarRuikar opened this issue Mar 17, 2023 · 12 comments
Closed

Broken URL fragments in the repo #25421

OnkarRuikar opened this issue Mar 17, 2023 · 12 comments
Labels
MDN:Project Anything related to larger core projects on MDN

Comments

@OnkarRuikar
Copy link
Contributor

OnkarRuikar commented Mar 17, 2023

There are many broken URL fragments in mdn/content.
Almost all the time when a header(## header) gets modified, contributors don't update related fragments all over the repo.

Following is the list of broken fragments data:
https://gist.github.com/OnkarRuikar/fa14827682a37f04f23c49189bfa0701

@github-actions github-actions bot added the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Mar 17, 2023
@teoli2003
Copy link
Contributor

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

@OnkarRuikar
Copy link
Contributor Author

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

But then we'll get multiple commits on each file :)
If that is not an issue then I've added sorted list of unique urls to OP.

@teoli2003
Copy link
Contributor

teoli2003 commented Mar 17, 2023

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

As long as we don't do one commit per flaw… I think this is ok: there is about 20 times the same broken link to HTTP/Cache#varying… If all are fixed in the same PR, it is quick to review, and I don't mind that the other flaws in these files are solved separately.

@Josh-Cena Josh-Cena added bulk update An update to a mass amount of data, or scripts/linters related to such changes chore A routine task. and removed needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. labels May 15, 2023
@Josh-Cena Josh-Cena added MDN:Project Anything related to larger core projects on MDN and removed bulk update An update to a mass amount of data, or scripts/linters related to such changes labels Jun 23, 2023
@Josh-Cena Josh-Cena removed the chore A routine task. label Aug 3, 2023
@Josh-Cena
Copy link
Member

@OnkarRuikar: I don't think this issue can be kept up to date. Can you share the script used to generate the gists and every time we try to fix something we can run the script?

@OnkarRuikar
Copy link
Contributor Author

every time we try to fix something we can run the script?

I don't think this issue can be kept up to date.

We don't have to. The current script prevents new occurrences from happening. The gist still hold true, e.g. content/files/en-us/learn/tools_and_testing/client-side_javascript_frameworks/ember_interactivity_events_state/index.md Learn/JavaScript/Building_blocks/Events#addeventlistener_and_removeeventlistener is still broken.

@Josh-Cena
Copy link
Member

What about anchors that happen to be fixed, though? For example, the very first item, "content/files/en-us/glossary/script-supporting_element/index.md Web/HTML/Kinds_of_HTML_content#script-supporting_elements", was fixed in #28000.

@Josh-Cena
Copy link
Member

Anyway I'll drop my script used in https://github.com/orgs/mdn/discussions/279:

import fs from "node:fs/promises";
import path from "node:path";
import { load } from "cheerio";

async function* getFiles(dir) {
  const dirents = await fs.readdir(dir, { withFileTypes: true });
  for (const dirent of dirents) {
    const res = path.join(dir, dirent.name);
    if (dirent.isDirectory()) {
      yield* getFiles(res);
    } else if (res.endsWith("plain.html")) {
      yield Promise.all([res, fs.readFile(res, "utf8")]);
    }
  }
}

const files = getFiles("build");
/** @type {Record<string, string[]>} */
const pathToIds = { __proto__: null };
/** @type {Record<string, string[]>} */
const pathToLinks = { __proto__: null };

for await (const [filename, content] of files) {
  const $ = load(content);
  pathToLinks[filename] = $("a")
    .map((i, el) => $(el).attr("href"))
    .get()
    .map(
      (url) =>
        new URL(
          url.startsWith("#")
            ? `${filename.replace(/build\/|\/plain.html/g, "")}${url}`
            : url,
          "https://developer.mozilla.org"
        )
    )
    .filter(
      (url) =>
        url.hostname === "developer.mozilla.org" &&
        url.hash &&
        url.hash !== "#languages-switcher-button"
    );
  pathToIds[filename] = $("[id]")
    .map((i, el) => $(el).attr("id"))
    .get();
}

for (const [filename, urls] of Object.entries(pathToLinks)) {
  const failures = [];
  for (const url of urls) {
    const referencedPath = path.join(
      "build",
      url.pathname
        .replaceAll("*", "_star_")
        .replaceAll("::", "_doublecolon_")
        .replaceAll(":", "_colon_")
        .replaceAll("?", "_question_")
        .toLowerCase(),
      "plain.html"
    );
    const ids = pathToIds[referencedPath];
    if (ids) {
      const target = ids.find((id) => id === decodeURI(url.hash.slice(1)));
      if (!target) {
        failures.push(
          `Anchor not found: ${url
            .toString()
            .replace("https://developer.mozilla.org", "")}`
        );
      }
    } else if (
      !["/", "/en-US/docs/MDN/Contribute/Feedback", "/docs/MDN/About"].includes(
        url.pathname
      )
    ) {
      failures.push(
        `Page not found: ${url
          .toString()
          .replace("https://developer.mozilla.org", "")}`
      );
    }
  }
  if (failures.length) {
    const sourceFile = path.resolve(
      filename
        .replace("build/en-us/docs", "files/en-us")
        .replace("plain.html", "index.md")
    );
    console.log(`In ${sourceFile}:
${failures.map((e) => `  - ${e}`).join("\n")}
`);
  }
}

@OnkarRuikar
Copy link
Contributor Author

OnkarRuikar commented Jun 27, 2024

Anyway I'll drop my script used in github.com/orgs/mdn/discussions/279:

The script doesn't seem to handle definition list anchors:

In /home/onkar/GitHub/mdn/yari/client/files/en-us/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.md:
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#hearing_impairment
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#visual_impairment
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#pausing_capacity
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#keyboard_capacity

I've updated the broken fragment data in OP. As per my script there seems only 617 broken URL fragments left. The script runs on custom platform, which i developed years ago. It'll take some effort to make is stand alone.

@bsmth We've already plugged the hole so there aren't going to be new broken frags. For existing ~617 ones we need to create a project and a online spreadsheet to track them while fixing them.

@Josh-Cena
Copy link
Member

Josh-Cena commented Jun 27, 2024

The script doesn't seem to handle definition list anchors:

I don't see those problems in my output. Make sure you have built using the latest content.

Also, the pre-commit hook isn't able to catch anchor errors generated from macros.

@OnkarRuikar
Copy link
Contributor Author

The script doesn't seem to handle definition list anchors:

I don't see those problems in my output. Make sure you have built using the latest content.

I built it in yari. Doing it in content works.

Also, the pre-commit hook isn't able to catch anchor errors generated from macros.

Yes, for macros the entire content needs to be built. Witch won't be favourable before every commit or in GitHub workflows.

@Josh-Cena
Copy link
Member

Closing in favor of mdn/mdn#561

@bsmth
Copy link
Member

bsmth commented Jul 2, 2024

Closing in favor of mdn/mdn#561

good idea, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MDN:Project Anything related to larger core projects on MDN
Projects
None yet
Development

No branches or pull requests

4 participants