CodeinWP CodeinWP

Solutions and Tools for Dealing with Broken Links in Web Pages

Internet Archive logo A couple of months ago a post by Leo Blanchette got to the front page of Hacker News and there was an interesting discussion on dealing with broken links and external content – the main problem being links that become out of date due to paywalls, altered content, or content getting taken down.

I’ve been running this blog since May 2008. If you’ve run a content-driven site for even a fraction of that, you know that link rot is a problem. In this post I’ll go over some of the suggestions in that thread along with some tools to use to check for broken links.

Linking to Archives by Default?

The gist of Leo’s original post was that he prefers to link to archive.org versions of links, essentially snapshots of the intended resource. In theory this sounds like a good idea, but as some in the thread pointed out, it’s not optimal.

First of all, the page has to exist in an archived snapshot. Fortunately, archive.org makes this easy to do. To save any page, just visit the following URL with the link you want to archive appended to it:

https://web.archive.org/save/

For example, if I want to save the home page for this blog I would visit:

https://web.archive.org/save/https://www.impressivewebs.com

Or I could visit web.archive.org/save and enter the page manually, with the option to save error messages too.

This is possible from a reader/blogger perspective, but due to takedowns and preventative measures by many websites, archive.org is not completely reliable for finding older pages. There are a number of other services for archiving pages, many that have gained popularity in recent years (e.g archive.is). But I’m not sure of the long-term reliability of those either and they’re not as popular as archive.org. For that reason, I tend to favour using archive.org.

As mentioned, many in the discussion on Hacker News disagreed with defaulting a link to the archived version. The primary reason not to do this is because it takes traffic away from the original content, which seems unethical and may even affect page rank for the resource (I can’t imagine Google is giving ranking juice to a site as a result of inbound links to archived pages).

A few of the comments in the discussion that caught my eye proposed alternative solutions to this problem.

Mark Graham, who manages the Wayback Machine at the Internet Archive wrote the following:

We suggest/encourage people link to original URLs but ALSO (as opposed to instead of) provide Wayback Machine URLs so that if/when the original URLs go bad (link rot) the archive URL is available, or to give people a way to compare the content associated with a given URL over time (content drift).

Interestingly, he also said:

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time… so that we are able to fix them if/when they break.

Good to know!

Mark’s solution is mainly a content solution. For example, you might include a link then put the archived link in parentheses next to it. But that’s a messy and tedious solution. I don’t want my content littered with unnecessary links. The discussion went on to suggest HTML solutions that I think are much better.

A workable HTML solution similar to one that one user presents looks something like this:

<a href="https://www.impressivewebs.com/"
   data-archive-url="https://web.archive.org/web/20201124215517/https://www.impressivewebs.com/"
   data-archive-date="2020-12-01"
   >My Blog</a>

This uses HTML’s data-* attributes (which can be anything you want as long as they begin with data-) to store the archived URL and the date it was archived. I think this is the best and cleanest solution. Of course, this won’t ensure that the user visits the archived page in the event of link rot (or “content drift” as Mark put it).

Ideally, browsers would be able to recognize that the link is down and automatically direct the user to the archived URL. But this presents problems of its own. What if the page isn’t down, but only changed? How would the browser detect this? What if the page is redirected to another URL. This too wouldn’t necessarily be detected.

To combat this, others proposed new HTML standards that would address these, which would be similar to using the data-* attributes:

It’s interesting to think about how HTML could be modified to fix the issue. Initial thought: along with HREF, provide AREF- a list of archive links. The browser could automatically try a backup if the main one fails. The user should be able to right-click the link to select a specific backup.

But again, while this seems effective in theory, it feels like you’re asking the browser to do too much, which could have performance or even security implications. I don’t know if there’s a perfect technical solution to this, but the HTML solution I presented seems simple enough for those who are concerned about link rot.

Even with a data-* attribute in place, you may need a way to discover if bad links even exist. True, it will be much harder and more tedious to discover broken links that aren’t actually broken but that redirect to a new page. And it’s even more difficult to find pages that have updated content that make the link less relevant. But you can do your part by incorporating any of a number of tools.

First of all, as a reader, if you come across an outdated or broken link, you can check the web archive for existing snapshots. Browser extensions make this easier.

Wayback Machine Add-on for Firefox

If you use Brave a broken page will display a bar at the top allowing you to check the Wayback Machine for an archived version.

Wayback Machine Feature in Brave Browser

This is a setting you can enable or disable:

Enabling the Wayback Machine Feature in Brave Browser

I’m not sure if other browsers have this ability, but the extensions linked above will accomplish the same thing.

Of course, those are solutions from a reader’s perspective. From a developer perspective, in addition to some extra HTML, you’ll also want to do a check for broken links. This can be automated or handled manually every so often with any number of tools. Below are a few that I’ve come across over the years.

Can be used as a CLI tool or programmatically via JavaScript.

broken-link-checker

linkinator

Similar to previous, can be used as a CLI tool or programmatically via JavaScript.

linkinator

Checkbot

Tests for broken links in addition to a number of SEO and security checks. Free for small websites.

Checkbot

Online dead link checker that tests the first 2000 links free.

Dead Link Checker

A popular WordPress plugin for checking dead links.

Broken Link Checker for WordPress

There are also a number of website monitoring solutions, usually not free, that include broken link checks as part of their service. I won’t list any of those here, but you can try this search in my newsletter’s archives that should pull up lots of issues that included monitoring tools.

Any Other Ideas?

If you have any other suggestions or know of another tool to deal with broken links in web pages or even the potential for broken links in the future, feel free to post a comment below.

Web Tools Weekly

3 Responses

  1. Nardi says:

    Checkbot is a great plugin. With a few more tweaks it’ll be perfect.

  2. Nabi says:

    Broken link checker is good one. It serves my purpose incredibly. I refer to many. Observation is fine.

  3. Great write-up! i will use this in the future.

Leave a Reply

Comment Rules: Please use a real name or alias. Keywords are not allowed in the "name" field and deep URLs are not allowed in the "Website" field. If you use keywords or deep URLs, your comment or URL will be removed. No foul language, please. Thank you for cooperating.

Markdown in use! Use `backticks` for inline code snippets and triple backticks at start and end for code blocks. You can also indent a code block four spaces. And no need to escape HTML, just type it correctly but make sure it's inside code delimeters (backticks or triple backticks).