There are several good reasons you could possibly need to discover every one of the URLs on a web site, but your correct goal will figure out Whatever you’re seeking. For illustration, you might want to:
Establish just about every indexed URL to research problems like cannibalization or index bloat
Collect latest and historic URLs Google has witnessed, especially for web-site migrations
Locate all 404 URLs to Get better from submit-migration errors
In each state of affairs, only one Software received’t Provide you with almost everything you require. Regretably, Google Lookup Console isn’t exhaustive, plus a “web site:illustration.com” look for is limited and tricky to extract info from.
During this post, I’ll walk you thru some instruments to develop your URL checklist and ahead of deduplicating the data employing a spreadsheet or Jupyter Notebook, based on your site’s dimension.
Aged sitemaps and crawl exports
When you’re in search of URLs that disappeared from your Are living web site not too long ago, there’s an opportunity somebody with your staff may have saved a sitemap file or maybe a crawl export prior to the alterations were being built. For those who haven’t by now, check for these documents; they're able to usually provide what you require. But, in the event you’re examining this, you almost certainly didn't get so Blessed.
Archive.org
Archive.org
Archive.org is a useful Resource for Web optimization jobs, funded by donations. When you seek for a website and choose the “URLs” alternative, it is possible to accessibility nearly 10,000 mentioned URLs.
Even so, There are some constraints:
URL Restrict: You may only retrieve as much as web designer kuala lumpur 10,000 URLs, that's insufficient for greater web sites.
Top quality: Numerous URLs may be malformed or reference useful resource files (e.g., images or scripts).
No export solution: There isn’t a developed-in way to export the list.
To bypass The shortage of the export button, make use of a browser scraping plugin like Dataminer.io. Even so, these limitations imply Archive.org may well not deliver an entire Resolution for much larger internet sites. Also, Archive.org doesn’t show no matter if Google indexed a URL—but if Archive.org uncovered it, there’s an excellent chance Google did, way too.
Moz Professional
Though you could possibly normally make use of a website link index to search out external internet sites linking to you, these resources also uncover URLs on your site in the method.
The best way to utilize it:
Export your inbound inbound links in Moz Pro to acquire a swift and simple listing of concentrate on URLs from your internet site. If you’re handling a huge website, consider using the Moz API to export info outside of what’s workable in Excel or Google Sheets.
It’s crucial to Observe that Moz Professional doesn’t validate if URLs are indexed or found by Google. Nevertheless, given that most sites use a similar robots.txt guidelines to Moz’s bots as they do to Google’s, this process normally is effective well as being a proxy for Googlebot’s discoverability.
Google Lookup Console
Google Search Console provides numerous useful sources for creating your list of URLs.
Links experiences:
Just like Moz Pro, the Links portion supplies exportable lists of target URLs. However, these exports are capped at 1,000 URLs Just about every. It is possible to utilize filters for distinct pages, but due to the fact filters don’t apply on the export, you might must trust in browser scraping tools—limited to 500 filtered URLs at a time. Not perfect.
Effectiveness → Search engine results:
This export provides a list of pages receiving search impressions. Whilst the export is limited, You can utilize Google Research Console API for larger sized datasets. There are also free Google Sheets plugins that simplify pulling extra comprehensive info.
Indexing → Pages report:
This part gives exports filtered by issue variety, although these are definitely also constrained in scope.
Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is an excellent source for gathering URLs, that has a generous Restrict of a hundred,000 URLs.
Even better, you'll be able to implement filters to build distinctive URL lists, effectively surpassing the 100k limit. For instance, in order to export only website URLs, stick to these measures:
Stage one: Increase a phase to your report
Phase 2: Simply click “Produce a new segment.”
Move 3: Define the section having a narrower URL sample, including URLs made up of /site/
Note: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide valuable insights.
Server log documents
Server or CDN log data files are Possibly the final word tool at your disposal. These logs capture an exhaustive checklist of each URL route queried by users, Googlebot, or other bots through the recorded period of time.
Criteria:
Data size: Log files may be huge, a lot of web sites only retain the final two months of knowledge.
Complexity: Examining log data files can be tough, but several instruments can be obtained to simplify the method.
Mix, and great luck
When you’ve collected URLs from all of these sources, it’s time to combine them. If your web site is small enough, use Excel or, for much larger datasets, resources like Google Sheets or Jupyter Notebook. Assure all URLs are constantly formatted, then deduplicate the checklist.
And voilà—you now have a comprehensive list of existing, previous, and archived URLs. Very good luck!