Skip to main content
Laptop displaying a sitemap.xml file structure with indexed URLs and hierarchical website page connections

sitemap.xml

Tell Crawlers What to Crawl and Index for Maximum Crawl Budget Efficiency

SEOWebsiteTechnical
Author
Steven Hsu
Published
Updated

A sitemap.xml is a structured file that lists the important URLs of a website so search engines can discover and crawl them more efficiently. It acts as a roadmap for search engines, helping them identify which URLs exist and when important pages were last updated.

Search engines can still discover pages through internal links, external links, and other crawl paths. A sitemap does not replace good site architecture. It gives search engines an additional discovery signal, especially for large websites, newly launched sites, frequently updated content, or websites with complex structures.

A sitemap.xml does not guarantee indexing. Its real job is to make important URLs easier for search engines to discover, revisit, and evaluate.

What Is a sitemap.xml?

A sitemap.xml is an XML file that lists the URLs a website wants search engines to discover. It usually sits at a predictable location, such as:

Some websites use one sitemap file. Larger websites often use a sitemap index that links to multiple sitemap files, such as page sitemaps, post sitemaps, product sitemaps, image sitemaps, or video sitemaps.

A sitemap should not be treated as a list of every URL that exists on a website. It should list the URLs that are useful, canonical, crawlable, and intended for search discovery.

What a Sitemap.xml Does

The main purpose of a sitemap is to help search engines discover important content more efficiently. It gives crawlers a structured list of URLs that the site owner wants to make visible for crawling and indexing consideration.

A sitemap can tell search engines:

  • Which important URLs exist on the site
  • When those URLs were last significantly updated
  • Where specialized content such as images, videos, or news articles may be found
  • How larger groups of URLs are organized through sitemap index files

This is useful because search engines do not always discover every page immediately through links alone. A new page may not have many internal links yet. A deep page may sit several clicks away from the homepage. A large website may have thousands of URLs spread across different templates and content types.

How Search Engines Use Sitemaps

Search engines use sitemaps as a discovery and crawl-support signal. When a sitemap is submitted through Google Search Console, Bing Webmaster Tools, or referenced in robots.txt, crawlers can use it to find URLs and detect changes more efficiently.

A sitemap can help search engines:

  1. Discover new pages faster
  2. Revisit updated pages more efficiently
  3. Find URLs that may be difficult to discover through links alone
  4. Understand which canonical URLs the site prefers to expose
  5. Monitor sitemap groups separately in search tools

Indexing still depends on crawlability, content quality, canonical signals, duplication, internal linking, page value, and whether the page is eligible to appear in search results.

Basic Structure of a Sitemap.xml

A sitemap is written in XML, which stands for Extensible Markup Language. It follows a standardized structure so search engines can parse the file consistently.

A simplified sitemap looks like this:

Sitemap Example
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>
    <loc>https://www.example.com/</loc>
    <lastmod>2026-03-15</lastmod>
  </url>

  <url>
    <loc>https://www.example.com/about</loc>
    <lastmod>2026-03-10</lastmod>
  </url>

</urlset>

The most important element is <loc>, which defines the full canonical URL of the page. Sitemap URLs should be fully qualified absolute URLs, not relative paths.

<lastmod> shows when the page was last significantly updated. This should reflect meaningful changes, such as updates to the main content, structured data, links, or page information. It should not change just because the footer year changed or the page was rebuilt without meaningful content changes.

Older sitemap examples often include <changefreq> and <priority>. These fields are part of the sitemap protocol, but Google ignores them. For modern SEO, <loc> and accurate <lastmod> values matter more than trying to assign artificial priority scores.

Types of Sitemaps

Modern websites may use different sitemap types depending on the content and scale of the site.

Sitemap Type

Best Used For

What It Helps Search Engines Discover

XML Sitemap

Standard website URLs

Pages, posts, products, services, and other indexable URLs

Image Sitemap

Image-heavy websites

Important images that may not be easily discovered through normal crawling

Video Sitemap

Pages with video content

Video metadata such as title, description, thumbnail, and duration

News Sitemap

News publishers

Recently published news articles

Sitemap Index

Large or segmented websites

Multiple sitemap files grouped under one index file

Large websites often benefit from separating sitemaps by content type. For example, a website may have one sitemap for pages, one for blog posts, one for products, and one for images.

This structure can make sitemap management cleaner and help with reporting. In Google Search Console, separate sitemap files can make it easier to identify which content groups have discovery, crawling, or indexing issues.

Sitemap Size Limits

Search engines impose limits on sitemap files. A single sitemap can contain:

  • Up to 50,000 URLs
  • Up to 50 MB uncompressed

If a website exceeds either limit, the sitemap should be split into multiple files and organized through a sitemap index.

Example Sitemap Index Structure
/sitemap-index.xml
/pages-sitemap.xml
/posts-sitemap.xml
/products-sitemap.xml
/images-sitemap.xml
/videos-sitemap.xml

For small websites, one sitemap is usually enough. For larger websites, splitting sitemaps by content type can make the system easier to maintain and audit.

Best Practices for Sitemap.xml

A sitemap should be clean, accurate, and aligned with the website’s canonical URL strategy. It should not be treated as a dumping ground for every possible URL.

Include Only Indexable URLs

A sitemap should include URLs that are intended to be discovered and considered for indexing. Pages blocked by robots.txt, marked with noindex, redirected, duplicated, or removed should not appear in the sitemap.

If a sitemap includes URLs that search engines cannot or should not index, it creates conflicting signals. The sitemap says “this URL is important,” while the page-level signals say “do not index this URL” or “this URL is not the preferred version.”

Use Canonical URLs

Each sitemap URL should represent the preferred canonical version of the page.

For example, if the canonical URL is:

https://www.example.com/about

The sitemap should not list alternate versions such as:

http://example.com/about
https://example.com/about
https://www.example.com/about/
https://www.example.com/about?source=nav

The sitemap should reinforce the canonical version, not introduce more URL variation.

Keep the Sitemap Updated

A sitemap should reflect the current state of the website. When important pages are added, removed, redirected, or significantly updated, the sitemap should update accordingly.

For active websites, dynamic or auto-generated sitemaps are usually better than manually maintained files. A CMS, framework, or sitemap generator can keep the sitemap synchronized with published content.

The <lastmod> value should also be accurate. It should change when the actual page content changes in a meaningful way, not every time the site is redeployed.

Submit the Sitemap to Search Engines

A sitemap can be submitted through Google Search Console and Bing Webmaster Tools. It can also be referenced in the robots.txt file:

Sitemap: https://www.example.com/sitemap.xml

For larger websites, it is common to submit the sitemap index rather than every individual sitemap file.

Avoid Low-Quality or Duplicate Pages

A sitemap should focus on important URLs. Low-value pages, duplicate pages, thin pages, temporary URLs, internal search results, filtered parameter URLs, and test pages should usually be excluded.

This helps keep the sitemap clean and makes it easier to diagnose indexing issues. If a sitemap contains thousands of weak or duplicate URLs, it becomes harder to understand which pages actually matter.

Pages That Should Not Be in a Sitemap

A sitemap should include URLs that are useful, canonical, crawlable, and intended for search discovery. Pages that are private, duplicated, temporary, redirected, blocked, or marked noindex should usually be excluded.

Type of Page

Why It Should Usually Be Excluded

Better Handling

Internal search pages

Often create thin or duplicate result pages

Exclude from sitemap; consider robots.txt controls if crawl waste is high

Checkout and account pages

Private, transactional, or not useful as search landing pages

Exclude from sitemap; protect private areas properly

Parameter pages

Can create duplicate or near-duplicate URL variations

Exclude from sitemap; use canonicals or crawl controls where needed

Temporary campaign pages

Often short-term, ad-only, or not intended for organic search

Exclude unless they are evergreen and indexable

Admin or backend URLs

Not public search content

Exclude and protect with authentication

Print pages

Usually duplicate layout versions of existing pages

Exclude and canonicalize if needed

Redirected URLs

No longer the final destination

Include only the final canonical URL

Noindex pages

Explicitly not meant for search results

Exclude from sitemap

This keeps the sitemap aligned with the URLs the website actually wants search engines to discover and evaluate.

Common Mistakes to Avoid

A sitemap.xml is simple in concept, but it often becomes messy when websites scale or when teams add URLs without a clear rule.

A good sitemap should be boring and consistent. It should reflect the real canonical structure of the website.

Why Sitemaps Matter for SEO

A sitemap is not a direct ranking factor, but it supports SEO by improving discovery, crawl efficiency, and sitemap-level diagnostics.

It is especially useful for:

  • Large websites with thousands of pages, such as ecommerce, media, or travel websites
  • New websites with limited backlinks and weak external discovery signals
  • Websites with complex navigation or deep content structures
  • Websites with frequently updated content, such as blogs, publications, or inventory-driven platforms
  • Websites with rich media that may need image or video sitemap support

For example, a hotel group website may have pages for properties, rooms, offers, restaurants, experiences, destinations, blog posts, and booking flows. A sitemap can help search engines discover the important public URLs across that structure.

A clean sitemap does not fix poor architecture, but it supports it. The strongest setup is still a combination of clear internal linking, crawlable navigation, canonical URLs, strong content, and an accurate sitemap.

Summary

A sitemap.xml is a structured file that helps search engines discover the important URLs of a website. It supports crawling and indexing by giving search engines a clear list of canonical, indexable pages.

It does not guarantee indexing, and it does not replace strong internal linking or clean site architecture. Its role is to support discovery, especially when a website is large, new, frequently updated, or structurally complex.

The best sitemaps are accurate, current, and intentionally limited to URLs that matter. They include canonical URLs, use accurate <lastmod> values, exclude low-value or non-indexable pages, and are submitted through search tools or referenced in robots.txt.

Frequently Asked Questions

sitemap.xml