SEO site analysis for duplicates: how to parse sitemap.xml identifies hidden problems

Websites

30.03.2026

Content

Introduction
What is it sitemap.xml and why is it needed
What duplicates are found on websites
Problems with hreflang on multilingual websites
Broken pages in the site map
How sitemap parsing helps you find problems
Typical reasons for duplicates
Practical SEO Audit Checklist
Conclusion
Full listing of the audit team

Introduction

Duplicates on a website are one of the most insidious SEO problems. They do not cause errors in the browser, do not break the layout and do not interfere with users. But for search engines, duplicate content is a serious signal of poor resource quality. Google may lower the position of the site, if you find the same titles, descriptions, URLS or repetitive in its structure.

sitemap.xmlautomatically

What is it sitemap.xml and why is it needed

lastmodchangefreqpriority

Search robots use this file as a map to crawl the site. If the page is in the sitemap, the robot will visit it first. If there is no page, it may be indexed with a delay or not indexed at all.

That is why the quality of the site map directly affects SEO indicators.:

Indexing speed
Completeness of coverage
A signal of trust

But if the map contains errors — duplicate URLs, broken links, incorrect annotations — the effect is reversed. The robot spends its "crawling budget" crawling the same pages, and the site receives penalty signals.

What duplicates are found on websites

When auditing a sitemap, there are three main types of duplicates that need to be checked. Each of them affects SEO in different ways, but they all reduce the effectiveness of indexing.

Duplicates of the URL in the site map itself

/services/contacts

The search robot will process both occurrences, but it will spend twice as much resources. On large sites with thousands of pages, this leads to a significant loss of the crowding budget.

<title>

<title><title>

A typical example: by default, CMS substitutes the site name in all headings, and the developer forgets to set a unique title for each page. As a result, dozens of pages receive a title like "Company name — Home".

<meta description>

click-through rate (CTR)

A common situation is that the description is set by default at the template level, and if the author has not filled in the field manually, the page receives a general description of the site. During the audit, such duplicates are easy to detect — several pages will have the same description value.

Problems with hreflang on multilingual websites

hreflang

The three most common problems with hreflang are:

No backlink.I must
Lack of self-reference.
A link to a non-existent page.

It is almost impossible to detect these problems manually on a site with hundreds of pages. Automatic sitemap parsing with hreflang link verification is the only reliable way to keep a multilingual website in good condition.

Broken pages in the site map

200 OK

When the search robot finds a link to a non-existent page in the site map, it interprets this as a sign of poor quality of service for the resource. Single errors will not lead to disaster, but the systematic presence of broken links in the sitemap reduces the trust of the search engine in the site as a whole.

Checking the HTTP status of all URLs from the site map allows you to:

Identify deleted pages that you forgot to remove from the sitemap
Detect "silent" server errors that are not visible to ordinary visitors
Find redirection chains that slow down indexing

How sitemap parsing helps you find problems

Manual site map verification is ineffective even for small projects. A website with a blog, a catalog of services and a portfolio easily reaches 100-200 pages, and with a multilingual version — twice as many. Automatic parsing solves this problem in seconds.

The algorithm of auditing through parsing sitemap.xml It includes several stages:

Loading and parsing XML.
Checking for duplicate URLs.
hreflang validation.
Page crawling.
Extraction of meta data.
Grouping and searching for duplicates.

The result of the parsing is a structured report that shows the exact number of issues in each category and specific URLs that require attention.

Typical reasons for duplicates

Understanding the causes helps not only to correct, but also to prevent duplicates. Here are the most common situations:

Mixing static and dynamic data.
Template meta tags.
Massive database updates.
Incorrect hreflang generation.
Forgotten deleted pages.

Practical SEO Audit Checklist

once a weekonce a month

What to check during each audit:

URL Duplicates
HTTP statuses
Uniqueness of <title>
Uniqueness of the <meta description>
The correctness of hreflang
Relevance of lastmod
Completeness of the map

Automating these checks is the optimal solution. A script running on a schedule will detect the problem on the same day it appeared, rather than a month later during manual verification.

Conclusion

system process

Automatic site map parsing solves three key tasks:

Speed
Completeness
Regularity

Investing in SEO audit automation pays off multiple times: you spend time setting it up once, and the system works for you all the time, protecting the site's position and ensuring high-quality indexing of all pages.

Full listing of the Laravel sitemap audit team

seo:audit-sitemap

<?php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Http\Client\Pool;
use Illuminate\Http\Client\Response;
use Illuminate\Support\Collection;
use Illuminate\Support\Facades\Http;

class AuditSitemap extends Command
{
    protected $signature = 'seo:audit-sitemap
        {url : URL of the sitemap.xml to parse}
        {--timeout=15 : HTTP request timeout in seconds per page}
        {--concurrency=5 : Max concurrent HTTP requests when fetching pages}';

    protected $description = 'Parse sitemap.xml and detect duplicate URLs, <title>, <meta description>, broken pages and hreflang issues';

    private const XHTML_NS = 'http://www.w3.org/1999/xhtml';

    private Collection $brokenPages;

    public function handle(): int
    {
        $this->brokenPages = collect();

        $sitemapUrl = $this->argument('url');
        $timeout = (int) $this->option('timeout');

        $this->info("Fetching sitemap: {$sitemapUrl}");

        $entries = $this->parseSitemap($sitemapUrl);

        if ($entries === null) {
            $this->error('Failed to fetch or parse the sitemap.');

            return self::FAILURE;
        }

        $locs = $entries->pluck('loc');

        $this->info("Found {$locs->count()} URL entries in the sitemap.");
        $this->newLine();

        $duplicateUrls = $locs->countBy()->filter(fn (int $count) => $count > 1);
        $this->reportDuplicateUrls($duplicateUrls);

        $hreflangIssues = $this->validateHreflang($entries);
        $this->reportHreflangIssues($hreflangIssues);

        $uniqueUrls = $locs->unique()->values();
        $this->info("Fetching {$uniqueUrls->count()} unique pages...");

        $pages = $this->fetchPages($uniqueUrls, $timeout);
        $this->newLine();

        $this->reportBrokenPages();

        $duplicateTitles = $this->findDuplicates($pages, 'title');
        $this->reportDuplicates($duplicateTitles, 'title');

        $duplicateDescriptions = $this->findDuplicates($pages, 'description');
        $this->reportDuplicates($duplicateDescriptions, 'meta description');

        $this->printSummary($duplicateUrls, $duplicateTitles, $duplicateDescriptions, $hreflangIssues);

        $hasIssues = $duplicateUrls->isNotEmpty()
            || $duplicateTitles->isNotEmpty()
            || $duplicateDescriptions->isNotEmpty()
            || $this->brokenPages->isNotEmpty()
            || $hreflangIssues->isNotEmpty();

        return $hasIssues ? self::FAILURE : self::SUCCESS;
    }

    private function parseSitemap(string $url): ?Collection
    {
        try {
            $response = Http::timeout(15)->get($url);
        } catch (\Throwable $e) {
            $this->error("HTTP error: {$e->getMessage()}");

            return null;
        }

        if ($response->failed()) {
            return null;
        }

        $previousUseErrors = libxml_use_internal_errors(true);
        $xml = simplexml_load_string($response->body());
        libxml_use_internal_errors($previousUseErrors);

        if ($xml === false) {
            return null;
        }

        $nodes = [];
        foreach ($xml->url as $node) {
            $nodes[] = $node;
        }

        return collect($nodes)
            ->map(fn (\SimpleXMLElement $node) => [
                'loc' => trim((string) $node->loc),
                'hreflang' => $this->parseHreflangLinks($node),
            ])
            ->filter(fn (array $entry) => $entry['loc'] !== '')
            ->values();
    }

    private function parseHreflangLinks(\SimpleXMLElement $node): array
    {
        $links = [];
        foreach ($node->children(self::XHTML_NS)->link as $link) {
            $links[] = $link;
        }

        return collect($links)
            ->mapWithKeys(function (\SimpleXMLElement $link) {
                $attrs = $link->attributes();

                return [(string) ($attrs['hreflang'] ?? '') => trim((string) ($attrs['href'] ?? ''))];
            })
            ->filter(fn (string $href, string $lang) => $lang !== '' && $href !== '')
            ->all();
    }

    private function validateHreflang(Collection $entries): Collection
    {
        $locSet = $entries->pluck('loc')->flip();

        $hreflangMap = $entries
            ->filter(fn (array $e) => ! empty($e['hreflang']))
            ->pluck('hreflang', 'loc');

        return $hreflangMap
            ->flatMap(fn (array $alternates, string $loc) => $this->checkAlternates($loc, $alternates, $locSet, $hreflangMap))
            ->values();
    }

    private function checkAlternates(string $loc, array $alternates, Collection $locSet, Collection $hreflangMap): array
    {
        $issues = collect($alternates)
            ->flatMap(fn (string $href, string $lang) => $this->checkSingleAlternate($loc, $lang, $href, $locSet, $hreflangMap))
            ->all();

        $hasSelfRef = in_array($loc, $alternates, true);

        if (! $hasSelfRef) {
            $issues[] = [
                'type' => 'missing_self_ref',
                'loc' => $loc,
                'detail' => 'Missing self-referencing hreflang annotation',
            ];
        }

        return $issues;
    }

    private function checkSingleAlternate(string $loc, string $lang, string $href, Collection $locSet, Collection $hreflangMap): array
    {
        $issues = [];

        if (! $locSet->has($href)) {
            $issues[] = [
                'type' => 'missing_in_sitemap',
                'loc' => $loc,
                'detail' => "hreflang=\"{$lang}\" points to {$href} which is NOT in the sitemap",
            ];

            return $issues;
        }

        if ($href === $loc) {
            return $issues;
        }

        if (! $hreflangMap->has($href)) {
            $issues[] = [
                'type' => 'missing_reciprocal',
                'loc' => $loc,
                'detail' => "hreflang=\"{$lang}\" points to {$href}, but that page has no hreflang annotations",
            ];

            return $issues;
        }

        if (! in_array($loc, $hreflangMap->get($href), true)) {
            $issues[] = [
                'type' => 'missing_reciprocal',
                'loc' => $loc,
                'detail' => "hreflang=\"{$lang}\" points to {$href}, but that page does not link back",
            ];
        }

        return $issues;
    }

    private function fetchPages(Collection $urls, int $timeout): Collection
    {
        $concurrency = max(1, (int) $this->option('concurrency'));
        $bar = $this->output->createProgressBar($urls->count());
        $bar->start();

        $pages = $urls->chunk($concurrency)
            ->flatMap(function (Collection $chunk) use ($timeout, $bar) {
                $chunkUrls = $chunk->values()->all();

                $responses = Http::pool(fn (Pool $pool) => collect($chunkUrls)
                    ->each(fn (string $pageUrl) => $pool->as($pageUrl)->timeout($timeout)->get($pageUrl))
                );

                return collect($chunkUrls)
                    ->mapWithKeys(function (string $pageUrl) use ($responses, $bar) {
                        $bar->advance();

                        return [$pageUrl => $this->processResponse($pageUrl, $responses[$pageUrl] ?? null)];
                    })
                    ->filter();
            });

        $bar->finish();
        $this->newLine();

        return $pages;
    }

    private function processResponse(string $url, mixed $response): ?array
    {
        if ($response === null || $response instanceof \Throwable) {
            $this->brokenPages[$url] = $response instanceof \Throwable ? $response->getMessage() : 'No response';

            return null;
        }

        if ($response->status() !== 200) {
            $this->brokenPages[$url] = $response->status();
        }

        if ($response->failed()) {
            return null;
        }

        $html = $response->body();

        return [
            'title' => $this->extractTitle($html),
            'description' => $this->extractMetaDescription($html),
        ];
    }

    private function extractTitle(string $html): ?string
    {
        if (! preg_match('/<title[^>]*>(.*?)<\/title>/si', $html, $matches)) {
            return null;
        }

        $title = trim(html_entity_decode($matches[1], ENT_QUOTES | ENT_HTML5, 'UTF-8'));

        return $title !== '' ? $title : null;
    }

    private function extractMetaDescription(string $html): ?string
    {
        $patterns = [
            '/<meta\s[^>]*name\s*=\s*["\']description["\']\s[^>]*content\s*=\s*["\']([^"\']*?)["\']/si',
            '/<meta\s[^>]*content\s*=\s*["\']([^"\']*?)["\']\s[^>]*name\s*=\s*["\']description["\']/si',
        ];

        foreach ($patterns as $pattern) {
            if (preg_match($pattern, $html, $m)) {
                $desc = trim(html_entity_decode($m[1], ENT_QUOTES | ENT_HTML5, 'UTF-8'));

                return $desc !== '' ? $desc : null;
            }
        }

        return null;
    }

    private function findDuplicates(Collection $pages, string $field): Collection
    {
        return $pages
            ->filter(fn (array $meta) => ($meta[$field] ?? null) !== null)
            ->groupBy(fn (array $meta) => $meta[$field], preserveKeys: true)
            ->filter(fn (Collection $group) => $group->count() > 1)
            ->map(fn (Collection $group) => $group->keys());
    }

    private function reportBrokenPages(): void
    {
        if ($this->brokenPages->isEmpty()) {
            $this->info('All pages returned HTTP 200.');
            $this->newLine();

            return;
        }

        $this->error("Found {$this->brokenPages->count()} page(s) with non-200 HTTP status:");

        $rows = $this->brokenPages->map(fn ($status, $url) => [$url, $status])->values()->all();
        $this->table(['URL', 'Status'], $rows);
        $this->newLine();
    }

    private function reportHreflangIssues(Collection $issues): void
    {
        if ($issues->isEmpty()) {
            $this->info('All hreflang annotations are valid.');
            $this->newLine();

            return;
        }

        $labels = [
            'missing_in_sitemap' => 'Alternate URL not found in sitemap',
            'missing_reciprocal' => 'Missing reciprocal hreflang link',
            'missing_self_ref' => 'Missing self-referencing hreflang',
        ];

        $this->error("Found {$issues->count()} hreflang issue(s):");

        $issues->groupBy('type')->each(function (Collection $group, string $type) use ($labels) {
            $this->newLine();
            $this->warn('  '.($labels[$type] ?? $type).':');

            $group->each(function (array $issue) {
                $this->line("    {$issue['loc']}");
                $this->line("      {$issue['detail']}");
            });
        });

        $this->newLine();
    }

    private function reportDuplicateUrls(Collection $duplicates): void
    {
        if ($duplicates->isEmpty()) {
            $this->info('No duplicate URLs found in the sitemap.');
            $this->newLine();

            return;
        }

        $this->error("Found {$duplicates->count()} duplicate URL(s) in the sitemap:");

        $rows = $duplicates->map(fn (int $count, string $url) => [$url, $count])->values()->all();
        $this->table(['URL', 'Occurrences'], $rows);
        $this->newLine();
    }

    private function reportDuplicates(Collection $duplicates, string $label): void
    {
        if ($duplicates->isEmpty()) {
            $this->info("No duplicate <{$label}> values found.");
            $this->newLine();

            return;
        }

        $this->error("Found {$duplicates->count()} duplicate <{$label}> value(s):");

        $duplicates->each(function (Collection $urls, string $value) {
            $truncated = mb_strlen($value) > 80 ? mb_substr($value, 0, 80) : $value;
            $this->warn("  \"{$truncated}\":");

            $urls->each(fn (string $url) => $this->line("    {$url}"));
        });

        $this->newLine();
    }

    private function printSummary(
        Collection $duplicateUrls,
        Collection $duplicateTitles,
        Collection $duplicateDescriptions,
        Collection $hreflangIssues,
    ): void {
        $this->newLine();
        $this->info('Audit Summary');

        $summaryLine = fn (string $label, Collection $items) => sprintf(
            '  %-30s %d',
            $label,
            $items->count(),
        );

        $this->line($summaryLine('Non-200 pages:', $this->brokenPages));
        $this->line($summaryLine('Duplicate URLs:', $duplicateUrls));
        $this->line($summaryLine('Hreflang issues:', $hreflangIssues));
        $this->line($summaryLine('Duplicate title:', $duplicateTitles));
        $this->line($summaryLine('Duplicate meta description:', $duplicateDescriptions));
    }
}

Автор SoftRest