How to merge two sitemap.xml files

January 26, 2013

Since flipping the switch on the Jekyll transition a couple of weeks ago, one of the to-dos that has persisted is what do about the fact that I now have two sitemap.xml files to contend with, the first from the regular blog, and the second from the photoblog. These obviously needed to be merged, and last night I whipped up the following Ruby code (explained below) to do just that.

desc "Merge two sitemap files"
task :merge do

    header, footer, content1, content2 = []

    File.open( "/path/to/first/sitemap", 'r' ) do |f1|
        content1 = (IO.readlines f1)
        header = content1.slice!( 0..11 )
        footer = [ content1.slice!( -1 ) ]
        File.delete f1
    end

    File.open( "/path/to/second/sitemap", 'r' ) do |f2|
        content2 = (IO.readlines f2)
        content2.slice!( 0..11 )
        content2.slice!( -1 ) 
    end

    File.open( "/path/to/merged/sitemap", 'w' ) do |f3|
        f3.write ( header + content1 + content2 + footer ).join()
    end

    puts "Sitemaps have been merged!"

end

Yeah, you could accomplish this in a million different ways using any number of languages, but I decided to go with Ruby (and Rake tasks) to keep with the Jekyll theme, and because I know nearly nothing about the language and thought it’d be fun. That said, if the above can be made any more efficient or elegant, please let me know.

As you can see, the code is rather simple, mainly because the structure of sitemap.xml files is rather simple, and so grabbing what we need from each isn’t too difficult.

This was written with two conditions in mind: 1) We have two sitemap.xml files being generated by two separate systems (be it Jekyll or whatever); and 2) The first sitemap.xml corresponds to a blog that is updated more frequently than the second.

The first thing we do is read in the sitemap.xml file of the blog that is updated most frequently. (You’ll need to change /path/ to correspond to the location of this file. If you’re using Jekyll, this file will be wherever you tell Jekyll to write your generated files, likely .../_site/.) The code stores each line in the content1 array, and then peels from that array information corresponding to the “header” and “footer” of the sitemap.xml file, which are stored in the header and footer arrays, respectively.

The “header” stuff is the XML declaration and the opening urlset tag required for sitemap.xml files. You’ll notice that the “header” in my code actually is 12 lines long; that’s because I’m also including the initial, root URL of the sitemap as part of the “header”:

<url>
    <loc>http://hypertext.net</loc>
    <lastmod>2013-01-26</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
</url>

The above is found in both sitemap.xml files, but I only want it to exist once in the merged file, and so I’ve decided to grab it from the first file.

The “footer” contains just the closing urlset tag.

Once the information from the first sitemap.xml file has been retained, we want to delete this file. The reason for this is because we don’t want two sitemap files—namely this first sitemap file and the merged one we create later—to be uploaded to our server when next we do a push/sync of the site. (Granted, web crawlers are going to read only the one you point them to, but why waste the time/bandwidth required to upload the unused file? In my case, it’d be an extra 700KB each time I pushed the site.)

The code next acts similarly on the second sitemap.xml file. (Again, you’ll want to change /path/ to point to this file.) We again remove from the content array (content2 this time) the “header” and “footer’ information, though we don’t store these anywhere as we already have them from the first sitemap.xml file. Unlike the first sitemap file, we don’t want to delete this one because it’s likely that we’re going to update our first blog again—before we update the second one—and we want the second sitemap file to be there this next time around, otherwise the merged sitemap won’t contain the second blog at all.

Finally, the code simply concatenates the information we’ve gathered, namely the header, all of the content from the first and second sitemap files, and the footer, and writes this to the new sitemap file that we specify (i.e., the one we’re going to want web crawlers to use). (If using Jekyll, this likely will be .../_site/sitemap.xml).

When to run?

You’ll want to run this after each build of your more frequently-updated blog. It’ll grab the current first sitemap.xml file (from the first blog), delete it, grab the current second sitemap.xml (from the second blog), and then write the combination to the final sitemap.xml you want to get pushed to the server when next you push/sync your site.

Obviously, if there’s a lot of lag between updating your second blog and then updating your first blog, the sitemap.xml file that exists on the server could be slightly outdated (i.e., it won’t contain the stuff recently added to the second blog), but this really isn’t a big deal, and will resolve itself when next you update the first blog.

You should follow me on Twitter here