WordPress: Hack “Google XML Sitemap” plugin to solve too many URLs in one file

pluginSitemapWordpress

Google XML Sitemap is a light-weight plugin to generate sitemap for your wordpress site automactically.

The generated sitemap can be based on several ways, such as post type, and group by month. However, if there are too many posts for a specific type in a month, the sitemap will be very large.

For one thing, it is not friendly for search engine to scrape large file. For another, which is also a most important aspect, is that search engine can only process 50,000 URLs of a single file, meaning URLs later than the 50,000 will be ignored. If you are using Google Console, an error indication will be observed in sitemap option.

By default, Google XML Sitmap does not support the number of URLs included in one sitemap file. You can choose other plugins to implement such setting. However, as I mentioned at the beginning of the article, Google XML Sitemap is a light-weight tool, running very fast and occupy very small resources.

For me, I would like to do some hacking of the codes to implement my requirements.

The hacked file is sitemap-builder.php under the plugin folder.

Function Index() responses for top level file generation (http://domain/sitemap.xml).

Modify the code according to your requirement.

For example, I want to split one XML of a specific post type to several ones. Posts from 1st day of the month to 10th day will be shown in one sitemap file, 11th day to 20th day shown in another, and 21th day to the last day saved in the 3rd file.

Here I add 2 more files for specific post type(by default there is already one). Notice the field month is hacked, and it will be used later.

foreach($posts as $post) {
  if($postType=="xxx") {
    $gsg->AddSitemap("pt", $postType . "-" . sprintf("%04d-%02d", $post->year, $post->month+20), $gsg->GetTimestampFromMySql($post->last_mod));
    $gsg->AddSitemap("pt", $postType . "-" . sprintf("%04d-%02d", $post->year, $post->month+40), $gsg->GetTimestampFromMySql($post->last_mod));
  }
  $gsg->AddSitemap("pt", $postType . "-" . sprintf("%04d-%02d", $post->year, $post->month), $gsg->GetTimestampFromMySql($post->last_mod));
}

Function BuildPosts() process each of sitemap file added by above function above. Firstly, I extract the month field and recover $month and $day info.

if($postType == "xxx") {
  if($month >= 21 && $month <= 32) {
    $month -= 20;
    $dayFirst = 11;
    $dayLast = 20;
  } elseif ($month >= 41 && $month <= 52) {
    $month -= 40;
    $dayFirst = 21;
    $dayLast = 31;
  } else {
    $dayFirst = 1;
    $dayLast = 10;
  }
}

Then, during SQL query, add more conditions. The default value of dayFirst is 1 and dayLast is 31, so it will not other post types.

AND DAY(p.post_date_gmt) >= dayFirst
AND DAY(p.post_date_gmt) <= dayLast

Congratulations! This is all you need to modify.

Refresh the top level sitemap, you can see for each month, sitemap of the hacked post type is divided into 3 files. The 1st sitemap file contains posts from 1st day to 10th, the 2nd contains posts from 11th to 20th, and so on.

You can modify according to your requirements.