LiteSpeed Cache Crawler Blacklist
By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met:
- The page is not cacheable by design or by default. In other words, any pages that send the response header
x-litespeed-cache-control: no-cache
- The page doesn't respond with the following headers:
HTTP/1.1 200 OK HTTP/1.1 201 Created HTTP/2 200 HTTP/2 201
Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.
Problem
Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the x-litespeed-cache
header and 200 OK
status code. So, why are the URIs ending up in the blacklist?
Investigation
Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.
Open the following file: litespeed-cache/lib/litespeed/litespeed-crawler.class.php
Add the following code at line 273, to allow us to log more information:
LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ;
Now, when the crawler processes a URI, the $headers
will be written to the debug log.
Run the crawler manually, and check grep headers /path/to/wordpress/wp-content/debug.log
. You should see something like this:
So here is the problem: most of the logs show the header is HTTP/1.1 200 OK
but a few of them are empty. It's the empty ones that are being added to the blacklist.
But why, if you manually run a curl, it looks fine?
[root@test ~]# curl -I -XGET https://example.com/product-name-1 HTTP/1.1 200 OK Date: Thu, 11 Jul 2019 20:57:54 GMT Content-Type: text/html; charset=UTF-8 Transfer-Encoding: chunked Connection: keep-alive Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly Cf-Railgun: direct (starting new WAN connection) Link: <https://example.com/wp-json/>; rel="https://api.w.org/" Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly Vary: Accept-Encoding X-Litespeed-Cache: miss X-Litespeed-Cache-Control: public,max-age=604800 X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_ X-Powered-By: PHP/7.3.6 Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" Server: cloudflare CF-RAY: 5f5db4fd1c234c56-AMS
This URI returns 200
, and x-litespeed-cache-control: public
, so why is the header empty in the previous debugging process?
To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.
To add another debug log code to grab the curl options the crawler used, add following code into litespeed-cache/lib/litespeed/litespeed-crawler.class.php
at line 627, directly before return $options ;
, like so:
$options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ; return $options ;
Now, manually crawl it again to get the all the options.
07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{ "19913":true, "42":true, "10036":"GET", "52":false, "10102":"gzip", "78":10, "13":10, "81":0, "64":false, "44":false, "10023":["Cache-Control: max-age=0","Host: example.com"], "84":2, "10018":"lscache_runner ", "10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert" }'
The numbers you see are PHP curlset reference code. An internet search reveals that the 78
and 13
are particularly interesting. They represent curl connection timeout
and curl timeout
.
Let's apply these options to our curl command.
[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1 curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received
So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load.
Let's do one more test to confirm it:
[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/ Establish Connection: 0.006s TTFB: 16.455s Total: 16.462s
So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the 200
status is never received by the crawler, and the URL is blacklisted.
Solution
We need to determine where the timeout is set, and increase it. Use grep
:
[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php" includes/litespeed-cache-crawler.class.php: $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ; inc/crawler.class.php: $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ; lib/litespeed/litespeed-crawler.class.php: CURLOPT_CONNECTTIMEOUT => 10, lib/litespeed/litespeed-crawler.class.php: CURLOPT_TIMEOUT => 10,
The last result, shows curl timeout
is defined there. Open litespeed-cache/lib/litespeed/litespeed-crawler.class.php
and somewhere around lines 561-572, raise the timeout from 10
to something higher, like 30
.
CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => true, CURLOPT_CUSTOMREQUEST => 'GET', CURLOPT_FOLLOWLOCATION => false, CURLOPT_ENCODING => 'gzip', CURLOPT_CONNECTTIMEOUT => 10, CURLOPT_TIMEOUT => 10, CURLOPT_SSL_VERIFYHOST => 0, CURLOPT_SSL_VERIFYPEER => false, CURLOPT_NOBODY => false, CURLOPT_HTTPHEADER => $this->_curl_headers,
Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist.
NOTE: This is a temporarily solution, manually editing the code to raise the timeout. LiteSpeed Cache's crawler timeout default will be changed to 30
seconds in LSCWP 2.9.8.4 , and will be made configurable in LSCWP 3.0 in case you need an even longer time.