You need to sign in to do that
Don't have an account?
Intermittent 401 errors when google crawls a public site
Hi all,
We have a public site where all pages are available to the guest user profile. When accessing this site through the browser, we've never seen an authorization required page. However, when google bots attempt to crawl the site we get intermittent 401 Authorization Required responses.
I've tried hitting the site from seoconsultants check header tool, and I get intermittent 401 and 302 responses. The 302 that I've just received is saying that the page has been moved to : cust_maint/site_down/maintenance.html. Web analyzer gives the same results, yet using fiddler shows that the browsers are receiving nothing but 200 responses.
Has anyone else seen behaviour like this? Its proving difficult to track down as every time I raise a case with support, they close it saying they don't support the google client. However, that's not the issue here - I need to understand why the Salesforce server is returning the responses that it is.
So I finally got to the bottom of this and I'm posting to hopefully save others some pain.
It turned out to be a bug in my code that determines the browser that the user is accessing the site with by processing the USER_AGENT header.
Google doesn't provide a USER_AGENT header when crawling, so I ended up causing a null pointer exception. This appears to cause Salesforce to carry out a server side redirect to a standard platform error page, which requires user login. Unfortunately from the client perspective it just looks like the page you tried to access required authentication.
I managed to track this down by accessing the site through telnet and carrying out http requests etc from the keyboard - a fine way to lose a morning!
All Answers
The 302 errors you are getting: cust_maint/site_down/maintenance.html, means that the salesforce instance was down for maintenance when the page was requested.
Hi Ryan,
Thanks for the reply. Unfortunately this doesn't stack up with what I'm seeing.
For example, this morning I've run a number of requests to check server response headers from seoconsultants (http://www.seoconsultants.com/tools/headers)
each and every time I get the following response:
However, in between trying this I am carrying out hard refreshes from my browser with fiddler enabled and I see nothing but 200 responses, which indicates to me that the site isn't down.
If I try to access the site as a google bot, using http://www.avivadirectory.com/bethebot/, I get a 401 response and am taken to the login page. Again, I've never seen this when accessing the site from a regular browser. The only difference that I can see is the user agent header. Does Salesforce sites return different responses based on the browser header?
I'm at a loss as to how to proceed on this one - the platform appears to be returning incorrect responses, yet there's nothing I can do to influence those responses.
I'm not an expert at all in this, but would it have anything to do with the robots.txt file? I just know google doesn't pick up a salesforce site until you actually create a robots.txt file since Salesforce automatically blocks crawlers.
Are you sure the bot is using exactly the same address as your browser?
www.ecohomesquad.com doesn't seem to take you to the same page as http://ecohomesquad.force.com (the nav is broken on the former) and I've found that force sites doesn't seem to follow CNAME aliases completely and you can end up dumped at a login page if it's not a direct CNAME.
Your Robot.txt file on your site ecohomesquad.force.com/robots.txt is not allowing googlebot to crawl the site... you need to go into Sites in setup and load a new robot.txt file...
it would look something like this if you want all the pages to be able to be crawled...
<apex:page contentType="text/plain">
User-agent: *
Allow: /
</apex:page>
My robots.txt is set up correctly.
This wouldn't account for 401 errors as that is an error returned by the web server. Google inspects the robots.txt file and then decides if it should crawl - if it shouldn't then it will log that as the reason that it didn't carry out the crawl.
So I finally got to the bottom of this and I'm posting to hopefully save others some pain.
It turned out to be a bug in my code that determines the browser that the user is accessing the site with by processing the USER_AGENT header.
Google doesn't provide a USER_AGENT header when crawling, so I ended up causing a null pointer exception. This appears to cause Salesforce to carry out a server side redirect to a standard platform error page, which requires user login. Unfortunately from the client perspective it just looks like the page you tried to access required authentication.
I managed to track this down by accessing the site through telnet and carrying out http requests etc from the keyboard - a fine way to lose a morning!