Friday, June 22, 2012

Getting Crawl Errors Using Google Web Master Tools API

Now a days, It is very important to bring your site into first position in Google. To do this, SEOs needs to get those data regarding the site. Good thing is Google provides the web master tool. But Using Google Web Master Tool, we can't get all the crawl error, it shows only minimal amount of data(1000 urls). At the same time, Google provides the web master tools API with various client library, thanks to Google, to get more than 10,000 urls. As I have tested it downloads around 1,20,000 urls.
In this article, I will explain about how to get the crawl errors using Google web master tools API with java client library.
Steps to crawl the errors from web master tool for your site:
Step1: Download web master tool java client library

You can download gdata client library files from here.In the above link, you will see two zip files. One contains the source files of gdata-client library, another contains the sample examples.Sample zip contains zip files to develop the java client.

Step2: Include the following jar files in your class path

Make sure that following jar files are included in your class path

1.gdata-webmastertools-2.0.jar
2.gdata-client-1.0.jar
3.guava-11.0.jar


Step3: Copy and Paste the below code in your program
Following example retrieves the crawl errors and print it in a console.

CrawlErrors.java


import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLEncoder;

import com.google.gdata.client.webmastertools.WebmasterToolsService;
import com.google.gdata.data.webmastertools.CrawlIssueEntry;
import com.google.gdata.data.webmastertools.CrawlIssuesFeed;
import com.google.gdata.util.AuthenticationException;
import com.google.gdata.util.ServiceException;

/*
* Class is used for retrieve errors from Google Web master Tools Using API
*/

public class CrawlErrors {
//Make Sure the site name must end with "/"
static String SITE="http://www.mysite.com/"; // your web site name(e.g:http://www.example.com/)

// Here We are using feed to get error. Feed Urls
static String FEEDS_PATH="https://www.google.com/webmasters/tools/feeds/";

/*
*Making urls to send request to get crawl errors
*@param site - site name for which errors to be retrieved
*@param start- Starting point of the error
*@param count- how many errors to be retrieved
*/
public static URL getCrawlIssuesPageUrl(String site, int start, int count)
throws UnsupportedEncodingException, MalformedURLException {
String crawlIssuesUrl = FEEDS_PATH + URLEncoder.encode(site, "UTF-8")
+ "/crawlissues/"
+ "?start-index=" + start
+ "&max-results=" + count;
//System.out.println(URLEncoder.encode(SITE, "UTF-8"));
//System.out.println(crawlIssuesUrl);


return new URL(crawlIssuesUrl);
}

public static void printCrawlIssueEntry(CrawlIssueEntry entry) {
System.out.println("\tId: " + entry.getId());
System.out.println("\tCrawl Type: " + entry.getCrawlType().getCrawlType());
System.out.println("\tIssue Type: " + entry.getIssueType().getIssueType());
System.out.println("\tUrl: " + entry.getUrl().getUrl());
System.out.println("\tDetail: " + entry.getDetail().getDetail());
}
public static void main(String[] args) {
// Here I have retrieved 10,000 urls
int ISSUES_PER_PAGE=100; // Google retrieves maximum 100 urls for a single request.
int MAX_PAGES=100; // maximum pages that we retrieve 10000 urls here(100*100)
int startFrom=1; // starting from which error(if you need error from 1001(this value could be 1001)
String USERNAME="UserName@gmail.com";
String PASSWORD="Password";
// First authenticate.
WebmasterToolsService myService;
try {
myService = new WebmasterToolsService("Example");

myService.setUserCredentials(USERNAME, PASSWORD);
} catch (AuthenticationException e) {
System.out.println("Error while authenticating.");
return;
}

// Get the first pages of crawl issues, with a fixed number of issues each.
for (int i = 0; i < MAX_PAGES; i++) {
try {
// Get the feed page.
URL crawlIssuesPageUrl = getCrawlIssuesPageUrl(SITE, i * ISSUES_PER_PAGE + startFrom, ISSUES_PER_PAGE);
CrawlIssuesFeed feed = myService.getFeed(crawlIssuesPageUrl,CrawlIssuesFeed.class);

// Print the crawl issues.
if (feed == null) {
System.out.println("The feed could not be retrieved");
return;
}
System.out.println("Crawl Issues for site: " + SITE);
System.out.println("Issues from: " + feed.getStartIndex()
+ " to: " + (feed.getStartIndex() + feed.getItemsPerPage() - 1)
+ " out of a total: " + feed.getTotalResults());
for (CrawlIssueEntry entry : feed.getEntries()) {
printCrawlIssueEntry(entry);
}
} catch(UnsupportedEncodingException e) {
System.out.println("Unable to encode the site ID.");
return;
} catch (MalformedURLException e) {
System.out.println("The feed URL is not valid.");
return;
} catch (IOException e) {
System.out.println("Unable to retrieve the feed: Network error.");
return;
} catch (ServiceException e) {
System.out.println("Unable to retrieve the feed. Server unavailable.");
e.printStackTrace();

return;
}
}
}
}

I have found that some times this will return nothing. because some times retrieving the error takes too much time. In this case, do as below
1.run the program in debug mode.
2. put the break point at line
    URL crawlIssuesPageUrl........
    now do the step over for 5 to 7 times.
3.then run the program in debug mode(resume the step over)
4. it will get running.
Hope this was helpful.

4 comments:

  1. Thank you for your article. In original Google API doc I couldn't find what is FEED_PATH but I could find it in your article.

    ReplyDelete
  2. You have done really a superb job with your web site. Marvelous stuff is here to read.check pagerank

    ReplyDelete