A web crawler is a relatively simple automated program, or script that methodically
scans or “crawls” through Internet pages to create an index of the data it’s looking for; these programs are usually made to be used only once, but they can be programmed for long-term usage as well. There are several uses for the program, perhaps the most popular being search engines using it to provide webs surfers with relevant websites. Other users include linguists and market researchers, or anyone trying to search information from the Internet in an organized manner. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. Crawler programs can be purchased on the Internet, or from many companies that sell computer software, and the programs can be downloaded to most computers.
Common Uses of Web Crawler
There are various uses for web crawlers, but essentially a web crawler may be used by anyone seeking to collect information out on the Internet. Search engines frequently use web crawlers to collect information about what is available on public web pages. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. Linguists may use a web crawler to perform a textual analysis; that is, they may comb the Internet to determine what words are commonly used today. Market researchers may use a web crawler to determine and assess trends in a given market.
Web crawling is an important method for collecting data on, and keeping up with, the rapidly expanding Internet. A vast number of web pages are continually being added every day, and information is constantly changing. A web crawler is a way for the search engines and other users to regularly ensure that their databases are up-to-date. There are numerous illegal uses of web crawlers as well such as hacking a server for more information than is freely given.
How Web Crawler Works
When a search engine‘s web crawler visits a web page, it “reads” the visible text, the hyperlinks, and the content of the various tags used in the site, such as keyword rich meta tags. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. The website is then included in the search engine’s database and its page ranking process.
Web crawlers may operate one time only, say for a particular one-time project. If its purpose is for something long-term, as is the case with search engines, web crawlers may be programed to comb through the Internet periodically to determine whether there has been any significant changes. If a site is experiencing heavy traffic or technical difficulties, the spider may be programmed to note that and revisit the site again, hopefully after the technical issues have subsided.
How to Build A Basic Web Crawler
Web Crawlers, sometimes called scrapers, automatically scan the Internet attempting to
glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.
The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links – then report back to Google HQ and add the information to their huge database.
Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.
Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.
To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.
Before we start, you will need a server to run PHP. You have a number of options here:
- If you host your own blog using WordPress, you already have one, so upload the files you write via FTP and run them from there. Matt showed us some free FTP clients for Windows you could use.
- If you don’t have a web server but do have an old PC sitting around, then you could follow Dave’s tutorial here to turn an old PC into a web server.
- Just one computer? Don’t worry – Jeffry showed us how we can run a local server inside of Windows or Mac.
Getting Started
We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.
First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:
<?phpinclude_once('simple_html_dom.php');phpinfo();?>
Access the file through your internet browser. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.
The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.
One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.
<?phpinclude_once('simple_html_dom.php');$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>
You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.
If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.
That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.
Include once the simple HTML DOM helper file.
Set the target URL as http://www.tokyobit.com.
Create a new simple HTML DOM object to store the target page
Load our target URL into that object
For each link <a> that we find on the target page
- Print out the HREF attribute
That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.
How to Increase Google’s Crawl Frequency
While reviewing the pages crawled per day in Google Webmaster Tools, I noticed that
out of the gate we had an instant crawl of nearly all site pages.
I passed this on to the client for which I got the quick reply of, “Why is it crawling more pages now than it used to crawl?”
Seeing this reminded me yet again of all the reasons why SEO practices sound on-site can help aid in crawl frequency enhancement.
Through the redesign, we enacted several SEO elements, which have helped to allow and some instance entice crawling bots to frequent the site more often…and more pages at that. Let’s examine how those elements increased Google’s crawl frequency.
Why You Should Care About Crawl Frequency
SEO, to many, hinges upon attaining enhanced visibility for highly searched terms as well as referring this traffic to their sites. Taking our blinders off for a moment, there are a few things we have to remember.
We want to rank many pages on a site, not just the homepage. Additionally, we’re actively making changes to our sites and we want bots to see this as quickly as possible and as deep within the site as possible.
Redesign/Site Migration or Not, No Excuses
As mentioned above the redesign effort did a good job of lending to the opportunity to enhance crawl frequency as so many good SEO changes were taking place at once. Additionally, so much more new content and refreshed content drives the bots nuts giving them so much more to want to peruse on the site, thanks Google Caffeine!
For many out there, you can’t enjoy the opportunity of creating a full scale redesign, platform change, and SEO overhaul of a site all at once. If this is you, then the list below is a working order of all the standard SEO practices you can work on to improve crawl frequency on your site.
Get ’em on the Site
- Run a DNS check, Ping and Traceroute check of the site to assess if there are any issues with site pages loading with regard to connectivity or any other server issues. Can the bots even access your pages?
- Run a page load speed report of your 10 most important pages to review how fast your pages are loading. Crawlers lack patience. Are you asking too much of them?
- Utilize parameter-free static/clean URLs on the site. Bots have long had issues with parameter crawling. Yes, they can often see their way through these now, but why not make it easier for them to crawl the site?
Hand Them the Keys to the Site
- Review your robots.txt file as well as your usage of meta robots tags. What pages are holding from them?
- Have an XML sitemap as well as HTML sitemap.
- Enlist supplemental navigation on-site (i.e., footer navigation, breadcrumb navigation, and relevant internal linking in copy). Create pathways to make a site easy to crawl.
- Fix internal links resulting in 404 errors as well as ensuring that external links open in new windows. You don’t want to stop the crawl and you don’t want to usher them away.
Entice & Lure Them
Generate fresh content! This may be the most important point in the checklist.
Give them a reason to feel they should come back on a regular basis. This doesn’t mean you need new content site-wide every month, but it does mean refreshing existing content on a quarterly basis and maintaining site sections – news, blog, etc. –that have continuously added content onto the domain.
Generate links and social citations to your site. This can be a large scale task in itself. Think of it this way: the more links you have out on the web, the greater your chances are of attracting crawling bots. Think of links as portals into your site.
Conclusion
As you can see, there are many components that aid in enhancing your bot crawl frequency and depth of crawl. These are also many of the foundational elements of SEO. This helps to reinforce that crawl frequency, is after all, a very important aspect of SEO itself.
Regards,





