navi.gif Navigation and Sitemap
What is PageRank and how does it work?
How is the PageRank calculated?
The Random Surfer Model
False of Spoof Pagerank
What means SEO?
History of SEO
Relationship between SEO and Searchengines
Getting discovered by search engines
"Ethical" methods
"Unethical" methods


what.gif What is PageRank and how does it work?
To get the PageRank explained correctly, we have to start in the early years of the internet. Since the early stages of the world wide web, search engines have developed different methods to rank web pages. Until today, the event of a search phrase within a document is one major factors within ranking techniques of the major search engines and can thereby be weighted by the length of a document (ranking by keyword density) or by its accentuation within a document by HTML tags.

The technique develops and more and more automatic generated web pages flooded the www. To keep the view the concept of link popularity was developed. PageRank is one of the methods Google uses to determine a page’s relevance or importance. This easy idea is the main concept of the GooglePagerank as it is used today.

The PageRank itself is a numeric value that represents the importance of your website. When one page links to another page, Google figures that this is effectively casting a vote for the other page. A link to a page counts as a vote of support. The more votes one page has, the more important that page must be. And if this voting-page is important too, the vote itself is important. Google calculates a page's importance from the votes cast for it.
The basic approach of PageRank is that a document is in fact considered the more important the more other documents link to it, but those inbound links do not count equally, or to say it in Google words:

"We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

or

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal vector of the normalized link matrix of the web."

Let's split this up so that we get this a little closer to the people who are not into mathematics:

1. PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to “PR(Tn)” for the last page

2. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.

3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)”

4. d(... - All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by 0.85 (the factor “d”)

5. (1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”: it adds in the bit lost by the d(.... It also means that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the Google paper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “the average” to you and me.

6. N - The number of all pages in the web

First of all, we see that PageRank does not rank web sites as a whole, but is determined for each page individually. Further, the PageRank of page A is recursively defined by the PageRanks of those pages which link to page A.

But attention, not all links are counted by Google. Some links can cause Google to ignore a site, especially because the webmaster can't control which sites link to their site, but they can control which sites they link out to. So check out the site you are linking to, a link to a PR0-site would be unwise for you.

how.gif How is the PageRank calculated?
This is where the hole complex gets a little tricky. The PR of each page depends on the PR of the pages pointing to it. But we won’t know what PageRank those pages have until the pages pointing to them have their PR calculated and so on. Sounds like a circle, and indeed, it is a circle.

But Google gives an easy explanation to this:

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.

That means, that we can just start calculating a page's PR without knowing the PageRank of the pages linking to our page. Sounds weird, but each time we calculate this, we are getting clother to the final value. All we have to do is calculate until the number stops changing much.

rsm.gif The Random Surfer Model

In the Google explanations we can find another addition to this algorithm, the so called Random Surfer Model:

They consider PageRank as a model of user behaviour, where a surfer clicks on links at random with no regard towards content.

The random surfer visits a web page with a certain probability which derives from the page's PageRank. The probability that the random surfer clicks on one link is solely given by the number of links on that page. This is why one page's PageRank is not completely passed on to a page it links to, but is devided by the number of links on the page.

So, the probability for the random surfer reaching one page is the sum of probabilities for the random surfer following links to this page. Now, this probability is reduced by the damping factor d. The justification within the Random Surfer Model, therefore, is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.

The probability for the random surfer not stopping to click on links is given by the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-d) into the algorithm. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank.

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the frequency of hits on that page by the random surfer. It can be understood as a Markov process in which the states are pages, and the transitions are all equally probable and are the links between pages. If a page has no links to another pages, it becomes a sink and therefore makes this whole thing unusable, because the sink pages will trap the random visitors forever. However, the solution is quite simple. If the random surfer arrives to a sink page, it picks another URL at random and continues surfing again.

To be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually q=0.15, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.

spoof.gif False of Spoof Pagerank
While the PR shown is usually accurate for most sites it must be noted that it is also easily manipulated. A current flaw is that any low PageRank page that is redirected, via a 302 server header or a "Refresh" meta tag, to a high PR page causes the lower PR page to acquire the PR of the destination page. In theory a new, PR0 page with no incoming links can be redirected to the Google home page - which is a PR 10 - and by the next PageRank update the PR of the new page will be upgraded to a PR10. This is called spoofing and is a known failing or bug in the system. Any page's PR can be spoofed to a higher or lower number of the webmaster's choice and only Google has access to the real PR of the page.

For SEO purposes webmasters often buy links for their sites. As links from higher PR pages are believed to be more valuable they tend to be more expensive.

seo.gif What means SEO?
Search engine optimization (SEO) is a set of methods aimed at improving the ranking of a website in search engine listings. The term also refers to an industry of consultants that carry out optimization projects on behalf of clients' sites.

Using search engines, visitors can find sites in a variety of ways: via paid-for advertisements in the search engine results pages (SERPs), via third parties who are listed in the search engines, or via "organic" listings, i.e. the results the search engines present users. SEO is primarily concerned with improving the visibility of a site in the organic search results.

High rankings in the organic search results can provide targeted traffic for a site. Obtaining that traffic by other means can potentially be expensive. For particularly competitive terms, the cost per click can run several dollars, or more, when pay per click advertising or banner advertising are used. For even moderately competitive terms the cost can range from a few cents to several tens of dollars per visitor. Given those costs, it often makes sense for site owners to optimize their sites for organic search..

Other sites target a specific population, with particular needs or interests. Many businesses try to optimize their sites for large numbers of highly specific keywords that indicate a prospective customer who is ready to buy their product. Focusing on desired traffic can generate more high-quality sales leads, and fewer time-wasting inquiries.

History of SEO
SEO began in the mid-1990s, as the first search engines were cataloging the early web. Initially, all a webmaster needed to do was submit a site to the various engines which would run spiders programs to "crawl" the site, and store the collected data. The search engines then sorted the information by topic, and serve results based on pages they had spidered. As the number of documents online kept growing, and more webmasters realised the value of organic search listings, it became imperative for search engines to sort the vast collection of pages they had spidered and display the most relevant pages first. This was the start of a search engine vs. SEO struggle that continues to this day.

Initially, search engines were guided by the webmasters themselves. Early versions of search algorithms relied on webmaster-provided information like meta tags. Meta tags provided a guide to each page's content and relevant keywords. Soon some webmasters began to abuse meta tags, causing their pages to rank for irrelevant searches. In response, search engines developed more complex algorithms, taking into account a wider range of factors, but they still relied largely on what are today known as "on-site" factors. Examples of on-site factors include:

* Keywords in the domain name

* Keywords in the site's directory and file names

* Page titles and tags: for example, a phrase marked up as an H1 (heading) element was considered to contain keywords relevant to the page

* Ratio of the keyword(s) to other words on the page, the keyword density

* Content of alternate text provided in the form of Alt tags for images, noframes text for browsers not able to display framed pages, etc.


The inherent flaw in relying so extensively on that factors was that webmasters and SEOs had full control over them and could "optimize" their pages for better rankings. Search engines had to adapt again to ensure their SERPs showed the most relevant pages rather than the best optimized ones.

A new search engine emerged with a new kind of thinking. Google was started by two PhD students at Stanford University, Sergey Brin and Larry Page, and brought a new concept to ranking web pages. This concept, called PageRank, was, for many years, the mainstay of the Google algorithm. PageRank relied heavily on incoming links and used the logic that each link to a page is a vote for that page's value. The more incoming links a page had the more "worthy" it was. The value of each incoming link itself varied directly based on the PageRank of the page it was coming from and inversely on the number of outgoing links on that page. PageRank proved to be very good at serving relevant results. Google became the most popular and successful search engine. Because PageRank measured an off-site factor, it was more difficult to manipulate - at first.

Given time, and the realization that PageRank was the new game in town, webmasters focused on exchanging, buying, and selling links on a massive scale. PageRank's reliance on the link as a vote of confidence in a page's value was undermined as many webmasters sought to garner links purely to influence Google into sending them more traffic, irrespective of whether the link was useful to human site visitors.

It was time for Google and other search engines to look at a wider range of off-site factors. There were other reasons to develop more intelligent algorithms. The Internet was reaching a vast population of non-technical users who were often unable to use advanced querying techniques to reach the information they were seeking and the sheer volume and complexity of the indexed data was vastly different to the early days. Search engines had to develop predictive, semantic, linguistic and heuristic algorithms.

The PageRank metric itself is still displayed in the Google Toolbar, but it is only one of several factors that Google considers in ranking pages.

Today, most search engines keep their methods and ranking algorithms secret. A search engine may use hundreds of factors in ranking the listings on its SERPs; the factors themselves and the weight each carries may change continually.

Much current SEO thinking on what works and what doesn't is largely speculation and informed guesses. Some SEOs have carried out controlled experiments to guage the effects of different approaches to search optimization.

The following, though, are some of the considerations search engines could be building into their algorithms, and the list of Google patents may give some indication as to what is in the pipeline:

* Age of site

* Length of time domain has been registered

* Age of content

* Regularity with which new content is added

* Age of link and reputation of linking site

* Standard on-site factors

* Negative scoring for on-site factors (for example, a dampening for sites with extensive keyword meta tags indicative of having being SEO-ed)

* Uniqueness of content

* Related terms used in content (the terms the search engine associates as being related to the main content of the page)

* External links, the anchor text in those external links and in the sites/pages containing those links

* Citations and research sources (indicating the content is of research quality)

* Stem-related terms in the search engine's database (finance/financing)

* Incoming backlinks and anchor text of incoming backlinks

* Negative scoring for some incoming backlinks (perhaps those coming from low value pages, reciprocated backlinks, etc.)

* Rate of acquisition of backlinks: too many too fast could indicate "unnatural" link buying activity

* Text surrounding outward links and incoming backlinks. A link following the words "Sponsored Links" could be ignored

* Use of "rel=nofollow" to suggest that the search engine should ignore the link

* Depth of document in site

* Metrics collected from other sources, such as monitoring how frequently users hit the back button when SERPs send them to a particular page

* Metrics collected from sources like the Google Toolbar, Google AdWords/Adsense programs, etc.

* Metrics collected in data-sharing arrangements with third parties (like providers of statistical programs used to monitor site traffic)

* Rate of removal of incoming links to the site

* Use of sub-domains, use of keywords in sub-domains and volume of content on sub-domains… and negative scoring for such activity

* Semantic connections of hosted documents

* Rate of document addition or change

* IP of hosting service and the number/quality of other sites hosted on that IP

* Other affiliations of linking site with the linked site (do they share an IP? have a common postal address on the "contact us" page?)

* Technical matters like the proper use of robots.txt

* Hosting uptime

* Whether the site serves different content to different categories of users (cloaking)

* Broken outgoing links not rectified promptly

* Unsafe or illegal content

* Quality of HTML coding, presence of coding errors

* Actual click through rates observed by the search engines for listings displayed on their SERPs

* Hand ranking by humans of the most frequently accessed SERPs

seose.gif Relationship between SEO and Searchengines
In the early 2000, search engines and SEO firms attempted to establish an unofficial 'truce'. There are several tiers of SEO firms, and the more reputable companies employ content-based optimizations which meet with the search engines' (reluctant) approval. These techniques include improvements to site navigation and copywriting, designed to make websites more intelligible to search engine algorithms.

Search engines have also reached out to the SEO industry, and are frequent sponsors and guests at SEO conferences and seminars. In fact, with the advent of paid inclusion, search engines now have a vested interest in the health of the optimization community.

found.gif Getting discovered by search engines
New sites need no longer need to be submitted to search engines to be listed. A simple link from an established site will get the search engines to visit the new site and spider its contents. It is rarely more than a few days from the acquisition of the link to all the main search engine spiders visiting and indexing the new site.

Naturally, this means that it is good practice to have some means (such as a site map, or plain hypertext links) so that once a spider finds part of a site, it can navigate to the rest. Otherwise, individual, isolated, dead-end pages must be found one-by-one from outside the site; any pages that are not linked to from outside can only be found by links internal to the site.

For those search engines, like Yahoo, who have their own paid submission, it may save some time to pay a nominal fee for submission.

do.gif "Ethical" methods

So-called "Ethical" methods of SEO involve following the search engines' guidelines as to what is and what isn't acceptable. Their advice generally is to create content for the user, not the search engines; to make that content easily accessible to their spiders; and to not try to game their system. Often webmasters make critical mistakes when designing or setting up their web sites, and "poison" them so that they will not rank well. Ethical SEO attempts to discover and correct mistakes, such as menus not-readible, broken links, temporary redirects, or a generally poor navigation structure that places pages too many clicks from the home page.

Because search engines are text-centric, many of the same methods that are useful for web accessibility are also advantageous for SEO. Methods are available for optimizing graphical content, even Flash animation (by placing a paragraph or division within, and at the end of the enclosing OBJECT tag), so that search engines can interpret the information.

Some methods considered ethical by the search engines:

* Using a robots.txt file to grant permissions to spiders to access, or avoid, specific files and directories in the site

* Using a short and relevant page title to name each page

* Using a reasonably sized description meta tag without excessive use of keywords, exclamation marks or off topic comments

* Keeping the page accessible via links from other pages on the site and, preferably, from a sitemap

* Developing links via natural methods: Google doesn't elaborate on this somewhat vague guideline, but buying a link from an off-topic page purely because it has a high PageRank is probably not considered acceptable. Dropping an email to a fellow webmaster telling him about a great article you've just posted, and requesting a link, is most likely acceptable.

dont.gif "Unethical" methods

As search engines operate in a highly automated way it is often possible for webmasters to use methods and tactics not approved by search engines to gain better ranking. These methods often go unnoticed unless an employee from the search engine manually visits the site and notices the activity, or a change in ranking algorithm causes the site to lose the advantage thus gained. Sometimes a company will employ an SEO consultant to evaluate competitor's sites, and report "unethical" optimization methods to the search engines.

So-called "unethical" methods may include:

Keyword spamming (or keyword stuffing) involves the insertion of hidden, random text on a webpage to raise the keyword density or ratio of keywords to other words on the page. Hiding text out of view of the visitor's screen is done in many different ways. A popular technique is text colored to blend with the background. Using CSS "Z" positioning to place text "behind" an image -- and therefore out of view of the visitor -- is also common. Other ways include using CSS absolute positioning to have the text positioned several feet away from the page center and, again, out of physical view of the visitor but plainly text that any search engine would pick up in a crawl of the page. Invisible text is a bad idea, as of 2005, because top search engines apparently can detect.

Abusing NOSCRIPT tags is another way to place hidden content within a page so that the search engines will index it, but the visitor won't see the content. NOSCRIPT tags are also a valid optimization method for displaying an alternative representation of Javascript content, such as dynamic methods. The NOSCRIPT tags is not unethical by itself, only if misused.

The inserted text sometimes includes words that are frequently searched (such as "sex") even if those terms bear little connection to the content of the page. The goal in these cases is plainly to increase traffic at all costs whether that traffic is relevant or not. Once traffic comes to the page, the unethical webmaster may hope to monetize the traffic by displaying ads.

Spamdexing is the promotion of irrelevant, chiefly commercial, pages through abuse of the search algorithms. Many search engine administrators consider any form of search engine optimization used to improve a website's page rank as spamdexing. However, over time a widespread consensus has developed in the industry as to what are and are not acceptable means of boosting one's search engine placement and resultant traffic.

Cloaking refers to any of several means to serve up a different page to the search-engine spider than will be seen by human users. It can be an attempt to mislead search engines regarding the content on a particular web site. It should be noted, however, that cloaking can also be used to ethically increase accessibility of a site to users with disabilities, or to provide human users with content that search engines aren't able to process or parse. It is also used to deliver content based on a user's location; Google themselves use IP delivery, a form of cloaking, to deliver results.