- Category: January - February 2009
Did you know that search engines do not really search the World Wide Web directly? Each one searches a database of the full text of Web pages selected from the billions of Web sites out there residing on servers. When you search the Web using a search engine, you are always searching a somewhat stale copy of the real Web page. When you click on links provided in a search engine's search results, you retrieve from the server the current version of the page.
Search engine databases are selected and built by computer robot programs called spiders. Although it is said they "crawl" the Web in their hunt for pages to include, they actually stay in one place. Spiders find the pages for potential inclusion by following the links in the pages they already have in their database. They cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it. Forget all Sci-Fi movies, as computers, although getting more sophisticated all the time, are still brainless.
If a Web page is never linked to other pages, search engine spiders can’t find it, too. The only way a brand new page (one that no other page has ever linked to) can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included. All search engine companies offer ways to do this.
After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content. These reachable pages are the surface Web, whereas pages which do not have a chain of links from a page in the spider's initial list are invisible to that spider and not part of the surface web it defines.
What’s the difference between PageRank and Toolbar PageRank?
Internally PageRank is constantly updated while toolbar PageRank is updated every 2-3 months. Toolbar PageRank is a single digit integer while the internally calculated PageRank is more like a floating-point number. And the final answer: Who cares?
What is Latent Semantic Analysis (LSI - Indexing)?
The process of analyzing the relationships between terms in sets of documents. The engine looks not only at the query, but also looks for common terms in the document set. Documents that are semantically similar will carry more weight than those that are not. This is often a misunderstood concept.
What is Phrase Based Indexing and Retrieval and what roles does it play?
Phrase based indexing is used to classify good and bad phrases based on certain criteria inside the entire document. The number and proximity are taken into account. It also is capable of predicting the presence of other phrases on the page and will assign a higher or lower value depending on if those phrases or present or not.
Google Lore - what are ‘Hilltop’, ‘Florida’, and ‘Big Daddy’?
Hilltop: An old and often contested algorithm that calculates PageRank based on expert documents and topical relevancy. The theory behind it was to decrease the possibility of manipulation from buying high PR links from off topic pages. This was implemented during the Florida update, which is our next topic.
Florida: The highly controversial update implemented by Google in November of 2003, much to the chagrin of many seasonal retail properties. There were several theories as to what was included in this update; Over optimization filter, competitive term filter, and the Hilltop algorithm. This update had catastrophic results on many web merchants.
Big Daddy: A test data center used by Google to preview algorithm changes. This information was made public around November of 2005 by Matt Cutts and allowed marketers to preview upcoming SERP’s.
What is a shingling algorithm and how is it used?
A shingling algorithm is a page segmentation method similar to VIPS, but less resource intensive and more likely to be used in search engine algorithms. These shingling algorithms look for blocks of content that do not occur frequently across a web site and look for blocks with certain desired features. When the engine stores this information, the navigational, advertisements, and other non-content areas are omitted. This increases speed, saves on storage space, and theoretically makes the results more relevant because of the increase in unique content.