Searching The Web




Introduction To Web Searching
According to recent results of a study published by Cyveillance, the World Wide Web is estimated to contain more than two billion pages of publicly-accessible information. As if the Web's immense size weren't enough to strike fear in the heart of all but the most intrepid surfers, consider that the Web continues to grow at an exponential rate: tripling in size over the past two years, according to one estimate.

Add to this, the fact that the Web lacks the bibliographic control standards we take for granted in the print world: There is no equivalent to the ISBN to uniquely identify a document; no standard system, analogous to those developed by the Library of Congress, of cataloguing or classification; no central catalogue including the Web's holdings. In fact, many, if not most, Web documents lack even the name of the author and the date of publication. Imagine that you are searching for information in the world's largest library, where the books and journals (stripped of their covers and title pages) are shelved in no particular order, and without reference to a central catalogue. A researcher's nightmare? Without question. The World Wide Web defined? Not exactly. Instead of a central catalogue, the Web offers the choice of dozens of different search tools, each with its own database, command language, search capabilities, and method of displaying results.

Given the above, the need is clear to familiarize yourself with a variety of search tools and to develop effective search techniques, if you hope to take advantage of the resources offered by the Web without spending many fruitless hours flailing about, and eventually drowning, in a sea of irrelevant information.

SEARCH ENGINES AND SUBJECT DIRECTORIES

The two basic approaches to searching the Web are search engines and subject directories (also called Portals).

Search engines allow the user to enter keywords that are run against a database (most often created automatically, by "spiders" or "robots"). Based on a combination of criteria (established by the user and/or the search engine), the search engine retrieves WWW documents from its database that match the keywords entered by the searcher. It is important to note that when you are using a search engine you are not searching the Internet "live", as it exists at this very moment. Rather, you are searching a fixed database that has been compiled some time previous to your search.

While all search engines are intended to perform the same task, each goes about this task in a different way, which leads to sometimes amazingly different results. Factors that influence results include the size of the database, the frequency of updating, and the search capabilities. Search engines also differ in their search speed, the design of the search interface, the way in which they display results, and the amount of help they offer.

In most cases, search engines are best used to locate a specific piece of information, such as a known document, an image, or a computer program, rather than a general subject.

Examples of search engines include:

The growth in the number of search engines has led to the creation of "meta" search tools, often referred to as multi-threaded search engines. These search engines allow the user to search multiple databases simultaneously, via a single interface. While they do not offer the same level of control over the search interface and search logic as do individual search engines, most of the multi-threaded engines are very fast. Recently, the capabilities of meta-tools have been improved to include such useful features as the ability to sort results by site, by type of resource, or by domain, the ability to select which search engines to include, and the ability to modify results. These modifications have greatly increased the effectiveness and utility of the meta-tools.

Popular multi-threaded search engines include:

Subject-specific search engines do not attempt to index the entire Web. Instead, they focus on searching for Web sites or pages within a defined subject area, geographical area, or type of resource. Because these specialized search engines aim for depth of coverage within a single area, rather than breadth of coverage across subjects, they are often able to index documents that are not included even in the largest search engine databases. For this reason, they offer a useful starting point for certain searches. The table below lists some of the subject-specific search engines by category. For a more comprehensive list of subject-specific search engines, see one of the following directories of search tools:

Table of selected subject-specific search engines
 
Regional (Canada) Regional (Other) Companies
People (E-mail addresses) People (Postal addresses & telephone numbers)
Images Jobs
Games Software
Health/Medicine Education/Children's Sites


How Search Engines Work.
Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. (There are some disadvantages to calling part of the Internet the World Wide Web -- a large set of arachnid-centric names for tools is one of them.) In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

Google.com began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.

Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the Domain Name Server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.

When the Google spider looked at an HTML page, it took note of two things:

The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.

Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags.

Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed -- the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users: the information stored with the data, and the method by which the information is indexed.

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the "ranking" list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a "weight" to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.

Subject Directories/Portals.
Subject directories, or portals, are hierarchically organized indexes of subject categories that allow the Web searcher to browse through lists of Web sites by subject in search of relevant information. They are compiled and maintained by humans and many include a search engine for searching their own database.

Subject directory databases tend to be smaller than those of the search engines, which means that result lists tend to be smaller as well. However, there are other differences between search engines and subject directories that can lead to the latter producing more relevant results. For example, while a search engine typically indexes every page of a given Web site, a subject directory is more likely to provide a link only to the site's home page. Furthermore, because their maintenance includes human intervention, subject directories greatly reduce the probability of retrieving results out of context.

Because subject directories are arranged by category and because they usually return links to the top level of a web site rather than to individual pages, they lend themselves best to searching for information about a general subject, rather than for a specific piece of information.

Examples of subject directories include:

Specialized subject directories
Due to the Web's immense size and constant transformation, keeping up with important sites in all subject areas is humanly impossible. Therefore, a guide compiled by a subject specialist to important resources in his or her area of expertise is more likely than a general subject directory to produce relevant information and is usually more comprehensive than a general guide. Such guides exist for virtually every topic. For example, Voice of the Shuttle (http://vos.ucsb.edu) provides an excellent starting point for humanities research. Film buffs should consider starting their search with the Internet Movie Database (http://us.imdb.com).

Just as multi-threaded search engines attempt to provide simultaneous access to a number of different search engines, some web sites act as collections or clearinghouses of specialized subject directories. Many of these sites offer reviews and annotations of the subject directories included and most work on the principle of allowing subject experts to maintain the individual subject directories. Some clearinghouses maintain the specialized guides on their own web site while others link to guides located at various remote sites.

Examples of clearinghouses include:

SEARCH STRATEGY

Regardless of the search tool being used, the development of an effective search strategy is essential if you hope to obtain satisfactory results. A simplified, generic search strategy might consist of the following steps:
  1. Formulate the research question and its scope
  2. Identify the important concepts within the question
  3. Identify search terms to describe those concepts
  4. Consider synonyms and variations of those terms
  5. Prepare your search logic
This strategy should be applied to a search of any electronic information tool, including library catalogues and CD-ROM databases. However, a well-planned search strategy is of especially great importance when the database under consideration is one as large, amorphous and evolving as the World Wide Web. Along with the characteristics already mentioned in the Introduction, another factor that underscores the need for effective Web search strategy is the fact that most search engines index every word of a document. This method of indexing tends to greatly increase the number of results retrieved, while decreasing the relevance of those results, because of the increased likelihood of words being found in an inappropriate context. When selecting a search engine, one factor to consider is whether it allows the searcher to specify which part(s) of the document to search (eg. URL, title, first heading) or whether it simply defaults to search the entire document.

Search logic refers to the way in which you, and the search engine you are using, combine your search terms. For example, the search Okanagan University College could be interpreted as a search for any of the three search terms, all of the search terms, or the exact phrase. Depending on the logic applied, the results of each of the three searches would differ greatly. All search engines have some default method of combining terms, but their documentation does not always make it easy to ascertain which method is in use. Reading online Help and experimenting with different combinations of words can both help in this regard.  Most search engines also allow the searcher to modify the default search logic, either with the use of pull-down menus or special operators, such as the + sign to require that a search term be present and the - sign to exclude a term from a search.

Boolean logic is the term used to describe certain logical operations that are used to combine search terms in many databases. The basic Boolean operators are represented by the words AND, OR and NOT. Variations on these operators, sometimes called proximity operators, that are supported by some search engines include ADJACENT, NEAR and FOLLOWED BY. Whether or not a search engine supports Boolean logic, and the way in which it implements it, is another important consideration when selecting a search tool. The following diagrams illustrate the basic Boolean operations.

AND
OR
NOT
Boolean operators are most useful for complex searches, while the + and - operators are often adequate for simple searches.

Dogpile

URL: http://www.dogpile.com
Interface: Single search box, with a very busy interface, and is very commercial in its orientation.  Allows you to restrict your search to one of the following categories: The Web, Images, Audio/MP3/ Action/ News, FTP, Discussion, Small Biz, and Streaming Media.
Engines Searched: Dogpile searches the following search engines: AltaVista, Bay9, Direct Hit, Dogpile Web Catalog, FindWhat, Google, GoTo.com, Kanoodle, LookSmart, Lycos, Open Directory, RealNames, Sprinks by About and Yahoo, but only will search 3 at a time.

Search Features:

Search logic and syntax: Dogpile defaults to a Boolean AND search. Enclose phrases in quotation marks. Use + and - to require and exclude search terms.

Limit options:  Dogpile allows you the ability to search the following categories as the only mechanism to limit your search; The Web, Images, Audio/MP3/ Action/ News, FTP, Discussion, Small Biz, and Streaming Media.  

Dogpile provides the user with the option to limit which search engines it uses.  You have the ability to select 5 search engines per column.  Selecting column 1 and the 5 (or fewer) search engines limits your results to those engines alone for that page of results.  This function writes a cookie on your hard drive to maintain the information for future searches.

Truncation: None

Case sensitivity: None

Results:

What is displayed:  The search engine that was used and the number of hits that it returned. Results include document title, first few words of text, URL.

Order of results: Based upon the engines searched, and their specifications of ranking.

Other features: A variety of commercial (shopping) services


Ixquick

URL: http://www.ixquick.com
Interface: Single search box.  Ixquick allows the user the choice to search one of the following catogories: “Web”, “News”, “MP3”, and “Pictures”.
Engines Searched: Ixquick searches the following search engines: AOL, AltaVista, Euro Seek, Excite, Fast Search, GoTo.com, HotBot, LookSmart, MSN, NBCi,  Webcrawler, Yahoo.

Search Features:

Search logic and syntax: Ixquick is one of the few metasearch engines that allows, natural language, simple, and Boolean searches.  Ixquick defaults to a Boolean AND search. Enclose phrases in quotation marks. Use + and - to require and exclude search terms.

Limit options: Allows for the user to choose whether to search “The Web”, “MP3”, “News”, and “Pictures”.  The user has the option to select from the following countries; Deutsch, English, English UK, Espańol, Français, Italiano, Nederlands, Portuguęs.

On the results page you have the option to limit or select the search engines you wish to use, for a subsequent search.

Truncation:  Ixquick has the ability to determine which search engines utilize wildcards, it has the ability to translate and use the search engines that have the ability to respond to them.

Case sensitivity: Yes, Capitalized results will retrieve only the pages where the search term is also capitalized, however lower case will disregard case sensitivity and retrieve everything.

Results:

What is displayed: Results include document title, first few words of text, URL, which search engines it used to find the site and the numbers of times they are hit by each engine.  Clicking on the search engine will take you that search engine's list of related topic pages.  All search results open in a new page so that your Ixquick search page is always available.

Order of results: Ixquick provides one star to a result list item for each search engine that ranks that site in its top ten results. That is, a five-star result is ranked in the top ten by five search engines.

Refining results: Clicking the "More Like This" link retrieves pages that are "related" to the current result.

Other features: 

 Search for pages that link to a given page (e.g. link:http://www.ouc.bc.ca)


Metacrawler

URL: http://www.metacrawler.com

Interface:  

Normal Search: Single search box.  Metacrawler allows the user the choice to search for Any word, All words, or Phrase contained within the search box.  The interface is very busy and commercial in orientation.  

Power Search: Within the power search mode Metacrawler offers the ability to customize how it performs its search. The user is provided with the option of selecting which search engines to use,  the order of the results displayed, and the domain (whether it be by country, or institution, government, education, etc.).  The option is also provided to limit the number of results, and results per page. The layout is very user friendly. Metacrawler uses cookies to save customizations to your hard drive.

Engines Searched:  Metacrawler searches the following search engines: AltaVista, DirectHit, Excite, FindWhat.com, Google, GoTo.com, Internet Keywords, Kanoodle, LookSmart, Lycos, MetaCatalog, Sprinks by About, Webcrawler.

Search Features:

Search logic and syntax: Metacrawler defaults to Boolean AND. Enclose phrases in quotation marks. Use + and - to require and exclude search terms.

Limit options: Allows for the user to choose whether to search “The Web”, “Directory”, “Audio/MP3”, “Image”, “News Groups”, “Action”, “Streaming Media”.  Power search allows the user the ability to select which engines they wish to search, the domain (whether it be an Educational site, Government, or Country of choice).  You may specify the amount of time you wish to wait, as well as the quantity of results on one result page.  Search by Country allows the user the ability to select a country of choice to refine one's search.

Metacrawler also provides the option to customize your search options.  This provides the user with the ability to customize the colour scheme, as well as the default interface (Normal search, Slide Show, Power search, Low bandwidth).  This feature also provides you the ability to select which engines are searched, key word defaults, domains, the time out length, the number of results, and how they will be viewed, whether the cursor will appear in the search box each time, and if  it should save search parameters automatically every time you change an option and issue a query.  This is all accomplished by writing a cookie on your hard drive to retain this information.

Truncation: None

Case sensitivity: None

Results:

What is displayed: Results include document title, first few words of text, URL, which search engines it used to find the site, and a link to more sites similar to this title.

Order of results: Metacrawler’s PageRank algorithm ranks pages based on the frequency the document is accessed, i.e. a document accessed 100 times will rank higher than a document accessed 50 times.

Refining results: Clicking the "More Like This" link retrieves pages that are "related" to the current result.

Other features:  

  • View the most popular searches

  • A variety of commercial services


ProFusion

URL: http://www.profusion.com
Interface: Single search box.  Provides search options below the search text box, thus allowing you the ability to customize your search.
Engines Searched: Profusion searches the following search engines; AltaVista, Yahoo!, GO, LookSmart, Britannica, Lycos, About, Excite, DirectHit.

Search Features:

Search logic and syntax: ProFusion defaults to a Boolean AND search, but allows you to choose how you wish to change the search logic in a pull down menu. Enclose phrases in quotation marks. Use + and - to require and exclude search terms. 

Limit options:  Profusion provides you with the option to limit your search into one of the following categories: Web Search, Business, Computing, Discussion, Entertainment, Health, Investment, MP3, Sports, Kids Fun, Genealogy, People.  Allows you the ability to choose search sources: Best 3, Fastest 3, All, You Choose.  These allow you to pick yourself or allow ProFusion to choose which search engines to use. You may predetermine the number of results that will be presented.  

Truncation: None

Case sensitivity: None.

Results:

What is displayed:  The search engine that was used and the number of hits that it returned. Results include document title, first few words of text, URL, also the search engine that the document was found by is presented as an active link (but will only take you to the search page).  It provides a "track it" feature that requires your email address, since Intelliseek's Web Tracker will email to you when the page changes or is updated.

Order of results: Results are displayed in order of decreasing relevance.

Other features: 

 Search for pages that link to a given page (e.g. link:http://www.ouc.bc.ca)


SurfWax

URL: http://www.surfwax.com
Interface: Single search box.  Very simple layout, with the only option being a choice for the number of results that you wish to be returned.
Engines Searched: Surfwax searches the following search engines: All The Web, AltaVista, Excite, HotBot, InfoSeek, OpenDirectory, SearchEdu,  Webcrawler, Yahoo, Yahoo News, Ditto.

Search Features:

Search logic and syntax: SurfWax defaults to a Boolean AND search. Enclose phrases in quotation marks. Use + and - to require and exclude search terms.

Limit options: Uses a series of "dimensions".  The 1st dimension aids the user in focusing their search.  The 2nd dimension allows for what the creators call a SearchBlanket™, which allows for the user to generate a concise list of results.  The 3rd dimension allows for the user the ability to review a site with more relevant information pertaining to the search.  The 4th dimension allows for you to personalize SurfWax to your preferred style of searching.  

Truncation: None

Case sensitivity: None.

Results:

What is displayed: The display page is subdivided into two halves.  The right side displays the results of your search.  The left side, initially, makes reference to the various search engines that were used in generating the list of results.  By clicking on the green arrowhead, in front of the desired result, a description page is open in the frame on the left side.  This “Snapshot” provides you with the criterion that was used to find this page, as well as an abstract of the site and some key points that can be found in the site or page. Also provided is a series of site focus words that will allow you to further refine your search if you so desire.  When a link is chosen, that link is opened on a separate page, so you always have access to the search results page.

Order of results: Unknown.

Refining results: By clicking on the focus button allows you the ability to refine your search.  SurfWax then provides you with a series of terms that are related to the word or words you are searching for.  For example "Hurricane" brought up the following choices: a severe tropical cyclone usually with heavy rains and winds moving a [definition truncated at this point]

Broader << cyclone  

 tropical storm marked by extremely low pressure and circular winds [definition truncated at this point]

  Broader<< mile<< statute mile<< stat mi<< land mile<< mi<< speed<< velocity  

  Narrower << circular winds<< extremely low pressure 

Other features: 

 Search for pages that link to a given page (e.g. link:http://www.ouc.bc.ca)