Finding Information on the Internet Session 1: Searching the World Wide Web

Finding Information on the
Internet
Session 1:
Searching the World Wide Web
Dr. Hesham Azmi
Program of Information Science
Dept. of Mass Comm.& Information Science
Tuesday 17/1/2006
What is the Internet?
 The Internet is a computer network made up of
thousands of networks worldwide. No one knows exactly
how many computers are connected to the Internet. It is
certain, however, that these number in the millions and
are growing.
 No one is in charge of the Internet. There are
organizations which develop technical aspects of this
network and set standards for creating applications on it,
 All computers on the Internet communicate with one
another using the Transmission Control Protocol/Internet
Protocol suite, abbreviated to TCP/IP.
 The Internet consists primarily of a variety of access
protocols. Many of these protocols feature programs that
allow users to search for and retrieve material made
available by the protocol.
COMPONENTS OF THE INTERNET








WORLD WIDE WEB
E-MAIL
TELNET
FTP
E-MAIL DISCUSSION GROUPS
USENET NEWS
FAQ, RFC, FYI
CHAT & INSTANT MESSAGING
Information on the Internet
 The Internet provides access to a wealth of information on countless
topics contributed by people throughout the world. On the Internet, a
user has access to a wide variety of services: vast information
sources, electronic mail, file transfer, interest group membership,
interactive collaboration, multimedia displays, and more.
 The Internet is not a library in which all its available items are
identified and can be retrieved by a single catalogue. In fact, no one
knows how many individual files reside on the Internet. The number
runs into a few billion and is growing at a rapid pace.
 The Internet is a self-publishing medium. This means that anyone
with little or now technical skills and access to a host computer can
publish on the Internet. Also be aware that the addresses of Internet
sites frequently change. Web sites can disappear altogether. Do not
expect stability on the Internet.
 One of the most efficient ways of conducting research on the
Internet is to use the World Wide Web. Since the Web includes most
Internet protocols, it offers access to a great deal of what is available
on the Internet.
WORLD WIDE WEB
 The World Wide Web (abbreviated as the Web or WWW) is a
system of Internet servers that supports hypertext to access several
Internet protocols on a single interface. Almost every protocol type
available on the Internet is accessible on the Web. This includes email, FTP, Telnet, and Usenet News. In addition to these, the World
Wide Web has its own protocol: Hypertext Transfer Protocol, or
HTTP.
 The Web gathers together these protocols into a single system.
 Web's ability to work with multimedia and advanced programming
languages, the Web is the fastest-growing component of the
Internet.
HOW TO FIND INFORMATION ON THE WEB
 There are a number of basic ways to access
information on the Web:





Go directly to a site if you have the address
Browse
Conduct a search using a Web search engine
Explore a subject directory
Explore the information stored in live databases
on the Web, known as the "deep Web"
 Join an e-mail discussion group or Usenet
newsgroup
GO DIRECTLY TO A SITE IF YOU HAVE
THE ADDRESS
 URL stands for Uniform Resource Locator. The URL
specifies the Internet address of the electronic
document. Every file on the Internet, no matter what its
access protocol, has a unique URL. Web browsers use
the URL to retrieve the file from the host computer and
the directory in which it resides. This file is then
downloaded to the user's computer and displayed on the
monitor.
 This is the format of the URL: protocol://host. second
level domain. upper level Domain/path/filename
READING WEB ADDRESSES












First, you need to know how to read a web address, or URL (Universal Resource
Locator). Let's look at the URL for this tutorial:
http://www.sc.edu/beaufort/library/pages/bones/lesson1.shtml
Here's what it all means:
"http" means hypertext transfer protocol and refers to the format used to transfer
and deal with information
"www" stands for World Wide Web and is the general name for the host server that
supports text, graphics, sound files, etc. (It is not an essential part of the address,
and some sites choose not to use it)
"sc" is the second-level domain name and usually designates the server's location, in
this case, the University of South Carolina
"edu" is the top-level domain name (see below)
"beaufort" is the directory name
"library" is the sub-directory name
"pages" and "bones" are the folder and sub-folder names
"lesson1" is the file name
"shtml" is the file type extension and, in this case, stands for "scripted hypertext
mark-up language" (that's the language the computer reads). The addition of the "s"
indicates that the server will scan the page for commands that require additional
insertion before the page is sent to the user.
Top Level Domains
 Only a few top-level domains are currently






recognized, but this is changing. Here is a list of
the domains generally accepted by all:
.edu -- educational site (usually a university or
college)
.com -- commercial business site
.gov -- U.S. governmental/non-military site
.mil -- U.S. military sites and agencies
.net -- networks, internet service providers,
organizations
.org -- U.S. non-profit organizations and others .
Additional Top Level Domains
 In mid November 2000, the Internet Corporation for







Assigned Names and Numbers (ICANN) voted to accept
an additional seven new suffixes, which are expected to
be made available to users :
.aero -- restricted use by air transportation industry
.biz -- general use by businesses
.coop -- restricted use by cooperatives
.info -- general use by both commercial and noncommercial sites
.museum -- restricted use by museums
.name -- general use by individuals
.pro -- restricted use by certified professionals and
professional entities
CONDUCT A SEARCH USING A WEB
SEARCH ENGINE
 An Internet search engine allows the user to enter
keywords relating to a topic and retrieve information
about Internet sites containing those keywords.
 Search engines located on the Web have become quite
popular as the Web itself has become the Internet's
environment of choice. Web search engines have the
advantage of offering access to a vast range of
information resources located on the Internet.
 Web search engines tend to be developed by private
companies, though most of them are available free of
charge.
Search Engines
 A Web search engine service consists of three
components:
 Spider: Program that traverses the Web from
link to link, identifying and reading pages
 Index: Database containing a copy of each Web
page gathered by the spider
 Search engine mechanism: Software that
enables users to query the index and that
usually returns results in term relevancy ranked
order
Search Engines
 With most search engines, you fill out a form with your search terms
and then ask that the search proceed. The engine searches its index
and generates a page with links to those resources containing some
or all of your terms. These resources are usually presented in
ranked order. Term ranking was once a popular ranking method, in
which a document appears higher in your list of results if your
search term appears many times, near the beginning of the
document, close together in the document, in the document title, etc.
These may be thought of as first generation search engines.
 A more sophisticated development in search engine technology is
the ordering of search results by concept, keyword, site, links or
popularity. Engines that support these features may be thought of as
second generation search engines. These engines offer
improvements in the ranking of results.
Search Engines
 It is important to stress that by the very nature of
Search Engines, they cannot index the entire
content of the ‘Net. Since the content of the
Internet changes continuously, there will always
be a delay in indexing the Net. The possible
theoretical exception is Google, whose
proprietary engine takes a ‘picture’ of the Net
every time it is accessed. But in practice it is
estimated that no search engine indexes more
than about 30% of the Web’s content.
HOW TO FORMULATE QUERIES
1.
Identify your concepts When conducting any database search,
you need to break down your topic into its component concepts.
2.
List keywords for each concept Once you have identified your
concepts, you need to list keywords which describe each concept.
Some concepts may have only one keyword, while others may
have many.
3.
Specify the logical relationships among your keywords Once
you know the keywords you want to search, you need to establish
the logical relationships among them. The formal name for this is
Boolean logic. Boolean logic allows you to specify the
relationships among search terms by using any of three logical
operators: AND, OR, NOT.
Simple Vs Advanced Search
 Simple search
Very broad :retrieves thousands of irrelevant files
 Advanced search
Narrowing the search
Boolean
Phrase searching
Field search
Truncation
Boolean Operators
 A AND B ( Files containing both terms)
 A OR B ( Files containing at least one of the
terms)
 A NOT B ( Files containing term A only)

QUICK TIPS


NOTE: These tips will work with most search engines in their basic search option.
Use the plus (+) and minus (-) signs in front of words to force their inclusion and/or
exclusion in searches.
EXAMPLE: +meat -potatoes
(NO space between the sign and the keyword)

Use double quotation marks (" ") around phrases to ensure they are searched exactly
as is, with the words side by side in the same order.
EXAMPLE: "bye bye miss american pie"
(Do NOT put quotation marks around a single word.)

Put your most important keywords first in the string.
EXAMPLE: dog breed family pet choose

Type keywords and phrases in lower case to find both lower and upper case
versions. Typing capital letters will usually return only an exact match.
EXAMPLE: president retrieves both president and President

Use truncation (or stemming) and wildcards (e.g., *) to look for variations in spelling
and word form.
EXAMPLE: librar* returns library, libraries, librarian, etc.
EXAMPLE: colo*r returns color (American spelling) and colour (British
spelling)
QUICK TIPS
 Know whether or not the search engine you are using maintains a stop word
list If it does, don't use known stop words in your search statement. Also,
consider trying your search on another engine that does not recognize stop
words.
 Combine phrases with keywords, using the double quotes and the plus (+)
and/or minus (-) signs.
EXAMPLE: +cowboys +"wild west" -football -dallas
(In this case, if you use a keyword with a +sign, you must put the +sign in
front of the phrase as well. When searching for a phrase alone, the +sign is
not necessary.)
 When searching within a document for the location of your keyword(s), use
the "find" command on that page.
 Know the default (basic) settings your search engine uses (OR or AND).
This will have an effect on how you configure your search statement
because, if you don't use any signs (+, -, " "), the engine will default to its
own settings.

CREATING A SEARCH
STATEMENT
When structuring your query, keep the following tips in mind:
 Be specific
EXAMPLE:
Hurricane Hugo
 Whenever possible, use nouns and objects as keywords
EXAMPLE:
fiesta dinnerware plates cups saucers
 Put most important terms first in your keyword list; to ensure that
they will be searched, put a +sign in front of each one
EXAMPLE: +hybrid +electric +gas +vehicles
 Use at least three keywords in your query
EXAMPLE:
interaction vitamins drugs
 Combine keywords, whenever possible, into phrases
EXAMPLE:
"search engine tutorial"
CREATING A SEARCH
STATEMENT
 Avoid common words, e.g., water, unless they're part of
a phrase
EXAMPLE:
"bottled water"
 Think about words you'd expect to find in the body of the
page, and use them as keywords
EXAMPLE: anorexia bulimia eating disorder
 Write down your search statement and revise it before
you type it into a search engine query box
EXAMPLE: +“South Carolina" +"financial aid"
+applications +grants
Meta Search Engines
 Utilities that search more than search engine and/or
subject directory at once and then compile the results in
a sometimes convenient display, sometimes
consolidating all the results into a uniform format and
listing. Some offer added value features like the ability to
refine searches, customize which search engines or
directories are queried, the time spent in each, etc.
Some you must download and install on your computer,
whereas most run as server-side applications.
 Examples ;
 Dogplile ( http://www.dogpile.com )
 Webcrawler ( http://www.webcrawler.com)
Subject Directories
 built by human selection -- not by computers or robot




programs
organized into subject categories, classification of pages
by subjects -- subjects not standardized and vary
according to the scope of each directory
NEVER contain full-text of the web pages they link to -you can only search what you can see (titles,
descriptions, subject categories, etc.) -- use broad or
general terms
small and specialized to large, but smaller than most
search engines -- huge range in size
often carefully evaluated and annotated (but not
always!!)
When to use directories?
 Directories are useful for general topics, for
topics that need exploring, for in-depth research,
and for browsing.
 There are two basic types of directories:
1.academic and professional directories
often created and maintained by subject experts
to support the needs of researchers
2.commercial portals
that cater to the general public and are
competing for traffic. Be sure you use the
directory that appropriately meets your needs.
Subject Directories


















Internet Subject Directories.
INFOMINE, from the University of California, is a good example of an academic subject directory
Yahoo is a famous example of a commercial portal
Examples of Specialized directories
EXAMPLES OF SUBJECT-SPECIFIC DATABASES (i.e.,VORTALS),
Educator's Reference Desk (educational information)
Expedia (travel)
Internet Movie Database (movies)
Jumbo Software (computer software)
Kelley Blue Book (car values)
Monster Board (jobs)
Motley Fool (personal investment)
MySimon (comparison shopping)
PsychCrawler (psychology resources)
Roller Coaster Database (roller coasters)
SearchEdu (college & university sites)
Voice of the Shuttle (humanities research)
WebMD (health information)
WHAT ARE THE PROS AND CONS OF
SUBJECT DIRECTORIES?
 PROS:
Directory editors typically organize directories hierarchically into browsable
subject categories and sub-categories. When you're clicking through
several subject layers to get to an actual Web page, this kind of
organization may appear cumbersome, but it is also the directory's
strength. Because of the human oversight maintained in subject directories,
they have the capability of delivering a higher quality of content.
 They may also provide fewer results out of context than search engines.
 CONS:
Unlike search engines, most directories do not compile databases of their
own. Instead of storing pages, they point to them. This situation sometimes
creates problems because, once accepted for inclusion in a directory, the
Web page could change content and the editors might not realize it. The
directory might continue to point to a page that has been moved or that no
longer exists.
 Dead links are a real problem for subject directories, as is a perceived bias
toward e-commerce sites.
WHAT IS THE "INVISIBLE WEB"?
 There is a large portion of the Web that search engine spiders
cannot, or may not, index. It has been dubbed the "Invisible Web" or
the "Deep Web" and includes, among other things, pass-protected
sites, documents behind firewalls, archived material, the contents of
certain databases, and information that isn't static but assembled
dynamically in response to specific queries.
 Web profilers agree that the "Invisible Web," which is made up of
thousands of such documents and databases, accounts for 60 to 80
percent of existing Web material. This is information you probably
assumed you could access by using standard search engines, but
that's not always the case. According to the Invisible Web Catalog,
these resources may or may not be visible to search engine spiders,
although today's search engines are getting better and better at
finding and indexing the contents of "Invisible Web" pages.
Sources to locate Invisible Web
 http://www.lib.berkeley.edu/TeachingLib/G
uides/Internet/InvisibleWeb.html
Criteria for Critical Evaluation of
Information on the Internet
 Evaluating Information Content on the
Internet:








Purpose
Intended Audience
Scope
Currency
Authority
Bibliography
Objectivity
Accuracy
Criteria for Critical Evaluation of
Information on the Internet
 Evaluating Information Structure on the
Internet:





Design
Software Requirements
Hardware Requirements
Style
Uniqueness
Criteria for Critical Evaluation of
Information on the Internet
 Evaluating Information Accessibility on the
Internet:
 Restrictions
 Stability
 Security