GenealogySearch.info
    
RELATED LINKS
Home
 
Google

How to go about digging deeper on the Web

By JACKIE LOOHAUIS

of the Journal Sentinel staff

Sunday, May 13, 2001

This spring, a New York man started feeling ill and signed onto his computer for some help.

He hunted on the Net for a site that would tell him about his symptoms. He found none on his first search-engine try. Using other search techniques, he found a medical site with an answer, the answer being that he needed to have himself rushed to the hospital immediately. He did, and had an emergency bypass operation minutes after his arrival.

The coronary patient was lucky that he could find what he needed in his Internet search without a whole lot of delay. Because the notion that the Net is a vast encyclopedia just waiting to be opened by search engines and directories such as Alta Vista or Yahoo! is a myth.

Traditional search engines have access to only a fraction of 1% of what exists on the Web, according to BrightPlanet, an Internet search company, noting that as many as 550 billion pieces of content are hidden from most search engine scrutiny. These documents make up what is known as "The Deep Web."

Undercover and undercovered, the vast reservoir of the Deep Web is estimated to be 500 times larger than the "surface" World Wide Web. And, according to BrightPlanet, the Deep Web is the largest growing category of new information on the Net.

"There's a huge amount of information you can't find entirely or easily via a search engine," says Net search guru Gary Price, a librarian at George Washington University, and co-author of the upcoming book "The Invisible Web" (CyberAge Books, $29.95). "The material on the Web is unorganized, very ephemeral. There's no rhyme or reason, no language control. The Web is a huge directory that's very hard to get at."

What's hidden?

What makes up the depths of The Deep Web? The biggest part of this invisible Web is information stored in databases -- massive libraries of Web content unsearchable through such tools as Yahoo! and Google. You have to know they exist before you can search them.

Such a database would be the Government Printing Office listings at www.access.gpo.gov/sudocs/aces/aaces002.html. There are thousands more.

Other aspects of the Net remain hidden in deep waters, too.

"There are tons of things out there," says Tara Calishain of Researchbuzz.com, an online Internet guide. "Pay content sources, lots of genealogy sources. The Library of Congress (www.loc.gov) has fabulous collections you can't find on Alta Vista."

Several types of information are most elusive for search engines - - bibliographies, multimedia files, information that comes in .pdf files (Adobe's portable document format). "News is dreadful, says Calishain. "Search engines don't cover it. It's tough to find breaking news."

Some sites, such as Amazon.com have sections so far from the surface of their home pages that they, too, can be classified as Deep Web, says David Crane, a spokesman for search engine Google (www.google.com). An example, says Crane, is "the section that specifically offers a `portable compact disc player by Sony.' "

But the deepest Deep Web drop-off is in the category of government, and it's getting deeper.

"More and more city and county governments are putting their offerings on the Web. The State of Pennsylvania has a new crime reporting database (ucr.psp.state.pa.us/UCR/ComMain.asp), and more and more of that kind of thing is coming up now," says Calishain.

Why stuff is `hidden'

There are a number of reasons why these types of pages wear camouflage against search engines.

"Search engines are not easy to use," says Calishain. "Natural language was supposed to be the big savior, but that hasn't happened."

Take the screw-ups that everyday English can cause on a "natural language" search tool like Ask Jeeves (www.aj.com), for instance. Ask "How tall is a giraffe?" and you might get an answer. Ask "A giraffe is how tall?" and the search engine will see a different, perhaps unanswerable, question. Also, "You call it a `bubbler,' I call it a `water fountain,' " says Price.

Sometimes portions of the Web remain invisible because they only surface for money.

John December, president of Milwaukee's December Communications, says, "There is proprietary, for-sale content, or parts of the Web that are accessible by subscription like LexisNexis. People don't realize that not everything is free on the Web. They're shocked when they find out."

But the depths of the Web remain invisible largely because of the way search engines work.

They get their information two ways. First, a smattering of sites are indexed because authors submit their own Web pages for listings.

But search engines find most of their material by "crawling" or "spidering" the documents, following one hypertext link to another, like ripples in a pond. These ripples can obscure the waters for a searcher by providing too many indiscriminate results, sometimes returning hits in the millions. Some Web designers even manipulate the system simply by invisibly coding one word over and over again on a page to get better play in the search-results listing.

The age of the document also plays a factor in its visibility. New documents are found from links with older documents, and those older pages with a larger number of references have a far greater chance of being indexed by a search engine.

Also, because of the millions of Web pages existing, there is a long waiting period for new pages being recorded on conventional search engines. BrightPlanet estimates that search-engine listings are often as much as three or four months out of date.

Some Web experts think the answer is to put a human face on the search engine robot. At Yahoo! human editors can do indexing by hand, checking out subject lists for obvious errors like putting Yogi Bear under Yogi Berra. About.com bills itself as "The Human Internet" with a system of human "guides" who gather and create hundreds of thousands of pages about everything from cell phones to alcoholism. But this type of hand indexing can't match the speed of automated search engines.

Finding the goods

So what do searchers need to get at the Deep Web? The answers are somewhere between a whisk broom and dynamite.

First the big blast: Step away from one-size-fits-all search- engine techniques.

"The search engine itself has poor documentation. It doesn't give information. So many people just sit down at one search engine and think it's as good as any other," says December, who teaches about the Net at the University of Wisconsin-Milwaukee.

How different search engines work is sometimes "a tightly guarded software secret. Exactly how it happens they don't want us to know," says December. But it is sometimes possible to pick a search engine that will delve more deeply into the right Web tunnel. For instance, Google recently added the ability to launch searches for .pdf files.

Next, know what it is you don't know. Beverley Pickering-Reyna, adjunct instructor in the School of Information Studies at UWM, gives her students tips on how to formulate questions for searching the Web.

"Be as specific as you can be in designing your search query," she advises. "If you are looking for a specific database, look for key terms that will define that database."

Some new software may make your mining operation easier -- for a price. LexiBot is a downloadable program designed by BrightPlanet that gathers information from 600 search sites and databases simultaneously (cost: $89.95).

But searchers should also collect what Price calls their own "tool belt of things to go to depending on your problems. You need to have a range of resources, a reference room of databases." In other words, bookmark early and often whenever you find valuable sites. Check out Price's tool belt at gwis2.circ.gwu.edu/=gprice/listof.htm.

Expert advice

Two groups of Web experts are also making it their business to provide searchers with information on Deep Web sources.

Calishain's Researchbuzz.com (www.researchbuzz.com) chronicles search engines, new data collections ("Online Legal Information in Denmark, Norway and Sweden"), browser software and other Deep Web mining tools that "a research librarian, journalist, educator and others would find helpful, from the perspective of someone who's really going to use it."

And in the early '90s at the University of Wisconsin-Madison, the Internet Scout Project (scout.cs.wisc.edu) was started with funding from the National Science Foundation to "inform the higher education and research communities about resources on the Internet," says Scout Director Rachael Bower. The project posts detailed reports each Friday to keep searchers, including the general public, "up to speed" on Deep Web sources.

 1 -  2 -  Next 

 
Copyright ©  All Rights Reserved.
 
Related sites:
[an error occurred while processing this directive]