Chapter 2: Search-engine indexing

The WorldWideWeb (``Web'') grew from a collection (network) of computers that could ask each other for files. The files were formatted in a standard format, html, and the files were requested and downloaded with an exchange of communications (a protocol), http, based on ftp (file transfer protocol).

If you knew the name of the file (e.g., www.mtv.com), you could ask for a copy of it to display on your computer. But there was no way to ask, ``find me MTV's website.''

The Alta-Vista search engine was a program that you could ask to find a web page. AltaVista found and downloaded as many html files as it could find. It extracted the words from each file and saved them in a monstrous table that a user could search at www.altavista.com. When you asked to find the page for, say MTV, Alta Vista searched its table for the web pages that mentioned MTV.

See the figures in MacCormick's book. Here's the one from Page 16, which shows three Web pages and the table constructed from them:

 
(Recall that the entries 1-2, 3-2 for cat signifies that the word occurs in Page 1, position 2, and in Page 3, position 2.) Chapter 2 addresses the issue, ``how do we use the word table to find web pages?'' Read the chapter, and concentrate on Pages 12-19. An example: find all web pages that have the precise phrase, "cat sat":
  1. look up cat in the table and extract the page-position list, 1-2, 3-2.
  2. look up sat in the table and extract the page-position list, 1-3, 3-7.
  3. for each entry in cat's list, call it #page-#num, see if there is a matching entry, #page-#(num+1) in sat's list:

    Indeed, 1-2 matches with 1-3, but 3-2 has no match.

  4. collect and return the page numbers that have a match, here, only Page 1.

Questions for discussion

  1. How does Alta-Vista find the Web pages to build its word table? (Hint: .com web-domain names are saved at a site managed by the Verisign company. Start from there.)
  2. Should the placement of words in the Web page impact the word table (e.g., MTV appears in the title bar or in a header line versus MTV appearing in the text body or in a link address)?
  3. Why was Alta Vista built? That is, what problem did it solve? What do we require that it do?

    Every computer program is built to satisfy a requirement or a specification. The more precisely stated the requirement, the easier it is to understand how to build a program that meets it.

    State, as precisely as you can: what is the requirement that Alta Vista attempts to satisfy? (See the very end of this page for a possible answer.)

From examples to algorithms

The "cat sat" example stated above is called a use-case --- a case that shows how the computer program should perform the example.

It is critical that we generalize from use-cases to a general pattern --- an algorithm --- for looking up Web pages that match an arbitrary phrase (and not just "cat sat"). Here is an algorithm that works for a two-word search phrase:

INPUT: a two-word phrase of form,  "Word1 Word2"
OUTPUT: an  AnswerList,  a list of the numbers of the Web pages that contain the phrase

ALGORITHM:
(0.  The  AnswerList  starts with nothing saved in it.)

1. Extract from the table the list of  Page#-Position#  pairs for  Word1.  Call it  List1.

2. Extract from the table the list of  Page#-Position#  pairs for  Word2.  Call it  List2.

3. For each  page#-pos#  pair in List1,
      search  List2  to see if there is a pair,  page#-(pos#+1).
        (that is, the page# is the same and pos# differs by +1)
      if yes, then include  page#  in the  AnswerList.
      if no, then ignore this pair.

4. Announce all the page numbers in  AnswerList
The algorithm must be "precise enough" that we can explain it to a computer in a "computer language" (C, Java, Python, etc.)

Exercises

  1. Go to Page 16 of MacCormick and use the word table there (it's the same one as that pictured above) to calculate the results of these three use-cases:
    cat stood
    "cat stood"
    cat OR stood
    Write down, in words, how you calculated each use-case.
  2. Go to https://support.google.com/websearch/answer/2466433 and find at least two other operations like OR.
  3. Generalize the algorithm in the previous narrative so that it searches for a phrase of one or more words, that is, the input is a phrase of form, "Word1 Word2 Word3 ... Word_n", where n > 0. Try your algorithm on these use cases:
    "on the mat"
    "a cat"
    "mat"
  4. Modify the algorithm in the previous narrative so that it searches for multiple words, not necessarily as a phrase. The input pattern is Word1 Word2 ... Word_n. Try the example on some use-cases, like
    mat the
    a cat
    elephant
  5. Modify the algorithm so that it searches for multiple possible words, Word1 OR Word2 OR ... OR Word_n. Try the example on some use-cases.
  6. Do you remember what sets are, and do you remember the set operations union and intersection? If so, write the answers to the previous two questions in terms of sets, union, and intersection.
  7. How does Alta Vista build the word table from the Web pages that it has collected? As best you can, complete this algorithm whose requirement is to build the word table:
    INPUT: a list of Web pages
    OUTPUT: a table that contains all the words in all the Web pages...
    
    ALGORITHM:
    (0. Table starts empty)
    1. for each Web page, call it Page#P, in the list of pages,
          for each word, W, within Page#P,
              ... 
    2. Return the  Table.
    
  8. The word table on Page 16 is shown as it would be written on a two-dimensional sheet of paper, but computers rarely store information in this format. The choice of format for a table is a fundamental topic of computing (called data structures).

    Here is a "spelling tree" format for the table:

    
    
    Explain how to use the spelling tree to conduct the search for cat stood: write down the one-by-one steps a computer uses to "read" the relevant "branches"/"links" in the "tree".

    Next, add the contents of this web page to the above spelling tree:

    PAGE 4:
    can my cat stand up
    

    In your data structures course, you will learn how to use pointers/handles to build spelling trees that grow inside computer storage.



An answer to the earlier discussion question: The requirement for a Web-search program might be stated as, ``build a program that lets its user enter a phrase and then tells the user a list of the Web pages that are relevant to the phrase.''

But what do we mean by ``relevant''? Alta Vista interprets ``relevant'' as ``mentions the phrase.'' But there should be better definitions, yes? How we define the term, ``relevant,'' leads us to how to build the program....