RuSSIR 2014     Summer School 2014

Document Analysis and Retrieval in Scientific Digital Libraries


  • Welcome to the RuSSIR 2014 course!


Instructors:Sujatha Das G., Cornelia Caragea, Xiaoli Li, C. Lee Giles
Contact:Sujatha at: gsdas [[at]] cse [[dot]] psu [[dot]] edu
Cornelia at: ccaragea [[at]] unt [[dot]] edu
Relevance:This course is aimed at exposing the RuSSIR attendees to the challenges involved in designing and implementing the back-end tasks of a large-scale IR system. Machine learning techniques are studied as an alternative to rule-based and heuristic approaches traditionally adopted in domain-specific IR applications. This hands-on course provides complementary practical experience for RuSSIR attendees familiar with IR problems, models and concepts.
Description:We discuss the application of machine learning techniques in large-scale IR systems using digital libraries as representative systems. Digital library portals provide various IR applications for focused repositories of digital objects such as documents, video, and images. This course is based on our experience with document processing, retrieval and analysis tasks in CiteSeerx, a digital library portal for scientific documents in Computer Science and related areas. We discuss various topics including: web crawling; document classification; content analysis with topic modeling tools; metadata extraction using sequential labeling; and ranking algorithms such as PageRank and HITS used in social and information network analysis. Each self-contained course lecture has three parts: (1) A review of related IR concepts and models, (2) A presentation on sample state-of-the-art applications illustrating the discussed theory, and (3) A guided, hands-on exercise for attendees using publicly-available IR and data mining tools on large document collections obtained from well-known digital library portals.
Prerequisties:Course attendees are expected to have basic knowledge in Information Retrieval and Machine Learning.


Recommended Textbooks for Machine Learning

  • Pattern Recognition and Machine Learning, by Christopher Bishop.
  • Machine Learning, by Tom Mitchell.
  • The Elements of Statistical Learning: Data Mining, Inference and Prediction, by Trevor Hastie, Robert Tibshirani, Jerome Friedman (available online at:

Recommended Textbooks for Information Retrieval

  • Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (available online at:
  • Mining the Web: Discovering Knowledge from Hypertext, by Soumen Chakrabarti.
  • Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman