RuSSIR 2014     Summer School 2014

Document Analysis and Retrieval in Scientific Digital Libraries

Schedule (tentative) and Class Notes

DateLectureDescription and Reading MaterialNB
08/18/2014Introduction: Machine Learning in Digital Libraries [slides]
  • Machine Learning: Basics
  • Information Retrieval Systems: Basics
  • ML in a practical IR system (CiteSeerX)
  • Crawling the Web
  • Reading:
    • Manning et al.: Chapter 20.
    • Tutorial on "Document Analysis and Retrieval Tasks in Scientific Digital Libraries" by Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, C. Lee Giles [pdf]
-
08/19/2014Text Classification in Digital Libraries [slides]
  • Generative vs. Discriminative Classifiers
  • The "Bag of Words" Representation
  • Classification Tasks in CiteSeerX [slides]
  • Reading:
    • Manning et al.: Chapters 13 and 15.
    • Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles. "Researcher homepage classification using unlabeled data." In: Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio de Janeiro, Brazil, 2013. [pdf]
    • Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. "CiteSeerX: A Scholarly Big Dataset." In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
    • Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. "Automatic Identification of Research Articles from Crawled Documents." In: Proceedings of the WSDM Workshop: Web-Scale Classification: Classifying Big Data from the Web (WSC 2014), New York City, 2014. [pdf]
-
08/20/2014Data Clustering [slides]
  • Recap [slides]
  • Types of Clustering: Hierarchical vs. Flat
  • The K-means algorithm
  • Clustering using Topic Models [slides]
  • Reading:
    • Manning et al.: Chapters 16, 17 and 18.
    • Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. "On Identifying Academic Homepages for Digital Libraries." In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Ontario, Canada, 2011. [pdf]
    • Saurabh Kataria, Prasenjit Mitra, Sumit Bhatia. "Utilizing Context in Generative Bayesian Models for Linked Corpus." In: Proceedings of the 24th American Association for Artificial Intelligence (AAAI 2010), Atlanta, Georgia, USA, 2010. [pdf]
    • Saurabh Kataria, Prasenjit Mitra, Cornelia Caragea, and C. Lee Giles. "Context Sensitive Topic Models for Author Influence in Document Networks." In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, 2011. [pdf]
-
08/21/2014Information Extraction (IE) [slides]
  • Discriminative vs. Generative Models [slides]
  • IE Tasks: Named Entity Recognition, Sequence Labeling, Relation Extraction
  • IE tasks in CiteSeerX [slides]
  • Reading:
    • Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and Edward A. Fox. "Automatic document metadata extraction using support vector machines." In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2003), Houston, Texas, USA, 2003. [pdf]
    • Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles. "Extracting Researcher Metadata with Labeled Features." In: Proceedings of SIAM 2014 International Conference on Data Mining (SDM 2014), Philadelphia, Pennsylvania, USA, 2014. [pdf]
    • Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. “CiteSeerX: A Scholarly Big Dataset.” In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
-
08/22/2014Link Analysis in Document Networks [slides]
  • Searching the Web
  • The PageRank Algorithm
  • Link Analysis Applications in CiteSeerX [slides]
  • Reading:
    • Manning et al.: Chapter 21.
    • Sujatha Das Gollapalli, Prasenjit Mitra, and C. Lee Giles. "Ranking Authors in Digital Libraries." In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Ontario, Canada, 2011. [pdf]
    • Sujatha Das Gollapalli and Cornelia Caragea. "Extracting Keyphrases from Research Papers using Citation Networks." In: Proceedings of the 28th American Association for Artificial Intelligence (AAAI 2014), Quebec City, Quebec, Canada, 2014. [pdf]
    • Cornelia Caragea, Florin Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 2014. [pdf]
-