08/18/2014 | Introduction: Machine Learning in Digital Libraries [slides] |
- Machine Learning: Basics
- Information Retrieval Systems: Basics
- ML in a practical IR system (CiteSeerX)
- Crawling the Web
- Reading:
- Manning et al.: Chapter 20.
- Tutorial on "Document Analysis and Retrieval Tasks in Scientific Digital Libraries" by Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, C. Lee Giles [pdf]
| - |
08/19/2014 | Text Classification in Digital Libraries [slides] |
- Generative vs. Discriminative Classifiers
- The "Bag of Words" Representation
- Classification Tasks in CiteSeerX [slides]
- Reading:
- Manning et al.: Chapters 13 and 15.
- Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles. "Researcher homepage classification using unlabeled data." In: Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio de Janeiro, Brazil, 2013. [pdf]
- Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. "CiteSeerX: A Scholarly Big Dataset." In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
- Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. "Automatic Identification of Research Articles from Crawled Documents." In: Proceedings of the WSDM Workshop: Web-Scale Classification: Classifying Big Data from the Web (WSC 2014), New York City, 2014. [pdf]
| - |
08/20/2014 | Data Clustering [slides] |
- Recap [slides]
- Types of Clustering: Hierarchical vs. Flat
- The K-means algorithm
- Clustering using Topic Models [slides]
- Reading:
- Manning et al.: Chapters 16, 17 and 18.
- Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. "On Identifying Academic Homepages for Digital Libraries." In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Ontario, Canada, 2011. [pdf]
- Saurabh Kataria, Prasenjit Mitra, Sumit Bhatia. "Utilizing Context in Generative Bayesian Models for Linked Corpus." In: Proceedings of the 24th American Association for Artificial Intelligence (AAAI 2010), Atlanta, Georgia, USA, 2010. [pdf]
- Saurabh Kataria, Prasenjit Mitra, Cornelia Caragea, and C. Lee Giles. "Context Sensitive Topic Models for Author Influence in Document Networks." In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, 2011. [pdf]
| - |
08/21/2014 | Information Extraction (IE) [slides] |
- Discriminative vs. Generative Models [slides]
- IE Tasks: Named Entity Recognition, Sequence Labeling, Relation Extraction
- IE tasks in CiteSeerX [slides]
- Reading:
- Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and Edward A. Fox. "Automatic document metadata extraction using support vector machines." In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2003), Houston, Texas, USA, 2003. [pdf]
- Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles. "Extracting Researcher Metadata with Labeled Features." In: Proceedings of SIAM 2014 International Conference on Data Mining (SDM 2014), Philadelphia, Pennsylvania, USA, 2014. [pdf]
- Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. “CiteSeerX: A Scholarly Big Dataset.” In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
| - |
08/22/2014 | Link Analysis in Document Networks [slides] |
- Searching the Web
- The PageRank Algorithm
- Link Analysis Applications in CiteSeerX [slides]
- Reading:
- Manning et al.: Chapter 21.
- Sujatha Das Gollapalli, Prasenjit Mitra, and C. Lee Giles. "Ranking Authors in Digital Libraries." In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Ontario, Canada, 2011. [pdf]
- Sujatha Das Gollapalli and Cornelia Caragea. "Extracting Keyphrases from Research Papers using Citation Networks." In: Proceedings of the 28th American Association for Artificial Intelligence (AAAI 2014), Quebec City, Quebec, Canada, 2014. [pdf]
- Cornelia Caragea, Florin Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 2014. [pdf]
| - |