Keyphrase Extraction in Document Networks
Keyphrases for a document concisely describe the document using a small set of phrases. For example, the keyphrases "social networks" and "interest targeting" quickly provide us with a high-level topic description (i.e., a summary) of a document focused on targeting interest for recommending services such as products and news to users, in the context of social networks. Given today’s very large collections of documents, these keyphrases are extremely important not only for summarizing a document, but also for the search and retrieval of relevant information. However, keyphrases are not always available directly. Instead, they need to be gleaned from the many details in documents. This project addresses the problem of automatic keyphrase extraction from research papers, which are enablers of the sharing and dissemination of scientific discoveries. The goal of the project is to explore accurate approaches that automatically discover and extract keyphrases in documents, using document networks, which will help handle and digest more information in less time during these "big data" times.
Although much research to date has been done on automatic keyphrase extraction, no previous approaches have captured the impact of documents on one another via the citation relation that connects documents in a network. This project will investigate models that take into consideration the linkage between citing and cited documents in a document network and will explore various qualitative and quantitative aspects of the question: "What are the key phrases or concepts in a document?" We will design and develop scalable iterative algorithms that capture different aspects of documents (e.g., topics or concepts), as well as the impact of one document on another (e.g., influence or topic evolution) in a document network. The results of this research will have a direct pipeline to the CiteSeerX digital library.
Publications
Cornelia Caragea and Corina Florescu. "Venue Classification of Research Papers in Scholarly Digital Libraries." In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries (TPDL 2018), Porto, Portugal, 2018. [pdf]
Gilad Katz, Cornelia Caragea, and Asaf Shabtai. "Vertical Ensemble Co-Training for Text Classification." In: ACM Transactions on Intelligent Systems and Technology (ACM TIST 2018), 2018. [Accepted.]
Corina Florescu and Cornelia Caragea. "PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents." In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, Canada, 2017. [pdf]
Corina Florescu and Cornelia Caragea. "A Position-Biased PageRank Algorithm for Keyphrase Extraction." In: Proceedings of the 31st American Association for Artificial Intelligence (AAAI 2017), Student Abstract and Poster Program (SA-17), San Francisco, California, USA, 2017. [pdf] [poster]
Corina Florescu and Cornelia Caragea. "A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction." In: Proceedings of the 39th European Conference on Information Retrieval (ECIR 2017), Aberdeen, Scotland, UK, 2017. [pdf]
Cornelia Caragea. "Identifying Descriptive Keyphrases from Scholarly Big Data." In: The Workshop “Artificial Intelligence for Data Science,” Co-located with the Neural Information Processing Systems Conference (AI4DataSci 2016), Barcelona, Spain, 2016. [pdf]
Corina Florescu and Cornelia Caragea. "An Unsupervised Algorithm for Keyphrase Extraction." In: The Workshop for Women in Machine Learning, Co-located with the Neural Information Processing Systems Conference (WiML 2016), Barcelona, Spain, 2016. [pdf]
Lucas Sterckx, Cornelia Caragea, Thomas Demeester, and Chris Develder. "Supervised Keyphrase Extraction as Positive Unlabeled Learning." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, Texas, USA, 2016. [pdf]
Cornelia Caragea, Jian Wu, Sujatha Das Gollapalli, and C. Lee Giles. "Document Type Classification in Online Digital Libraries." In: Proceedings of the Twenty-Eighth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2016), Phoenix, AZ, USA, 2016. [pdf] [slides]
Sujatha Das Gollapalli, Krutarth Patel, and Cornelia Caragea. "A Search/Crawl Framework for Automatically Acquiring Scientific Documents." CoRR, 2016. [pdf] [arXiv]
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan, Prasenjit Mitra and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." Artificial Intelligence Magazine (AI Magazine), 36(3): 35-48, 2015.
Cornelia Caragea, Florin Bulgarov, and Rada Mihalcea. "Co-Training for Topic Classification of Scholarly Data." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal, 2015. [pdf] [slides]
Florin Bulgarov and Cornelia Caragea. "A Comparison of Supervised Keyphrase Extraction Models." In: The International World Wide Web Conference (WWW 2015), Poster Program, Florence, Italy, 2015. [pdf] [code and data]
Cornelia Caragea, Florin Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach." (Using citation contexts in a supervised approach to improve keyphrase extraction.) In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 2014. [pdf] [poster] [code and data]
Sujatha Das Gollapalli and Cornelia Caragea. "Extracting Keyphrases from Research Papers using Citation Networks." (Using citation contexts in an unsupervised approach to improve keyphrase extraction.) In: Proceedings of the 28th American Association for Artificial Intelligence (AAAI 2014), Quebec City, Quebec, Canada, 2014. Full Oral Presentation. [pdf] [slides] [code and data]
Shibamouli Lahiri, Sagnik Ray Choudhury, and Cornelia Caragea. "Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks." (Using network centrality measures to extract keywords and keyphrases in documents.) CoRR, 2014. [pdf] [arXiv]
Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, C. Lee Giles. "PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search." In: Proceedings of the 8th International Conference on Knowledge Capture (K-Cap 2015), Palisades, NY, USA, 2015.
Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra and C. Lee Giles. "Improving Researcher Homepage Classification with Unlabeled Data." To appear in: ACM Transactions on the Web (ACM TWeb), 2015.
Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, and C. Lee Giles. Tutorial on "Document Analysis and Retrieval Tasks in Scientific Digital Libraries." In: Proceedings of the 8th Russian Summer School in Information Retrieval (RuSSIR 2014), Nizhny Novgorod, Russia, 2014. [pdf]
Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Pablo Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. “CiteSeerX: A Scholarly Big Dataset.” In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander Ororbia, Douglas Jordan, and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." In: Proceedings of the 26th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2014), co-located with AAAI 2014, Quebec City, Quebec, Canada, 2014. [pdf]
Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. “Automatic Identification of Research Articles from Crawled Documents.” In: Proceedings of the Workshop: Web-Scale Classification: Classifying Big Data from the Web, co-located with WSDM (WSC 2014), New York, 2014. [pdf] [slides]
Invited Talks and Presentations
- Learning to Extract Descriptive Keyphrases from Scholarly Big Data. Invited Speaker at the IEEE MetroCon, Arlington Convention Center, TX, October 26, 2017.
- Learning to Extract Descriptive Keyphrases from Scholarly Big Data. Invited Talk at Southern Methodist University (SMU), September 29, 2017.
- Learning to Extract Descriptive Keyphrases from Scholarly Big Data at the First Conference on Recent Advances in Artificial Intelligence (RAAI-17), Bucharest, Romania, June 19, 2017.
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help? at the University of the Andes, Bogota, Colombia, June 15, 2016.
- Keyphrase Extraction for Scholarly Big Data at the MetroCon, Discovery Through Engineering, Arlington, Texas, October 22, 2015. [MetroCon 2015 Program].
- Keyphrase Extraction for Scholarly Big Data at the Big Scholarly Data: Birds of a Feather Session, co-located with Microsoft Research Faculty Summit, Redmond, WA, July 10, 2015.
- Extracting Keyphrases from Research Papers using Citation Networks at the Researcher Luncheon and Poster Forum at the Fort Worth Museum of Science and History in the Research and Learning Center, Fort Worth, TX, March 7, 2015. [Picture at the Museum].
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help?, University of Texas at Austin, Austin, Texas, February 13, 2015.
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help?, University of Michigan, Ann Arbor, Ann Arbor, Michigan, November 21, 2014.
- Big Data and its Implications for Research Methodology and Funding., University of North Texas, Denton, at TARDIS - The Advances in Research DesIgns Symposium, Denton, Texas, November 7, 2014.
International Summer Schools
"Knowledge Discovery in Social and Information Networks." The University of the Andes, Bogota, Colombia, Summer 2018, June 05-19, 2018. [website]
"Knowledge Discovery in Social and Information Networks." The University of the Andes, Bogota, Colombia, Summer 2016, June 09-24, 2016. [website]
"RuSSIR 2014: Document Analysis and Retrieval in Scientific Digital Libraries: Case studies in applying Machine Learning for Information Retrieval." The 8th Russian Summer School in Information Retrieval (RuSSIR 2014), Summer 2014. [website]
Related Workshops
- Cornelia Caragea, Madian Khabsa, Sujatha Das Gollapalli, C. Lee Giles, Alex Wade. "The IJCAI-16 Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The International Joint Conference on Artificial Intelligence 2016 (IJCAI 2016), July 14-15, 2016, New York City, USA. [website]
- Cornelia Caragea, C. Lee Giles, Alex Wade, Doina Caragea, Vu Ha, Madian Khabsa, Irwin King, Jie Tang. "The AAAI-16 International Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The Association for the Advancement of Artificial Intelligence 2016 (AAAI 2016), 12-17 February 2016, Phoenix, AZ, USA. [website]
- Sujatha Das Gollapalli, Cornelia Caragea, C. Lee Giles, Xiaoli Li. "The ACL-15 International Workshop on Novel Computational Approaches to Keyphrase Extraction." Presented at: The 53rd Annual Meeting of the Association for Computational Linguistics 2015 (ACL 2015), 30-31 July 2015, Beijing, China. [website]
- Cornelia Caragea, C. Lee Giles. "The AAAI-15 International Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The Association for the Advancement of Artificial Intelligence 2015 (AAAI 2015), 25-26 January 2015, Austin, TX, USA. [website]
People
- Cornelia Caragea, Main PI of the project
- C. Lee Giles, Co-PI at Penn State
- Krutarth Patel, UNT
- Indish Cholleti, UNT
- Clement Cole, UNT
- Nathan Contreras, UNT
- Namchi Do, Trinity University, visiting student at UNT through the Distributed Research Experiences for Undergraduates (DREU)
- Lucas Sterckx, Visiting graduate student: PhD student at Ghent University - iMinds, Belgium
Faculty
Students
This research is supported by the National Science Foundation under awards #1423337 and #1422951. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.