From Data to Knowledge: Extracting and Utilizing Scholarly Knowledge Graphs
Knowledge bases today are central to the successful utilization of information available in the large and growing amounts of digital data on the Web. Such technologies have started to unleash a transformation of Web search from a keyword match to discovery, learning, and creativity, which are crucial to promoting the goal of knowledge discovery. Unfortunately, the search for information remains inherently difficult for significant portions of the Web such as the Scholarly Web, which contains many millions of scientific documents. For example, PubMed has over 20 million documents, whereas Google Scholar is estimated to have more than 100 million. Open-access digital libraries such as CiteSeerX, which acquire freely-available research articles from the Web, witness an increase in their document collections as well. Despite attractive advancements by scholarly search portals, semantic search technologies that “understand” complex concepts and their relations and can systematically satisfy users’ intricate information needs are yet to be investigated on the Scholarly Web. The goal of this project is to design solutions to make information more accessible and comprehensible to Scholarly Web users in particular, and Web users in general, and to help them discover knowledge more effectively and efficiently. The approach taken will be to develop an integrated framework, focusing on the extraction and utilization of scholarly knowledge graphs in online scholarly environments. Educationally, this work will involve: training of graduate, undergraduate, and high-school students, particularly encouraging the participation of women and underrepresented groups in the research efforts; curriculum development and integration of research into courses taught by the PI; exposure of students to industry and international experiences; and education for the general public.
The project will target the following research objectives: (1) explore the construction of scholarly knowledge graphs that combine evidence from multiple resources in an open information extraction framework; (2) design and develop novel algorithms for the detection and analysis of interesting and previously unknown connections between concepts, in order to enforce knowledge discovery on the Scholarly Web; and (3) investigate the utility of scholarly knowledge graphs in a question answering system. The results of this research will be integrated into the CiteSeerX digital library. The software, tools, and benchmark datasets, which will be developed during the course of this project will be made publicly available. All findings will be shared to the research community through publications in academic journals and presented in Information Retrieval, Text Mining and Natural Language Processing conferences. For further information, see the project web page: http://people.cs.ksu.edu/~ccaragea/skg.html.
Selected Publications related to Scientific Data Analysis
Corina Florescu and Cornelia Caragea. "A Position-Biased PageRank Algorithm for Keyphrase Extraction." In: Proceedings of the 31st American Association for Artificial Intelligence (AAAI 2017), Student Abstract and Poster Program (SA-17), San Francisco, California, USA, 2017. [pdf] [poster]
Corina Florescu and Cornelia Caragea. "A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction." In: Proceedings of the 39th European Conference on Information Retrieval (ECIR 2017), Aberdeen, Scotland, UK, 2017. [pdf]
Cornelia Caragea. "Identifying Descriptive Keyphrases from Scholarly Big Data." In: The Workshop “Artificial Intelligence for Data Science,” Co-located with the Neural Information Processing Systems Conference (AI4DataSci 2016), Barcelona, Spain, 2016. [pdf]
Corina Florescu and Cornelia Caragea. "An Unsupervised Algorithm for Keyphrase Extraction." In: The Workshop for Women in Machine Learning, Co-located with the Neural Information Processing Systems Conference (WiML 2016), Barcelona, Spain, 2016. [pdf]
Lucas Sterckx, Cornelia Caragea, Thomas Demeester, and Chris Develder. "Supervised Keyphrase Extraction as Positive Unlabeled Learning." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, Texas, USA, 2016. [pdf]
Cornelia Caragea, Jian Wu, Sujatha Das Gollapalli, and C. Lee Giles. "Document Type Classification in Online Digital Libraries." In: Proceedings of the Twenty-Eighth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2016), Phoenix, AZ, USA, 2016. [pdf] [slides]
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan, Prasenjit Mitra and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." Artificial Intelligence Magazine (AI Magazine), 36(3): 35-48, 2015.
Cornelia Caragea, Florin Bulgarov, and Rada Mihalcea. "Co-Training for Topic Classification of Scholarly Data." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal, 2015. [pdf] [slides]
Florin Bulgarov and Cornelia Caragea. "A Comparison of Supervised Keyphrase Extraction Models." In: The International World Wide Web Conference (WWW 2015), Poster Program, Florence, Italy, 2015. [pdf] [code and data]
Cornelia Caragea, Florin Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach." (Using citation contexts in a supervised approach to improve keyphrase extraction.) In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 2014. [pdf] [poster] [code and data]
Sujatha Das Gollapalli and Cornelia Caragea. "Extracting Keyphrases from Research Papers using Citation Networks." (Using citation contexts in an unsupervised approach to improve keyphrase extraction.) In: Proceedings of the 28th American Association for Artificial Intelligence (AAAI 2014), Quebec City, Quebec, Canada, 2014. Full Oral Presentation. [pdf] [slides] [code and data]
Shibamouli Lahiri, Sagnik Ray Choudhury, and Cornelia Caragea. "Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks." (Using network centrality measures to extract keywords and keyphrases in documents.) CoRR, 2014. [pdf] [arXiv]
Other CiteSeerX Related Publications
Sujatha Das Gollapalli, Krutarth Patel, and Cornelia Caragea. "A Search/Crawl Framework for Automatically Acquiring Scientific Documents." CoRR, 2016. [pdf] [arXiv]
Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, C. Lee Giles. "PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search." In: Proceedings of the 8th International Conference on Knowledge Capture (K-Cap 2015), Palisades, NY, USA, 2015.
Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra and C. Lee Giles. "Improving Researcher Homepage Classification with Unlabeled Data." To appear in: ACM Transactions on the Web (ACM TWeb), 2015.
Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, and C. Lee Giles. Tutorial on "Document Analysis and Retrieval Tasks in Scientific Digital Libraries." In: Proceedings of the 8th Russian Summer School in Information Retrieval (RuSSIR 2014), Nizhny Novgorod, Russia, 2014. [pdf]
Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Pablo Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. “CiteSeerX: A Scholarly Big Dataset.” In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands, 2014. [pdf]
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander Ororbia, Douglas Jordan, and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." In: Proceedings of the 26th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2014), co-located with AAAI 2014, Quebec City, Quebec, Canada, 2014. [pdf]
Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. “Automatic Identification of Research Articles from Crawled Documents.” In: Proceedings of the Workshop: Web-Scale Classification: Classifying Big Data from the Web, co-located with WSDM (WSC 2014), New York, 2014. [pdf] [slides]
Invited Talks and Presentations
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help? at the University of the Andes, Bogota, Colombia, June 15, 2016.
- Keyphrase Extraction for Scholarly Big Data at the MetroCon, Discovery Through Engineering, Arlington, Texas, October 22, 2015. [MetroCon 2015 Program].
- Keyphrase Extraction for Scholarly Big Data at the Big Scholarly Data: Birds of a Feather Session, co-located with Microsoft Research Faculty Summit, Redmond, WA, July 10, 2015.
- Extracting Keyphrases from Research Papers using Citation Networks at the Researcher Luncheon and Poster Forum at the Fort Worth Museum of Science and History in the Research and Learning Center, Fort Worth, TX, March 7, 2015. [Picture at the Museum].
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help?, University of Texas at Austin, Austin, Texas, February 13, 2015.
- Keyphrase Extraction in Citation Networks: How do Citation Contexts Help?, University of Michigan, Ann Arbor, Ann Arbor, Michigan, November 21, 2014.
- Big Data and its Implications for Research Methodology and Funding., University of North Texas, Denton, at TARDIS - The Advances in Research DesIgns Symposium, Denton, Texas, November 7, 2014.
International Summer Schools
"Knowledge Discovery in Social and Information Networks." The University of the Andes, Bogota, Colombia, Summer 2016, June 09-24, 2016. [website]
"RuSSIR 2014: Document Analysis and Retrieval in Scientific Digital Libraries: Case studies in applying Machine Learning for Information Retrieval." The 8th Russian Summer School in Information Retrieval (RuSSIR 2014), Summer 2014. [website]
Related Workshops
- Cornelia Caragea, Madian Khabsa, Sujatha Das Gollapalli, C. Lee Giles, Alex Wade. "The IJCAI-16 Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The International Joint Conference on Artificial Intelligence 2016 (IJCAI 2016), July 14-15, 2016, New York City, USA. [website]
- Cornelia Caragea, C. Lee Giles, Alex Wade, Doina Caragea, Vu Ha, Madian Khabsa, Irwin King, Jie Tang. "The AAAI-16 International Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The Association for the Advancement of Artificial Intelligence 2016 (AAAI 2016), 12-17 February 2016, Phoenix, AZ, USA. [website]
- Sujatha Das Gollapalli, Cornelia Caragea, C. Lee Giles, Xiaoli Li. "The ACL-15 International Workshop on Novel Computational Approaches to Keyphrase Extraction." Presented at: The 53rd Annual Meeting of the Association for Computational Linguistics 2015 (ACL 2015), 30-31 July 2015, Beijing, China. [website]
- Cornelia Caragea, C. Lee Giles. "The AAAI-15 International Workshop on Scholarly Big Data: AI Perspectives, Challenges and Ideas." Presented at: The Association for the Advancement of Artificial Intelligence 2015 (AAAI 2015), 25-26 January 2015, Austin, TX, USA. [website]
People
- Cornelia Caragea, the PI of the project
- Corina Florescu, UNT
- Krutarth Patel, UNT
- Indish Cholleti, UNT
- Clement Cole, UNT
- Nathan Contreras, UNT
- Namchi Do, Trinity University, visiting student at UNT through the Distributed Research Experiences for Undergraduates (DREU)
- Lucas Sterckx, Visiting graduate student: PhD student at Ghent University - iMinds, Belgium
Faculty
Students involved in this research:
This research is supported by the National Science Foundation.