CiteSeerX: A Scholarly Big Dataset
The dataset and code are available upon request by sending an email to ccaragea@ksu.edu.
The CiteSeerX digital library stores and indexes research articles in Computer Science and related fields that are found on the Web. Its main purpose is to make it easier for researchers worldwide to search for scientific information. CiteSeerX is rapidly growing in size. The figure below shows the increase in the number of documents crawled from the Web, the number of ingested documents, as well as the number of documents (or research papers) indexed by CiteSeerX between 2008 and 2013. As can be seen from the figure, the number of crawled documents has increased from less than two million to over thirteen million, whereas the number of indexed papers has increased from less than one million to almost three million.
CiteSeerX has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeerX is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata.
We have come to understand that data quality is essential in digital libraries so that these data are searched and retrieved in a timely fashion, in the appropriate form, to the appropriate person, and the appropriate level of confidence. Data quality continues as a challenge in CiteSeerX. In an effort to improve its quality, we designed an approach to building a scholarly big dataset, derived from CiteSeerX, that is substantially cleaner than the entire set. This dataset can benefit many research projects in many fields such as Machine Learning, Data Mining, Information Retrieval, and Natural Language Processing.
An example of a CiteSeerX metadata record corresponding to a paper by Mimno and McCallum from our built dataset is available here (note that only one reference is shown to avoid clutter). In addition to metadata available on the CiteSeerX page, the dataset provides citation contexts for cited papers, where a citation context is defined as a window of words surrounding a citation mention (i.e., n words on each side of the citation mention).
Representative Publications
Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramirez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. “CiteSeerX: A Scholarly Big Dataset.” In: Proceedings of the 36th European Conference on Information Retrieval (ECIR'14), Amsterdam, Netherlands, 2014. [pdf]
Other Related Publications
Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. “Automatic Identification of Research Articles from Crawled Documents.” In: Proceedings of the Workshop: Web-Scale Classification: Classifying Big Data from the Web, co-located with WSDM (WSC'14), New York City, 2014. [pdf] [slides]
Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra, and C. Lee Giles. "Can't see the forest for the trees? A citation recommendation system." In: Proceedings of The ACM/IEEE Joint Conference on Digital Libraries (JCDL'13), Indianapolis, Indiana, USA, 2013. [pdf][data]
Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. "Researcher Homepage Classification using Unlabeled Data." In: Proceedings of the 22nd International World Wide Web Conference (WWW'13), Rio de Janeiro, Brazil, 2013. [pdf]
Wenyi Huang, Saurabh Kataria, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles, Lior Rokach, "Recommending Citations: Translating Papers into References." In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM'12), Maui, Hawaii, 2012. [pdf]
Saurabh Kataria, Prasenjit Mitra, Cornelia Caragea, and C. Lee Giles. "Context Sensitive Topic Models for Author Influence in Document Networks." In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11), Barcelona, Spain, 2011. [pdf]
Sujatha Das, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. "On Identifying Academic Homepages for Digital Libraries." In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL'11), Ottawa, Canada, 2011. [pdf]