Home | Research | Publications | Teaching | Curriculum Vitae | Links

CiteSeerX: A Scholarly Big Dataset

The dataset and code are available upon request by sending an email to ccaragea@ksu.edu.

The CiteSeerX digital library stores and indexes research articles in Computer Science and related fields that are found on the Web. Its main purpose is to make it easier for researchers worldwide to search for scientific information. CiteSeerX is rapidly growing in size. The figure below shows the increase in the number of documents crawled from the Web, the number of ingested documents, as well as the number of documents (or research papers) indexed by CiteSeerX between 2008 and 2013. As can be seen from the figure, the number of crawled documents has increased from less than two million to over thirteen million, whereas the number of indexed papers has increased from less than one million to almost three million.

CiteSeerX

CiteSeerX has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeerX is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata.

We have come to understand that data quality is essential in digital libraries so that these data are searched and retrieved in a timely fashion, in the appropriate form, to the appropriate person, and the appropriate level of confidence. Data quality continues as a challenge in CiteSeerX. In an effort to improve its quality, we designed an approach to building a scholarly big dataset, derived from CiteSeerX, that is substantially cleaner than the entire set. This dataset can benefit many research projects in many fields such as Machine Learning, Data Mining, Information Retrieval, and Natural Language Processing.

An example of a CiteSeerX metadata record corresponding to a paper by Mimno and McCallum from our built dataset is available here (note that only one reference is shown to avoid clutter). In addition to metadata available on the CiteSeerX page, the dataset provides citation contexts for cited papers, where a citation context is defined as a window of words surrounding a citation mention (i.e., n words on each side of the citation mention).

Representative Publications

Other Related Publications