The DataBase

pySciSci provides a standardized interface for working with several of the major datasets in the Science of Science, including:

The storage and processing frameworks are highly generalizable, and can be extended to other databases not mentioned here. Please contribute an interface to your data on the github project page!

Each dataset is accessed as a customized variant of the BibDataBase class, a container of python Pandas data frames pandas, that handles all data loading and pre-processing.

Currently, we provide direct data access only to DBLP and PubMed. All other data must be manually downloaded from the data provider before processing.

To facilitate data movement and lower memory overhead when the complete tables are not required, we pre-process the data tables into smaller chunks. When loading a table into memory, the user can quickly load the full table by referencing the table name as a database property or specify multiple filters to load only a subset of the data.

Basic Data WorkFlow

Every dataset in pySciSci is first pre-processed into a standardized tabular format based around DataFrame objects.

First usage only:
  • Download Data

  • Preprocess Data

All other usages:
  • Apply Data Filter

  • Load Only the DataFrame you need

DataSet Examples

DataFrames

Each dataset is partitioned into several DataFrames containing information for different bibliometric objects that are accessed as properties of the BibDataBase:
  • pub: The DataFrame keeping publication information, including publication date, journal, title, etc.. Each PubId occurs only once. Columns depend on the specific datasource.

  • author: The DataFrame keeping author names and personal information. Each AuthorId occurs only once. Columns depend on the specific datasource.

  • affiliation: The DataFrame keeping affilations names and websites. Each AffiliationId occurs only once. Columns depend on the specific datasource.

  • journal: The DataFrame keeping journal names and websites. Each JournalId occurs only once. Columns depend on the specific datasource.

  • fieldinfo: The DataFrame keeping field names and levels. Each FieldId occurs only once. Columns depend on the specific datasource.

There are also DataFrames which contain edge lists linking the different bibliometric data objects:
  • pub2ref: The DataFrame linking publications to their references (or citations).

  • pub2field: The DataFrame linking publications to their fields.

  • paa: The DataFrame linking publications to authors to affiliations. Columns depend on the specific datasource.

And two processed DataFrames are created which contain the most popular citation counts for future reference:
  • impact: The DataFrame linking publications to their citations counts. Columns depend on the specific datasource and processing options.

  • pub2refnoself: The DataFrame linking publications to their references where all self-citations are removed.

Filters

Some datasets contain a wide-range of publication types, from many times, in many fields, spanning many different topics. Often, it is useful to focus only on a subset of the available data. For example, the MAG contains many different document types including journal publications, books, patents, and others.

Filters can be applied to the BibDataBase to ensure only a desired subset of the data is loaded into memory.

There are four default filters provided by pySciSci:
  • YearFilter

  • DocTypeFilter

  • FieldFilter

  • JournalFilter

Property Dictionaries

Two property dictionaries are created for quick reference without the need to load the complete publication dataframe:
  • pub2year: mapping between the PubId and PubYear

  • pub2doctype: mapping between the PubId and DocType (when available)

BibDataBase

OpenAlex

For initial example, see Getting Started With OpenAlex.

Microsoft Academic Graph (MAG)

For initial example, see Getting Started With MAG.

Web of Science (WoS)

For initial example, see Getting Started With WoS.

DBLP Computer Science Bibliography (DBLP)

For initial example, see Getting Started With DBLP.

American Physics Society (APS)

For initial example, see Getting Started With APS.

PubMed

For initial example, see Getting Started With PubMed.

Custom DB

For initial example, see Getting Started With Custom DB.