The DataBase¶
- pySciSci provides a standardized interface for working with several of the major datasets in the Science of Science, including:
Microsoft Academic Graph (MAG)
Clarivate Web of Science (WOS)
American Physics Society (APS)
The storage and processing frameworks are highly generalizable, and can be extended to other databases not mentioned here. Please contribute an interface to your data on the github project page!
Each dataset is accessed as a customized variant of the BibDataBase class, a container of python Pandas data frames pandas, that handles all data loading and pre-processing.
Currently, we provide direct data access only to DBLP and PubMed. All other data must be manually downloaded from the data provider before processing.
To facilitate data movement and lower memory overhead when the complete tables are not required, we pre-process the data tables into smaller chunks. When loading a table into memory, the user can quickly load the full table by referencing the table name as a database property or specify multiple filters to load only a subset of the data.
Basic Data WorkFlow¶
Every dataset in pySciSci is first pre-processed into a standardized tabular format based around DataFrame objects.
- First usage only:
Download Data
Preprocess Data
- All other usages:
Apply Data Filter
Load Only the DataFrame you need
DataSet Examples¶
DataFrames¶
- Each dataset is partitioned into several DataFrames containing information for different bibliometric objects that are accessed as properties of the BibDataBase:
pub: The DataFrame keeping publication information, including publication date, journal, title, etc.. Each PubId occurs only once. Columns depend on the specific datasource.
author: The DataFrame keeping author names and personal information. Each AuthorId occurs only once. Columns depend on the specific datasource.
affiliation: The DataFrame keeping affilations names and websites. Each AffiliationId occurs only once. Columns depend on the specific datasource.
journal: The DataFrame keeping journal names and websites. Each JournalId occurs only once. Columns depend on the specific datasource.
fieldinfo: The DataFrame keeping field names and levels. Each FieldId occurs only once. Columns depend on the specific datasource.
- There are also DataFrames which contain edge lists linking the different bibliometric data objects:
pub2ref: The DataFrame linking publications to their references (or citations).
pub2field: The DataFrame linking publications to their fields.
paa: The DataFrame linking publications to authors to affiliations. Columns depend on the specific datasource.
- And two processed DataFrames are created which contain the most popular citation counts for future reference:
impact: The DataFrame linking publications to their citations counts. Columns depend on the specific datasource and processing options.
pub2refnoself: The DataFrame linking publications to their references where all self-citations are removed.
Filters¶
Some datasets contain a wide-range of publication types, from many times, in many fields, spanning many different topics. Often, it is useful to focus only on a subset of the available data. For example, the MAG contains many different document types including journal publications, books, patents, and others.
Filters can be applied to the BibDataBase to ensure only a desired subset of the data is loaded into memory.
- There are four default filters provided by pySciSci:
YearFilter
DocTypeFilter
FieldFilter
JournalFilter
Property Dictionaries¶
- Two property dictionaries are created for quick reference without the need to load the complete publication dataframe:
pub2year: mapping between the PubId and PubYear
pub2doctype: mapping between the PubId and DocType (when available)
BibDataBase¶
OpenAlex¶
For initial example, see Getting Started With OpenAlex.
Microsoft Academic Graph (MAG)¶
For initial example, see Getting Started With MAG.
Web of Science (WoS)¶
For initial example, see Getting Started With WoS.
DBLP Computer Science Bibliography (DBLP)¶
For initial example, see Getting Started With DBLP.
American Physics Society (APS)¶
For initial example, see Getting Started With APS.
PubMed¶
For initial example, see Getting Started With PubMed.
Custom DB¶
For initial example, see Getting Started With Custom DB.