Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.
The dataset is distributed as a knowledge graph, a corpus, and aliases. Besides, we also provide the inductive splits used in the original paper.
Wikidata5m follows the identifier system used in Wikidata. Each entity and relation is identified by a unique ID. Entities are prefixed by
Q, while relations are prefixed by
The knowledge graph is stored in the triplet list format. For example, the following line corresponds to <Donald Trump, position held, President of the United States>.
Q22686 P39 Q11696
Each line in the corpus is a document, indexed by entity ID. The following line shows the description for Donald Trump.
Q22686 Donald John Trump (born June 14, 1946) is the 45th and current president of the United States ...
Each line lists the alias for an entity or relation. The following line shows the aliases of Donald Trump.
Q22686 donald john trump 45th president of the united states @realdonaldtrump ...