wikipedia-list-extractor

Read entries from Wikipedia lists

Usage no npm install needed!

<script type="module">
  import wikipediaListExtractor from 'https://cdn.skypack.dev/wikipedia-list-extractor';
</script>

README

wikipedia-list-extractor

Wikipedia has lists of objects (e.g. monuments), often referenced by governmental data (e.g. heritage protection). This module helps to extract data from these lists.

Example: The sub-pages of [https://de.wikipedia.org/wiki/Denkmalgesch%C3%BCtzte_Objekte_in_%C3%96sterreich](Denkmalgeschützte Objekte in Österreich) will list all heritage protected objects in Austria. This module will return individual items of this list as JSON objects. The ID within this module for this list is 'AT-BDA'. The items can be referenced either by their ID (e.g. 'id-24536') or their Wikidata-ID (e.g. 'Q1534177') or their page plus index (e.g. 'Liste der denkmalgeschützten Objekte in Wien/Innere Stadt/E–He#69').

There's a demo-application where you can view items on a map: https://openstreetmap.at/demo-wikipedia-list-extractor (Source).

In data/ there are config files for each type of list.

Usage

Stand-alone with NodeJS server (included with the dev dependencies)

git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install
npm start

Point your browser to http://localhost:8080/ for the interactive App.

You can try the list 'AT-BDA' and as ID 'id-24536' or 'Q1534177'. Both IDs should return the Goethedenkmal in Vienna.

Additionally, the standalone server exposes a HTTP API which you can query: http://localhost:8080/api//

  • where list is the ID of a list (e.g. INT-UNESCO)
  • where id is one or several ids, comma separated

Example:

curl http://localhost:8080/api/INT-UNESCO-de/91,80

As module within a NodeJS application

Wikipedia List Extractor uses a few modules (node-fetch, jsdom) as indirect dependencies (so they don't get compiled when using browserify). These have to be exposed as global variables. This can be done by requiring wikipedia-list-extractor/node.

let extractor = new MediawikiListExtractor('INT-UNESCO-de', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, JSON.stringify(result, null, '  '))
})

Stand-alone on a PHP server (e.g. with Apache2)

cd /var/www/html
git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install

Point your browser to https://server/wikipedia-list-extractor

You have to select 'Run code in browser', as the PHP code does not implement the server side.

As module within a web application in a browser

As Wikipedia does not allow requests from a web browser, when they do not originate from a wikipedia page, we have to use a proxy. The URL of the proxy has to be supplied with the options, when loading MediawikiListExtractor:

// def is the file data/INT-UNESCO.json as Javascript Object
let extractor = new MediawikiListExtractor('INT-UNESCO', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
  proxy: 'proxy/?'
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, result)
})

See proxy/index.php or proxy/index.js for examples.

List definition files

The list definition files are in the data/ folder and these are YAML files. The basic structure:

title:
  en: List for something
param:
  ... Definition for a source or several sources

Definition of a source:

language: de
source: https://de.wikipedia.org
pageTitleMatch: Liste der Kunstwerke
renderedFields:
  id:
    column: 2
    regexp: /<a[^>]*>([0-9]+)<\/a>/
    type: html
  wikidata:
    column: 3
    regexp: /<a href="https:\/\/www.wikidata.org\/wiki\/(Q[0-9]+)">Wikidata<\/a>/
    type: html

For sources, the following options are possible

Field Description
language Language of this list
source URL of the Mediawiki / Wikipedia where this list is to be found
pageTitleMatch The template title for pages which build this page (e.g. there might be a list of artwork for each town). This is a regular expression for Mediawiki CirrusSearch, so there might be some restrictions.
template Mediawiki pages use the specified template (or, when this is an array, templates) for rendering content.
rawIdField The id of the item can be read from this field (in the template in page source).
rawAnchorField The HTML anchor of the item can be read from this field (in the template in page source).
rawWikidataField The wikidata id of the item can be read from this field (in the template in page source).
renderedTableClass In rendered output, the table in the page can be detected from this class.
renderedIdField The id of the item can be read from this field (in the rendered output, see renderedFields). If the id is empty ('', null, ...), the item will be ignored.
renderedAnchorField The HTML anchor of the item can be read from this field (in the rendered output, see renderedFields).
renderedWikidataField The wikidata id of the item can be read from this field (in the rendered output, see renderedFields).
renderedFields Hashed array of fields, see below.
wikidataFields Optionally load the specified list of fields from the matching wikidata item. Example: [{property: P31, field: "is_a"}, ...]

Advanced Fields:

Field Description
pages List of pages which constitutes the whole dataset (e.g. for getAll, which returns all items).
rawAnchorTemplate Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: item.field (with each field from the template), page (page title), index (index of the item on this page).
rawIdTemplate More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in renderedIdTemplate. Available parameters: item.field (with all fields from the template), page (page title), index (index of the item on this page).
renderedAnchorTemplate Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: item.field (with each field from the template), page (page title), index (index of the item on this page).
renderedIdTemplate More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in rawIdTemplate. Available parameters: item.field (with all parsed fields from the rendered page), page (page title), index (index of the item on this page).
wikidataIdTemplate Additional aliases for the item. Uses Twig syntax to compile the alias (one alias per line). Available parameters: item.P1234 (with all properties specified in wikidataFields).
idToQuery When searching for an ID, how to search on the Mediawiki site. idToQuery uses Twig syntax to generate the query, with multiple lines prefixed by a query option and =; available parameter: id (the id we are looking for). Query options: field (which field to query), value (which value to query), wikidataProperty and wikidataValue (value can't be found in the page source, needs to query wikidata first -> use wikidata item id as value), page (doesn't need to search, just load the specified page).

Rendered Fields Parameter:

Parameter Description
column Table column
type 'html' (default), 'image' (parse url, width, height from first image in this field)
domQuery CSS style query for a DOM node in the cell.
domAttribute Use the value of the DOM node (or the cell, if domQuery was not specified).
regexp A regular expression, where the first match is the resulting value (to exclude patterns, use: /foo(?:bar)(bla)/ -> "bla".
modify A TwigJS template, which can modify value. The following parameters are available: value (if column was specified; the result after column, domQuery, domAttribute, regexp), row (the full table row as array), index (the n'th item on this page), page (the name of the Wikipedia page).