Querying blog metadata via SPARQL

Since the start of 2025, I have been recording additional metadata for the posts on this blog.

The suggestion came from Frank Reichert via Mastodon as a comment on the post Archives: Fostering collection development through citizen participation.

To this end, at the start of the year I added further JSON-LD-based metadata formats such as Codemeta and Linked Art for some posts and improved the [Schema.org] (https://schema.org/).

Source data

Only this Schema.org source file is used to capture the necessary triplets. All necessary data and their sources are extracted from this:

Even though there are still no “great” applications, SPARQL queries can now be run on this data. To keep things interesting, some data from Wikidata has been included in the corpus.

To create this, the blog’s Schema.org data is used as a starting point, with LinkedArt and CodeMeta entries incorporated. The resulting graph is then enriched with data from Wikidata and converted into the HDT format.

This approach is necessary because the number of entities has risen massively, particularly since the start of this year: originally, only the tags were mapped to the respective Wikidata entities, but now there is a script that extracts them semi-automatically from the respective texts. A Python script based on SpaCy is used for this purpose. Due to the rate limit, executing queries via SPARQL from Wikidata very quickly results in error messages. On the other hand, serialising the data into a JSON-LD graph results in a large file.

Query

Examples

  • Blog posts about artists born in the 19th century; this query returns even more results, but the data isn"t available yet.
PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?blogPost ?artist ?artistLabel 
       (GROUP_CONCAT(DISTINCT ?occLabel; separator=", ") AS ?occupations)
       ?birthDate
WHERE {
  ?blogPost a schema:BlogPosting ;
            schema:about ?artist .

  ?artist wdt:P106 ?directOccupation .

  ?directOccupation wdt:P279* ?occupation .
  VALUES ?occupation {
    wd:Q483507   # Künstler
    wd:Q1028181  # Maler
    wd:Q1281618  # Bildhauer
    wd:Q42973    # Architekt
    wd:Q15296    # Schriftsteller
  }

  OPTIONAL {
    ?artist rdfs:label ?artistLabel .
    FILTER(LANG(?artistLabel) = "de")
  }
  OPTIONAL {
    ?directOccupation rdfs:label ?occLabel .
    FILTER(LANG(?occLabel) = "en")
  }

  ?artist wdt:P569 ?birthDate .
  FILTER(year(xsd:dateTime(str(?birthDate))) < 1900)
}
GROUP BY ?blogPost ?artist ?artistLabel ?birthDate
ORDER BY ?birthDate
  • The coordinates of the locations mentioned in blog posts containing the keywords “museum” and “exhibition”.
PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?blogPost ?place ?placeLabel ?coordinates
WHERE {
  ?blogPost a schema:BlogPosting ;
            schema:about ?place .

  ?place wdt:P31/wdt:P279* ?type .
  VALUES ?type { wd:Q33506 wd:Q464980 }

  ?place wdt:P625 ?coordinates .

  OPTIONAL {
    ?place rdfs:label ?placeLabel .
    FILTER(LANG(?placeLabel) = "en")
  }
}
  • Search for articles matching the term (keyword) “woodcut” from the Getty ATT Thesaurus (http://vocab.getty.edu/aat/300041405).
PREFIX schema: <http://schema.org/>
PREFIX aat: <http://vocab.getty.edu/aat/>
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?article ?object ?technique ?techniqueLabel
WHERE {
  ?article a schema:BlogPosting ; 
           schema:about ?object .

  ?object a crm:E22_Human-Made_Object ;
          crm:P108i_was_produced_by ?production .

  ?production crm:P32_used_general_technique ?technique .

  VALUES ?technique { aat:300041405 }

  ?technique rdfs:label ?techniqueLabel .
}
  • Number of triplets
SELECT (COUNT(*) AS ?count)
WHERE {
  ?s ?p ?o .
}

Potential issues

During implementation, I noticed a few minor issues with Chrome: after reloading the page several times, memory errors occurred – presumably the browser is not clearing a tab’s memory properly or is caching more than necessary. In this case, an error message (e.g. memory access out of bounds) appears in the status bar and the browser needs to be restarted.

Implementation

The HDT file created in the first step is loaded using the WASM variant of the Rust library HDT. The contents are then converted in memory so that they can be used in OxiGraph (also compiled from Rust to Wasm). Strictly speaking, OxiGraph is not actually necessary at this stage, as HDT can also execute SPARQL queries. However, OxiGraph has the advantage of being able to execute distributed SPARQL queries as well.

Translated with DeepL.com (free version)