COW: Converter for CSV on the Web

This package is a comprehensive tool (COW [1]) for batch conversion of multiple datasets expressed in CSV. It uses a JSON schema expressed using an extended version of the CSVW standard, to convert CSV files to RDF in scalable fashion.


Instead of using the command line tool there is also the webservice cattle, providing the same functionality that COW provides without having to install it. CSV files can be uploaded to the service and a JSON schema will be created, using that JSON schema cattle is able to create a RDF structured graph. More information about cattle, including how to use it, can be found at: https://github.com/CLARIAH/cattle.


CSV on the Web (CSVW) is a W3C standard for metadata descriptions for tabular data. Typically, these data reside in CSV files. CSVW metadata is captured in .csv-metadata.json files that live alongside the CSV files that they describe. For instance, a CSV file called data.csv and its metadata data.csv-metadata.json would be hosted at:

http://example.com/data.csv
http://example.com/data.csv-metadata.json

Another feature of CSVW is that it allows the specification of a mapping (or interpretation) of values in the CSV in terms of RDF. The tableSchema element in CSVW files defines per column what its properties should be, but may also define custom mappings to e.g. URIs in RDF.

Interestingly, the JSON format used by CSVW metadata is an extension of the JSON-LD specification, a JSON-based serialization for Linked Data. As a consequence of this, the CSVW metadata can be directly attached (as provenance) to the RDF resulting from a CSVW-based conversion.

This is exactly what the COW converter does.

The rest of this documentation will be fairly technical, for some hands-on examples you can take a look at the Wiki.

Features & Limitations

Compared to the CSVW specification, the converter has a number of limitations and extra features. These are:

  1. COW does not perform any schema checking, and ignores any and all parts of the CSVW Specification that are not directly needed for the RDF conversion.
  2. COW extends the CSVW specification in several ways:
  • Advanced formatting of URLs and values
  • Dealing with multiple null values and null values for one or more other columns.
  • Simple SKOS support (generating collections and schemes)
  • Optionally skipping/not skipping empty cells
  • A default set of namespace prefixes
  1. COW does some smart guessing:
  • Determining file encoding
  • Determining the delimiter
  • Generating a skeleton schema for any CSV file (see here)
  1. COW produces extensive provenance:
  • Converted data is encapsulated in a Nanopublication
  • The original CSVW schema is encapsulated in the np:hasProvenance graph associated with the nanopublication.

Installation

Prerequisites

  • Python 2.7 (installed on most systems)
  • pip
  • virtualenv (simply pip install virtualenv) [2]

Installing with pip (preferred)

Open up a terminal and instantiate a virtual Python environment:

virtualenv .

Activate the virtual environment:

source bin/activate

Install cow in the new environment:

pip install cow_csvw

To upgrade a previously installed version of COW, do:

pip install --upgrade cow_csvw

(you might need permissions if you’re installing outside a virtualenv). To check the version currently installed:

cow_tool --version

To get help:

cow_tool

Usage

The primary command line script for CSVW-based conversion is cow_tool. It can be used for two tasks:

  1. Generating a skeleton CSVW JSON-Schema for a specific CSV file.
  2. Using such a schema to convert a CSV file to RDF (in NQuads format)

General usage instructions can be obtained by running cow_tool -h:

usage: cow_tool [-h] [--dataset DATASET] [--delimiter DELIMITER]
                [--quotechar QUOTECHAR] [--processes PROCESSES]
                [--chunksize CHUNKSIZE] [--base BASE]
                {convert,build} file [file ...]

The table below gives a brief description of each of these options.

Commandline options for cow_tool
Option Explanation
dataset Specifies the name of the dataset, if it is different from the filename with the .csv extension stripped.
delimiter Forces the use of a specific delimiter when parsing the CSV file (only used with build option)
quotechar Forces the use of a specific quote character (default is ", only used with build option)
processes Specifies the number of parallel processes to use when converting a CSV file (default is 4)
chunksize Specifies the number of lines that will be passed to each process (default is 5000)
base The base for URIs generated with the schema (only used with build option, the default is http://data.socialhistory.org)
{convert,build} The convert option triggers a conversion to RDF for the files specified in file [file ...]. The build option generates a skeleton JSON schema for the files specified.
file [file ...] A list of files to be converted (or “built”); any unix-style wildcards are allowed.

Generating a Skeleton Schema

Since JSON is a rather verbose language, and we currently do not have a convenient UI for constructing CSVW schema files, COW allows you to generate a skeleton schema for any CSV file.

Suppose you want to build a skeleton schema for a file imf_gdppc.csv (from [4]) that looks like:

1
2
3
4
5
6
7
8
9
Rank;Country;Int
1;Qatar;131,063
2;Luxembourg;104,906
3;Macau;96,832
4;Singapore;90,249
5;Brunei Darussalam;83,513
6;Kuwait;72,675
7;Ireland;72,524
8;Norway;70,645

Make sure you have your virtual environment enabled (if applicable), and run:

cow_tool build imf_gdppc.csv --base=http://example.com/resource

The --base option specifies the base for all URIs generated through the schema. This is https://iisg.amsterdam/ by default (see http://datalegend.net)

This will generate a file called imf_gdppc.csv-metadata.json with the following contents:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
{
 "dialect": {
  "quoteChar": "\"",
  "delimiter": ";",
  "encoding": "ascii"
 },
 "dcat:keyword": [],
 "dc:license": {
  "@id": "http://opendefinition.org/licenses/cc-by/"
 },
 "dc:publisher": {
  "schema:name": "CLARIAH Structured Data Hub - Datalegend",
  "schema:url": {
   "@id": "http://datalegend.net"
  }
 },
 "url": "imf_gdppc.csv",
 "@context": [
  "http://csvw.clariah-sdh.eculture.labs.vu.nl/csvw.json",
  {
   "@base": "http://example.com/resource/",
   "@language": "en"
  },
  {
   "owl": "http://www.w3.org/2002/07/owl#",
   "napp-eng81": "https://iisg.amsterdam/napp/dataset/englandwales1881/",
   "dbo": "http://dbpedia.org/ontology/",
   "clioctr": "https://iisg.amsterdam/clio/country/",
   "hisclass": "https://iisg.amsterdam/hisclass/",
   "hisco-product": "https://iisg.amsterdam/hisco/product/",
   "ldp": "http://www.w3.org/ns/ldp#",
   "clio": "https://iisg.amsterdam/clio/",
   "occhisco": "https://iisg.amsterdam/napp/OCCHISCO/",
   "dbr": "http://dbpedia.org/resource/",
   "skos": "http://www.w3.org/2004/02/skos/core#",
   "xml": "http://www.w3.org/XML/1998/namespace/",
   "sdmx-concept": "http://purl.org/linked-data/sdmx/2009/concept#",
   "napp": "https://iisg.amsterdam/napp/",
   "prov": "http://www.w3.org/ns/prov#",
   "sdmx-code": "http://purl.org/linked-data/sdmx/2009/code#",
   "napp-can91": "https://iisg.amsterdam/napp/dataset/canada1891/",
   "hiscam": "https://iisg.amsterdam/hiscam/",
   "dbpedia": "http://dbpedia.org/resource/",
   "np": "http://www.nanopub.org/nschema#",
   "hisclass5": "https://iisg.amsterdam/hisclass5/",
   "canfam-auke": "https://iisg.amsterdam/canfam/auke/",
   "dcterms": "http://purl.org/dc/terms/",
   "schema": "http://schema.org/",
   "foaf": "http://xmlns.com/foaf/0.1/",
   "sdv": "http://example.com/resource/vocab/",
   "hisco": "https://iisg.amsterdam/hisco/",
   "bibo": "http://purl.org/ontology/bibo/",
   "sdmx-dimension": "http://purl.org/linked-data/sdmx/2009/dimension#",
   "hsn": "https://iisg.amsterdam/hsn2013a/",
   "dc": "http://purl.org/dc/terms/",
   "hisco-relation": "https://iisg.amsterdam/hisco/relation/",
   "hisco-status": "https://iisg.amsterdam/hisco/status/",
   "dbp": "http://dbpedia.org/property/",
   "clioprop": "https://iisg.amsterdam/clio/property/",
   "csvw": "http://www.w3.org/ns/csvw#",
   "clioind": "https://iisg.amsterdam/clio/indicator/",
   "dc11": "http://purl.org/dc/elements/1.1/",
   "qb": "http://purl.org/linked-data/cube#",
   "canfam-dimension": "http://data.socialhistory.org/vocab/canfam/dimension/",
   "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
   "canfam": "https://iisg.amsterdam/canfam/dataset/canada1901/",
   "napp-sct81": "https://iisg.amsterdam/napp/dataset/scotland1881/",
   "sdmx-measure": "http://purl.org/linked-data/sdmx/2009/measure#",
   "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
   "sdr": "http://example.com/resource/",
   "xsd": "http://www.w3.org/2001/XMLSchema#",
   "time": "http://www.w3.org/2006/time#",
   "napp-dimension": "http://data.socialhistory.org/vocab/napp/dimension/"
  }
 ],
 "dc:title": "imf_gdppc.csv",
 "@id": "http://example.com/resource/imf_gdppc.csv",
 "dc:modified": {
  "@value": "2018-11-14",
  "@type": "xsd:date"
 },
 "tableSchema": {
  "aboutUrl": "{_row}",
  "primaryKey": "Rank",
  "columns": [
   {
    "datatype": "string",
    "titles": [
     "Rank"
    ],
    "@id": "http://example.com/resource/imf_gdppc.csv/column/Rank",
    "name": "Rank",
    "dc:description": "Rank"
   },
   {
    "datatype": "string",
    "titles": [
     "Country"
    ],
    "@id": "http://example.com/resource/imf_gdppc.csv/column/Country",
    "name": "Country",
    "dc:description": "Country"
   },
   {
    "datatype": "string",
    "titles": [
     "Int"
    ],
    "@id": "http://example.com/resource/imf_gdppc.csv/column/Int",
    "name": "Int",
    "dc:description": "Int"
   }
  ]
 }
}

The exact meaning of this structure is explained in the section below.

Converting a CSV file

If we now want to convert our example file imf_gdppc.csv, you first make sure you have your virtual environment enabled (if applicable), and run:

cow_tool convert imf_gdppc.csv

This will produce a file imf_gdppc.csv.nq that holds an NQuads serialization of the RDF.

This is also the preferred method for converting multiple files at the same time. For instance, if you want to convert all CSV files in a specific directory, simply use unix-style wildcards:

cow_tool convert /path/to/some/directory/*.csv

Going back to our running example, the resulting RDF will be serialized as N-Quads. This is a computer friendly but not so much human friendly serialization so for the benefit of (human) readability below the RDF will be represented in the TriG serialization:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
@prefix ns1: <http://www.w3.org/ns/prov#> .
@prefix ns2: <http://www.w3.org/ns/csvw#> .
@prefix ns3: <http://schema.org/> .
@prefix ns4: <http://purl.org/dc/terms/> .
@prefix ns5: <urn:uuid:5> .
@prefix ns6: <http://www.nanopub.org/nschema#> .
@prefix ns7: <https://iisg.amsterdam/vocab/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://iisg.amsterdam/imf_gdppc/pubinfo/48422b27/2018-11-14T10:59> {
    <https://iisg.amsterdam/imf_gdppc/nanopublication/48422b27/2018-11-14T10:59> ns1:generatedAtTime "2018-11-14T10:59:00"^^xsd:dateTime ;
        ns1:wasGeneratedBy <https://github.com/CLARIAH/wp4-converters> .
}

<https://iisg.amsterdam/imf_gdppc/provenance/48422b27/2018-11-14T10:59> {
    <https://iisg.amsterdam/imf_gdppc/assertion/48422b27/2018-11-14T10:59> ns1:generatedAtTime "2018-11-14T10:59:00"^^xsd:dateTime ;
        ns1:wasDerivedFrom <http://example.com/resource/imf_gdppc.csv>,
            <https://iisg.amsterdam/48422b27cba4a0e68c9c66d0f7ca614ec688dfcb> .

    <http://example.com/resource/__row_> ns1:wasDerivedFrom "http://example.com/resource/{_row}"^^xsd:string .

    <http://example.com/resource/imf_gdppc.csv> ns4:license <http://opendefinition.org/licenses/cc-by/> ;
        ns4:modified "2018-11-14"^^xsd:date ;
        ns4:publisher [ ns3:name "CLARIAH Structured Data Hub - Datalegend"@en ;
                ns3:url <http://datalegend.net/> ] ;
        ns4:title "imf_gdppc.csv"@en ;
        ns2:dialect [ ns2:delimiter ";" ;
                ns2:encoding "ascii" ;
                ns2:quoteChar "\"" ] ;
        ns2:tableSchema [ ns2:aboutUrl <http://example.com/resource/__row_> ;
                ns2:column ( <http://example.com/resource/imf_gdppc.csv/column/Rank> <http://example.com/resource/imf_gdppc.csv/column/Country> <http://example.com/resource/imf_gdppc.csv/column/Int> ) ;
                ns2:primaryKey "Rank" ] ;
        ns2:url "imf_gdppc.csv"^^xsd:anyURI .

    <http://example.com/resource/imf_gdppc.csv/column/Country> ns4:description "Country"@en ;
        ns2:datatype xsd:string ;
        ns2:name "Country" ;
        ns2:title "Country"@en .

    <http://example.com/resource/imf_gdppc.csv/column/Int> ns4:description "Int"@en ;
        ns2:datatype xsd:string ;
        ns2:name "Int" ;
        ns2:title "Int"@en .

    <http://example.com/resource/imf_gdppc.csv/column/Rank> ns4:description "Rank"@en ;
        ns2:datatype xsd:string ;
        ns2:name "Rank" ;
        ns2:title "Rank"@en .
}

ns5:db490c7-50c3-4ad6-b0df-d48fe3dfa984 {
    <https://iisg.amsterdam/48422b27cba4a0e68c9c66d0f7ca614ec688dfcb> ns7:path "/tmp/V2RY7QULW9/web_interface/91a7c0a271826cf3e7e5b470dfd5e345/imf_gdppc.csv"^^xsd:string ;
        ns7:sha1_hash "48422b27cba4a0e68c9c66d0f7ca614ec688dfcb"^^xsd:string .

    <https://iisg.amsterdam/imf_gdppc/nanopublication/48422b27/2018-11-14T10:59> a ns6:Nanopublication ;
        ns6:hasAssertion <https://iisg.amsterdam/imf_gdppc/assertion/48422b27/2018-11-14T10:59> ;
        ns6:hasProvenance <https://iisg.amsterdam/imf_gdppc/provenance/48422b27/2018-11-14T10:59> ;
        ns6:hasPublicationInfo <https://iisg.amsterdam/imf_gdppc/pubinfo/48422b27/2018-11-14T10:59> .

    <https://iisg.amsterdam/imf_gdppc/assertion/48422b27/2018-11-14T10:59> a ns6:Assertion .

    <https://iisg.amsterdam/imf_gdppc/provenance/48422b27/2018-11-14T10:59> a ns6:Provenance .

    <https://iisg.amsterdam/imf_gdppc/pubinfo/48422b27/2018-11-14T10:59> a ns6:PublicationInfo .
}

<https://iisg.amsterdam/imf_gdppc/assertion/48422b27/2018-11-14T10:59> {
    <http://example.com/resource/0> ns7:Country "Qatar"^^xsd:string ;
        ns7:Int "131,063"^^xsd:string ;
        ns7:Rank "1"^^xsd:string .

    <http://example.com/resource/1> ns7:Country "Luxembourg"^^xsd:string ;
        ns7:Int "104,906"^^xsd:string ;
        ns7:Rank "2"^^xsd:string .

    <http://example.com/resource/2> ns7:Country "Macau"^^xsd:string ;
        ns7:Int "96,832"^^xsd:string ;
        ns7:Rank "3"^^xsd:string .

    <http://example.com/resource/3> ns7:Country "Singapore"^^xsd:string ;
        ns7:Int "90,249"^^xsd:string ;
        ns7:Rank "4"^^xsd:string .

    <http://example.com/resource/4> ns7:Country "Brunei Darussalam"^^xsd:string ;
        ns7:Int "83,513"^^xsd:string ;
        ns7:Rank "5"^^xsd:string .

    <http://example.com/resource/5> ns7:Country "Kuwait"^^xsd:string ;
        ns7:Int "72,675"^^xsd:string ;
        ns7:Rank "6"^^xsd:string .

    <http://example.com/resource/6> ns7:Country "Ireland"^^xsd:string ;
        ns7:Int "72,524"^^xsd:string ;
        ns7:Rank "7"^^xsd:string .

    <http://example.com/resource/7> ns7:Country "Norway"^^xsd:string ;
        ns7:Int "70,645"^^xsd:string ;
        ns7:Rank "8"^^xsd:string .
}

What does this mean?

  • Everything in https://iisg.amsterdam/imf_gdppc/provenance/48422b27/2018-11-14T10:59 is the RDF representation of the CSVW JSON schema.

  • Everything in https://iisg.amsterdam/imf_gdppc/assertion/48422b27/2018-11-14T10:59 is the RDF representation of the CSV file.

    Since the global aboutUrl is set to {_row}, every row is represented in RDF as a resource with the base URI concatenated with the row number. The column names are used as predicates to relate the row resource to a string literal representation of the value of a cell in that row.

  • The graph ns5:db490c7-50c3-4ad6-b0df-d48fe3dfa984 is the default graph that contains the Nanopublication.

The Schema

The COW converter uses the CSWV standard syntax for defining mappings from CSV to RDF graphs. These mappings are all defined in the tableSchema dictionary. For a full reference of the things you can do, we refer to the CSV on the Web (CSVW) specification and in particular to the document on Generating RDF from Tabular Data on the Web.

Important: COW does not purport to implement the full CSVW specification, nor has it been tested against the official test suite. In fact, COW extends and deviates from the CSVW specification in several important ways.

We document the most important differences in the section below, and give a short overview of how schemas can be defined.

Differences and Extensions

  1. While CSVW allows only for simple references to values in a column using the curly-brackets syntax (e.g. {name} to refer to the value of the name column at the current row), COW interprets the strings containing these references in two ways:
  1. as Python Format Strings, and
  2. as Jinja2 Templates

This allows for very elaborate operations on row contents (e.g. containing conditionals, loops, and string operations.) [3].

  1. CSVW allows only to specify a single null value for a column; when the cell in that column is equal to the null value, it is ignored for RDF conversion. COW extends the CSVW treatment of null values in two ways:
  1. multiple potential null values for a column, expressed as a JSON list, and
  2. conditional on values in another column, as a JSON-LD list (using the @list keyword)
  1. COW allows the use of csvw:collectionUrl and csvw:schemeUrl on column specifications. This will automatically cast the value for valueUrl to a skos:Concept, and adds it to the collection or scheme respectively indicated by these urls using a skos:member or skos:inScheme predicate.
  2. By default COW skips cells that are empty (as per the CSVW specification), setting the csvw:parseOnEmpty attribute to true overrides this setting. This is useful when an empty cell has a specific meaning.
  3. Column specifications with a xsd:anyURI datatype are converted to proper URIs rather than Literals with the xsd:anyURI datatype. This allows for conditionally generating URIs across multiple namespaces using Jinja2 templates, see issue #13 .
  4. Column specifications in COW should have a JSON-LD style @id attribute. This ensures that all predicates generated through the conversion are linked back to the RDF representation of the CSVW JSON schema that informed the conversion.
  5. COW converts column names to valid Python dictionary keys. In general this means that spaces in column names will be replaced with underscores.
  6. For convenience, COW uses a default set of namespaces, specified in the src/converter/namespaces.yaml file, that will be used to interpret namespace prefix use in the JSON schema. Any namespace prefixes defined in the JSON schema will override the default ones.

Short Overview

A very simple tableSchema may have the following structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
"tableSchema": {
  "aboutUrl": "{_row}",
  "primaryKey": "Rank",
  "columns": [
    {
     "@id": "http://example.com/resource/imf_gdppc.csv/column/Rank",
     "dc:description": "Rank",
     "datatype": "string",
     "name": "Rank"
    }
  ]
}

For the conversion to RDF, only the aboutUrl and columns attributes are of importance.

aboutUrl

The aboutUrl attribute defines a template for all URIs that occur in the subject position of triples generated by the converter. It may appear in the tableSchema or in one of the columns. If defined in the tableSchema, it acts as a global template that may be overriden by individual columns.

We explain URL template expansion here.

columns

The columns array defines a schema for each column, and any additional virtual columns. The distinction between the two is important, as non-virtual columns must actually be present in the CSV (schema compliance) while virtual columns only instruct the conversion to RDF.

In the schema above, we state that the column identifiable with the name Rank specifies a literal value, with the datatype of string (a shorthand for xsd:string). The titles array gives a number of alternative

Column Attributes

Every column is a dictionary that may have the following attributes.

Attributes usable in column specifications
Attribute Explanation
name Specifies the column to which this column specification applies. If no propertyUrl is defined on the column, the value for name will be used to generate the URL for the predicate position of the triple generated.
virtual If set to true, the column specification is not taken into account when validating a CSV file against this schema.
aboutUrl Overrides the global aboutUrl template defined for the schema. This template will be used to generate the subject URL of the triple.
valueUrl If present, this template will be used to generate the object URL of the triple. Otherwise, the value for name is used to retrieve the value for that cell, to generate a URL.
datatype Specifies that this column should result in a triple where the object is a Literal with the datatype specified here (for common XML Schema datatypes, it is possible to drop the xsd: prefix). The value of the literal is then the value of the cell in this row indicated by the value of name. Special case: when the datatype is xsd:anyURI COW creates a URI rather than a literal value.
csvw:value Specifies that this column should result in a triple where the object is a Literal with the default xsd:string datatype (unless otherwise specified in the datatype attribute). The literal value for this cell is determined by applying the ref::template expansion <template-expansion> rule to this row. Can only be used in virtual columns.
csvw:parseOnEmpty When set to true, specifies that this column should be processed even when the cell corresponding to this column in this row is empty.
null Specifies that this template does not apply if the cell in this column in this row corresponds to the value specified here. Can take a single value (as per the CSVW spec) or an array of values.
lang Specifies the language tag for the literal in the object position, but only if the datatype is set to be string.
collectionUrl Specifies that the valueUrl (or equivalent) should be of type skos:Concept and that it is a skos:member of the URL generated by applying the collectionUrl template.
schemeUrl Specifies that the valueUrl (or equivalent) should be of type skos:Concept and that it is skos:inScheme the URL generated by applying the schemeUrl template.

Template Expansion

When a CSV file is processed, COW does this row by row in the file, producing a dictionary where key/value pairs correspond to column headers and the value of the cell. So for:

Rank;Country;Int
1;Qatar;131063

the first row becomes [5]

row = {'Rank': 1, 'Country': 'Qatar', 'Int': 131063}

For each row, COW then applies each column definition in the columns array (i.e. not for each column in the CSV file).

The URL templates in the attributes aboutUrl, propertyUrl, valueUrl, and the regular template in the csvw:value are used to generate URLs and Literal values from the values of the cells in a specific row.

The values for the URL templates that the parser receives are interpreted as URLs. This means that they are expanded relative to the @base URI of the CSVW JSON schema file, unless they are explicitly preceded by a defined namespace prefix.

The names of Jinja2 or Python formatting field names should correspond to the keys of the dictionary (i.e. to the column names). COW supports a special CSVW field name _row that inserts the row number. This means that our row now becomes:

row = {'Rank': 1, 'Country': 'Qatar', 'Int': 131063, '_row': 1}

COW always first applies the Jinja2 template, and then the Python formatting.

For instance (assuming a @base of http://example.com/), we define an aboutUrl with the special _row key as a Python string formatting field name, and Country as a Jinja2 field name:

"aboutUrl": "{_row}/{{Country}}"

the JSON-LD parser interprets the value for aboutUrl as the following URI:

"http://example.com/{_row}/{{Country}}"

we then apply the Jinja2 formatting (Template("http://example.com/{_row}{{Country}}").render(**row)):

"http://example.com/{_row}/Qatar"

followed by the Python formatting ("http://example.com/{_row}/{{Country}}".format(**row)):

"http://example.com/1/Qatar"

For csvw:value attributes this works similarly, with the exception that the JSON-LD parser will not interpret these fields as URIs:

"csvw:value": "{_row}/{{Country}}"

is parsed as:

"{_row}/{{Country}}"

This means that one can use Jinja2 conditional formatting on csvw:value atributes in combination with an xsd:anyURI value for datatype to generate custom URIs that do not fit within a defined namespace.

Jinja2 is a very expressive templating language. To give a small example, we could define a virtual column that allows us to specify whether a country is http://example.com/rich or http://example.com/poor depending on whether the GDP is over 100k.

Our virtual column may look as follows:

1
2
3
4
5
6
{
  "virtual": "true",
  "aboutUrl": "{Country}",
  "propertyUrl": "rdf:type",
  "valueUrl": "{% if Int > 100000 %}rich{% else %}poor{% endif %}"
}

This will produce, for Qatar and Singapore, the respective triples:

<http://example.com/Qatar>     rdf:type <http://example.com/rich> .
<http://example.com/Singapore> rdf:type <http://example.com/poor> .

FAQ: Frequently Asked Questions

What’s the general structure of the resulting .csv-schema.json file?
  • block 1 contains information on the csv file such as delimiter and encoding
  • block 2 provides information on the licensing
  • block 3 provides information on the publisher (you)
  • block 4 contains the reference to the file, your base URI, language and short-hand uri definitions, and finally some provenance info, such as date first created
  • block 5 starts with tableSchema. This is the section that you will want to focus on, as this is where you’ll describe your variables in RDF. Note that the aboutUrl just under tableSchema is the default aboutUrl.
After I build my file I get all kinds of @id rows. What should I do with those?
The purpose of these rows is that they identify the schema instruction itself. This way we can explicitly refer generated RDF back to the part of the schema that produced it.
How can I expand the description of my variable in RDF?

You can add information by adding a ‘virtual block’. For example, suppose we started out with the block above, we could expand on it in the follow way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
 "datatype": "string",
 "titles": [
 "institution"
 ],
 "name": "institution",
 "dc:description": "The research institute providing the study"
 },
 {
 "virtual": true,
 "propertyUrl": "foaf:Organization",
 "valueUrl": "institute/{institution}"
}

If our base url would be http://example.com, it would create a series of URI’s representing all the different values (institutes) in the dataset, like so: http://example.com/institute/Institute4AwesomeResearch, http://example.com/institute/StarWarsAcademy, etc..

My variable contains web addresses (URLs). How can I have them as proper addresses rather than strings?
Change the datatype to: xsd:anyURI
Can I assign values conditionally, i.e. use an if-statement?

Indeed you can. For example, HISCO codes ought to be 5 digits, but sometimes the first digit is a zero and then gets ‘lost’. So we want to add that zero back, when HISCO has less than 5 digits. However, we don’t want to add a zero, to the codes indicating missing HISCO’s: (-1, -2 (and -3 in the Dutch Historical Sample version of HISCO)). Ergo:

"aboutUrl": "hisco:{% if not HISCO in ['-1','-3','-2']%}{HISCO:0>5}{% else %}{HISCO}{% endif %}"

Also see the section Commonly used Template Formatting

What does ValueError: Expecting , delimiter: line 161 column 5 (char 5716) mean?
This might happen when you use the csvw-tool to convert a dataset. It indicates that you probably forgot a comma (or other delimiter) in the line before.
What does ValueError:"No JSON object could be decoded" mean?
In general it means that the Json schema is not syntactically valid. This could mean anything from missing commas, brackets, closing quotes, incorrect string quotes or invalid structure. E.g. this error occurs when your second to final } is followed by a comma.
When converting, I get the following error: "Exception: Could not find source file or necessary metadata file in path..."
It’s likely your trying to convert the .csv-metadata.json file rather than the .csv file itself.
I’ve added a language tag using "lang": "whatever_language", but why doesn’t show @whatever_language in the n-quads file?
The "lang": part should be placed directly after the specification of the skos:pref/hidden/altLabel.

Commonly used Template Formatting

  • Leading zeroes: {variable:0>N}, where N is the number of digits to fill up to.
  • If-else statements: {% if conditional_variable=="something" %} value_if {% else %} value_else {% endif %}.
  • Convert to string and concatenate: {{variable ~ 'string'}}, e.g. if variable has value “Hello” then the result would be “Hello string”. Note the double braces.
  • Arithmetic: use double braces and cast as numeric first, e.g. {{variable|float() * 1000}}.
  • Lowercase, uppercase, etc.: {{variable|lower()}}`. Note the double brace.
  • String slices: {{variable[n:m]}} as described here.

API Documentation

Indices and tables

Footnotes

Footnotes

[1]COW: **C**SV **O**n the **W**eb.
[2]These instructions use virtualenv but you can also install all packages globally, or use an alternative such as conda.
[3]In the future we may enable the Jinja2 plugin mechanism. This will allow running custom Python functions as filters over values.
[4]https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28PPP%29_per_capita
[5]Assuming that you have the proper locale settings that instructs Python to interpret the comma as a thousands separator.