Module validate_elter

Intro

Validactylus is a small Python command line tool to validate CSV data against a ruleset laid down as a JSON schema.

Its main function, called from the command line is validate_elter. It takes a CSV data file, fetches a topic-specific and a shared schema from an eLTER schema store and validates the former, linewise, with the latter.

Validation errors – if any – are returned as a JSON array of error objects.

Example

Windows

py -m validate_elter path/to/data.csv, -r station, -rs shared

Linux

python validate_elter.py path/to/data.csv -r station, -rs shared

(make sure to specify the full path to validate_elter.py)

Sample output (truncated)

[{"line": 1, "path": "SITE_CODE", "message": "'qwer' does not match 
    '^https://deims.org/[a-zA-Z0-9]{8}-([a-zA-Z0-9]{4}-){3}[a-zA-Z0-9]{12}$'"},
    {"line": 1, "path": "LAT", "message": "'100' is not of type 'number'"},

where -r is the topic-specific ruleset (here: for a station description) and rs the shared rules (common to several topics, e. g. longitude). The rulesets can be give as names ("station"), file names ("station.json") or full URL (currently: "https://raw.githubusercontent.com/eLTER-RI/elter-ci-schemas/main/schemas/station.json").

Expand source code
"""
# Intro

<img src = "../assets/validactylus.svg" width = 200>

Validactylus is a small Python **command line tool** to validate CSV data
against a ruleset laid down as a [JSON schema](https://json-schema.org/).

Its main function, called from the command line is `validate_elter()`.
It takes a CSV data file, fetches a topic-specific and a shared schema from an
eLTER schema store and validates the former, linewise, with the latter.

Validation errors -- if any -- are returned as a JSON array of error objects.

**Example**


Windows
```
py -m validate_elter path/to/data.csv, -r station, -rs shared
```

Linux
```
python validate_elter.py path/to/data.csv -r station, -rs shared
```

*(make sure to specify the full path to `validate_elter.py`)*

**Sample output** (truncated)

```
[{"line": 1, "path": "SITE_CODE", "message": "'qwer' does not match 
    '^https://deims.org/[a-zA-Z0-9]{8}-([a-zA-Z0-9]{4}-){3}[a-zA-Z0-9]{12}$'"},
    {"line": 1, "path": "LAT", "message": "'100' is not of type 'number'"},
```    






where `-r` is the topic-specific ruleset (here: for a station description)
and `rs` the shared rules (common to several topics, e. g. longitude).
The rulesets can be give as names ("station"), file names ("station.json") or
full URL (currently: 
"https://raw.githubusercontent.com/eLTER-RI/elter-ci-schemas/main/schemas/station.json").

"""

# render html docs with pdoc (from /src folder):
# python -m pdoc validate_elter.py --html -o ../docs --force

import argparse
import jsonschema
import requests
import re
import csv
import json
from urllib.parse import urljoin, quote
import referencing # for in-memory registration of schemas






def get_remote_schemas(url_schema_topic,
                       url_schema_shared):
    """
    Retrieve validation schemas from a remote schema store (e. g. a dedicated
    GitHub repo).
    Will raise an error if not both schemas can be retrieved
    with a server status of 200.
    
    Parameters
    ----------
    
    url_schema_topic : str
        full URL of topic-specific schema (e. g. site description)
        
    url_schema_shared : str
        full URL of shared schema (containing common 
                definitions shared by several topic-specific schemas)    
    """


    max_waiting = 5 ## s
    rs = {
        "schema_topic" : requests.get(url_schema_topic,
                                      timeout = max_waiting),
        "schema_shared" : requests.get(url_schema_shared,
                                        timeout =max_waiting)
    }
    
    if not all (v.status_code == 200 for k, v in rs.items()):
        raise ValueError("failed to retrieve" +\
                          f"\"{url_schema_topic}\" and/or " + \
                          f"\"{url_schema_shared}.json\""
                          )
            
            
#     # decode server byte response to UTF-8 and return schema as JSON:        
    rs = {k: json.loads(v.content.decode("UTF-8")) for k, v in rs.items()}
        
    return rs


def register_schemas(schema_topic, schema_shared,
                     spec = referencing.jsonschema.DRAFT202012):
    """
    Registers schemas (JSON objects of schemas written in JSONSchema)                       
                       locally in a Registry object.
    
    This is necessary to allow references from one schema to another,
    e. g. a topic-specific schema for site description referring to 
    common definitions (like latitude or site code format) stored in a
    shared schema.
    
    Parameters:
    -----------
    schema_topic : Dict
        topic-specific schema, `loads`'ed from remote JSON ressource
    schema_shared : Dict
        schema with shared definitions, `loads`'ed from remote JSON ressource
    """

    schema_topic_resource = (
        referencing.Resource(contents = schema_topic,
                             specification = spec)
        )
    schema_shared_resource = (
        referencing.Resource(contents = schema_shared,
                             specification = spec)
        )

    return referencing.Registry().with_resources([
        ("https://example.com/schema_topic", schema_topic_resource),
        ("https://example.com/schema_shared", schema_shared_resource)
    ])


def get_validator(schema_topic, registry):
    
    """
    Parameters
    ----------
    schema_topic : Dict
        topic-specific schema, `loads`'ed from JSON string
    registry : referencing.Registry
        a validator (package JSONschema) to ingest the CSV data
    
    Returns
    -------
    validator : jsonschema.Validator
        a validator which uses the schema "schema_topic" including references
        to a common schema "schema_shared" stored in the schema
        registry. This validator can be used to validate an instance (=data
        to be checked) like so: `validate_file(file_path, validator)`.  
               
    """
    validator = jsonschema.Draft202012Validator(
        schema = schemas["schema_topic"],
        registry = registry)
    return (validator)
    

def validate_file(file_path, validator):
    """
    
    
    Parameters
    ----------
    file_path : str
        full path to CSV file to be validated
    validator : jsonschema.validator
        a validator (package JSONschema) to ingest the CSV data

    Returns
    -------
    v_results : str
        A string describing a JSON array of error objects encountered during
        validation, each object consisting of CSV line number,
        invalid parameter and schema violation.

    """
    v_results = []
    with open(file_path) as csv_data:
        reader = csv.DictReader(csv_data, delimiter = args["delim"])
        i = 1
        for row in reader:
            instance = json.dumps(row)
            v_results.extend([{"line" : i,
                               "path" : ','.join(e.path),
                               "message" : e.message}
                      for e in validator.iter_errors(
                              instance = json.loads(instance))
                    ])
            i += 1
    v_results = json.dumps(v_results)            
    return (v_results)
    
    
if __name__ == "__main__":
    
    # get centrally managed schemas here:
    schema_base_url = ("https://raw.githubusercontent.com/eLTER-RI/"
                               "elter-ci-schemas/main/schemas/")

    # currently (Apr. 2024) available topic schemas:
    topic_choices = ['data_mapping', 'data_observation', 'event', 'license', 
                     'mapping', 'method', 'reference', 'sample', 'station']
        
    

    ## use JSONSchema version DRAFT202012:
    spec = referencing.jsonschema.DRAFT202012

    parser = argparse.ArgumentParser(prog = "elter_validate",
                                     description = "validate a CSV " +\
                        "using JSON schema",
                        epilog = "HTH")
    parser.add_argument("file_path", type = str, # positional (first) argument
                        help = ("path to CSV-file which to validate")
                        )
    
    parser.add_argument("-u", "--schema-base",  type = str,
                        default = schema_base_url,
                        help = "base url for remote schemas," + \
                            f" default: {schema_base_url}"
                        )
    
    parser.add_argument("-r", "--rules", type = str,
                        choices = topic_choices,
                        help = "name of a topic-specific schema")
    
    parser.add_argument("-rs", "--shared-rules", type = str,
                        default = "shared",
                        help = ("name of a schema with definitions shared by"
                                " topic schemas, default: \"shared\""))
    
    parser.add_argument("-delim", type = str, default = ";",
                        help = ("column separator, default: \";\" (semicolon"))

    # command line arguments to dictionary "args":
    args = vars(parser.parse_args()) 

    # sanitize url paths to schemas
    args = {k: quote(v, safe = ":./_-") if bool(re.search("rules", k)) else v 
            for k, v in args.items()}
   
    # expand schema name to full URL, whether supplied as foo, foo.json
    # or https://www.my_schemahost.org/schemas/foo.json:       
    def expand_path(fragment):
        return(urljoin(schema_base_url,
                  re.sub("(\\.json)+$", "", fragment) + ".json"))
    args = {k: expand_path(v) if bool(re.search("rules", k)) else v
           for k, v in args.items()}  
    
    


    schemas = get_remote_schemas(args["rules"], args["shared_rules"])
    registry = register_schemas(schemas["schema_topic"],
                                       schemas["schema_shared"],
                                       spec)

    validator = get_validator(schemas["schema_topic"], registry)
    result = validate_file(args["file_path"], validator)
    print(result)
    



        
         
    

Functions

def get_remote_schemas(url_schema_topic, url_schema_shared)

Retrieve validation schemas from a remote schema store (e. g. a dedicated GitHub repo). Will raise an error if not both schemas can be retrieved with a server status of 200.

Parameters

url_schema_topic : str
full URL of topic-specific schema (e. g. site description)
url_schema_shared : str
full URL of shared schema (containing common definitions shared by several topic-specific schemas)
Expand source code
def get_remote_schemas(url_schema_topic,
                       url_schema_shared):
    """
    Retrieve validation schemas from a remote schema store (e. g. a dedicated
    GitHub repo).
    Will raise an error if not both schemas can be retrieved
    with a server status of 200.
    
    Parameters
    ----------
    
    url_schema_topic : str
        full URL of topic-specific schema (e. g. site description)
        
    url_schema_shared : str
        full URL of shared schema (containing common 
                definitions shared by several topic-specific schemas)    
    """


    max_waiting = 5 ## s
    rs = {
        "schema_topic" : requests.get(url_schema_topic,
                                      timeout = max_waiting),
        "schema_shared" : requests.get(url_schema_shared,
                                        timeout =max_waiting)
    }
    
    if not all (v.status_code == 200 for k, v in rs.items()):
        raise ValueError("failed to retrieve" +\
                          f"\"{url_schema_topic}\" and/or " + \
                          f"\"{url_schema_shared}.json\""
                          )
            
            
#     # decode server byte response to UTF-8 and return schema as JSON:        
    rs = {k: json.loads(v.content.decode("UTF-8")) for k, v in rs.items()}
        
    return rs
def get_validator(schema_topic, registry)

Parameters

schema_topic : Dict
topic-specific schema, loads'ed from JSON string
registry : referencing.Registry
a validator (package JSONschema) to ingest the CSV data

Returns

validator : jsonschema.Validator
a validator which uses the schema "schema_topic" including references to a common schema "schema_shared" stored in the schema registry. This validator can be used to validate an instance (=data to be checked) like so: validate_file()(file_path, validator).
Expand source code
def get_validator(schema_topic, registry):
    
    """
    Parameters
    ----------
    schema_topic : Dict
        topic-specific schema, `loads`'ed from JSON string
    registry : referencing.Registry
        a validator (package JSONschema) to ingest the CSV data
    
    Returns
    -------
    validator : jsonschema.Validator
        a validator which uses the schema "schema_topic" including references
        to a common schema "schema_shared" stored in the schema
        registry. This validator can be used to validate an instance (=data
        to be checked) like so: `validate_file(file_path, validator)`.  
               
    """
    validator = jsonschema.Draft202012Validator(
        schema = schemas["schema_topic"],
        registry = registry)
    return (validator)
def register_schemas(schema_topic, schema_shared, spec=<Specification name='draft2020-12'>)

Registers schemas (JSON objects of schemas written in JSONSchema)
locally in a Registry object.

This is necessary to allow references from one schema to another, e. g. a topic-specific schema for site description referring to common definitions (like latitude or site code format) stored in a shared schema.

Parameters:

schema_topic : Dict topic-specific schema, loads'ed from remote JSON ressource schema_shared : Dict schema with shared definitions, loads'ed from remote JSON ressource

Expand source code
def register_schemas(schema_topic, schema_shared,
                     spec = referencing.jsonschema.DRAFT202012):
    """
    Registers schemas (JSON objects of schemas written in JSONSchema)                       
                       locally in a Registry object.
    
    This is necessary to allow references from one schema to another,
    e. g. a topic-specific schema for site description referring to 
    common definitions (like latitude or site code format) stored in a
    shared schema.
    
    Parameters:
    -----------
    schema_topic : Dict
        topic-specific schema, `loads`'ed from remote JSON ressource
    schema_shared : Dict
        schema with shared definitions, `loads`'ed from remote JSON ressource
    """

    schema_topic_resource = (
        referencing.Resource(contents = schema_topic,
                             specification = spec)
        )
    schema_shared_resource = (
        referencing.Resource(contents = schema_shared,
                             specification = spec)
        )

    return referencing.Registry().with_resources([
        ("https://example.com/schema_topic", schema_topic_resource),
        ("https://example.com/schema_shared", schema_shared_resource)
    ])
def validate_file(file_path, validator)

Parameters

file_path : str
full path to CSV file to be validated
validator : jsonschema.validator
a validator (package JSONschema) to ingest the CSV data

Returns

v_results : str
A string describing a JSON array of error objects encountered during validation, each object consisting of CSV line number, invalid parameter and schema violation.
Expand source code
def validate_file(file_path, validator):
    """
    
    
    Parameters
    ----------
    file_path : str
        full path to CSV file to be validated
    validator : jsonschema.validator
        a validator (package JSONschema) to ingest the CSV data

    Returns
    -------
    v_results : str
        A string describing a JSON array of error objects encountered during
        validation, each object consisting of CSV line number,
        invalid parameter and schema violation.

    """
    v_results = []
    with open(file_path) as csv_data:
        reader = csv.DictReader(csv_data, delimiter = args["delim"])
        i = 1
        for row in reader:
            instance = json.dumps(row)
            v_results.extend([{"line" : i,
                               "path" : ','.join(e.path),
                               "message" : e.message}
                      for e in validator.iter_errors(
                              instance = json.loads(instance))
                    ])
            i += 1
    v_results = json.dumps(v_results)            
    return (v_results)