ADDED tests/test_docs_v0.1.md Index: tests/test_docs_v0.1.md ================================================================== --- /dev/null +++ tests/test_docs_v0.1.md @@ -0,0 +1,69 @@ +It is a **single accompanying Python script** illustrating how you might validate your CSV data by loading it into an **in-memory SQLite** database and performing the various checks (File Integrity, Schema-Level, Key Constraints, Logic Validations, Data Quality, etc.). The script is somewhat verbose for clarity. In practice, you'd tailor it to exactly match your real CSV paths, column names, constraints, and business rules. + +You can run the script from the command line, providing paths to your CSV files. It handles: + +1. **Parsing CSV** (verifies row length). +2. **Creating SQLite tables** with constraints. +3. **Loading CSV data** into those tables. +4. **Running validation queries** (duplicate primary keys, foreign key mismatches, domain checks, etc.). +5. **Reporting** results and exiting with success/failure. + +--- + +## How The Script Works + +1. **Command-Line Arguments** + - `--spatial`, `--linguistic`, `--temporal`, `--sources` specify paths to each CSV. + +2. **CSV Format Check** + - `check_csv_format(...)` ensures each row has the expected number of columns. If there’s a mismatch, the script raises errors and stops. + +3. **In-Memory SQLite** + - We open `sqlite3.connect(":memory:")` so everything runs in RAM. When the script ends, the database disappears (ephemeral). + +4. **Table Creation** + - `create_tables(conn)` sets up the schema. You’ll see placeholders for constraints like foreign keys. SQLite requires `PRAGMA foreign_keys = ON;` if you want foreign key enforcement. You can add that right after connecting, e.g.: + ```python + conn.execute("PRAGMA foreign_keys = ON;") + ``` + - Note that some constraints (e.g., composite foreign keys or advanced checks) may require custom triggers in SQLite. + +5. **Loading CSVs** + - `load_csv_into_table(...)` uses a generic approach to read each CSV and insert rows. By default, it skips the header row (with `next(reader, None)`). Adjust if your CSVs don’t have headers. + +6. **Validation Tests** + - `run_tests(conn)` executes various queries that reflect your earlier mention of needed checks: + + - **Schema-Level & Key Constraints** + - Duplicate primary key checks in Spatial, Linguistic, etc. + - Foreign key checks (e.g., Linguistic.SPATID must exist in Spatial). + + - **Logic-Based Validations** + - `Temporal` data must obey rules about `EVENT`, `OBJID`, `OBJNAME`, `START < END`, etc. + + - **Data Quality** + - Checking coordinate ranges in `Spatial`. + - Allowed language codes. + - Allowed `EVENT` values. + +7. **Error Handling** + - The script accumulates all errors, then prints them at the end and exits with status code 1 if there are any. A zero exit code means success. + +8. **Execution** + - Example usage: + ```bash + python validate_csvs.py \ + --spatial /path/to/Spatial.csv \ + --linguistic /path/to/Linguistic.csv \ + --temporal /path/to/Temporal.csv \ + --sources /path/to/Sources.csv + ``` + +--- + +## Customizing the Script + +- **Columns & Orders**: Make sure the lists in `load_csv_into_table` precisely match your CSV layout and skip-headers logic. +- **Constraints**: If you need advanced constraints (e.g., composite foreign keys, or advanced check constraints), you can add them when creating tables. SQLite is a bit limited on multi-column foreign keys but can do them with the right syntax or triggers. +- **Extra Validation**: Add or remove queries in `run_tests` for your domain-specific logic. +- **Performance**: If your CSVs are large, you may want to optimize with `executemany()` or `PRAGMA` tweaks. For moderate data sizes, the above approach is fine. ADDED tests/test_v0.1.py Index: tests/test_v0.1.py ================================================================== --- /dev/null +++ tests/test_v0.1.py @@ -0,0 +1,422 @@ +#!/usr/bin/env python3 +import argparse +import csv +import sqlite3 +import sys +from pathlib import Path + +def parse_args(): + parser = argparse.ArgumentParser( + description="Validate toponymical CSV files by loading into SQLite and checking constraints." + ) + parser.add_argument("--spatial", required=True, help="Path to the Spatial CSV file.") + parser.add_argument("--linguistic", required=True, help="Path to the Linguistic CSV file.") + parser.add_argument("--temporal", required=True, help="Path to the Temporal CSV file.") + parser.add_argument("--sources", required=True, help="Path to the Sources CSV file.") + return parser.parse_args() + +def check_csv_format(csv_path, expected_num_columns, delimiter=';'): + """ + Basic file integrity & parsing check: + - Ensures each row has the expected number of columns. + - Reads entire file to confirm no parse errors (like unbalanced quotes). + """ + errors = [] + with open(csv_path, mode="r", encoding="utf-8-sig") as f: + reader = csv.reader(f, delimiter=delimiter) + line_num = 0 + for row in reader: + line_num += 1 + if len(row) != expected_num_columns: + errors.append(f"Row {line_num} in {csv_path} has {len(row)} columns; expected {expected_num_columns}.") + return errors + +def create_tables(conn): + """ + Creates tables in SQLite with constraints approximating your documentation. + Some constraints are added as CHECK or references. Adapt as needed. + """ + cur = conn.cursor() + + # Spatial Table + cur.execute(""" + CREATE TABLE IF NOT EXISTS Spatial ( + ID INTEGER PRIMARY KEY, + LAT REAL, + LON REAL, + OFFNAME TEXT NOT NULL, + LANG TEXT NOT NULL, + CLASS TEXT, + TYPE TEXT, + DISTRICT TEXT, + DOUBT INTEGER, + LANDMARK TEXT, + COMMENTS TEXT, + OTHER TEXT + -- Additional constraints or indexes if desired + ); + """) + + # Linguistic Table + cur.execute(""" + CREATE TABLE IF NOT EXISTS Linguistic ( + ID INTEGER PRIMARY KEY, + SPATID INTEGER, -- references Spatial(ID) + DOUSPAT INTEGER, + MAINID INTEGER, -- references Linguistic(ID) + TOPONYM TEXT NOT NULL, + TOPFORMS TEXT, + DOUTOPO INTEGER, + LANG TEXT NOT NULL, + DOULANG INTEGER, + PRONUNC TEXT, + DOUPRON INTEGER, + ETYM TEXT, + ORIGIN TEXT, + COMMENTS TEXT, + OTHER TEXT, + FOREIGN KEY (SPATID) REFERENCES Spatial(ID) + ON DELETE CASCADE ON UPDATE CASCADE, + FOREIGN KEY (MAINID) REFERENCES Linguistic(ID) + ON DELETE CASCADE ON UPDATE CASCADE + ); + """) + + # Temporal Table + # Using a composite key approach is trickier in SQLite, but we'll replicate the logic in checks. + cur.execute(""" + CREATE TABLE IF NOT EXISTS Temporal ( + ID INTEGER PRIMARY KEY, + LINGID INTEGER, -- references Linguistic(ID) + LINGNAME TEXT, -- references Linguistic(TOPONYM) for the same ID + START INTEGER NOT NULL, + DOUSTART INTEGER, + END INTEGER, + DOUEND INTEGER, + EVENT TEXT NOT NULL, + OBJID INTEGER, + OBJNAME TEXT, + COMMENTS TEXT, + OTHER TEXT, + FULLTEXT TEXT NOT NULL, + FOREIGN KEY (LINGID) REFERENCES Linguistic(ID) + ON DELETE CASCADE ON UPDATE CASCADE + ); + """) + + # Sources Table + cur.execute(""" + CREATE TABLE IF NOT EXISTS Sources ( + ID TEXT PRIMARY KEY, + TYPE TEXT NOT NULL, + AUTHOR TEXT, + TITLE TEXT NOT NULL, + YEAR INTEGER, + PUBLISHER TEXT, + CITATION TEXT, + COMMENTS TEXT, + PDF TEXT, + OTHER TEXT + ); + """) + + conn.commit() + +def load_csv_into_table(conn, csv_path, table_name, columns, delimiter=';'): + """ + Generic CSV loader. + 'columns' is a list of the column names (in the same order as in CSV). + Example: + columns = ["ID","LAT","LON","OFFNAME",...] + """ + cur = conn.cursor() + + # Build placeholders for INSERT statement. For example, if there are 5 columns -> (?, ?, ?, ?, ?) + placeholders = ", ".join(["?"] * len(columns)) + col_names = ", ".join(columns) + insert_sql = f"INSERT INTO {table_name} ({col_names}) VALUES ({placeholders})" + + with open(csv_path, mode='r', encoding='utf-8') as f: + reader = csv.reader(f, delimiter=delimiter) + next(reader, None) # skip header if your CSV has a header row + for row in reader: + # If your CSV includes a header row, either skip it or map row to columns carefully. + # Adjust casting if needed (e.g., int(row[0]) for ID). + cur.execute(insert_sql, row) + + conn.commit() + +def run_tests(conn): + """ + Runs a series of tests and returns a list of error messages (if any). + If there are no errors, the list will be empty. + """ + errors = [] + c = conn.cursor() + + # ------------------------------------------------- + # 1) Schema-Level & Key Constraints Tests + # Examples of duplicate checks and foreign key existence checks + # ------------------------------------------------- + + # 1A) Check for duplicate IDs in Spatial + c.execute(""" + SELECT ID, COUNT(*) as cnt + FROM Spatial + GROUP BY ID + HAVING cnt > 1 + """) + duplicates = c.fetchall() + for row in duplicates: + errors.append(f"[Spatial] Duplicate ID found: {row[0]}") + + # 1B) Check for duplicate IDs in Linguistic + c.execute(""" + SELECT ID, COUNT(*) as cnt + FROM Linguistic + GROUP BY ID + HAVING cnt > 1 + """) + duplicates = c.fetchall() + for row in duplicates: + errors.append(f"[Linguistic] Duplicate ID found: {row[0]}") + + # 1C) Check for duplicate IDs in Temporal + c.execute(""" + SELECT ID, COUNT(*) as cnt + FROM Temporal + GROUP BY ID + HAVING cnt > 1 + """) + duplicates = c.fetchall() + for row in duplicates: + errors.append(f"[Temporal] Duplicate ID found: {row[0]}") + + # 1D) Check for duplicate IDs in Sources + c.execute(""" + SELECT ID, COUNT(*) as cnt + FROM Sources + GROUP BY ID + HAVING cnt > 1 + """) + duplicates = c.fetchall() + for row in duplicates: + errors.append(f"[Sources] Duplicate ID found: {row[0]}") + + # 1E) Check foreign key: Linguistic.SPATID must exist in Spatial + c.execute(""" + SELECT L.ID, L.SPATID + FROM Linguistic L + LEFT JOIN Spatial S ON L.SPATID = S.ID + WHERE L.SPATID IS NOT NULL + AND S.ID IS NULL + """) + invalid_refs = c.fetchall() + for row in invalid_refs: + errors.append(f"[Linguistic] SPATID={row[1]} not found in Spatial (Linguistic.ID={row[0]})") + + # 1F) Check foreign key: Temporal.LINGID must exist in Linguistic + c.execute(""" + SELECT T.ID, T.LINGID + FROM Temporal T + LEFT JOIN Linguistic L ON T.LINGID = L.ID + WHERE T.LINGID IS NOT NULL + AND L.ID IS NULL + """) + invalid_refs = c.fetchall() + for row in invalid_refs: + errors.append(f"[Temporal] LINGID={row[1]} not found in Linguistic (Temporal.ID={row[0]})") + + # (Composite key check) For each row in Temporal, + # LINGNAME should match the TOPONYM in the row of Linguistic where ID = LINGID. + c.execute(""" + SELECT T.ID, T.LINGID, T.LINGNAME, L.TOPONYM + FROM Temporal T + LEFT JOIN Linguistic L ON T.LINGID = L.ID + WHERE L.ID IS NOT NULL + """) + for temporal_id, lingid, lingname, top_name in c.fetchall(): + if lingname and lingname != top_name: + errors.append(f"[Temporal] LINGNAME='{lingname}' does not match Linguistic.TOPONYM='{top_name}' for LINGID={lingid} (Temporal.ID={temporal_id})") + + # ------------------------------------------------- + # 2) Logic-Based Validations + # ------------------------------------------------- + + # Example 2A: In Temporal, if EVENT='ACTIVE' => END should be NULL (or at least not set) + c.execute(""" + SELECT ID, END, EVENT + FROM Temporal + WHERE EVENT='ACTIVE' AND END IS NOT NULL + """) + active_with_end = c.fetchall() + for row in active_with_end: + errors.append(f"[Temporal] EVENT=ACTIVE but END is not null for row ID={row[0]}") + + # Example 2B: If EVENT='CEASE' => OBJID and OBJNAME must be empty + c.execute(""" + SELECT ID, OBJID, OBJNAME + FROM Temporal + WHERE EVENT='CEASE' AND (OBJID IS NOT NULL OR OBJNAME IS NOT NULL AND OBJNAME != '') + """) + cease_with_obj = c.fetchall() + for row in cease_with_obj: + errors.append(f"[Temporal] EVENT=CEASE but OBJID/OBJNAME present for ID={row[0]}") + + # Example 2C: If EVENT in ('RENAME','MERGEIN') => OBJID/OBJNAME must not be empty + c.execute(""" + SELECT ID, EVENT, OBJID, OBJNAME + FROM Temporal + WHERE EVENT IN ('RENAME','MERGEIN') + AND (OBJID IS NULL OR OBJNAME IS NULL OR OBJNAME = '') + """) + missing_objs = c.fetchall() + for row in missing_objs: + errors.append(f"[Temporal] EVENT={row[1]} but OBJID/OBJNAME is missing (Temporal.ID={row[0]})") + + # Example 2D: Check year consistency in Temporal (START < END if both exist) + c.execute(""" + SELECT ID, START, END + FROM Temporal + WHERE END IS NOT NULL AND START >= END + """) + invalid_years = c.fetchall() + for row in invalid_years: + errors.append(f"[Temporal] START >= END for ID={row[0]} => (START={row[1]}, END={row[2]})") + + # ------------------------------------------------- + # 3) Data Quality Checks (examples) + # ------------------------------------------------- + + # 3A) Check LAT/LON ranges in Spatial (if not null) + c.execute(""" + SELECT ID, LAT, LON + FROM Spatial + WHERE (LAT < -90 OR LAT > 90 OR LON < -180 OR LON > 180) + """) + invalid_coords = c.fetchall() + for row in invalid_coords: + errors.append(f"[Spatial] ID={row[0]} has out-of-range coordinates (LAT={row[1]}, LON={row[2]})") + + # 3B) Check EVENT in Temporal is among allowed values + c.execute(""" + SELECT ID, EVENT + FROM Temporal + WHERE EVENT NOT IN ('MERGEIN', 'ACTIVE', 'RENAME', 'CEASE') + """) + bad_events = c.fetchall() + for row in bad_events: + errors.append(f"[Temporal] Invalid EVENT='{row[1]}' for ID={row[0]}") + + # 3C) Check LANG codes in Spatial + allowed_langs = {'RUS','CHU','HIM','MEM','ERZ','TAT','UNK','OTH'} # adapt if needed + c.execute("SELECT ID, LANG FROM Spatial") + for row in c.fetchall(): + if row[1] not in allowed_langs: + errors.append(f"[Spatial] ID={row[0]} has unknown LANG code '{row[1]}'") + + # 3D) Check LANG codes in Linguistic + c.execute("SELECT ID, LANG FROM Linguistic") + for row in c.fetchall(): + if row[1] not in allowed_langs: + errors.append(f"[Linguistic] ID={row[0]} has unknown LANG code '{row[1]}'") + + # Additional checks can be added similarly ... + + return errors + +def main(): + args = parse_args() + + # -------------------------------------------------------------------------- + # 0) Basic CSV integrity checks before loading into SQLite + # We verify that each CSV has the expected number of columns. + # Adapt the expected column counts and order to match your actual files. + # -------------------------------------------------------------------------- + # For example, from your doc: + # Spatial columns: ID, LAT, LON, OFFNAME, LANG, CLASS, TYPE, DISTRICT, DOUBT, LANDMARK, COMMENTS, OTHER => 12 columns + # Linguistic columns: ID, SPATID, DOUSPAT, MAINID, TOPONYM, TOPFORMS, DOUTOPO, LANG, DOULANG, PRONUNC, DOUPRON, ETYM, ORIGIN, COMMENTS, OTHER => 15 columns + # Temporal columns: ID, LINGID, LINGNAME, START, DOUSTART, END, DOUEND, EVENT, OBJID, OBJNAME, COMMENTS, OTHER, FULLTEXT => 13 columns + # Sources columns: ID, TYPE, AUTHOR, TITLE, YEAR, PUBLISHER, CITATION, COMMENTS, PDF, OTHER => 10 columns + # Adjust to reflect your real CSV structure (especially if there's a header row). + + # Adjust the expected column counts if your CSV includes a header row that you skip. + # If you skip the header, you might want to check row length - 1 or adapt logic. + # For clarity, let's assume we do have a header but we skip it in loading. + # So the check here is for total columns, including the header row. + + spatial_errors = check_csv_format(args.spatial, 12, delimiter=';') + linguistic_errors = check_csv_format(args.linguistic, 15, delimiter=';') + temporal_errors = check_csv_format(args.temporal, 13, delimiter=';') + sources_errors = check_csv_format(args.sources, 10, delimiter=';') + + all_format_errors = spatial_errors + linguistic_errors + temporal_errors + sources_errors + if all_format_errors: + print("CSV Format/Parsing Errors Detected:") + for err in all_format_errors: + print(" -", err) + sys.exit(1) # Fail immediately if the CSV structure is off + + # -------------------------------------------------------------------------- + # Create an in-memory SQLite database & define tables + # -------------------------------------------------------------------------- + conn = sqlite3.connect(":memory:") + create_tables(conn) + + # -------------------------------------------------------------------------- + # Load CSV data into each table + # -------------------------------------------------------------------------- + # You must pass the actual column names (and order) that match your CSVs. + # If your CSV includes headers, the easiest approach is to skip them in load_csv_into_table + # and define columns in the correct order. Adapt as needed. + + load_csv_into_table( + conn, + args.spatial, + "Spatial", + ["ID","LAT","LON","OFFNAME","LANG","CLASS","TYPE","DISTRICT","DOUBT","LANDMARK","COMMENTS","OTHER"] + ) + + load_csv_into_table( + conn, + args.linguistic, + "Linguistic", + ["ID","SPATID","DOUSPAT","MAINID","TOPONYM","TOPFORMS","DOUTOPO","LANG","DOULANG", + "PRONUNC","DOUPRON","ETYM","ORIGIN","COMMENTS","OTHER"] + ) + + load_csv_into_table( + conn, + args.temporal, + "Temporal", + ["ID","LINGID","LINGNAME","START","DOUSTART","END","DOUEND","EVENT","OBJID", + "OBJNAME","COMMENTS","OTHER","FULLTEXT"] + ) + + load_csv_into_table( + conn, + args.sources, + "Sources", + ["ID","TYPE","AUTHOR","TITLE","YEAR","PUBLISHER","CITATION","COMMENTS","PDF","OTHER"] + ) + + # -------------------------------------------------------------------------- + # Run all validation tests + # -------------------------------------------------------------------------- + errors = run_tests(conn) + + # -------------------------------------------------------------------------- + # Report results + # -------------------------------------------------------------------------- + if errors: + print("Validation Errors Detected:") + for e in errors: + print(" -", e) + print(f"Total errors: {len(errors)}") + sys.exit(1) + else: + print("All validations passed successfully!") + sys.exit(0) + +if __name__ == "__main__": + main()