Chuvash Toponymy Database v.0.1: test_docs

File tests/test_docs_v0.1.md artifact c1fc71d1c6 part of check-in e19594e4c4

It is a **single accompanying Python script** illustrating how you might validate your CSV data by loading it into an **in-memory SQLite** database and performing the various checks (File Integrity, Schema-Level, Key Constraints, Logic Validations, Data Quality, etc.). The script is somewhat verbose for clarity. In practice, you'd tailor it to exactly match your real CSV paths, column names, constraints, and business rules.

You can run the script from the command line, providing paths to your CSV files. It handles:

1. **Parsing CSV** (verifies row length).
2. **Creating SQLite tables** with constraints.
3. **Loading CSV data** into those tables.
4. **Running validation queries** (duplicate primary keys, foreign key mismatches, domain checks, etc.).
5. **Reporting** results and exiting with success/failure.

---

## How The Script Works

1. **Command-Line Arguments**
   - `--spatial`, `--linguistic`, `--temporal`, `--sources` specify paths to each CSV.

2. **CSV Format Check**
   - `check_csv_format(...)` ensures each row has the expected number of columns. If there’s a mismatch, the script raises errors and stops.

3. **In-Memory SQLite**
   - We open `sqlite3.connect(":memory:")` so everything runs in RAM. When the script ends, the database disappears (ephemeral).

4. **Table Creation**
   - `create_tables(conn)` sets up the schema. You’ll see placeholders for constraints like foreign keys. SQLite requires `PRAGMA foreign_keys = ON;` if you want foreign key enforcement. You can add that right after connecting, e.g.:
     ```python
     conn.execute("PRAGMA foreign_keys = ON;")
     ```
   - Note that some constraints (e.g., composite foreign keys or advanced checks) may require custom triggers in SQLite.

5. **Loading CSVs**
   - `load_csv_into_table(...)` uses a generic approach to read each CSV and insert rows. By default, it skips the header row (with `next(reader, None)`). Adjust if your CSVs don’t have headers.

6. **Validation Tests**
   - `run_tests(conn)` executes various queries that reflect your earlier mention of needed checks:

   - **Schema-Level & Key Constraints**
     - Duplicate primary key checks in Spatial, Linguistic, etc.
     - Foreign key checks (e.g., Linguistic.SPATID must exist in Spatial).

   - **Logic-Based Validations**
     - `Temporal` data must obey rules about `EVENT`, `OBJID`, `OBJNAME`, `START < END`, etc.

   - **Data Quality**
     - Checking coordinate ranges in `Spatial`.
     - Allowed language codes.
     - Allowed `EVENT` values.

7. **Error Handling**
   - The script accumulates all errors, then prints them at the end and exits with status code 1 if there are any. A zero exit code means success.

8. **Execution**
   - Example usage:
     ```bash
     python validate_csvs.py \
       --spatial /path/to/Spatial.csv \
       --linguistic /path/to/Linguistic.csv \
       --temporal /path/to/Temporal.csv \
       --sources /path/to/Sources.csv
     ```

---

## Customizing the Script

- **Columns & Orders**: Make sure the lists in `load_csv_into_table` precisely match your CSV layout and skip-headers logic.
- **Constraints**: If you need advanced constraints (e.g., composite foreign keys, or advanced check constraints), you can add them when creating tables. SQLite is a bit limited on multi-column foreign keys but can do them with the right syntax or triggers.
- **Extra Validation**: Add or remove queries in `run_tests` for your domain-specific logic.
- **Performance**: If your CSVs are large, you may want to optimize with `executemany()` or `PRAGMA` tweaks. For moderate data sizes, the above approach is fine.
test_docs_v0.1.md at [e19594e4c4]

File tests/test_docs_v0.1.md artifact c1fc71d1c6 part of check-in e19594e4c4