Chuvash Toponymy Database v.0.1: 2025_Feb_12_chatGPT

File data/csv/db_experiment/v0.3/2025_Feb_12_chatGPT_chat.txt artifact 166674fd08 on branch trunk

Hello, ChatGPT!

I need your help with table data transformation.

This is the structure of the source csv file

headers look like that:
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM

data rows could be several types
First one is the most frequent -- CHUV and RUS columns are not empty and REFID column is empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
23;55,983707;46,827595;YS;Атапай;Адабай;;;;деревня;ойконим;Моргаушский р-н;;;;чув. яз. имя (СЧЯ);CHV / TAT;От имени чуваша-язычника Атапай: ата 'отец', пай 'богатый'.;;12;;;

The second type is quite frequent too -- CHUV column is empty, RUS is not empty, REFID is empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
69;56,015002;46,346179;YS;;Алексеевка;;;;деревня;ойконим;Ядринский р-н;;;;;;;;24;Алешкина - название деревни Алексеевка до 1924 года;;

Third type is the rarest -- CHUV is not empty, RUS and REFID are empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
244;54,942062;47,721401;MS;Патирек;;;;;деревня;ойконим;Шемуршинский р-н;;;;;;;;52;;;

Fourth type is not frequent -- CHUV, RUS and REFID columns are not empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
445;55,075314;47,695053;MS;Пулак;Булаково;;;;деревня;ойконим;Батыревский р-н;;;;;;Название на 1923 год - Плаккасси.;446;93;;;От р. Була или личного имени Булак.

Fifth type -- CHUV column is empty, RUS is not empty, REFID is not empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
523;55,75605;46,358486;MS;;Верхнее Аккозино;;;;деревня;ойконим;Красночетайский р-н;;;;;;Рус. название от имени Акаҫ.;713;105;;;

Sixth type -- CHUV column is not empty, RUS is empty, REFID is not empty
ID;LAT;LON;coordcommenttemp;CHUV;RUS;GEOCOMMENT;CORRECTION;HISTORIC;OBJTYPE;OBJCLASS;ABSREF;OLD2025;HISTCOMM;RELREF;ETYMSAV;ORIG;DUBANOV;REFID;PAGE;COMMIDR;COMMSAV;ETYM
594;;;;Малти Чиперуй;;;;;деревня;ойконим;Мариинско-Посадский р-н;;;;;;;399;85;;;От чув. чипер 'красивый' и уй 'поле'.

About REFID column -- if the value is present in this field it means that two records (basically two strings in sorce.csv file) are connected, because it is the one spatial object (an entity having LAT and LON attributes) with several names (CHUV and RUS field in several records from source file).

I want you to write python script transforming the source file into 2 csv files: spatial.csv and linguistic.csv

Here is the specification of spatial table
"## 1. Spatial Table

### Purpose:
The spatial table contains unique geographic entities, their locations, and relevant metadata.

### Columns:
- ID (int, PK): Unique identifier for each geographic object, here goes the ID from source file
- LAT, LON (float): Estimated geographic coordinates, LAT and LON field from source file
- OFFNAME (string): Usually the RUS field goes here (It is the RUS name for the location) -- only if the RUS name is absent in the record itself and absent in the record connected through REFID field, only in that case CHUV name is used to fill the OFFNAME field.
- LANG (string): The language of the name, using codes such as RUS for RUS name and CHU for CHUV name.
- CLASS (string): The OBJCLASS field from source file.
- TYPE (string): The OBJTYPE field from source file.
- DISTRICT (string, nullable): The ABSREF field.
- DOUBT (int, nullable): This field is empty for now.
- LANDMARK (string, nullable): The RELREF field.
- COMMENTS (text, nullable): The GEOCOMMENT field.
- OTHER (text, nullable): This field is empty for now."

Here is the specification of linguistic table
"## 2. Linguistic Table

### Purpose:
The linguistic table stores different names (and their variants) of the same geographic object across different languages and sources, etymological information and other linguistic metadata. Every CHUV name and RUS name (toponym) must have its own record in this table.

### Columns:
- ID (int, PK): Unique identifier for each toponym entry, independent from the spatial table.
- SPATID (int, FK → Spatial.ID): Links the toponym to its corresponding spatial object.
- DOUSPAT (int, nullable): Empty for now.
- MAINID (int, FK → Linguistic.ID): See below, the logic of this field is connected with REFID column from source file
- TOPONYM (string): The attested name as recorded from sources, here goes CHUV field or RUS field
- TOPFORMS (text, nullable): Empty for now.
- DOUTOPO (int, nullable): Empty for now.
- LANG (string): RUS for RUS name or CHU for CHUV name.
- DOULANG (int, nullable): Empty for now.
- PRONUNC (text, nullable): Empty for now.
- DOUPRON (int, nullable): Empty for now.
- ETYM (text, nullable): The concatenation of ETYMSAV and ETYM fields.
- ORIGIN (string, nullable): The ORIG field.
- COMMENTS (text, nullable): The concatenation of CORRECTION, HISTORIC, HISTCOMM, DUBANOV, PAGE, COMMIDR, COMMSAV in the format: "CORRECTION: <text from the CORRECTION field> | HISTORIC: <text from the CORRECTION field> | and so on".
- OTHER (text, nullable): Empty for now."

coordcommenttemp field is not needed

The logic behind REFID column from the source table goes like that: 
- if REFID column is empty, then the RUS name (if not empty) from the record goes to OFFNAME column of spatial table and ID goes to spatial table and linked records from linguistic table
- if it is not empty, it means there are several connected records in the source file with structure like that:
ID_1, LAT, LON, ... REFID_1
ID_2, empty, empty, ... REFID_2
ID_3, empty, empty, ... REFID_3
etc
REFID_1 equals REFID_2 or REFID_3, REFID_2 equals REFID_1 and REFID_3 equals REFID_1 too etc
The source record with the LAT and LON filled is the main one (sometimes it is not the first one, maybe the second or third) others are secondary records:
- its' ID goes to the spatial table's ID field
- its' RUS name (if it is not present it means the error in source data) goes to OFFNAME column of spatial table
- filling out of the linguistic table for main and secondary source records requires special care, bear with me -- 
the RUS name of the main record needs to be processed first, its' linguistic.ID (unique, remember) goes to the MAINID field of its own record (so linguistic.ID equals linguistic.MAINID for the RUS name of the main record) and into all linguistic.MAINID fields of connected names, namely, into inguistic.MAINID field for CHUV name (if present) of the main record and into inguistic.MAINID field for CHUV name and RUS name for all secondary source records

Remember, official name, the spatial.OFFNAME, is usually in RUS column, only if RUS column is empty **and** the REFID column is empty, only in that case we should use CHUV name as OFFNAME.

Now please write a python script for the described data transformation. If there's anything unclear just ask, I'll add the necessary details.
2025_Feb_12_chatGPT_chat.txt at trunk

File data/csv/db_experiment/v0.3/2025_Feb_12_chatGPT_chat.txt artifact 166674fd08 on branch trunk