Postgresql – Importing data from file with ID dependencies

postgresql

I have this little conundrum going on: I want to bulk import historical data from multiple CSV files like binance_btc_usd.csv with structure of time|value into table id|time|value where id refers to the identification of what exchange and what currency pair is going on, which is the file name. I dont want to bake-in an ID into the CSV files, as they may not be valid next time I re-run the script. COPY doesn't seem like the thing I am looking for.

CREATE TABLE history_data ( 
    id    SERIAL PRIMARY KEY, 
    pair  INTEGER NOT NULL,
    time  TIMESTAMP,
    value DOUBLE PRECISION, 
    CONSTRAINT pair_pair_id_fk FOREIGN KEY (pair) REFERENCES pair (id) ON DELETE CASCADE
);

CREATE TABLE pair ( 
    id    SERIAL PRIMARY KEY, 
    exchange  INTEGER NOT NULL,
    base_coin INTEGER NOT NULL,
    quoted_coin INTEGER NOT NULL 
    # and other ... 

    CONSTRAINT exchange_currency_id_fk FOREIGN KEY (exchange) REFERENCES exchange (id) ON DELETE CASCADE
    CONSTRAINT base_coin_id_fk FOREIGN KEY (base_coin) REFERENCES currency (id) ON DELETE CASCADE
    CONSTRAINT quoted_coin_id_fk FOREIGN KEY (quoted_coin) REFERENCES (id) ON DELETE CASCADE
    CONSTRAINT pair_pair_id_fk FOREIGN KEY (pair) REFERENCES pair (id) ON DELETE CASCADE
);

Best Answer

The only thing I can think of is a combination of a staging table and some shell scripting.

First step: import the raw file into a staging table, e.g. history_data_import which only contains time and value.

Second step: extract the corresponding values for the pair table from the filename and create a SQL statement like the following:

with pair_id as (
  select id
  from pair
  where exchange = 'binance' --<< this is "generated"
    and base_coin = 'btc' --<< this is "generated"
    and quoted_coin = 'usd' --<< this is "generated"
)
insert into history_data (pair, time, value)
select pi.id, imp.time, imp.value
from history_data_import imp
  cross join pair_id pi;

This assumes that the combination of (exchange, base_coin, quoted_coin) is unique in the pair table so that only a single row is returned when looking up the id.

How exactly you extract the parts of the filename is a different topic and depends on the operating system you are using. On Linux this is probably a piece of cake for awk gurus. But this would better be asked on e.g. https://superuser.com/