Part of my day job involves loading data from lots of different files into a database, and one way this can go horribly wrong is if I attempt to load the same file twice. Either the database will get mad at me because I’ll be trying to violate referential integrity, or my end users will get mad at me when I manage to fill the database with duplicate data, or I’ll get mad at me for having to waste several hours sorting out a self made mess. Here’s my homebrewed solution for this, to allow me to take complicated code and entrust it to other people (or to myself, if it’s a Friday afternoon and I’m stressed and distracted with a hundred other things). If you’re not interested in databases, avert your eyes now.
I used to have lots of scripts stored as .sql files, and I’d pick the one I wanted when I wanted to run some code. But since I’d often have to edit these files to change one tiny part of them (the part that said which file I’d want to load into the database) I soon realised that it would be better to turn my scripts into functions in postgres. That way a complicated process that could be hundreds of lines of code becomes something simple like this:
SELECT load_spend_data('C:\Users\jforeman\Temp\DB Feeds\Spend_2013-11-01.csv');
However, that load_spend_data procedure that I wrote is too dumb to know if the data I’m asking it to load has already been shovelled into the database or not. I began to think about how to get the procedure to check that the file hadn’t already been loaded.
And then, because in general I like to generalise, I realised that instead of rewriting each of my load procedures to take advantage of the file checking logic, I could write another function that would wrap around them, like a cuddly teddy bear, and apply the same logic to every load. Less rewriting code means more time doing more fun things, and more time reusing code means less code to have to debug or worry about going out of date.
There’s two things required for this: a table that stores all the files we’ve already loaded:
CREATE TABLE load_file_log ( filename varchar(300) NOT NULL, load_process varchar(200), start_date date, start_time time, end_date date, end_time time, PRIMARY KEY (filename) );
…and a function to handle the loading:
CREATE OR REPLACE FUNCTION generic_loader (procedure_name character varying, input_file character varying) RETURNS character VARYING AS $BODY$ DECLARE resultstate varchar(350); DECLARE filecheck varchar(300); BEGIN -- Here we check to make sure the file isn't already loaded SELECT filename FROM load_file_log INTO filecheck WHERE filename = input_file; IF filecheck = input_file THEN SELECT input_file || ' already loaded' INTO resultstate; ELSE EXECUTE 'SELECT ' || procedure_name || '(''' || input_file || ''') ' ; INSERT INTO load_file_log VALUES (input_file, procedure_name, current_date, current_timestamp, date(clock_timestamp()), clock_timestamp()); SELECT procedure_name || ' loaded ' || input_file INTO resultstate; END IF; RETURN resultstate; END $BODY$ LANGUAGE plpgsql VOLATILE COST 100;
Now I can happily share my workload with other people: it’s a lot less frightening to give somebody one line of code
SELECT generic_loader ('load_blacklisted_stuff', 'C:\Users\jforeman\Temp\DB Feeds\Data2013.csv')
tell them how to change it, and know they’ll struggle to break it or themselves, than to go on holiday having provided somebody the database equivalent of a fully-automatic shooting-yourself-in-the-foot device.
Plus, it actually gives the users a nice message, both if the file does load, or if the procedure refuses to load it because it’s already been put in. And although it’s nice to be important, it’s also important to be nice.
This doesn’t solve problems like people changing the name of a file in order to try to feed it into the database twice, or the file being full of rubbish to begin with, but eliminating some of the consequences of carelessness is better than being wide open.