Last year, I fell off the radar on Artificial Intelligence as from an engineering and management standpoint my perceptions of AI were that responses from prompts were very difficult to use in test driven development.
My goal over April is to offer some more perspectives on AI from an end user’s point of view. I primarily use GPT 4.x and Claude Sonnet with GitHub Co-pilot. In my work, I use those models daily to ask questions, not necessarily code with it. I use it to brainstorm.
One of the reasons I use it this way, is that after twenty-five years of typing on a keyboard I am a very fast typist, but I just don’t care so much about writing boiler plate code over and over again. One thing I am looking at generative AI to do is move past template based generators like NestJS and others use.
Not that there is anything wrong with templated generators - ironically AI works better with a template as well! What I mean by this is that to me a logical expansion from hard coded generators like this:
nest generate <type> <name> [options]
to prompt based generators that can write more code for me to review. I feel it is very important to emphasize this statement: for my review.
What I have been experimenting with in different languages is what co-pilot is using as its source information and what answers it is giving out. I can say that the results are still transient and SQL code generation is not the golden goose.
Why? Even with a well crafted, very specific prompt; about half of the time, these LLMs are still getting the answer completely wrong. Speaking with my colleagues at https://www.humanservices.ai we are seeing similar outputs.
Code quality varies drastically.
The same prompt may generate disparate results and lacks consistency.
From my perspective, it really struggles with computer science basics.
It still has a tendency to hold into memory items you have asked it to forget.
It tends to fall back on copying from fallback source material in your workspace, like third party dependencies you have installed in your VS Code workspace.
For the rest of this article though, I am going to focus on an interesting task that perplexed me, which is data quality and creating good data models with AI. I am going to condense this so it is more concise, but here is the scenario.
I am working on a data warehouse and the data model is large enough that I am using AI to help me write some boiler plate data warehouse “intermediate“ tables, to see how it reasons data aggregation.
To tip-toe into this, I have queries that I need to work with, that I run into errors when running, sometimes esoteric, sometimes I already know the answers to. So, I’ll ask generic questions on errors and how co-pilot would debug them. I keep away from using the actual SQL query as it tends to break the responses from the model.
Recently, I’ve been asking something like this to see how co-pilot is reasoning without a lot of context:
I need a PostgreSQL table with X, Y and Z expected results.
but, with enough detail to make a start. And then, I will supply a well formed table that I personally made; that has good code quality requirements and see what it generates. In my example, it generated something like this:
CREATE TABLE IF NOT EXISTS public.generic_table (
generic_table_id SERIAL NOT NULL,
generic_table_uuid UUID DEFAULT gen_random_uuid(),
text_column TEXT,
int_column INTEGER,
float_column NUMERIC(10, 4),
date_column DATE,
foreign_key_column INTEGER NOT NULL,
jsonb_column JSONB,
created_at TIMESTAMP WITHOUT TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITHOUT TIME ZONE DEFAULT NOW(),
CONSTRAINT generic_table_pkey PRIMARY KEY (id),
CONSTRAINT generic_table_foreign_key_column_fkey FOREIGN KEY (foreign_key_column)
REFERENCES public.foreign_table (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
);
Now, I asked co-pilot to generate this for me to prove a point. I asked it to make me a table that was randomly generated, matching a table that I already made with good code quality as a reference template. Co-pilot, generated the table above, however missed something interesting:
created_at TIMESTAMP WITHOUT TIME ZONE DEFAULT NOW(),
What the above definition is declaring is that this field will default to the current date on insert, however in the future if someone updates this column with a NULL value - that is still a valid value. That breaks data quality and on a bulk update could be disastrous.
I changed that to not allow nulls.
Moving on, I asked co-pilot to write a query on what one may potentially reason is a good strategy for working with null values by leveraging the COALESCE keyword.
I would like a select query that coalesces null values from that table
And co-pilot returned this:
SELECT
generic_table_id,
COALESCE(generic_table_uuid, gen_random_uuid()) AS generic_table_uuid,
COALESCE(text_column, 'Default Text') AS text_column,
COALESCE(int_column, 0) AS int_column,
COALESCE(float_column, 0.0) AS float_column,
COALESCE(date_column, CURRENT_DATE) AS date_column,
foreign_key_column, -- Assuming this column is NOT NULL, no need for COALESCE
COALESCE(jsonb_column, '{}'::JSONB) AS jsonb_column,
COALESCE(created_at, NOW()) AS created_at,
COALESCE(updated_at, NOW()) AS updated_at
FROM
public.generic_table;
Now, the question we need to ask of this answer is: how is this working and is it correct?
Let’s start with “is this correct?”. Well… technically yes. Basically what it has done is parroted back the definition of the table and filled in the defaults. However there are issues with this concerning data quality.
And honestly: I would not have not coalesced anything unless the query truly needed it. However, because I asked co-pilot to do so, it was happy to oblige as the generator was inferring from my prompt that coalesce was a good idea in the first place. AI isn’t really handing out advice unless you ask it to, and honestly even that advice is over-coded and not at a level that protects folks from inadvertently making bad decisions.
Let’s talk a bit more of the the net effect of that select query:
Every time one would run this query, nulled uuids would rotate to a new UUID.
'Default Text' is meaningful in this case, for example if folks were needing a default, however this could have been accomplished by the table definition.
Adding in zeros to the numeric columns may or may not be appropriate. Since we have not told co-pilot what the upper and lower boundaries are for the range that this table should accept, co-pilot took zero as the range it needed.
The date coalesce suffers the same problem as UUIDs. A null date now rotates to the current date every time the query runs.
JSONB rightfully defaults to a an empty object, however once again this could have been accomplished with a default value. We also don’t know what the JSON schema is meant to be in here, so this could also be an array as that is valid JSON.
Created dates and updated dates default to the current date from the database. Once again, these all rotate given the date and time the query was run.
And what this equals is poor data quality. So, let’s go back to the actual data definition and look at what a table focused on data quality could look like.
CREATE TABLE IF NOT EXISTS public.generic_table (
generic_table_id SERIAL NOT NULL,
generic_table_uuid UUID NOT NULL DEFAULT gen_random_uuid(),
text_column TEXT,
int_column INTEGER DEFAULT 0,
float_column NUMERIC(10, 4) DEFAULT '-infinity',
date_column DATE,
foreign_key_column INTEGER NOT NULL,
jsonb_column JSONB NOT NULL DEFAULT '{}'::JSONB,
created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
CONSTRAINT generic_table_pkey PRIMARY KEY (id),
CONSTRAINT generic_table_foreign_key_column_fkey FOREIGN KEY (foreign_key_column)
REFERENCES public.foreign_table (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION,
CONSTRAINT created_at_check CHECK (created_at >= NOW()),
CONSTRAINT updated_at_check CHECK (updated_at >= created_at),
CONSTRAINT int_column_check CHECK (int_column BETWEEN -10000000 AND 10000000),
CONSTRAINT date_column_check CHECK (
date_column >= (CURRENT_DATE - INTERVAL '500 years') AND
date_column <= (CURRENT_DATE + INTERVAL '500 years')
)
);
This needs to be vetted on my local DB as this is pretty close, but requires testing. What I am doing is focusing in on having the database support the application by creating sanity checks from the table regarding what data should be collected. IE, the database is the last line of defense if something is missed in data validation further up the stack. This is especially important in distributed teams.
The original generated table was in effect “unbounded”, and what this table does now is:
Generates UUIDs and says that UUIDs cannot be null.
Says that the default value for the integer is NULL, and constrains the data between negative ten million and ten million.
The float defaults to negative infinity, which is a special number. This can seem controversial, but because this value is so far out of any normal range, it is easier to find out of bound numbers. Is it necessary? That’s debatable.
The date is constrained to 1000 years.
And so on and so forth. For me, when I audit data quality in older systems I check if the query needs to COALESCE into a default value. If I do, then I work at the outer ranges for that item that obviously creates an anomaly.
DATE DEFAULT '4713-01-01 BC' - This is the furthest date in the past that can be in the database
DEFAULT '-infinity' - for numeric/float/real/double precision column types
-32,768, -2,147,483,648, -9,223,372,036,854,775,808 for int2 (SMALLINT), int4 (INT), int8 (BIGINT) respectively.
NO_ENTRY or XXXXXX for text fields, which for example could be used for filtering later in a BI tool unless another value is needed.
Again, what this approach does is create a predictable anomaly in a data graph that stands out easily. It also allows one to see data entry issues easier, for example if a date is entered as 1875, instead of 1975 (like my birth year) with a typo then a true null would still be farther in the past. I have to shout out my colleague Josiah as we worked this out a few years ago.
Speaking of dates, co-pilot has a tendency to choose the Linux Epoch as the default SQL date which is 00:00:00 UTC on January 1, 1970. Since January 01, 1970 could be a valid date; that recommendation in my humble opinion shouldn’t be used.
What I am seeing that can happen with AI generated tables and data models is that we humans are not giving enough written information in a prompt to get the specificity we need. Also, quite frankly almost every database I have seen or worked on the last twenty years has exhibited table definitions that tacitly reinforce data quality. It’s not uncommon for legacy systems to need data quality checks to find anomalies as they scale over time.
To generate this “data quality orientated“ table, one would need to do this:
Write a PostgreSQL CREATE TABLE IF NOT EXITS statement for a table named "generic_table" with the following requirements:
- A primary key column named "generic_table_id" of type SERIAL.
- A UUID column named "generic_table_uuid" with a default value generated by gen_random_uuid().
- A TEXT column named "text_column".
- An INTEGER column named "int_column" with a constraint that its value must be between -10,000,000 and 10,000,000.
- A NUMERIC(10, 4) column named "float_column" with a default value of '-infinity'.
- A DATE column named "date_column" with a constraint that its value must be within 500 years before or after the current date.
- An INTEGER column named "foreign_key_column" that is a foreign key referencing the "id" column of the "foreign_table" table in the "public" schema.
- A JSONB column named "jsonb_column" with a default value of an empty JSON object ('{}').
- A TIMESTAMP WITHOUT TIME ZONE column named "created_at" with a default value of NOW() and a constraint that it must be greater than or equal to the current time.
- A TIMESTAMP WITHOUT TIME ZONE column named "updated_at" with a default value of NOW() and a constraint that it must be greater than or equal to "created_at".
- Ensure all constraints, including primary key, foreign key, and column-specific checks, are properly defined.
And, at that point one could just write the table by hand. It will be faster and this is what I do for the most part. From my experimentation though, it would be interesting to figure out how to interactively work on a data model iteratively with AI, where some boiler plate “quality“ standards could be baked into the output, then refined. However, the time and cost to benefit ratio may not be there as tools like PGAdmin could be more consistent in the long run.
My next test will be to try and use RPC and JSON Schema definitions and see what AI can do by extrapolating data models from templates. Or perhaps something like mermaid which has a predictable syntax that can act as an sql template.
There is a good chance, that with more context and examples that the output could require less re-work and be more production ready.
Thanks for reading, don’t code tired!