Hi everyone,
Recently I was pretty much involved in reading and understanding different parts of PostgreSQL, especially the cumulative and static statistics collector. Even more, the concepts of wal, replication, and archiving being some of the building blocks of how PostgreSQL handles multiple transactions were fascinating.
To test different features, I needed data. And that too lots of data. I nearly filled my 1 Terabyte HDD generating test files to check out all types of queries, short-lived and long-running.
generate_series
is a function in PostgreSQL which was my helper in all the tests. It’s a simple function that takes 3 arguments, start - end, and interval, and generates rows in that range. Using multiple tweaks these simple functions can be used to generate a variety of data which can be extremely helpful in understanding PostgreSQL.
Here are some of the tricks I read/learned/used to generate sample data for lots of testing.
Integer Series
Character and Strings
A bit complex query to generate names can be done using this:
The most important part here is the id*0 which forces a new name to be generated on every iteration. It has zero contribution in calculating the value but it makes the string_aggregator run again for every new value of name.
Timestamping Time
Some fun Tweaks
Generate All IPv4 Addresses in CIDR Notation
Caution before executing this query: The number of rows would be 256^4 = 4,294,967,296 ~ 4.2 Billion and will take up lots of space. I tried this on my system with addresses starting with 0 (0.*.*.*) and taking 5 minutes of execution, the table size was ~14 GB hence perform at its own risk.
A Normal Distribution
From PostgreSQL 16, a new function random_normal
is being included in the core which will make this easy. Till then writing the functions manually is the best bet available.
Generating data with weights
Suppose we want our data to be generated as follows:
- 75% of data should be 1
- 15% of data should be 2
- 5% of data should be 3
- 3% of data should be 4
- Remaining 2% should be 5
After some trying and testing, could come up with a function that works this way:
This works in a way that it generates a random_value and for each such value checks which values.value fits in the range of start and end each weight. The start and end are weights but extended from 100 to higher powers for more granular control.
To test this I performed a test to check the percentage of each value generated and here is the result:
id | float8 |
---|---|
1 | 0.750275 |
2 | 0.149736 |
3 | 0.05017 |
4 | 0.029985 |
5 | 0.019834 |
Can try it out yourself at https://dbfiddle.uk/3yBKJYFz.
I also found another way of doing this after celebrating my joy of writing this function. It was a blogpost by Depesz which uses a for loop to find random values in the range. Clever.
Most of the ideas originated while I found some articles/blogs/problems where a dataset was needed to understand and analyze the query better. Ig that would be it for this one too.
Fin.