Data Generator

bkeeney · 14 April 2023 22:09

I have a need for a data generator to stress test our server. I could easily create a little utility to add (somewhat) realistic data but just wondering if anyone has anything they’re in love with. Our DB is Postgres and has uuid, bytea and jsonb fields but only a couple of dozen tables and out of those maybe a dozen could really use a few 10 million rows of data.

jjr · 14 April 2023 23:14

I’ve had good luck with this one https://generatedata.com/
Online is row limited, but you can download and run it in-house.

jmadren · 14 April 2023 23:22

I’ve used this one as well. It’s highly configurable. If you create a $25 account, the only limitation is your browser’s download size limit. Or as you said, download the code from GitHub and run it in-house.

Edit: Thought it was free (maybe once was?) - but the fee is very reasonable.

dickey · 15 April 2023 01:35

Also https://www.mockaroo.com/ and https://www.tonic.ai/ . Mockaroo also allows you to create mock back end APIs based on WIP database schema as well as generate mock data. Tonic I believe is better at simulating realistic user data using AI, if you can seed it with actual real world data - but if your project is greenfield that may not be a viable option.

Bob, (notwithstanding the fact you are a superior developer to me) the exercise might well be just a stress test of the backend, but you’ll be able to shake-out a great many other issues both back and front end with substantial mock dataset in play. Also an early chance to design BI or management reporting (if that’s in your scope of work). In my experience, large mock datasets always inspire/trigger improvements to the database schema.

Kind regards, Andrew

bgrommes · 15 April 2023 10:49

I am fortunate to have a huge dogpile of real data to test with at leisure. That is the best of all worlds IMO, especially if the application involves, as mine does, extensive standardization and cleanup of the data. I don’t think that mock data products would simulate odd spacing, misspellings, random notations, etc.

In that regard I have found government data to be a great way to shake things down. I have 1.5 million records from a federal business licensing agency that is full of anomalous punctuation, duplication and inconsistency. By the time I finish iterating on that I think I’ll be able to produce correct and matchable output from data entered by a crazed raccoon on drugs!

bkeeney · 15 April 2023 18:41

Yeah, that’s kind of the point. I’m not exceptionally happy with the schema that’s already in place and we are likely to have an explosion of data coming in the next several months. I’d like to get a handle on performance with large datasets. As a developer I have a very minimal database and even our QA folks don’t keep their databases around very long so everyone is using fresh/clean databases all day long.

Plus, this is the first project I’ve worked on that uses UUID’s for primary and foreign keys and I’d like to get some idea on what kind of penalties we’re going to get with those too.