I work as a developer advocate. Part of my job is to build DEMOs and answer database questions online. To do my job well I often need to spin up new database instances in the cloud or on my own machine. This database could be PostgreSQL, Cassandra, ScyllaDB or something else depending on the DEMO I am building or the question I’m trying to answer. There are days when I spin up 10+ different database instances from scratch. And it’s not just the database itself, because I rarely need just an empty database, I also need to ingest some sample data to start with and for that I need the following:
- What type of sample data do I need? Can it be static or do I need something dynamic like time-series? Does the use case matter (so it can be financial data or IoT or anything else)? Etc
- How many tables do I need? Do I need JOINs? If so, one table is not going to be enough, I will need at least two tables so I can demonstrate a JOIN either within the DB itself (PostgreSQL) or within the app (ScyllaDB/Cassandra)
- Where can I source this type of data (for free)? This is the last step but often this takes the most amount of my time. I usually look at places like Kaggle and GitHub or just take my chances with Google. It’s rare that I find the perfect dataset. I usually find datasets that need work or are not free. But once I find a good public high-quality free dataset, I try to reuse it often.
It can take anywhere between 10 minutes and multiple hours to find or create the perfect dataset that I need for my DEMO or presentation or whatever. And, as I mentioned, I sometimes do this multiple times a day. It’s a lot of time for setting up sample databases that you only need temporarily. So anyway, I wanted to streamline this process as much as possible so I don’t waste time spinning up databases, finding datasets and ingesting data. I came up with this GitHub repo: PostgreSQL database samples with Docker
Using this repo, I can spin up PostgreSQL containers within seconds, preloaded with curated datasets. Exactly what I needed. Here’s how it works:
- Choose what dataset you want to use. Currently it only supports
movies
andstocks
but I plan to add more datasets in the future. When you choose a dataset, you also choose the schema with it. So make sure to look up the schema and maybe even the CSV data files before spinning up PostgreSQL to make sure you get what you need as far as schema, tables, and data. - Build the image.
- Run the container.
That’s it! No more searching for datasets, finding out that it’s not free, looking up PostgreSQL Docker commands the Nth time and other nonsense. You just have to know what sample data you need for the demo or presentation or whatever you’re building and then you have a PostgreSQL database running in a Docker container in seconds.
Contribute curated datasets Link to heading
I welcome you to contribute to the project. I expect this repo to grow and include more and more samples from different use cases (IoT, finance, streaming, time-series, sports etc) and datasets with different structures (one table, two tables, more tables, complex relations with lots of JOINs etc). The general requirement for the dataset is to be publicly available, free to use for any purpose, and has to be in CSV format. There’s a quick guide in the repo on how to add a new dataset.
Hopefully, this project will save some time for you as well!