
Ask HN: What's a good, clean, public dataset with many entity relationships? - lobster_johnson
I need some data to populate a database with for stress testing.<p>For example: A movie&#x2F;TV dataset with movies, TV shows, releases, episodes, actors, directories, companies, genres, etc., and relationships between them. I&#x27;d like at least few million entities in total, and an ordinary of magnitude more relationships between them.<p>I&#x27;d like JSON, or, at the very least, CSV&#x2F;TSV. The dataset should also be modern, fully denormalized and cleaned up, with fields that are easy to understand. I&#x27;m not looking for a triplet dataset, as they can be difficult to untangle, but it might do.<p>Focusing on movies, I&#x27;ve found some datasets that are too minimal and&#x2F;or focused on the wrong things (e.g. MovieLens), or too messy to easily consume (IMDb, Wikipedia), or hidden behind granular APIs (TheMovieDB, OMDb).<p>Does such a thing exist?
======
DougWebb
Microsoft provides an example database, AdventureWorks, for SQL Server. I've
used it a lot at my company; we make a product that generates Enterprise back-
office applications off of existing databases, and we've created a bunch of
demos using AdventureWorks.

[https://github.com/Microsoft/sql-server-
samples/releases/tag...](https://github.com/Microsoft/sql-server-
samples/releases/tag/adventureworks)

~~~
lobster_johnson
Thanks, but it looks like that data requires SQL Server.

~~~
DougWebb
If you browse around the github repository, you can find csv files for all of
the tables, which you might be able to work with. Also, search Google Images
for "AdventureWorks Schema" to get an idea of how complex the database is. It
seems to be the kind of dataset you're looking for, just maybe not in exactly
the format you'd like.

~~~
lobster_johnson
Thanks, I couldn't find anything but SQL Server dumps.

However, I've looked at the schema, and it's an OLTP database. I'm looking for
something that's more about real-world entities and public facts, and less
about nitty-gritty business-transactional details. Movies, music, people,
organization structures, literature, public infrastructure, etc.

------
PaulHoule
Look at TPC workload generators, those have a knob you can turn to control the
size of the database.

~~~
lobster_johnson
Looking for real data, though.

