Hacker News new | comments | show | ask | jobs | submit login
Pplapi: A virtual database of the entire human population (pplapi.com)
109 points by caseysoftware on Jan 11, 2017 | hide | past | web | favorite | 43 comments

This is part of my research and I'm happy to answer any questions.

I gave a preliminary poster about this a few months ago:


This project is under active development as I wrap up my PhD.

Typically, when one does a university research study in the US involving human subjects there are something like IRB requirements (e.g. [http://humansubjects.stanford.edu/new/resources/consent/] I see research mentioned on the website, and also that the entire world population is mentioned in the data set. Have you run into anything like this?

It's a synthetic world population derived from public population data; these are not real people (i.e. human subjects) and the source raw data were not identifiable or even individual-level data.

I nevertheless think there are a number of issues raised by the possibility of a human population database, but to be clear: pplapi does not have actual people in it. I am currently writing about some of these issues.

Until I looked at the actual data, I didn't realize this was synthetic either. The developer might want to make that clear in the header text. "Virtual Database" doesn't make this clear. I think "virtual database synthesizing the entire human population" might make it clearer.

That said. This is very impressive and has my imagination tingling with ways to apply it.

It was very clear to me from the title.

And secondly, I don't believe the author should pander to people who won't even visit the site and instead judge from titles.

I think virtual can either mean: stored /accessed on a computer, or simulated on a computer, which is what confused me.

where do you hear virtual used to mean digital data and not simulated or emulated. a virtual machine is a synthetic digital representation of physical hardware. virtual reality is a fake representation of reality. virtual memory is a simulacron of ram. a virtual assistant is not a dedicated helper.

i just googled the word virtual

     not physically existing as such but made by software to appear to do so.

     in effect or essence, if not in fact or reality; imitated, simulated. 

     simulated in a computer or online.
I suppose a counter example could be a virtual library, which is a digital representation of a library, not a simulation, but still the word virtual is not wrong in this context, if anything is confusing it is the word order. maybe A Database of Virtual ... instead of A Virtual Database

Thanks for providing a definition for the word "virtual."

I think part of the problem here is that I've invented something we don't have precise language for, yet. I call the agent population synthetic and I call the database virtual, and that's just the nomenclature I'm using.

I'm absolutely shocked at the number of people who feel deceived that this isn't an actual database of the entire human population.

This is a serious problem, though. My work exists at the intersection of social psychology and artificial intelligence, and these fields do not share enough vocabulary. There are major barriers to communication between "intellectual silos," and I think all of the confusion in this comment thread about the word "virtual" illustrates this point very nicely.

like I said, I think it is word order. your word order makes it sound like the database itself is virtual not the contents of the database.

How to make it clear what this data represents is a semantic conundrum. If I take a "virtual" tour of New York City, I expect to see an actual geographically-accurate approximation of New York City, not buildings randomly generated based on the probabilities of buildings having various dimensions of those found in New York City strewn across a google map.

What do we call such a thing? After reading the author's paper, I think "synthetic" is most appropriate because we would be synthesizing a city like NYC, but not a virtual representation of NYC. In fact, I find the word "virtual" in this context somewhat misleading.

if virtual means simulated, I dont find it misleading. simulated demographic population statistics implies an approximation.

It was not at all clear to me. I spent some time navigating the site and believed at first that it was just anonymized real data. Using the word "synthetic" and something like "these are not real people" on the front page would go a long way towards explaining the purpose and intent of this project.

Yeah, maybe just "Simulated ..." or "Synthetic ..."?

I did look at the data, and wasn't sure if it was real people, or what the source was.

I then searched the site, and couldn't find a description of what the project was, nor the data it contained.

Finally I came here to the comments to find out.

I consider the site a failure if it doesn't answer obvious questions about what it is, an unfortunate failure that is common in new 'lean startup' pages etc -

In defense against these sites that wants me to sign up etc before I know what it is, I simply forget them and their name and never mention them to anyone. It helps me sleep knowing I'm only supporting honest sites that explain what they are and what they do .. ymmv

Yeah, this almost got me. Until I start to see a bunch of 3-7 years olds making 6 digits. I have no idea toddler modeling could make this much.

Me: Mom, y didn't you sign me up for this?

Mom: Cuz you never asked for it...

Me: Uhhhh, I was little. I barely knew anything.

This is really fascinating.

Have you given any thought to the possibility of providing this synthetic population on a finer grained basis? It could be extremely useful if this were available for, e.g., each U.S. county (or even something smaller than that).

Edit: I noticed that each person has a lat/long location. So maybe a different way of framing my question would be, at what level of geographic granularity does this reflect differences in the distribution of the various characteristics? And, assuming that it reflects sub-national variations, have you considered allowing a random agent to be selected within an arbitrary geographic area?

Absolutely. This is an active project and I have a plugin system for synthesizing additional data sources. In the research literature, I have seen several impressive US census-derived synthetic population projects, so finer-grained data are a future direction.

Great! I've seen the U.S. Census Bureau's American Communities Survey, which has extremely fine grained data, but with a few gaps for, e.g., religion, Internet access, etc. Do you know of anything else as granular as ACS?

This, again, is really really interesting. When it comes to the fine-grained U.S. data, we may be approaching if-you-dont-build-it-I-will territory. :)

you need to add a way to adopt one(aka give it name).

I'd adopt one for a $1. to help fund your research.

you need a way to find the closest one to you.


That's a really clever idea!

    DOB: 2012-09-09
    Age: 5
    Language: Russian
    Religion: Muslim 10-15%
    Income: $18466 USD

I'm sure this type of data point exists in real life in terms of identity theft schemes...

edit: from browsing more random "agents", there are a lot of 5 y/o's making tens of thousands of $ per year:

Location Country: United States GPS: (36.073868, -103.923638) Demographics Sex: Male DOB: 2012-04-11 Age: 5 Language: English Religion: Protestant Income: $57834 USD Internet: True

Ahh, to be young.

It works by taking independent distributions and sampling them independently.

So any time you lack cross-correlation data (and age vs income isn't a widely available public data source) it will assume the data is uncorrelated, and you'll get this kind of error.

Whether it reduces the utility of the data depends on the use case. I suspect it often will.

I did think that referencing social networks is a bit off, since this isn't a social network model. We have those, this isn't one.

Looks like there's an off-by-one error with ages

Maybe the website should tell that this is a synthetic dataset, it is not clear from it.

This and the many similar comments are pretty confusing to me; with the common-sense context of how massively impossible it would be to have actual individual row-level facts on the entire population (with any choice for column[s]) "virtual database of the entire human population" seems it should be sufficiently clear about the nature of the project.

After a while it was also clear to me that this is synthetic data, but at first the wording sounds like this is real data, which lead to a bit of confusion.

In quite the coincidence, when I attempted to get a random human in the US, I got a simulated person 3mi from my (very rural) house. When I searched the coords, it was in the middle of a forest. The demographics seemed off for my area (too young, income too low, there is no internet access in that area, unlikely religion)

Is the precision limited to the country level?

Geographically, yes the precision is at the country level. Within each country, there is another breakdown by age and sex, another by religion, another for language, and so on. So, there are a few ways - beyond geography - that you can slice the results and they are going to be proportional to reality.

Of course, the devil is in the details, so when you zoom in on any individual you can see the "uncanny valley" of how the simulated agents aren't quite right.

The limit seems to be 2-7171922938 [1][2] for some reason 1 throws an error [3].

Edit: I should have read the documentation first [4].

> The current database contains 7,171,922,938 agents and is approximately 6.8 TB in size.

[1] http://pplapi.com/2.html

[2] http://pplapi.com/7171922938.html

[3] http://pplapi.com/1.html

[4] http://pplapi.com/docs/

Naturally. :) I wonder why id 1 isn't working ...

Thanks for the report!

Wow facinating. It was fun to just keep clicking random and read the stats of the virtual populace.

What does it mean by age 0 (unborn), how are those entries in there. E.g. http://pplapi.com/3187300229.html

Do you plan to simulate deaths as times go by and their stated birthdate puts their age above expected mortality and add more new "babies" as well, is that what I'm seeing? If not would be cool.

Thanks! Yes, I plan for agents to support death. In fact, I plan to backfill the entire human species into the database, most of whom are dead. I estimate this will grow the database by 80-110 billion agents.

Imagine that you are born as a random person - and ask yourself if outcome will be better than your existing country/wealth

This is fascinating. It's not quite clear to me however if these are real people who have been anonymized or if they are a unitary statistical representation of the population?

Let's say I was looking for a startup opportunity. What kind of products can I build with this?

They are not real people, but it is not a "unitary statistical representation" either. At the country level, crossed by the age/sex level, pplapi agents track pretty closely to real humans. Other dimensions (e.g. income) have more noise in them, so they are less reliable.

That explains why a 3 year old in Egypt had an income of $2500 USD or so. I was wondering if it included family income or something else not obvious from a cursory glance.

> Currently, Agent Space contains a simulated entry for each of the ~7 billion humans alive in 2014

Ah, I interpreted "based on" to mean that this might be an extension of that work. What comprised in this extension being a bit ambiguous.

Some say the issue of capitalism (and other social and economic theories) is that it makes the presumption that humans are rational, which is a pretty crappy simulation of human behaviour. Could this be an improvement on that?

Very interesting concept.

Can someone download the entire data set in JSON format, even for a small fee?

The public API is rate limited. However, if you want to purchase a dump, we can work something out.

Super cool idea.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact