
Pplapi: A virtual database of the entire human population - caseysoftware
http://pplapi.com/
======
idm
This is part of my research and I'm happy to answer any questions.

I gave a preliminary poster about this a few months ago:

[http://imiller.utsc.utoronto.ca/media/ci-2016/2016-06-03_a-s...](http://imiller.utsc.utoronto.ca/media/ci-2016/2016-06-03_a-
synthetic-world-population.pdf)

This project is under active development as I wrap up my PhD.

~~~
divbit
Typically, when one does a university research study in the US involving human
subjects there are something like IRB requirements (e.g.
[[http://humansubjects.stanford.edu/new/resources/consent/](http://humansubjects.stanford.edu/new/resources/consent/)]
I see research mentioned on the website, and also that the entire world
population is mentioned in the data set. Have you run into anything like this?

~~~
idm
It's a synthetic world population derived from public population data; these
are not real people (i.e. human subjects) and the source raw data were not
identifiable or even individual-level data.

I nevertheless think there are a number of issues raised by the possibility of
a human population database, but to be clear: pplapi does not have actual
people in it. I am currently writing about some of these issues.

~~~
ideonexus
Until I looked at the actual data, I didn't realize this was synthetic either.
The developer might want to make that clear in the header text. "Virtual
Database" doesn't make this clear. I think "virtual database _synthesizing_
the entire human population" might make it clearer.

That said. This is very impressive and has my imagination tingling with ways
to apply it.

~~~
philtar
It was very clear to me from the title.

And secondly, I don't believe the author should pander to people who won't
even visit the site and instead judge from titles.

~~~
divbit
I think virtual can either mean: stored /accessed on a computer, or simulated
on a computer, which is what confused me.

~~~
basch
where do you hear virtual used to mean digital data and not simulated or
emulated. a virtual machine is a synthetic digital representation of physical
hardware. virtual reality is a fake representation of reality. virtual memory
is a simulacron of ram. a virtual assistant is not a dedicated helper.

i just googled the word virtual

    
    
         not physically existing as such but made by software to appear to do so.
    
         in effect or essence, if not in fact or reality; imitated, simulated. 
    
         simulated in a computer or online.
    

I suppose a counter example could be a virtual library, which is a digital
representation of a library, not a simulation, but still the word virtual is
not wrong in this context, if anything is confusing it is the word order.
maybe A Database of Virtual ... instead of A Virtual Database

~~~
idm
Thanks for providing a definition for the word "virtual."

I think part of the problem here is that I've invented something we don't have
precise language for, yet. I call the agent population _synthetic_ and I call
the database _virtual_ , and that's just the nomenclature I'm using.

I'm absolutely shocked at the number of people who feel deceived that this
isn't an actual database of the entire human population.

This is a serious problem, though. My work exists at the intersection of
social psychology and artificial intelligence, and these fields do not share
enough vocabulary. There are major barriers to communication between
"intellectual silos," and I think all of the confusion in this comment thread
about the word "virtual" illustrates this point very nicely.

~~~
basch
like I said, I think it is word order. your word order makes it sound like the
database itself is virtual not the contents of the database.

------
sharemywin
you need to add a way to adopt one(aka give it name).

I'd adopt one for a $1. to help fund your research.

you need a way to find the closest one to you.

[http://www.w3schools.com/html/html5_geolocation.asp](http://www.w3schools.com/html/html5_geolocation.asp)

~~~
idm
That's a really clever idea!

------
dom0

        DOB: 2012-09-09
        Age: 5
        Language: Russian
        Religion: Muslim 10-15%
        Income: $18466 USD
    

Hmmm

~~~
24gttghh
I'm sure this type of data point exists in real life in terms of identity
theft schemes...

edit: from browsing more random "agents", there are a lot of 5 y/o's making
tens of thousands of $ per year:

Location Country: United States GPS: (36.073868, -103.923638) Demographics
Sex: Male DOB: 2012-04-11 Age: 5 Language: English Religion: Protestant
Income: $57834 USD Internet: True

Ahh, to be young.

------
legulere
Maybe the website should tell that this is a synthetic dataset, it is not
clear from it.

~~~
lotyrin
This and the many similar comments are pretty confusing to me; with the
common-sense context of how massively impossible it would be to have actual
individual row-level facts on the entire population (with any choice for
column[s]) "virtual database of the entire human population" seems it should
be sufficiently clear about the nature of the project.

~~~
legulere
After a while it was also clear to me that this is synthetic data, but at
first the wording sounds like this is real data, which lead to a bit of
confusion.

------
howderek
In quite the coincidence, when I attempted to get a random human in the US, I
got a simulated person 3mi from my (very rural) house. When I searched the
coords, it was in the middle of a forest. The demographics seemed off for my
area (too young, income too low, there is no internet access in that area,
unlikely religion)

Is the precision limited to the country level?

~~~
idm
Geographically, yes the precision is at the country level. Within each
country, there is another breakdown by age and sex, another by religion,
another for language, and so on. So, there are a few ways - beyond geography -
that you can slice the results and they are going to be proportional to
reality.

Of course, the devil is in the details, so when you zoom in on any individual
you can see the "uncanny valley" of how the simulated agents _aren 't quite
right_.

------
guessmyname
The limit seems to be 2-7171922938 [1][2] for some reason 1 throws an error
[3].

Edit: I should have read the documentation first [4].

> The current database contains 7,171,922,938 agents and is approximately 6.8
> TB in size.

[1] [http://pplapi.com/2.html](http://pplapi.com/2.html)

[2] [http://pplapi.com/7171922938.html](http://pplapi.com/7171922938.html)

[3] [http://pplapi.com/1.html](http://pplapi.com/1.html)

[4] [http://pplapi.com/docs/](http://pplapi.com/docs/)

~~~
idm
Naturally. :) I wonder why id 1 isn't working ...

Thanks for the report!

------
evolve2k
Wow facinating. It was fun to just keep clicking random and read the stats of
the virtual populace.

What does it mean by age 0 (unborn), how are those entries in there. E.g.
[http://pplapi.com/3187300229.html](http://pplapi.com/3187300229.html)

Do you plan to simulate deaths as times go by and their stated birthdate puts
their age above expected mortality and add more new "babies" as well, is that
what I'm seeing? If not would be cool.

~~~
idm
Thanks! Yes, I plan for agents to support death. In fact, I plan to backfill
the entire human species into the database, most of whom are dead. I estimate
this will grow the database by 80-110 billion agents.

------
kwikiel
Imagine that you are born as a random person - and ask yourself if outcome
will be better than your existing country/wealth

------
ReedJessen
This is fascinating. It's not quite clear to me however if these are real
people who have been anonymized or if they are a unitary statistical
representation of the population?

Let's say I was looking for a startup opportunity. What kind of products can I
build with this?

~~~
idm
They are not real people, but it is not a "unitary statistical representation"
either. At the country level, crossed by the age/sex level, pplapi agents
track pretty closely to real humans. Other dimensions (e.g. income) have more
noise in them, so they are less reliable.

~~~
dhimes
That explains why a 3 year old in Egypt had an income of $2500 USD or so. I
was wondering if it included family income or something else not obvious from
a cursory glance.

------
AAAton
Some say the issue of capitalism (and other social and economic theories) is
that it makes the presumption that humans are rational, which is a pretty
crappy simulation of human behaviour. Could this be an improvement on that?

------
beamatronic
Very interesting concept.

Can someone download the entire data set in JSON format, even for a small fee?

~~~
idm
The public API is rate limited. However, if you want to purchase a dump, we
can work something out.

------
george_ciobanu
Super cool idea.

