Hacker News new | comments | show | ask | jobs | submit login

This is part of my research and I'm happy to answer any questions.

I gave a preliminary poster about this a few months ago:

http://imiller.utsc.utoronto.ca/media/ci-2016/2016-06-03_a-s...

This project is under active development as I wrap up my PhD.




Typically, when one does a university research study in the US involving human subjects there are something like IRB requirements (e.g. [http://humansubjects.stanford.edu/new/resources/consent/] I see research mentioned on the website, and also that the entire world population is mentioned in the data set. Have you run into anything like this?


It's a synthetic world population derived from public population data; these are not real people (i.e. human subjects) and the source raw data were not identifiable or even individual-level data.

I nevertheless think there are a number of issues raised by the possibility of a human population database, but to be clear: pplapi does not have actual people in it. I am currently writing about some of these issues.


Until I looked at the actual data, I didn't realize this was synthetic either. The developer might want to make that clear in the header text. "Virtual Database" doesn't make this clear. I think "virtual database synthesizing the entire human population" might make it clearer.

That said. This is very impressive and has my imagination tingling with ways to apply it.


It was very clear to me from the title.

And secondly, I don't believe the author should pander to people who won't even visit the site and instead judge from titles.


I think virtual can either mean: stored /accessed on a computer, or simulated on a computer, which is what confused me.


where do you hear virtual used to mean digital data and not simulated or emulated. a virtual machine is a synthetic digital representation of physical hardware. virtual reality is a fake representation of reality. virtual memory is a simulacron of ram. a virtual assistant is not a dedicated helper.

i just googled the word virtual

     not physically existing as such but made by software to appear to do so.

     in effect or essence, if not in fact or reality; imitated, simulated. 

     simulated in a computer or online.
I suppose a counter example could be a virtual library, which is a digital representation of a library, not a simulation, but still the word virtual is not wrong in this context, if anything is confusing it is the word order. maybe A Database of Virtual ... instead of A Virtual Database


Thanks for providing a definition for the word "virtual."

I think part of the problem here is that I've invented something we don't have precise language for, yet. I call the agent population synthetic and I call the database virtual, and that's just the nomenclature I'm using.

I'm absolutely shocked at the number of people who feel deceived that this isn't an actual database of the entire human population.

This is a serious problem, though. My work exists at the intersection of social psychology and artificial intelligence, and these fields do not share enough vocabulary. There are major barriers to communication between "intellectual silos," and I think all of the confusion in this comment thread about the word "virtual" illustrates this point very nicely.


like I said, I think it is word order. your word order makes it sound like the database itself is virtual not the contents of the database.


How to make it clear what this data represents is a semantic conundrum. If I take a "virtual" tour of New York City, I expect to see an actual geographically-accurate approximation of New York City, not buildings randomly generated based on the probabilities of buildings having various dimensions of those found in New York City strewn across a google map.

What do we call such a thing? After reading the author's paper, I think "synthetic" is most appropriate because we would be synthesizing a city like NYC, but not a virtual representation of NYC. In fact, I find the word "virtual" in this context somewhat misleading.


if virtual means simulated, I dont find it misleading. simulated demographic population statistics implies an approximation.


It was not at all clear to me. I spent some time navigating the site and believed at first that it was just anonymized real data. Using the word "synthetic" and something like "these are not real people" on the front page would go a long way towards explaining the purpose and intent of this project.


Yeah, maybe just "Simulated ..." or "Synthetic ..."?


I did look at the data, and wasn't sure if it was real people, or what the source was.

I then searched the site, and couldn't find a description of what the project was, nor the data it contained.

Finally I came here to the comments to find out.

I consider the site a failure if it doesn't answer obvious questions about what it is, an unfortunate failure that is common in new 'lean startup' pages etc -

In defense against these sites that wants me to sign up etc before I know what it is, I simply forget them and their name and never mention them to anyone. It helps me sleep knowing I'm only supporting honest sites that explain what they are and what they do .. ymmv


Yeah, this almost got me. Until I start to see a bunch of 3-7 years olds making 6 digits. I have no idea toddler modeling could make this much.

Me: Mom, y didn't you sign me up for this?

Mom: Cuz you never asked for it...

Me: Uhhhh, I was little. I barely knew anything.


This is really fascinating.

Have you given any thought to the possibility of providing this synthetic population on a finer grained basis? It could be extremely useful if this were available for, e.g., each U.S. county (or even something smaller than that).

Edit: I noticed that each person has a lat/long location. So maybe a different way of framing my question would be, at what level of geographic granularity does this reflect differences in the distribution of the various characteristics? And, assuming that it reflects sub-national variations, have you considered allowing a random agent to be selected within an arbitrary geographic area?


Absolutely. This is an active project and I have a plugin system for synthesizing additional data sources. In the research literature, I have seen several impressive US census-derived synthetic population projects, so finer-grained data are a future direction.


Great! I've seen the U.S. Census Bureau's American Communities Survey, which has extremely fine grained data, but with a few gaps for, e.g., religion, Internet access, etc. Do you know of anything else as granular as ACS?

This, again, is really really interesting. When it comes to the fine-grained U.S. data, we may be approaching if-you-dont-build-it-I-will territory. :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: