Hacker News new | past | comments | ask | show | jobs | submit login
US Presidential Election $25k Database Bounty Review (dolthub.com)
145 points by mjangle1985 on Feb 16, 2021 | hide | past | favorite | 36 comments



This is interesting, but why isn't this data easily available by default? I understand that the Presidential Election is actually a separate election in each state, but why wouldn't each state's election authority make that data readily available by default? This really feels like data that should just be out there for anyone to download and analyze.

In Canada, the equivalent data is readily available from Elections Canada[0]. For Provincial elections, the story is a bit more mixed, but open by default is the general rule. Elections Alberta, for example, provides Excel files with poll-by-poll results[1] - it's not as easy to work with as Elections Canada's CSVs, but a little Python can get it into a more reasonable format.

[0] https://www.elections.ca/content.aspx?section=res&dir=rep/of...

[1] https://officialresults.elections.ab.ca/orResultsPGE.cfm?Eve...


The data is easily available in most places. The issue is that every place uses a different reporting format, including using slightly different names for the same candidates. The value of the bounty is getting this data into a single schema recording all the votes for every precinct in the country, with all names normalized.


What level are the differences at? I can see each state having a different reporting format. Do the differences go down to the electoral district or county or something?

In Canada it's nice and easy - federal elections are handled by Elections Canada, provincial elections are handled by each province's election authority, and we don't have cases where an election at one level results in a person at a higher level getting into office. Well, except for Alberta's senate elections which are somewhat farcical anyway (senators are appointed by the Prime Minister, so the results of this particular election are basically vague suggestions that the PM sometimes follows).


New York is a great example of weirdness.

New York allows candidates to be the nominees even if the candidates aren't in that party. So you had Joe Biden as the nominee for the Democratic Party and then Joe Biden also listed as the nominee for the Working Families party.

Just weirdness like that abounds in the data in almost every state.

In addition a lot of the reporting for precincts was county level, so states wouldn't have a csv that contained all precinct level voting data so you have to go to each county to get that data. Some states have a lot of counties. PA for example has 67 and each county publishes data in a different format with different values.

It's tedious and honestly impossible to automate (at least in the case of PA).


That's fascinating! Thank you for the explanation!

Our elections in Canada are a lot simpler - it's interesting to see how our neighbour does things!


Canada. Canada! Canada! — another Canadian :)


> So you had Joe Biden as the nominee for the Democratic Party and then Joe Biden also listed as the nominee for the Working Families party.

I’m confused how this works. The candidate themselves doesn’t have to claim they are from a particular party to be listed as such? That seems wrong or misleading but what is the point? Is it some kind of hack to garner enough votes for a party to trigger some kind of funding?


Parties don't get votes (at least in most elections), the candidates do. The candidates can be members of a party and may be endorsed by them. in NY it sounds like a party can endorse a non-member. This doesn't sound fundamentally wrong or misleading to me.


Differences are down to reporting entity. In the US this is usually the state, but is often the county. I don't think I saw any state with congressional district level differences.


Thanks - this is definitely clearing up some of my misconceptions about how the US votes!


Not only is it not easily available, but there's no current way to validate the vote casting or tabulation process. Biden literally couldn't prove that he won even if he cared to.

Some say this is a bug, but sounds more like a feature. If not, why wouldn't it be fixed when the technology exists? And why would people go over and above to have the current technology installed?


Seems like a lot of this data is sourced from OpenElections[0][1]? Not sure about what makes this better than OpenElection's data, especially considering OpenElections seems to have done a large portion of the difficult wrangling work...

There should definitely be better attribution for this data.

[0] https://twitter.com/derekwillis/status/1361508657154961408

[1] https://github.com/openelections


A portion of it was sourced from openelections, yes. These were attributed in the PRs, but not in the README for the repository. We've corrected this oversight in attribution and issued an apology.


"A lot" of the data wasn't sourced from Open Elections. Derek's claim is not true. As far as I know one user used Open Elections data for a portion of their total contributions (and attributed it in their PR).

"Many" of the contributors (myself included) used primary sources for their data.

The one contributor that used OE data cited it and it was a small portion of their overall contributions.


This “data bounty” thing is really cool.


I participated in this bounty, it was a blast.

The team at Dolthub is great and extremely accessible on their discord. These bounties seem like a great use case for their tech.

If you're into git and data (like I am) then these bounties are just awesome.


So you acquired and preprocessed some data for them? How did they verify its correctness?

I agree this looks like a great enterprise.


So data was accepted via pull requests.

The maintainer will review a PR and either accept the PR or ask you to modify it to better fit their requirements or reject it.

For example one of their requirements were no 0 vote rows. So that's a pretty simple SQL query on the database and can be checked before the maintainer does a merge.

All data was required to be sourced. I got most of my data from state and county websites so those links were included with the comments in the PR.

In addition I was in communication with the team via their discord so they would ask for changes to PRs from there also.


Interesting, thanks for sharing. So for the latest one where they have a bounty for assembling the largest healthcare dataset -- how do they determine who gets what portion of the bounty? It's not just winner takes all right?

This data looks cool too, I'll have a look in the Discord...


It's divided based upon total additions to the final data set I believe. They have github repo for their bounty board that shows that calculation I think.

I think the final calculation is based on the percent you've added to the final dataset.

edit:

yeah here's the repo that calculates the final payment. https://github.com/dolthub/bounties


Do they require you to "show your work" or somehow demonstrate that your methodology is sound? Is there any requirement that it be repeatable (e.g. let's say they find a minor issue in data you sourced. If you were required to provide code that does the work, rather than just the data, it could be fixed and re-run).

I'm kind of fascinated by the process, but I am having a hard time figuring out how this can really work. It can't really be as simple as paying people to shove arbitrary data of unknown value in their dbs, can it?


Join the discord and hit them with some questions. I'm sure they'll be happy to answer you.


Why not answer here? It’s an opportunity to generate more hacker interest. I’m probably not going to sign up for discord to ask one question.


So is this basically a combination of Mechanical Turk and a bug bounty?

Say I decided to skip work tomorrow and try to get a bounty, or at least part of one. What do I do?


Hospital data bounty is the only one they have running right now but that one is still early so there's lots left to do and plenty of data left to source.

They are launching another bounty later this week for college course data.


The voting data consists of four tables:

candidates, counties, precincts, vote_tallies

Without a set of open tables for voters and votes there is no chain of custody. Without a chain of custody this data is academic at best. There is no reliable way to verify it.

Those two tables remove privacy from the voter, and open up the risk of voter intimidation, but that doesn’t change the fact that this data, and any findings associated with it, are, at best, interesting.

The Bounty concept, on the other hand, is commendable.


How is this not already a public resource offered by the US federal government? Surely it would be trivial to require states to collect and report their election data to the fed using a standardized schema.


Elections are the responsibility of the individual states, so right there there’s at 51 entities with their own data sets. Of course, in many states, elections are handled at county level, so that’s more entities involved. And at least in my state, individual precincts may have different voting systems, within a single county.

Basically, it’s a decentralized mess.


It can stay a decentralized mess. All the federal government has to do is require states to collect and report specific data in a specific format. They don't have to take over the election process. Apparently the states are already keeping records of that data since this project was able to reach 100% for the 2016 election, so they're already half way there.


How much do you want to be taxed for this service? We got it from strangers for $25,000 in 2 months. I'm guessing it would be slightly more if dictated by statute.


"The fed" does not mean "the federal government." "The Fed" means Federal Reserve.


Yeah, sorry. Hopefully it was clear what I meant in context.


>How is this not already a public resource offered by the US federal government?

Because the Constitution protects how States run their elections from federal restrictions/requirements.


This bounty model for data is really interesting. What's dolthub's business model? Clearly the data must be worth more than the bounty.


CEO of DoltHub here.

The data created by bounties is free and open.

Our business model is to sell database licenses. We're a database start up. Bounties are the thing that shows off our capabilities the best, so we consider it marketing. It could be more in the future, ie. a two-sided marketplace, but right now, we're just getting started.


Thanks for your answer! I really like this bounty model. It's really creative and productive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: