This is interesting, but why isn't this data easily available by default? I understand that the Presidential Election is actually a separate election in each state, but why wouldn't each state's election authority make that data readily available by default? This really feels like data that should just be out there for anyone to download and analyze.
In Canada, the equivalent data is readily available from Elections Canada[0]. For Provincial elections, the story is a bit more mixed, but open by default is the general rule. Elections Alberta, for example, provides Excel files with poll-by-poll results[1] - it's not as easy to work with as Elections Canada's CSVs, but a little Python can get it into a more reasonable format.
The data is easily available in most places. The issue is that every place uses a different reporting format, including using slightly different names for the same candidates. The value of the bounty is getting this data into a single schema recording all the votes for every precinct in the country, with all names normalized.
What level are the differences at? I can see each state having a different reporting format. Do the differences go down to the electoral district or county or something?
In Canada it's nice and easy - federal elections are handled by Elections Canada, provincial elections are handled by each province's election authority, and we don't have cases where an election at one level results in a person at a higher level getting into office. Well, except for Alberta's senate elections which are somewhat farcical anyway (senators are appointed by the Prime Minister, so the results of this particular election are basically vague suggestions that the PM sometimes follows).
New York allows candidates to be the nominees even if the candidates aren't in that party. So you had Joe Biden as the nominee for the Democratic Party and then Joe Biden also listed as the nominee for the Working Families party.
Just weirdness like that abounds in the data in almost every state.
In addition a lot of the reporting for precincts was county level, so states wouldn't have a csv that contained all precinct level voting data so you have to go to each county to get that data. Some states have a lot of counties. PA for example has 67 and each county publishes data in a different format with different values.
It's tedious and honestly impossible to automate (at least in the case of PA).
> So you had Joe Biden as the nominee for the Democratic Party and then Joe Biden also listed as the nominee for the Working Families party.
I’m confused how this works. The candidate themselves doesn’t have to claim they are from a particular party to be listed as such? That seems wrong or misleading but what is the point? Is it some kind of hack to garner enough votes for a party to trigger some kind of funding?
Parties don't get votes (at least in most elections), the candidates do. The candidates can be members of a party and may be endorsed by them. in NY it sounds like a party can endorse a non-member. This doesn't sound fundamentally wrong or misleading to me.
Differences are down to reporting entity. In the US this is usually the state, but is often the county. I don't think I saw any state with congressional district level differences.
Not only is it not easily available, but there's no current way to validate the vote casting or tabulation process. Biden literally couldn't prove that he won even if he cared to.
Some say this is a bug, but sounds more like a feature. If not, why wouldn't it be fixed when the technology exists? And why would people go over and above to have the current technology installed?
Seems like a lot of this data is sourced from OpenElections[0][1]? Not sure about what makes this better than OpenElection's data, especially considering OpenElections seems to have done a large portion of the difficult wrangling work...
There should definitely be better attribution for this data.
A portion of it was sourced from openelections, yes. These were attributed in the PRs, but not in the README for the repository. We've corrected this oversight in attribution and issued an apology.
"A lot" of the data wasn't sourced from Open Elections. Derek's claim is not true. As far as I know one user used Open Elections data for a portion of their total contributions (and attributed it in their PR).
"Many" of the contributors (myself included) used primary sources for their data.
The one contributor that used OE data cited it and it was a small portion of their overall contributions.
The maintainer will review a PR and either accept the PR or ask you to modify it to better fit their requirements or reject it.
For example one of their requirements were no 0 vote rows. So that's a pretty simple SQL query on the database and can be checked before the maintainer does a merge.
All data was required to be sourced. I got most of my data from state and county websites so those links were included with the comments in the PR.
In addition I was in communication with the team via their discord so they would ask for changes to PRs from there also.
Interesting, thanks for sharing. So for the latest one where they have a bounty for assembling the largest healthcare dataset -- how do they determine who gets what portion of the bounty? It's not just winner takes all right?
This data looks cool too, I'll have a look in the Discord...
It's divided based upon total additions to the final data set I believe. They have github repo for their bounty board that shows that calculation I think.
I think the final calculation is based on the percent you've added to the final dataset.
Do they require you to "show your work" or somehow demonstrate that your methodology is sound? Is there any requirement that it be repeatable (e.g. let's say they find a minor issue in data you sourced. If you were required to provide code that does the work, rather than just the data, it could be fixed and re-run).
I'm kind of fascinated by the process, but I am having a hard time figuring out how this can really work. It can't really be as simple as paying people to shove arbitrary data of unknown value in their dbs, can it?
Hospital data bounty is the only one they have running right now but that one is still early so there's lots left to do and plenty of data left to source.
They are launching another bounty later this week for college course data.
Without a set of open tables for voters and votes there is no chain of custody. Without a chain of custody this data is academic at best. There is no reliable way to verify it.
Those two tables remove privacy from the voter, and open up the risk of voter intimidation, but that doesn’t change the fact that this data, and any findings associated with it, are, at best, interesting.
The Bounty concept, on the other hand, is commendable.
How is this not already a public resource offered by the US federal government? Surely it would be trivial to require states to collect and report their election data to the fed using a standardized schema.
Elections are the responsibility of the individual states, so right there there’s at 51 entities with their own data sets. Of course, in many states, elections are handled at county level, so that’s more entities involved. And at least in my state, individual precincts may have different voting systems, within a single county.
It can stay a decentralized mess. All the federal government has to do is require states to collect and report specific data in a specific format. They don't have to take over the election process. Apparently the states are already keeping records of that data since this project was able to reach 100% for the 2016 election, so they're already half way there.
How much do you want to be taxed for this service? We got it from strangers for $25,000 in 2 months. I'm guessing it would be slightly more if dictated by statute.
Our business model is to sell database licenses. We're a database start up. Bounties are the thing that shows off our capabilities the best, so we consider it marketing. It could be more in the future, ie. a two-sided marketplace, but right now, we're just getting started.
In Canada, the equivalent data is readily available from Elections Canada[0]. For Provincial elections, the story is a bit more mixed, but open by default is the general rule. Elections Alberta, for example, provides Excel files with poll-by-poll results[1] - it's not as easy to work with as Elections Canada's CSVs, but a little Python can get it into a more reasonable format.
[0] https://www.elections.ca/content.aspx?section=res&dir=rep/of...
[1] https://officialresults.elections.ab.ca/orResultsPGE.cfm?Eve...