I'm the instructor of an upcoming Coursera course . A couple of observations from my point of view:
* I wish there were a way to fund online education through philanthropy/donations. Coursera being for-profit leaves a bit of a bad taste in the mouth. At a practical level, it complicates what images I can use in my lectures and qualify as fair use.
* After several years the site is far from being at a point where an instructor can log on and upload content. The interface is constantly changing, confusing, and buggy. My university has a dedicated team who help out instructors with putting their material online and even they are often confused about how to edit this or upload that.
Overall I'm glad that Coursera exists and is finding a revenue stream; my own undergraduate education would have been vastly different if I'd had access to the material that's available today.
To address your first point, Edx.org is very similar to Coursera, but is a non-profit organization that releases all it's software as open source (https://open.edx.org/.)
For your second point, EdX Studio (https://studio.edx.org/) is focused on being accessible and easy to use for instructors - we hear good things from course staff about usability compared to Coursera.
Interesting observations and notes. Glad to see a little bit of background / context regarding the mechanics. Regarding my perspectives, I have a background teaching and an advanced degree in education course design.
Regarding point 1, my understanding of Fair Use within an Education environment is that an instructor using protected material in the context of a lecture or assignment is, by default, an instance of Fair Use. A lot of the pivot relates to the scope of the use - as in, photocopying an entire chapter or short-story is okay, but photocopying the entire book is not. With images, I think you're well in the clear. I can understand where you're coming from with your concern, I just don't believe it to be material.
My university has a dedicated team who help out instructors with putting their material online and even they are often confused about how to edit this or upload that.
This scenario strikes me as counter-intuitive from a savings perspective, because now there's two layers involved: Instructors and IT Support. Actually, it sounds like a terrible waste of overhead and expense the University is laying out. Will Coursera reimburse your institution for the burden, or is it so small compared to the revenue brought in through Coursera that the expense is immaterial?
I get a macabre laugh out of learning Coursera actually kind of sucks at its main value proposition of being a technology platform for education, in that it's not user friendly for actual educators. Yeah it's a 'disruption' platform, sure. Just seems to me like throwing a Basball into an Olympic Swimming Pool.
> This scenario strikes me as counter-intuitive from a savings perspective, because now there's two layers involved: Instructors and IT Support. Actually, it sounds like a terrible waste of overhead and expense the University is laying out. Will Coursera reimburse your institution for the burden, or is it so small compared to the revenue brought in through Coursera that the expense is immaterial?
If it's anything like my university was, the team he's referring to didn't exclusively support Coursera; their purpose is to provide faculty assistance with managing online content in general. Whether it's the university's internal Blackboard site or Coursera, they support whatever platforms the professors are using (assuming it's a university approved platform).
So the marginal expense of supporting Coursera is likely negligible, unless they've somehow managed to make a worse interface than Blackboard and it's particularly resource intensive to support.
DonorsChoose.org makes it easy to help classrooms in need. Public school teachers post classroom project requests which range from pencils for poetry to microscopes for mitochondria.
I bet there's a way you could sneak in support for your Coursera teaching.
As for uploading content, that seems like a really tough and time consuming problem. Are you allowed to put together a wiki or webpage? I was doing some prep for an Operating Systems course and read this excellent blog post about why textbooks should be free . In the post, the writer mentions that "perfect is the ultimate enemy of good", so he decided to write the initial draft of a textbook purely in plain text rather than properly format it with something like LaTeX. Getting the necessary content out there seems like a good first step for you and your team.
> * I wish there were a way to fund online education through philanthropy/donations. Coursera being for-profit leaves a bit of a bad taste in the mouth. At a practical level, it complicates what images I can use in my lectures and qualify as fair use.
There are ways to do this. The problem is that they don't readily scale. Self-funding systems scale much better than systems that require ever-increasing amounts of external funding.
It's a matter of requirements for external resources. If a system needs constant infusions of external money, its ability to grow will be determined by its ability to bring in such external money. This is how non-profits tend to work and why they dedicate such attention to fundraising.
Systems that generate what they need to grow don't have the same constraints. If money is what your business needs to grow and it generates a significant yearly profit, then your business can meet its own needs to enable growth.
Is that clearer? Some sorts of systems, when functioning correctly, will tend to be self-perpetuating. Others will, as an artifact of structure, require endless external resourcing.
It's not about that. It's about not needing continual fresh infusions. This is why GE, which doesn't need to raise money every six months, is more likely to be around in ten years than any given startup trying to raise a B round.
There are ways for the developers to earn the donations asked. Edx is an example of the other one.
I live off of mooc courses I learn everything from there, and that's coming from a current college student.
This probably belongs here. It's pretty close to what you're looking for, and it won me the Miami Bitcoin hackathon a few weeks ago. Source code is in the youtube description: https://www.youtube.com/watch?v=zlmk5tnoKBQ
Piazza is asking me for a princeton.edu email address (I tried to sign up through the link you gave above).
Can I sign up for the course if I am not a Princeton student (or any student of a university, for that matter)?
To limit enrollment in your class, by default a school
email domain is required to self-enroll. If your school
does not provide students with school email addresses,
instead a class access code is automatically set to
limit enrollment in your class.
Don't forget to mention that their consensus algorithm simply doesn't work. That's why they made it 100% centralized "temporarily" (without even asking, so apparently it was 100% centralized even before that). They themselves said exactly that in a blog post. Also, the creator of Ripple dumped all his coins before joining Stellar.
Here's the Stellar team's blog post about this. While there's no attempt to conceal this, it's also not being spelled out clearly enough: the Ripple/Stellar "consensus" model is a centralized and does NOT provide the guarantees which Proof-of-Work does in Bitcoin.
quantcast was sued for resuscitating browser cookies when flash LSOs persisted , ie taking the cookie value from the LSO and recookie-ing the browser. quantcast and clearspring settled for $2.5m . The crux of the matter seemed to be that users didn't know such data was in flash cookies or associated with quantcast, making it hard to opt-out, though I'm not sure if this is illegal; and violated quantcast and the 3rd party sites' privacy agreements, which appears to be illegal. A lawsuit outline for one plaintiff is here  and the full text of the initial filing here . I naively assume there is a clear parallel to this case, though perhaps verizon and turn have thoroughly privacy policied their way out, somewhere in 30 pages of legalese.
According to Jonathan Mayer,
Commercial supercookies, fingerprinting, and zombie cookies are tolerated
(if not permitted) under current United States law. [...] Any associated
consumer deception, however, is a violation of the Federal Trade Commission
Act and parallel state statutes. 
That's a good question. One of the points I'll make in the follow-up post that I promised is that it is indeed possible to capture the motivation of such an attacker in game theory, and in fact, this has been done.  However, it makes the model less elegant and introduces parameters. The more of these complexities you wish to model, the less tractable the model becomes.
One of the surveillance attacks pointed out in the post is the NSA piggybacking on advertising cookies. Details in the Snowden leaks were scant, so we did some research to figure out just how far the NSA could go with this technique. Very far, as it turns out. Here's a blog post with a link to our research paper: https://freedom-to-tinker.com/blog/dreisman/cookies-that-giv...
One of our conclusions was that tracking companies switching to HTTPS would help, but a large majority would have to switch to make any difference, because of the sheer number of trackers (Section 4.1). This proposal or something like it is probably necessary if we're to see that magnitude of change.
It studies the same problem as the author here exploited, but goes a lot farther to try and infer things about the targeted individual from the analytics that FB provides to the advertiser. The paper won the Privacy Enhancing Technologies award.
That was back in 2010. If the author succeeded anyway, it seems that Facebook was careless in their implementation of the fix.
In this case they used Custom Audience Targeting which is different to what's used in the paper. CAT is advertising based on a list of personally identifiable information about your "targets" that you supply, which can include user IDs, phone numbers, e-mail addresses, etc. I can't confirm but suspect no limit is required for this since you already have their personal data.
As an aside, an interesting use of this system I've seen in the Internet marketing world is to buy e-mail or phone number lists from list vendors and use them to create custom audiences in Facebook. You can't e-mail them because that's spam and you'll be busted very quickly, but using them as fodder for custom audiences on platforms like Facebook is a way of laundering the info, in a sense. (I've not tried it myself but thought the idea was clever.)
I'm one of the authors of the paper. One of our findings was that third-party cookie blocking is only marginally effective. See the tables under cookie syncing in our summary  or in our full paper . Trackers bypass cookie blocking in a variety of creative ways. We're currently investigating the bypassing mechanisms in more detail.
On the other hand, add-ons like Ghostery work much better.
People need to stop promoting Ghostery. It's made by Evidon, yet another tracking company. Just checkout their website and it's clearly obvious that everyone that installs this extension is doing them a huge favor:
Slightly off-topic but something with my Iceweasel/FF install (v31) borks the layout of the blog completely -- content is shifted far to the right of the viewport for some reason. Looked fine on another computer, so not sure what's up. Also works fine in chromium. FYI.
Actually, I went one step further than this, I dump that hosts file into dnsmasq and set that as my primary domain controller for the home network. So anything connected via WiFi or ethernet (phones, tablets, laptops, etc) will also have tracking blocked.
Social graphs are not fast mixing. This used to be widely assumed but recent empirical measurements have refuted the assumption.  If I recall correctly, random walks tend to get stuck in cities and other highly-dense subgraphs.
Six degrees of separation refers to shortest paths, and fast mixing is decidedly _not_ the intuition behind it.
EDIT: Wow, I've read that paper you linked to. So they use the data-set from SNAP like Facebook A and Facebook B.
But, if I understand correctly, these data-sets were formed by combining "egonets", which means they took a set of people and a small ball around them, and put all those balls in the graph. That explains the horrendous eigenvalues.
No. None of the graphs they use are ego-nets. That would be methodologically ridiculous, as you point out.
In fact, the main problem with the paper is the opposite -- the Facebook graphs they use are regional networks borrowed from . This means that Facebook's actual mixing time should be _dramatically higher_ than the times measured in the paper because of the tendency of random walks to get stuck in regional networks. I believe this is the reason their Facebook-A and Facebook-B mixing times are much lower than the others, such as LiveJournal.
Alvisi et al. have a couple of related papers specifically looking at the implications of our new understanding of social-graph random walks for sybil defenses. [2, 3]
What are your recommendations for anonymizing PHI data, for example? Saying there is no way to anonymize data isn't really a solution and beyond that, I don't think it's true.
What practitioners need is a canonical reference or toolset that you could feed a csv file to, tell it which columns should be scrambled and out comes an anonymized data set. Yes, I realize the "tell it" part is of concern, well that could be mitigated by smarter tools and more knowledgable practitioners - knowledge gained from a canonical reference.
Regardless of whether PHI can be anonymized in a fool-proof way, we can agree that careful anonymization is better than a superficial one, and so your question is valid and important.
We can't automate the process (in part because the transformations necessary are much more complex than "scrambling"), but knowledgeable practitioners can go a long way. I'm knowledgeable but not a practitioner, so I'm not the best source.
In #8 of our report (on the Heritage Health data), you'll notice that while I took Khaled El Emam to task for claims about quantifying risk, I do acknowledge that he did a very good job of de-identification. I don't think there's exactly a "canonical reference" (except HIPAA's superficial list of 18 identifiers), but reports written by practitioners like El Emam are probably useful documents.
In order to have minimally robust anonymization you must strip out all implicit or explicit references to locations and times. Unfortunately, space and time values are material to the analysis of most non-trivial data models (it certainly is for health data) so stripping that out is not really an option.
Few people appreciate the robustness and generalizability of spatiotemporal coincidence analysis for reconstructing relationships in anonymized data both within and across many unrelated sources of anonymized data, even sources that are identifying entities that are not people (like anonymous vehicle tracking). There are enough anonymous entity tracking data sources available to algorithmically reconstruct relationships to most other "anonymous" data sources. I've seen it done many times and the capabilities are jaw-dropping in part because it violates human intuition as to what it is possible with such data sets.
Do you have opinions on/is there research regarding how much "fuzzing" you have to do to get substantial rewards in making space/time data anonymous? That is, if my legitimate use for a space/time dataset can stand that data being handled at the resolution of "hours" and "within 100 meters," does it win me significant privacy benefits to scrub out more significant figures?
There is plenty of good work on anonymising and fuzzing location data. But unfortunately the results show that the amount of fuzzing required for anonymity is huge. "Hours and 100m" are far from enough, only something like "year and state" might work.
Some of the notable results are:
* Golle and Partridge 2009 (http://xenon.stanford.edu/~pgolle/papers/commute.pdf) - Just the home and work location at city block level is enough to uniquely identify 50% of the US population. Home and work at zip code level is enough to uniquely identify 5% of US population. Home and work county is still enough to identify 1% of people to a set of 6 candidates.
* Montjoye et. al. 2013 (http://www.nature.com/srep/2013/130325/srep01376/full/srep01...) - If you have a time-location dataset of people with hourly accuracy of time and cell tower accuracy for location (100m in cities, a few km in rural areas), then four randomly picked points for each person uniquely identify 95% of them. Just two randomly picked points for each person uniquely identify 50% of people.
The main conclusion is that it is not possible to release an "anonymised" location dataset of people that is still useful for mobility research. The only realistic approach seems to be to have strict privacy regulations and NDAs with people who are given access to this data.
But more generally, all research into anonymisation and preventing de-anonymisation is difficult because it's not known what other data sources the attacker has access to. If I have a time-location dataset of anonymous people, and if I see from Facebook when a friend visited Paris and Barcelona, then it becomes trivial to match these dates and cities against the location traces, and find their full movement trace. Similarly, if I can get an anonymous dataset of phone calls, then I can make 20 missed calls at 4am to a friend, and later look for that pattern in the data to find their other calls.
The best example of this was from the Netflix dataset of anonymous movie ratings ( http://en.wikipedia.org/wiki/Differential_privacy#Netflix_Pr... ) - people were anonymous in their dataset, but some people also rated movies with visible identities on IMDB. Correlating the two datasets allowed researchers to discover the identities of people in the Netflix dataset.
Differential privacy is a different way of releasing data that avoids the problems of re-identification. The theory is well developed, the tools are starting to get there, but the hardest part is that it requires a behavior change from data analysts -- you need to formulate the desired computation algorithmically instead of just poking around the data. This has proved to be a formidable barrier.
> you need to formulate the desired computation algorithmically instead of just poking around the data. This has proved to be a formidable barrier.
Very true. Furthermore, if you start to think seriously about making differential privacy guarantees across a whole company, it gets really hard and impractical.
It's not enough, as you might naively hope, to certify individual data analysis jobs separately as being "differentially private". Taken cumulatively they can amount to something which isn't 
It's not even sufficient to certify whole planned and delimited programmes of use for whole datasets as differentially private, if the company also holds (or wants to hold in future) other data on some of the same users and there's some chance of new actions being taken which directly or indirectly depend on both sources of data.
For the theory to truly apply, one needs to track (in a particular formal sense) how much "privacy budget" is used up by pretty much every user-data-dependent action taken across a company and across all datasets you hold pertaining to any overlapping set of users. Once your preallocated budget is used up it's pretty much game over in terms of taking any further actions based on any data (acquired now or in future) on any of those users, unless these actions can be based on inferences which were already obtained within the original budget.
These aren't necessarily insurmountable problems, and I'm sure there are new tools being developed to help which I'm not up to date with. (One which I am aware of is Microsoft's PINQ project ).
Still, it seems the organisations with the biggest chance of actually making this work, are those whose relationships with sets of users are inherently transitory and/or firewalled off from eachother. For example a B2B company who operate one-off surveys for clients, throwing away the raw data afterwards.
The fundamental problem is that differential privacy guarantees a degree of resistance to an incredibly strong adversary -- one with unlimited prior knowledge about your users, who is able to bring unlimited intelligence and computational resources to bear on drawing inferences about them based on every action you ever take conditional on user data. It's impressive that one can obtain any guarantees at all under this model, but perhaps not surprising that it can be hard to scale up and compose the guarantees without things blowing up.
That's not to say that there's isn't value in trying to obtain differential privacy guarantees for smaller scale pieces of work. It's still considerably better than other more naive pseudo-anonymisation methods.