I don't know if that's a silly suggestion.
The tricky bit here is closing the analog loophole (using this camera to record a carefully constructed, high resolution fake) and preventing the HSM from signing anything which wasn't recorded by the camera lens.
Like, I can imagine maybe there could be an attack where one could record gps signals at some nearby place, and then play them back in slightly different orders/rates to try to fool a receiving device into thinking it is somewhere else (nearby).
But I don't know how much of a time delay would be needed in order to do that. Could a timestamping service do timestamping quickly enough to prevent this attack for internet connected devices?
I tried looking at how the signals for GPS work (like, how frequently it sends time information and how detailed it is for civilians) but it seemed complicated and I got confused and/or didn't try hard enough, so I didn't arrive at an answer for how long it would take to spoof positions if one could only delay real signals that one received.
Edit: the purpose of making the gps coords unspoofable would be to make it so that even if the screen-in-front-of-camera attack was done, it would have to be done at the same location and time as it would be claimed to have happened.
Nope, classic replay attack case (just record the GPS signal and replay to the device at the desired location). You'd need a true time signal within the device, e.g. and atomic clock to make it work (so you'd authenticate the signed time against true time).
There is another way, however. If we assume the hardware is tamper-proof (unless drastically different methods are needed), then with strict timings we can device a challenge-response system that's immune to replay attacks due to relativity: simply transmit a signal A, have a known 3rd party (e.g. US government servers in cellphone towers) sign your signal Sig(A) and retransmit, and check that the delay matches the propagation delay you'd expect from the cellphone tower distance, plus the fixed (and immutable since it would be gov-controlled) processing delay. Your tamper-proof crypto-camera would record its location and whether it trusts the location. Using cellular signals is also better because GPS doesn't work indoors and is sensitive to interference.
Since we're adding a cellular connection to our device, it would also be a good idea to log its position on the state-controlled servers (again can be done with cryptographic safety assuming non-tampered device), along with some kind of intrusion detection system. As soon as it'd detect an attempt at tampering, it would relay this attempt to the servers, storing the intrusion and invalidating the authenticity subsequent recordings; a self-destruction attempt of the key would probably be also wise.
And now that I think about it, you'd probably want to put several keys/auths in the device, from different organizations -- not only governments. That way if the government authentication is positive but NGO's mismatch, you can suspect a government-backed forgery attempt (analogously vice-versa).
I guess in the case of faking political speeches and the like, it becomes some trust model regarding who owns the camera.
 https://news.ycombinator.com/item?id=16014047 [Google claims near-human accuracy at imitating a person speaking from text]
I've been wondering ever since I first started reading about 1x1 convolutions a while back.
My background is not in artificial neural networks, but I understand their single neuron operation: a linear combination of inputs (optional extra input or offset), so this part behaves like any linear correlator, and then a nonlinear but typically differentiable compression sigmoid.
I understand how convolutional neural networks operate, and that the synaptic weights correspond to filter kernel weights (like point spread functions, or impulse responses).
Given this engineering-like interpretation I have, can someone explain to me what use convolution with a 1x1 filter has??
I like to think of 1x1 convolutions as pixel wise dense layers
Here’s a toy example of a useful 1x1 convolution: you could convert a color image to greyscale by doing a 1x1 convolution with the kernel (.33, .33, .33)
again, thanks for succinct and clear explanation!
It is good to point out what they aren't doing, they should also point out what they are doing.
Humor me by considering:
"Today I did not go to the zoo, I did not go to school, ..., and I did not go by foot, nor by vehicle without weels, nor by a vehicle with only one wheel, nor by a vehicle that had 3 or more wheels, and the vehicle was not motorised, and I did not need to stand up on the vehicle"
While true, and possibly important to point out what I didn't do, it's generally more helpful to describe what I am doing, like "I rode my bicycle to the supermarket"
Increasing blondeness also increases smile.
Just as an aside here: "Blondeness" and "beard" are probably just labels the authors found correspond the most to the latent variables in this case. This means that there won't be a perfect translation between those words and what these variables directly respond to in the network.
So although the training data may have been biased with more smiling blonde people, it doesn't necessarily have to have been so. It might be that what this latent variable encodes just does something else in edge cases where there are few examples.
See their code snippet halfway down their page, or "Semantic Manipulation" on page 8 of their linked paper.
Our negative so far is that most of the interpolations really seem more like two existing pictures were photoshopped together than like a new face was generated from a latent space and knowledge of faces. Sorry, I don't have the vocabulary and concepts of visual composition to say why. It just looks "shooped".