Hacker News new | past | comments | ask | show | jobs | submit | octbash's comments login

>I found this part to be the most interesting. If this current proclamation is a prelude to the overhaul of H1B system in a way that would make it work like described above, then it is somewhat exciting for a couple of reasons.

I think most people who've dealt with the immigration system would think that that is naive. This administration has proven time and time again with regards to immigration that they will pay lip service to making improvements while almost always simply making life harder for immigrants and people on visas.

See: how they suspended H-1B premium processing for a while some time back, also claiming that it was in service of "overhauling" the H-1B system.


>See: how they suspended H-1B premium processing for a while some time back, also claiming that it was in service of "overhauling" the H-1B system.

Except this time they didn't say they suspended H1B to "overhaul" the system, and that by the time it is over, we will get a new system. They just said they have longer-term reforms in that area in their plans.

Which makes perfect sense in my eyes, because swapping to an H1B system similar to what Canada has is a massive change. I don't see it happening on a scale that is shorter than 5 years from start to finish.


So a reelection cliffhanger. Does actually seem plausible for this administration.


Is there any reasonable explanation why the previous administration, which controlled both the house and senate albeit for a short period + won 2 terms, didn't make any meaningful progress to overhaul the process and make it merit based?


The Democrats spent all of the period when they had supermajority in the senate and majority in the house barely being able to pass the affordable care act (and subsequently sustaining great political damage for that). After Scott Brown replaced the late Ted Kennedy, almost everything was subject to filibuster and it became really hard to do anything substantial. Comprehensive Immigration reform came close in 2013 (in the wake of the 2012 republican presidential election loss) and was passed bipartisanly in the senate but died in the republican-controlled house.


This. Democratic party never intended to pass DACA. All the wanted was to create shock waves and funny how that turned out. If President Obama wanted to legislate DACA, he would've done in his first term. He made sure to to get elected twice and in last term, he chose to introduce DACA. It is clear form last election voting, that majority of America doesn't want DACA. Doesn't matter how much news outlets push for it. All democrats wanted was to polarize their base. They're no different from DJT.


The house killed immigration reform in 2013 that passed the senate by overwhelming bipartisan margins. Blame Cantor and Boehner.


> This administration has proven time and time again with regards to immigration that they will pay lip service to making improvements

Not much different than 8 years of previous administration [1]. One good thing about this administration is that it hurts SV, and that might bring attention to the problem of Greencard backlogs, and the resulting indentured servitude and labor market distortions.

Also, note that the person putting hold on the bipartisan S386[2] bill is Sen. Durbin, a Democrat. So, please don't lay all blame for this situation at Republican's or President's doorstep.

[1] Executive Action: Support High-skilled Business and Workers -http://www.dhs.gov/sites/default/files/publications/14_1120_... [2] Fairness for High-Skilled Immigrants Act of 2019 - https://www.congress.gov/bill/116th-congress/senate-bill/386...


Very likely that there are people with different views who are pulling things to different directions.


I seem to find myself in the minority, but I don't think distill.pub is a particularly ideal model for publicizing research.

distill.pub heavily favors fancy and interactive visualization over actually meaningful research content. This is not to say that the research publicized on distill.pub is not meaningful, but that it is biased to research that can have fancy visualizations. So you end up seeing a lot of tweakable plots, image-augmentations, and attention weights visualizations. It is also further biased towards research groups that have the resources to create a range of D3 plots with sliders, carved out of actual research time.

For instance, I don't think BERT could ever make it into a distill.pub post. Despite completely upending the NLP field over the last 2 years, it has no fancy plots, multi-headed self-attention is too messy to visualize, and its setup is dead simple. You could maybe have one gif explaining how masked language modeling works. The best presentation of the significance of BERT is "here is a table of results showing BERT handily beating every other hand-tweaked implementation for every non-generation NLP task we could find with a dead-simple fine-tuning regime, and all it had was masked language modeling."

To give another exmaple: I think it's one of the reasons why a lot of junior researchers spend time trying to extract structure from attention and self-attention mechanisms. As someone who's spent some time looking into this topic, you'll find a ton of one-off analysis papers, and next to no insights that actually inform the field (other than super-trivial observations like "tokens tend to attend to themselves and adjacent tokens).


Excuse me - what "draconian quarantine laws"?



How is that draconian? If you're issued a stay-home notice, STAY AT HOME.


"Draconian" as in I don't see it practicable in my country. Not a moral judgment.


One is a language generation model, the other is a fill-in-the-blank model. It sounds like they might be similar, but in practice they are different enough objectives (and in particular the "bi-directional" aspect of BERT-type models) that the models learn different things.


Those are question-answering and language-understanding benchmarks respectively, neither of which has been suitable for language generation mode evaluation since GPT-1 was roundly beating by BERT. GPT-2 didn't evaluate on them either.


How are you using GPT-2 with an expanded context window? I was under the impression that the maximum context window was fixed.


I wrote code to repeat the wpe variable N times along the context axis during model load time.

Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.

At that point, you can just set context window to a larger value, then train.


Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)


Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.

Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.

The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.


My counter-arguments (as a huge PyTorch fan) are:

1. GPT hasn't really been about model/architectural experimentation, just scale. GPT-2 and GPT were architecturally very similar. Scale, especially at the scale of GPT-*, is one avenue that TensorFlow does have an edge over PyTorch 2. Work on GPT-3 probably started quite a while ago.


Minimum=/=lowest effort.

Especially with the caveat of "with a programming background", it is far easier to reason and debug through PyTorch with just Python knowledge, compared to TensorFlow/Keras, which sooner or later requires you to learn a condensed history of TensorFlow/Keras development to understand why things are the way they are.

In my opinion,

  import lib
  lib.train("imagenet", "resnet50", epochs=10)
  lib.eval()
is NOT a good example of a beginner friendly library. It's a thin wrapper facade that hides all of the actual complexity behind "Train ImageNet in 3 lines of code!"


Fair; maybe minimum isn't the right word. More like "minimum without full abstraction."

The Keras examples are a good reference (e.g. https://www.tensorflow.org/tutorials/keras/classification ); even without an AI background, you have a sense of both what's going on how to tweak the model to improve it.


The reason why Keras became so popular is that it borrowed a lot of concepts from Lua Torch (which predates even Theano). And anyone who worked with Torch immediately sees it reading Keras code. But Torch was Lua and naturally it received less recognition than it deserved. Your will not lose anything by simply moving to PyTorch.


Besides vocabulary/characters, in what sense is Mandarin not simple?


Tones


Which is a massive sense in which it is not simple! For any non-native speakers, (I'm a native English speaker), tones were nearly impossible to master, and I still have a terrible grasp of them I think. It requires a complete rewiring of the way you speak language, sure, but it also requires a complete rewiring of how you hear it too!


Tones are simple (in the sense of being governed by linguistic rules), just foreign.


Yes, the Reformer is basically trading off noisier for faster training / memory savings.


Yep, and I'm not saying its a bad approach! Just trying to answer "why is that any worse than, say, starting with randomly initialized weights in general?" wrt gradient passing

I'm not sure I'd agree with the "noisy" characterization - which to me implies stochasticity-, whereas this is just blocking off the flow of gradient information to save memory.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: