More

hlfshell · 2024-06-03T20:17:20

Location: San Diego, CA

Remote: Preferred unless it's a great fit

Willing to relocate: Yes, for the right offer

Technologies: PyTorch, Tensorflow, Python, Golang, AWS, GCP, Kubernetes, Robotics, LLMs

Resume/CV: https://github.com/hlfshell https://github.com/hlfshell/resume

Email: kchester@gmail.com

12 years of experience doing backend systems, including five at various robotics startups doing data pipelines and automated deep learning training pipelines. Completing my Masters of Science in Robotics Engineering; actively looking for a new role in robotics, AI, or other interesting spaces. Currently investigating language enriched models and AI w/ reinforcement learning!

hlfshell · 2024-06-03T20:16:09

I've worked with this poster before for a few years before moving on to break into a different industry. Fantastic engineer, great person to work for. I highly recommend sending your resume their way.

hlfshell · 2024-05-17T04:06:47

Go by example is an established favorite. https://gobyexample.com/

ishaanbahal · 2024-05-17T07:19:10

Have been using go for 8 years now, and started with this. And to be quite honest, I still refer to this sometimes for quick lookup on certain things related to syntax, pools, waitgroups or channels. Beautifully done tutorial of Golang, can be done in a few hours and gets you a good base knowledge of the language.

To the creator, a sincere thank you!

__loam · 2024-05-17T04:17:26

I love go. Now to continue with modern systems languages, here's ziglings: https://codeberg.org/ziglings

hlfshell · 2024-05-01T19:12:49

Location: San Diego, CA

Remote: Preferred unless it's a great fit

Willing to relocate: Yes, for the right offer

Technologies: PyTorch, Tensorflow, Python, Golang, AWS, GCP, Kubernetes, Robotics, LLMs

Resume/CV: https://github.com/hlfshell https://github.com/hlfshell/resume

Site/Portfolio: https://hlfshell.ai

Email: kchester@gmail.com

12 years of experience doing backend systems, including five at various robotics startups doing data pipelines and automated deep learning training pipelines. Completing my Masters of Science in Robotics Engineering; actively looking for a new role in robotics, AI, or other interesting spaces. Currently investigating language enriched models and AI w/ reinforcement learning!

hlfshell · 2024-04-27T03:35:41

Innovator's Dilemma demonstrated wonderfully right here.

If VR is the next big thing, but will take years of niche smaller profit margin products to launch it, established tech companies will struggle to justify the products existence wherein a smaller player could go all in, achieve roughly equivalent results, and be quite content.

pjmlp · 2024-04-27T05:58:56

The next big thing since people were playing DOOM with headsets at conferences back in 1994.

hlfshell · 2024-04-25T06:40:42

Yes. Interviewed someone that the team loved. We said definitely hire.

HR said they didn't accept the offer and went elsewhere. What they didn't know is that the person LinkedIn messaged me saying they enjoyed the team, and were disappointed we passed, respectfully asking what I would recommend to work on for their interviewing.

Opinions on whether to reach out after a rejection aside, it highlighted what I already suspected; the job was never going to be filled, and even as the team lead they were lying to me and the team.

When there was yet another round of layoffs months later, the spot remained, unsurprisingly, unfilled.

brandall10 · 2024-04-25T08:02:41

That is wild. Let me get this straight - the company actively lied to the team in an effort to get their hopes up that help was coming... by actually giving the extra overhead of sourcing a new hire?

Were you guys actively trying to fill the role (screen dozens of candidates, several went through final rounds, etc), or was this just some passive thing where recruiting tried to push someone who looked great?

hlfshell · 2024-04-25T18:28:45

We ranked candidates out of four, and this candidate was a four across the board. Experience in our tech stack, demonstrated competency, passed the lunch test, etc. The team was eager to get them started because we were desperate for more hands on. The team was at one point 6 and we were short handed, but through people resigning we were down to 4, with me taking the role of senior engineer and team lead.

This was an unequivocal HR lied and kept the mirage of "help is coming don't worry!".

Yes there's overhead of us spending time doing the interviewing, but a few hours here and there once a month is enough to keep the charade going. The applications site would go first to HR, then they'd put resumes in front of me. I always wondered why I couldn't just see everyone that applied and why it had to be filtered first; after this event I understood why. I suspect there was plenty more applicants than I was told of.

ang_cire · 2024-04-25T09:04:41

At my current job, we had a position that an absolutely stellar candidate supposedly ghosted HR on. Same thing; candidate reached out to me asking about next steps, and I was like, "let me check". Went and asked HR, and was told they'd handle the rest, and not to talk to candidates outside of HR-scheduled interviews. Never heard back again.

That was 3 months ago (and 3 months into the position posting), and the position is still unfilled.

brandall10 · 2024-04-25T11:04:03

Same question - were you and are you continuing to actively interview candidates?

That's the mystery for me, why put the team through a charade that actively harms their productivity?

The only thing I can think of is the candidate disclosed something that would be a legal red flag and can't really divulge that is an issue without threat of a lawsuit. Or something came up in a background check, but I imagine it didn't get to that point if the candidate reached out to you directly.

hlfshell · 2024-04-25T18:32:45

The "why" is because people quit environments that are toxic, but hold out if there are signs of improvement. People don't want to quit - it's a large investment of time and investment to job hunt. It's annoying as hell. So if you can string together promises:

- We'll hire more people, it's just so hard to find candidates! - We're doing bonuses right after evaluations, but we're doing a new evaluation system that's taking a bit longer than we expected this year - We're just finalizing talking to a customer we'll have them signed next month, we swear

That delays people looking, delays that large investment, and you can extract engineering time with little investment.

If you ran a company that couldn't afford another engineer for a team that desperately needed another engineer, are you going to paint the bleak picture for your team and risk losing those engineers that you have, or lie? If you have morals and say you'd be upfront, great; but that's unfortunately not a universal answer.

ang_cire · 2024-04-29T10:42:22

Honestly, the weird thing is we don't need another person at all.

ang_cire · 2024-04-29T10:41:23

Yes, were still interviewing.

Harming productivity only matters for hourly staff (in the minds of bad executives leaders). Salaried folks have to finish the with no matter how long it takes, and it doesn't cost the company more.

ckdarby · 2024-04-25T11:08:42

Sometimes this can happen if the candidate is outside of the position pay and the company is willing to make it work if they are a perfect fit.

If your interviewing notes were simply a, "would recommend to hire" they'll pass.

hlfshell · 2024-04-17T15:08:52

This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.

During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.

The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.

andai · 2024-04-17T15:25:29

I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.

MPSimmons · 2024-04-17T17:11:32

>I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.

samus · 2024-04-17T17:11:45

It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".

Filligree · 2024-04-17T15:33:28

The latter. Yes, it all needs to stay in memory.

fire_lake · 2024-04-17T15:13:40

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?

og_kalu · 2024-04-17T15:29:56

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse · 2024-04-17T15:22:18

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

rgbrgb · 2024-04-17T15:25:38

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

nialv7 · 2024-04-17T16:44:44

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

samus · 2024-04-17T17:17:24

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

cjbprime · 2024-04-17T16:28:05

Not quite, you don't save memory, only compute.

api · 2024-04-17T15:12:38

A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.

wenc · 2024-04-17T15:26:32

Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?

Filligree · 2024-04-17T15:34:57

Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.

andai · 2024-04-17T15:27:13

Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?

wongarsu · 2024-04-17T15:31:31

Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU

andai · 2024-04-17T15:33:03

Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...

samus · 2024-04-17T19:50:17

There is Qwen1.5-MoE-A2.7B, which was made by upcycling the weights of Qwen1.5-1.8B, splitting it and finetuning it.

jasonjmcghee · 2024-04-17T15:56:49

Yes there are many fine tunes on huggingface. Search "8x1B huggingface"

auspiv · 2024-04-17T15:34:35

The previous mixtral is 8x7B

hlfshell · 2024-04-01T21:42:00

Location: San Diego, CA

Remote: Preferred unless it's a great fit

Willing to relocate: Yes, for the right offer

Technologies: PyTorch, Tensorflow, Python, Golang, AWS, GCP, Kubernetes, Robotics, LLMs

Resume/CV: https://github.com/hlfshell https://github.com/hlfshell/resume

Site/Portfolio: https://hlfshell.ai

Email: kchester@gmail.com

12 years of experience doing backend systems, including five at various robotics startups doing data pipelines and automated deep learning training pipelines. Completing my Masters of Science in Robotics Engineering; actively looking for a new role in robotics, AI, or other interesting spaces. Currently investigating language enriched models and AI w/ reinforcement learning!

hlfshell · 2024-03-27T22:29:51

Yes and no. LLMs are definitely struggling with long horizon planning in complex situations, and can hallucinate/misunderstand environmental states such that they can recommend nonsensical or impossible steps. But we are starting to see more research that "grounds" LLM answers in a set state. Where they shine in the planning stage is they provide a sense of contextual heuristics we didn't have available before without a lot of expert domain programming.

LLMS in their current state are certainly not the answer, but point to some exciting possibilities that we couldn't do before because of sheer complexity.

hlfshell · 2024-03-26T05:16:49

I absolutely adore thess kind of DIY projects, where it becomes easily expandable and gets a good community empowering a feedback loop of better and better features.

LTT recently did a video about deej in their recent cheap gadgets video. https://youtu.be/8BxVi6YiicQ