Hacker News new | past | comments | ask | show | jobs | submit | natpat's comments login

Or - accept that you'll be "inauthentic" for a while as you get back on your feet. Attract companionship and let it help you get back on the path to self acceptance. It's not a bad thing to rely on people and let them help you, even if you have to hide parts of yourself.


Hi, I'm Nathan. I'm a SWE with experience in large scale data processing and backend, with a recent focus on ML/Natural Language Processing.

  Location: Vancouver, BC
  Remote: Yes, in Canada
  Willing to relocate: No
  Technologies: Python, Java, Spark, Kubernetes, FastAPI
  Résumé/CV: https://natpat.net/static/cv.pdf
  Email: nmtpatel7337 [at] gmail [dot] com
Whilst I am a strong technical contributor, I'm also interested in leadership, mentorship and learning. My favourite challenge recently has been leading a highly motivated and skilled team of SWE, AI Researchers and Data Scientists. I'm interested in a position where I can continue working with strong co-workers and help mould and lead strong, productive and happy teams. I also love learning and being surrounded by people from different backgrounds and fields to me. I've had exposure to bio-tech and NLP, but would also be interested in other ML fields, robotics, health, eco/green energy, and probably more.

I'm also very interested in game development, mainly as a hobby - see https://natpat.net/games. Would be very open to roles in that sector too!

Sadly, I can only accept roles that would be able to provide me with a Canadian Work Permit, which likely excludes many US-only based companies.


Hi, I'm Nathan. I'm a SWE with experience in large scale data processing and backend, with a recent focus on ML/Natural Language Processing.

  Location: Vancouver, BC
  Remote: Yes, in Canada
  Willing to relocate: No
  Technologies: Python, Java, Spark, Kubernetes, FastAPI
  Résumé/CV: https://natpat.net/static/cv.pdf
  Email: nmtpatel7337 [at] gmail [dot] com
Whilst I am a strong technical contributor, I'm also interested in leadership, mentorship and learning. My favourite challenge recently has been leading a highly motivated and skilled team of SWE, AI Researchers and Data Scientists. I'm interested in a position where I can continue working with strong co-workers and help mould and lead strong, productive and happy teams. I also love learning and being surrounded by people from different backgrounds and fields to me. I've had exposure to bio-tech and NLP, but would also be interested in other ML fields, robotics, health, eco/green energy, and probably more.

I'm also very interested in game development, mainly as a hobby - see https://natpat.net/games. Would be very open to roles in that sector too!

Sadly, I can only accept roles that would be able to provide me with a Canadian Work Permit, which likely excludes many US-only based companies.


Completely off topic - but I love the aesthetic of the post. "Vanilla HTML" is a design that isn't used enough. It's something I tried to apply to my personal blog, but I think it's been done much better here.


> One of our conclusions is that not everything on Twitter is a good candidate for an algorithm, and in this case, how to crop an image is a decision best made by people.

This seems like it should have been a foregone conclusion. What was the driving force in the first place to think cropping images with an AI model was desirable? Seems like ML was a solution looking for a problem here, and I'm glad they've realised that.


It seems obvious in retrospect. Calling it a foregone conclusion is too harsh.

Twitter crops photos to fit their preview formats. It seems like an obvious improvement to show people's faces when cropping, etc.


Right but... we've been cropping images in web applications since... y'know, pretty much ever. Using ML to do this was always pretty ridiculous overkill. Give the users an image cropper, and be done with it.


I can't see why this is overkill. You're eliminating a step from the image posting process, and making it so users don't have to crop an image twice (once for the full image, and a second time for the preview). That makes sense when you're writing a CMS or blogging platform like Wordpress, but for Twitter it adds some friction.

So, previously, the preview was just cropped in the center. But this made some images look funny, since people's faces would get chopped off.

Coming up with a workable solution to this with ML is not especially hard. You can get things like face detection off the shelf, maybe just tell your autocropper, "crop closer to the face" and have a demo within a couple days (and then much more effort to productionize it). From there, you can start introducing ML models to improve on your basic face detection. (I'm not counting face detection as ML.)

This is not a case where some massive ML model is being brought in to save two seconds of your time. This is a very natural and obvious application of ML, at a company which already does ML at scale, in a way that sounds like it has a good chance at improving the appearance of the site without introducing additional friction.

Instragram gets around this by encouraging everyone to take square photos.


I don’t think anyone is saying “I will always prefer to crop every photo and everyone else should too”. I think the point is closer to, if I may borrow a Simpsons line, “I liked your half-assed underparenting a lot more than your half-assed overparenting”. It’s actually impressive that Twitter didn’t object “but I was using my whole ass”, which is basically their default trope when they address user complaints.


> I don’t think anyone is saying “I will always prefer to crop every photo and everyone else should too”.

The parent comments are basically saying that, just not in such an exaggerated way. "Give the users an image cropper, and be done with it."


Look at the before and after pictures of when they released the ML crop:

https://blog.twitter.com/engineering/en_us/topics/infrastruc...

All those examples show large improvement. Of course they might cherrypick images with large improvement for their blog advertising the feature. But still, it illustrates why people would think it's a good idea.

Of course they don't seem to consider the idea of not cropping at all.


This is super interesting. I've recently also been working on a similar concept: we have a reasonable amount (in the terabytes) of data, that's fairly static, that I need to search fairly infrequently (but sometimes in bulk). A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Random access of a file on S3 is pretty fast, and running in an EC2 instance means latency is almost nil to S3. Cheap, fast and effective.

We're using some custom Python code to build a Marisa Trie as our index. I was wondering if there were alternatives to this set up?


You could look at AWS Athena, especially if you only query infrequently and can wait a minute on the search results. There are some data layout patterns in your S3 bucket that you can use to optimize the search. Then you have true pay-per-use querying and don't even have to run any EC2 nodes or code yourself.


> that I need to search fairly infrequently (but sometimes in bulk).

What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?

> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.

Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.

Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.


> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap


> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

My whole point was not building it ;)

There's also https://github.com/NVIDIA/aistore


> What do you mean by search ?

Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.

I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.

I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.


How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block


Combining data-at-rest with some slim index structure coupled with a common access method (like HTTP) was the idea behind a tool a key-value store for JSON I once wrote: https://github.com/miku/microblob

I first thought of building a custom index structure, but found that I did not need everything in memory all the time. Using an embedded leveldb works just fine.


There might be much better alternative but it really depends on the nature of your key.

Because the crux of S3 is the latency you can also decide to encode the docs in blocks, and retrieve more data than is actually needed.

For this demo, the index from DocID to offset in S3 takes 1.2 bytes per doc. For a log corpus, we end up with 0.2 bytes per doc.


You might want to check out Snowflake for something like this, it makes searching pretty easy, especially as it seems your data is semi-static? We use it pretty extensively at work and it's great.

For your usecase it'll be very cheap if you don't access it constantly (you can probably get away with the extra small instances, which you are billed per minute).

Not affiliated in anyway, just a suggestion.


Also check out Dremio with parquet files stored on S3


This is the kind of thing I value in Rails. Active storage [1] has been around for a few years and it solves all of this. All the metadata you care about is in the database - content type, file size, image dimensions, creation date, storage path.

[1] https://guides.rubyonrails.org/active_storage_overview.html


Hi, I'm Nathan. I'm a SWE with experience in backend and large scale data processing, with a recent focus on ML/Natural Language Processing. Looking for an opportunity to relocate to Vancouver.

  Location: Currently London, looking for Vancouver
  Remote: Don't mind
  Willing to relocate: Yes, to Vancouver!
  Technologies: Python, Java, Spark, Kubernetes, FastAPI
  Résumé/CV: https://natpat.net/static/cv.pdf
  Email: nmtpatel7337 [at] gmail [dot] com
Whilst I am a strong technical contributor, I'm also interested in leadership, mentorship and learning. My favourite challenge recently has been leading a highly motivated and skilled team of SWE, AI Researchers and Data Scientists. I'm interested in a position where I can continue working with strong co-workers and help mould and lead strong, productive and happy teams.

I also love learning and being surrounded by people from different backgrounds and fields to me. I've had exposure to bio-tech and NLP, but would also be interested in other ML fields, robotics, health, eco/green energy, and probably more.

I'm also very interested in game development, mainly as a hobby - see https://natpat.net/games. Would be very open to roles in that sector too!


There's lot of great ways to ensure your collision detection is perfect. Sadly none of them run at 60fps.


Seems like they have 24/25

> each day until this Christmas, we will explain one core piece.

https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-w...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: