Applied ML engineer with 8 years building ML Products in search & retrieval, and recently agentic LLM systems in production. At Eka.care I built agent-native retrieval (MCP tools, hybrid late-interaction reranking) grounding LLMs in external data, and open-sourced an LLM evaluation library (KARMA, adopted by the Gates Foundation).
Currently doing ML consulting and independent research on agentic search, training models with GRPO to plan, issue, and refine their own search-tool queries. Strong in Python and Go.
I have EU work authorization, no sponsorship needed.
Location: Göttingen, Germany (work authorization available)
Remote: Yes (preferred)
Willing to relocate: Yes (anywhere in Germany, open to EU)
Technologies: Python, Go, PyTorch, vLLM, TensorRT, Kubernetes, ElasticSearch, Vespa, Apache Beam, PySpark, AWS, GCP
Résumé/CV: https://bluenotebook.io/about
Email: nikhil.kasukurthi [at] gmail [dot] com
ML Engineer with 8 years building production AI systems.
Most recently Lead Data Scientist at Eka.care (healthcare, 100K+ doctors), I Deployed a Speech LLM (Whisper + Gemma 2) via custom vLLM plugins, cutting inference costs 60%.
Built medical search from query log analysis through query decomposition on ElasticSearch (nDCG@10 +55%, relevance +160%).
Designed MedAssist, an agentic LLM platform with MCP, adopted by Apollo Hospitals. Open-sourced KARMA (https://github.com/eka-care/KARMA-OpenMedEvalKit), an LLM evaluation library for Indian healthcare scenarios.
Architected model serving on Kubernetes (vLLM, RayServe, TensorRT), cutting costs 50% vs SageMaker.
Before that: search ranking at Udaan (India's largest B2B marketplace, +10% conversion via A/B-tested LTR models), research at NCBS-TIFR (published in Bioinformatics), and clinical AI at SigTuple (retinal disease detection through CE certification).
3 peer-reviewed papers (IEEE ISBI, Bioinformatics). Recently completed Stanford CS336 (LLMs from scratch) with distributed training on H100 clusters.
Looking for: Senior/Staff ML Engineer, Applied Scientist, or AI Engineer roles. Especially interested in search/retrieval, LLM infrastructure, or applied ML.
Author here. I wanted to train Nanochat d26 to GPT-2 level and had to pick between three H100 variants on Runpod.
SXM was the most expensive per hour but cheapest to finish:
SXM: 702ms/step - ~$37 (using vast.ai)
PCIe: 1,412ms/step - ~$112 (runpod)
NVL: 2,032ms/step - ~$181 (runpod)
My first SXM run hit 1,295ms. Barely faster than PCIe.
Nsight OS runtime summary led me to suspect CPU starvation.
I found a higher vCPU instance on Vast.ai which hit 700ms.
The 128 vCPU SXM instance also hit ~700ms, so it wasn't CPU count.
Looking at the network topology on Runpod and vast.ai, the first instance had GPUs split 4+4
across two NUMA nodes. NCCL's data transfer uses NVSwitch and is unaffected, but the control threads run on CPU. Cross-socket latency on every pthread_cond_signal added up.
NVL was the most confusing result, NCCL kernel times nearly identical to PCIe, but step times 44% worse. Only 4 of 28 GPU pairs share NVLink on NVL, the rest fall back to PCIe. I don't have a full explanation for this yet.
The scroll trigger was something I’ve seen and wanted to play around with, but I know it’s controversial so I added the toggle as well (upper left corner).
Interesting work, in the examples I can see that quite a few of them have the teracotta/warm-cream colour palette, was that an explicit choice to keep them in the prompts?
From the official frontend-design skill, on multiple occasions, unprompted, even I received the same warm-cream tones for different projects. Wondering if it's a new latent direction the model chooses to go to avoid safe/generic patterns.
Beyond the developer, the user massively benefits from MCP. Like you said, using any other SDK to build is a very valid approach but then you are tied down to a single client that you have built on that SDK.
If you would like to switch clients, then you have build it yourself. MCP solves this very well since, any MCP supported client can use the same tools/resources that you have built.
A neat trick in Vespa (vectors DB among other things) documentation is to use hex representation of vectors after converting them to binary.
This trick can be used to reduce your payload sizes.
In Vespa, they support this format which is particularly useful when the same vectors are referenced multiple times in a document. For ColBERT or ColPaLi like cases (where you have many embedding vectors), this can reduce the size of the vectors stored on disk massively.
LLMs = Latency? That's how most of us perceive it. When examining the timing breakdown of a request on Claude, you'll notice that the majority of the time is spent in Content Download—essentially, decoding output tokens.
In the blog, I discuss how partial json validation can help in workflow driven LLM products.
In one of my earlier jobs a few years back, we were training deep learning models on VMs with GPUs, back then the tooling was not as extensive (vs-code did not have the remote ssh then) as it is is now.
So, we would use SSH into the VM and do our work. This also involved a lot of debugging of code through vim since it's quicker to make in-place edits and re-run experiments, this taught me a lot on effective debugging and writing code for the VM
I'm interested in any tips you figured out for debugging in that environment. I find a GUI debugger to be an essential tool for this kind of work. It's the thing that keeps me using vscode remote vs just vim on the server (which I'd prefer if all I needed was editing).
Remote: Yes (preferred; hybrid also fine)
Willing to relocate: No (remote, or hybrid within Germany)
Technologies: Python, Go, search & retrieval, ranking, RAG, agentic LLM systems (MCP, tool use), LLM post-training (GRPO/RLHF), evaluation, vLLM, Kubernetes, AWS/GCP, ML Product building
Résumé/CV: https://bluenotebook.io/about (https://linkedin.com/in/nikhil-kasukurthi GitHub https://github.com/nikhil-kasukurthi)
Email: nikhil.kasukurthi@gmail.com
Applied ML engineer with 8 years building ML Products in search & retrieval, and recently agentic LLM systems in production. At Eka.care I built agent-native retrieval (MCP tools, hybrid late-interaction reranking) grounding LLMs in external data, and open-sourced an LLM evaluation library (KARMA, adopted by the Gates Foundation).
Currently doing ML consulting and independent research on agentic search, training models with GRPO to plan, issue, and refine their own search-tool queries. Strong in Python and Go.
I have EU work authorization, no sponsorship needed.
reply