Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Applying Open Source ML and LLM in side projects – where to start?
13 points by danproductman on July 20, 2023 | hide | past | favorite | 4 comments
Hi all,

The open-source ML and LLM community is experiencing rapid development, and I've been closely following its progress for the past few months. As a frontend developer with over a decade of experience and some backend dabbling, I'm eager to gain practical experience in working with these powerful tools on small side projects to unleash their potential.

My tinkering nature has led me to set up VPS instances, build data scraping tools, and even create complex iOS apps with CoreData in Swift, apart from my professional frontend work (building big FE applications). I'm not afraid to learn new languages like Python or learn to use libraries like Pandas. However, in the realm of ML and LLM, I feel a bit overwhelmed, unsure where to start or if the skills I possess can be effectively applied.

What I'm aiming for is hands-on implementation of this technology in real-world scenarios. For instance, I meticulously categorize my expenses each month and would like to create a model that automatically classifies transactions based on historical data. Can open-source tools help me achieve this, and if so, what should I learn to make it happen?

Another project I'm working on involves a straightforward analytics tool. I envision using NLP in its search interface to guide users seamlessly toward their goals. Imagine users asking questions like "What's my conversion rate for X in June?" and being navigated to a preconfigured page with the right filters. Or posing a query like "How is retention defined for Y?" and receiving a snippet of relevant documentation with a link for further reading.

I'm genuinely curious and eager to hear from all of you on the best path forward. What ML libraries or NLP frameworks would suit my projects? How can I effectively leverage these open-source tools for practical applications? Your insights and advice would be invaluable in helping me take the first steps.

Thank you in advance!

Dan




I would suggest getting your feet wet with HuggingFace Spaces free/Pro plan to get started & then their APIs once you get the hang of setting up things there. After that you can start with setting up LangChain pipelines or direct vector DB queries for which sort of columns or SQL queries to formulate(for the latter).

As for the former classifier you can try doing zero-shot classification between n number of categories + others. Models like Flan-T5/T5/Flan-UL2/DistillBART(also ~7B-40B param LLMs can also do this but would be overkill).


I was in a similar boat, and I built a project called Endless Academy - https://www.endless.academy/ . It helped me both brush up on some new tools, and scratch an itch to build something.

To start, just using an LLM API (like Anthropic or OpenAI), and a light wrapper like microsoft guidance will be enough for the AI piece. If you want to get more complex, you can add in semantic search with an embedding model and a vector database. But don't do that off the bat.

For your use case, you won't need ML off the bat, either. When/if you do need ML models like classifiers, I'd use scikit-learn.

For the queries, I would skip pandas, and just use SQL. You can use an LLM to turn natural language into SQL queries, then just show the query results in an interface. The hardest part will actually be mapping the queries into the interface, and vice versa.

For my stack, I used FastAPI for the backend, and SvelteKit for the frontend. I highly recommend this stack for LLM applications - the async paradigm works well for streaming LLM outputs, and you get nice reactivity on the frontend.


I think you need to figure out what your real goal is and focus. If you wanted to learn some stuff then focus on tutorials until you become an expert. I would start with python and move on from there.

If you goals is to build a product for sale I would focus on one idea build a demo using apis that exist first. Like openAI or something one of it's competitors. If it's a hard problem and users find it very useful then it doesn't matter your costs because you just tack on the costs and charge them.

The first goal is exploratory and the other has a specific goal. They're probably incompatible. It's like saying I want to be a chef(you have to focus on the fundamentals first) and I want a cake tonight. Either sign up for chef school or buy a cake mix(or go to a bakery).

Sorry it doesn't directly answer your question.


The thing is the OpenAI models still seem to be much more practical because they are significantly smarter than other models, especially the open source ones. You can use them without needing to train them. And creating the data for training can be a big effort.

Based on the projects you mentioned, my suggestion would be:

- to get signed up with OpenAI and make sure you have GPT-4 access set up with the API.

- even if the GPT-4 access isn't available immediately, you can get pretty far with gpt-3.5-turbo.

- Find a tutorial on using the OpenAI LLM API.

- Get the tutorial working and then modify it for your first use case. The prompt can be something like:

You will receive a financial transaction encoded as JSON. Output a classification as JSON in this format: { "transaction_type": "food"} where transaction_type is one of "food", "other", "transportation", etc.

- For the analytics project, you will want to find an OpenAI LLM tutorial for using their new Functions feature. Give it functions like configureAndShowPage(filters) or maybe even queryAnalyticsData(sql). The AI would write those function calls and parameters out on the fly based on a user question, your program would receive those function calls and execute them and display the result to the user.

- For the analytics documentation search, look up something like "OpenAI embedding search" or "llamaindex starter tutorial".

As far as open source, I know some of the recent ones are showing more promise, but I still believe they will need significant training to really be useful for most tasks. But I would be really interested to hear if that is not the case, or how someone with a lot of experience with the open source models would approach your use cases.

I assume actually the embedding search might work okay with one of the latest embedding models. Still more hassle and possibly more expensive to run than OpenAI.

Is the idea to use open source only because you are worried your wife will get mad if you spend $20 on the OpenAI API or something? I mean I get it if you just prefer to use open source in general. I would like to also. Its just that until the coding and other abilities for open source models get better, it seems much more practical to skip all the training and hosting and just use the OpenAI API with their general-purpose and capable models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: