Ask HN: Applying Open Source ML and LLM in side projects – where to start?

DarthNebo · on July 20, 2023

I would suggest getting your feet wet with HuggingFace Spaces free/Pro plan to get started & then their APIs once you get the hang of setting up things there. After that you can start with setting up LangChain pipelines or direct vector DB queries for which sort of columns or SQL queries to formulate(for the latter).

As for the former classifier you can try doing zero-shot classification between n number of categories + others. Models like Flan-T5/T5/Flan-UL2/DistillBART(also ~7B-40B param LLMs can also do this but would be overkill).

vikp · on July 20, 2023

I was in a similar boat, and I built a project called Endless Academy - https://www.endless.academy/ . It helped me both brush up on some new tools, and scratch an itch to build something.

To start, just using an LLM API (like Anthropic or OpenAI), and a light wrapper like microsoft guidance will be enough for the AI piece. If you want to get more complex, you can add in semantic search with an embedding model and a vector database. But don't do that off the bat.

For your use case, you won't need ML off the bat, either. When/if you do need ML models like classifiers, I'd use scikit-learn.

For the queries, I would skip pandas, and just use SQL. You can use an LLM to turn natural language into SQL queries, then just show the query results in an interface. The hardest part will actually be mapping the queries into the interface, and vice versa.

For my stack, I used FastAPI for the backend, and SvelteKit for the frontend. I highly recommend this stack for LLM applications - the async paradigm works well for streaming LLM outputs, and you get nice reactivity on the frontend.

sharemywin · on July 20, 2023

I think you need to figure out what your real goal is and focus. If you wanted to learn some stuff then focus on tutorials until you become an expert. I would start with python and move on from there.

If you goals is to build a product for sale I would focus on one idea build a demo using apis that exist first. Like openAI or something one of it's competitors. If it's a hard problem and users find it very useful then it doesn't matter your costs because you just tack on the costs and charge them.

The first goal is exploratory and the other has a specific goal. They're probably incompatible. It's like saying I want to be a chef(you have to focus on the fundamentals first) and I want a cake tonight. Either sign up for chef school or buy a cake mix(or go to a bakery).

Sorry it doesn't directly answer your question.

ilaksh · on July 20, 2023

The thing is the OpenAI models still seem to be much more practical because they are significantly smarter than other models, especially the open source ones. You can use them without needing to train them. And creating the data for training can be a big effort.

Based on the projects you mentioned, my suggestion would be:

- to get signed up with OpenAI and make sure you have GPT-4 access set up with the API.

- even if the GPT-4 access isn't available immediately, you can get pretty far with gpt-3.5-turbo.

- Find a tutorial on using the OpenAI LLM API.

- Get the tutorial working and then modify it for your first use case. The prompt can be something like:

You will receive a financial transaction encoded as JSON. Output a classification as JSON in this format: { "transaction_type": "food"} where transaction_type is one of "food", "other", "transportation", etc.

- For the analytics project, you will want to find an OpenAI LLM tutorial for using their new Functions feature. Give it functions like configureAndShowPage(filters) or maybe even queryAnalyticsData(sql). The AI would write those function calls and parameters out on the fly based on a user question, your program would receive those function calls and execute them and display the result to the user.

- For the analytics documentation search, look up something like "OpenAI embedding search" or "llamaindex starter tutorial".

As far as open source, I know some of the recent ones are showing more promise, but I still believe they will need significant training to really be useful for most tasks. But I would be really interested to hear if that is not the case, or how someone with a lot of experience with the open source models would approach your use cases.

I assume actually the embedding search might work okay with one of the latest embedding models. Still more hassle and possibly more expensive to run than OpenAI.

Is the idea to use open source only because you are worried your wife will get mad if you spend $20 on the OpenAI API or something? I mean I get it if you just prefer to use open source in general. I would like to also. Its just that until the coding and other abilities for open source models get better, it seems much more practical to skip all the training and hosting and just use the OpenAI API with their general-purpose and capable models.