Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I scraped 25M Shopify products to build a search engine (searchagora.com)
317 points by pencildiver on Dec 13, 2023 | hide | past | favorite | 268 comments
Hi HN! I built Agora as a side-project leading up to the holiday season. I wanted to find an easier way to find Christmas gifts, without needing to go store-by-store.

My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called 'Red Shoes'. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There's a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.

I've started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I've found is keeping the search speed and performance strong as the data set becomes larger. There's about 25 million products on Agora right now. I'll ramp this up carefully to make sure we don't compromise the search speed and user experience.

I'd love any feedback!




I hope you have better luck than I did!

A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.

I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.

Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.


>but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users

In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are. Some monetize trough ads, some have partnership with stores and you can buy directly from the search results.

I generally search first on the local Amazon equivalent, if I don't like what I see, I search on a smaller store. If I still can't find or dislike the products or prices I search Google. If I am still not contended with the results, I will go search on comparison engines.

And I also have a browser extension called Pricy who polls the comparison engines, so once I land in a product page I know which store has the better price and what was the price history through last year.

Probably many people have similar patterns. I expect people in US to search Amazon first, if it's not a very niche product they are after.

I think you can have a better monetization proposal, if instead of just search you build a sales platform, so people can directly buy after searching, without hoping to various websites.


Unfortunately many of these "comparison" websites have a businesses model built on affiliate fees.

It doesn't take much imagination to predict which products show up as "best" or "cheapest".

And the fairer ones have to keep playing cat and mouse with shops lowering pricing when they detect a scraper coming by. Or employ tricks to make their shipping seem free, lowering their overall price on the comparison platform.


> It doesn't take much imagination to predict which products show up as "best" or "cheapest".

Never seen a "best" outside of amazon, which does weird shit even without any affiliate fees. And "cheapest" is not really up to the site, unless they want to go under quite quickly.


Many if not all are like that. It's like everyone wants to take advantage of the lack of perfect information in the marketplace, as opposed to actually being helpful for consumers.


We were intentionally limiting the number of products and shops we were indexing due to opex. We needed to keep it low enough to provide ourselves with enough runway to keep things floating for longer.

pricerunner is another site which operates in a similar space. We had plans to build out the price tracking and a number of other features, so that we would appeal more to users who had your use cases. Sadly, we weren't getting enough traction. We did have regular users from the EU, but we simply couldn't seem to get in front of enough eyeballs for it to matter. At least at first, I expect that a large amount of your traffic to a new site like this has to be driven by Google, and we failed on that front as well. I'm not an SEO expert, so there were likely many things we did wrong or didn't even do which lead to this situation.

re: a sales platform, that's a pretty big challenge to take on, which would require massive investment up front. Not sure thats a viable route for most. We did have plans to address the "without hoping to various websites" problem, as we identified that as problematic for users very early on. The solution was relatively simple, but required more money to build out. We simply ran out of funds before we could get there.


<< We did have plans to address the "without hoping to various websites" problem, as we identified that as problematic for users very early on. The solution was relatively simple, but required more money to build out. We simply ran out of funds before we could get there. >>

What were your plans to solve this problem?


> In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are.

Anecdotally, I guess, I'd say extremely popular. I never search for products anywhere else.


Yeah, here in Czechia I always look at https://www.heureka.cz/ first.


What do you consider the local Amazon variant? And which country?


Amazon has no direct presence in Switzerland, but you can order a fraction of its products from neighboring countries. Many products are not available, mainly because nobody wants to deal with customs once the product crosses the EU boarder.

Amazon itself never moved into Switzerland in the first place for many reasons (small market, unusual customs situation, relatively high salary for warehouse workers), and in the meantime the largest Swiss supermarket chain created an Amazon clone which became hugely popular pretty much immediately: Galaxus.ch


If you wouldn't have said that it's basically the Amazon in Switzerland, I'd have thought that this is some blogspam dropshipping site...


Amazon is a blogspam drop shipping site in Europe


There are alternatives throughout Europe. The Balkans have Emag, Benelux has bol.com. I think in both regions Amazon is less popular. I'm sure there are other examples.


The Netherlands has plenty of them. Tweakers.net is a price tracker for electronics and such (eg: computer parts, phones, laptops etc) and usually it's easier to find a shop cheaper than Amazon. I have some go to stores for my needs because their content is organised way better than Amazon. I also find some alternatives better than Amazon because they have free next day shipping, something that's not free on Amazon.


Emag in Romania. I hate it, they bought most of the competition, they did a lot of anticompetitive things, but it's really easy to buy from them.


At some point, a couple of years ago when they introduced marketplace, I actually thought they are aiming for an "exit" to Amazon. They really got the service part of e-commerce nailed down. Merchants quality is and always will be an issue, but it is the same as on Amazon.


hagglezon.com to compare Amazon variant prices


bol.com in the Netherlands


Im curious why you consider lack of users to be the problem. I would have described it as lack of revenue.

What plans did you have for generating revenue from the site? (Serious question - given your low costs it would seem like a tiny amount of revenue would gave been enough.)


Our business model revolved around referrals, so lack of users directly translated to lack of revenue. While its true that even if we had millions of users but none of them were buying sponsored items we would have had a revenue problem, that wasn't the problem we were facing, as the few users we did have were in fact purchasing sponsored items.


Then the problem seem to be the lack of users.

Have you tried having an YouTube channel, TikTok, Facebook, Twitter, blog and explain daily how you built the website, how your platform is going to help users?


we did have channels on various sites, yes. However, its difficult to maintain a steady stream of content there for people to consume. Not only that, but you have the same discoverability problems as you do for the main site. Also, a blog outlining how you built the site may be of limited value. At least my experience on that front was it would generate short-lived bursts of traffic, but wouldnt generate returning users. So I think those articles were mostly appealing to technical users, and not necessarily users who were looking to do some shopping. Of course technical users do also shop, but after reading a technical article, they probably arent looking to immediately shop, and without some other mechanism putting the site in front of them again when they needed to shop, we would miss the opportunity.


Thanks for sharing this! If you're up for it, I'd love to talk more about your experience, especially the technical tooling. Working as fast as I can to understand the right way to approach the tech, as there are tradeoffs with performance and price. I'm at support @ searchagora .com


What strategies did you consider or implement to attract more users, and what would you do differently now to ensure better user acquisition?


We had no capital, so advertising or solutions that basically involved "throwing money at the problem" were off the table for us.

We spent time posting in forums helping people find items they were looking for, and we had a few posts here on HN that generated short-lived, explosive traffic bursts. I remember those days we had posts get picked up on HN, it was always an exciting night!

We were looking at influencers and getting our name getting bloggers to talk about us, but, again, without capital, our options were very limited here. I'm sure someone with more of a marketing background would have found a bunch of ways we could have generated organic user growth, but neither me or my business partner had that skill set.

If I were to do it again, I think I would try to get someone with a marketing background involved to help gain traction. Without that, even the best product in the world will die of starvation if no one finds it.


looks like simptoms of no market. maybe you were solving a problem already solved by amazon ? most shops on shopify also use amazon


Many shops do double list, this is true. However, I don't think its a solved problem. There are many people who do not want to shop on Amazon for their own reasons. There are also people who want to shop locally, and Amazon provides no mechanism to do so (that I'm aware of). There are also many smaller shops who simply cannot afford to list on Amazon, as there are considerable fees associated with running a successful business there. It was these smaller shops who we were initially building to serve, to provide a funnel for them.

Still, there were problems with our solution that if addressed may have provided a better market fit. If we had had more runway, we would have worked to address them, but that simply wasn't in the cards.


To me it seems like a small market. And worse, it's hard to conquer that small market since it's very fragmented. Even if you had money for advertisements, it still would have been hard.

On the plus side, though, if you had the skills to build that platform, you certainly have the skill to build a more profitable and easier to monetize platform.


>looks like simptoms of no market. maybe you were solving a problem already solved by amazon ? most shops on shopify also use amazon

FAANGS get around this by creating problems that they will offer to solve.


Not in all countries though. Amazon isn't present or popular, or as omnipresent in many countries.

That's an opportunity, I guess.


> We spent time posting in forums helping people find items they were looking for,

Did you run any analytics on how much overlap there was across Shopify sites on "similar items" (Alibaba resellers/dropshippers)?


we didn't, no, but we spent a lot of time sifting through our catalog, and there was a _tremendous_ amount of crap in there. We manually curated and purged shops that were obviously just dropshipping or looked like out-right scams.


Can't you sample ten random product then ask a llm to rate the shop on a scale from drop shipped to artisanal as a first approximation?


I doubt it would be that easy, but, ya, using some form of automation is necessary. We devised a few rudimentary way to filter out the chaff, and it did quite well to remove the garbage. Still some would slip through, so it still required vigilance to remove them when you happen to see them.


Wow, it's cool to see this idea trending on HN! Full disclosure, I'm one of the co-founders at https://www.marmalade.co. Speaking from personal experience, it’s been a long road getting from the universe of all Shopify products to a curated inventory that’s easy for people to shop on. While ChatGPT isn't going to replace human curation anytime soon, the AI tailwind has made it much easier to build search and recommendation systems. On our end, we've definitely caught the semantic search bug. Watch out for it - you’ll wake up one day with a cross-modal hybrid search index on pinecone and any number of models on huggingface :). However, as you rightly point out, user growth is still the key. We're working toward launching a community aspect of the platform in the coming months as a solution.


You site looks good, and your results are fantastic! Job well done. I did hit a server error though, so obviously still some issues to work out, but overall, really well done. Moving to semantic search was one of my top priorities before we went under, but I struggled to justify the costs of it as we were operating on a shoestring budget.

Best of luck to you and your team on user acquisition!


What was the process for scraping 25M products ?

I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.

Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?

______________

A recommendation for how to improve search.

Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.

Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.

Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.


Great suggestions, looking into this right now. First time building something like this so definitely new to some of these tools.

For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.

Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.

Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

A few improvement that has helped so far:

- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json

- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.

Hope this helps. Still wrapping my head around all of this.


2.2k/mo right off the bat is pretty steep, especially if you're paying that while the search response reliably takes over 10 seconds.

Why would you shovel 1.5k into MongoDB's pockets right off the bat? Especially when ElasticSearch is much better suited to what you're trying to do?


Sounds like someone drank the Mongo kool-aid. You absolutely do not need Mongo, let alone Mongo Atlas. 25 million documents with ecommeece products is measly and should fit in a single 600 GB server


Probably not even that - 25mil is nothing really. A normalised schema in an RDBMS would handle that without sweating.


You could run this entire stack (yes, even for 25 million products) using Kubernetes in a $40/month Linode + Elasticsearch + Cloudflare free plan.


If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.

Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.

- https://www.postgresql.org/docs/current/textsearch.html - https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...


You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)


I have troubles seeing how this is possible.

$220 dollars per instance gets you 8Gb of RAM which is way, way, below the index size if you are indexing billions of vectors.


how big is the disk for the biggest instance?


Pretty small still at 500gb. It only stores hot data right now and a subset of what's important. Most of our data is in S3.


Disclaimer: I am building https://pricetracker.wtf

You may want to look at Hetzner, and cut your costs by about 90%.

Feel free to reach me, email in profile.


In your footer you have a lot of links like "kitchenaid price tracker" and "best buy price tracker". Have these helped links helped?


hey! this is cool, I take it you are based in the US?

How long have you been working on this?


On and off for a year, with more time allocated since June. Yes I am in California.


I’ll second the comments that $2k/month is alarmingly high, especially for the performance that you seem to be getting. When I shoved ~40M webpages into a stock ElasticSearch instance running on a 2013-era server I bought for $200 (on eBay), it handled the load when I hit the HN front page just fine. Either you’re being drastically overcharged or there’s something horribly inefficient in your setup that could probably be tweaked fairly easily to bring your prices down.


I'm biased, but I'd recommend exploring Typesense for search.

It's an open source alternative to Algolia + Pinecone, optimized for speed (since it's in-memory) and an out-of-the-box dev experience. E-commerce is also a very common use-case I see among our users.

Here's a live demo with 32M songs: https://songs-search.typesense.org/

Disclaimer: I work on Typesense.


I can also highly recommend TypeSense and have no affiliation. You'll save a lot of money and get much faster results.


You’re spending $2k/mo run this?? Holy hell.


> I'm currently not storing the image files, so that reduces the cost as well.

I wonder if someone catches on and replaces all your image URLs to the fuzzy testicle egg cup[0], will that negatively impact reputation?

0: http://i.imgur.com/32R3qLv.png


I index 40M paragraphs of legal text, bm25 and vector similarity search, at < 200ms query time, on a single $80/month Hetzner server. Email in profile if you’d like to talk.


>Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

It will probably cost you just $100 to rent a server from Hetzner and do the same thing. I would also use Redis or another kind of cache to hit the DB less.


Take a look at TypeSense. Faster, better filtering, and much much cheaper if you’re going the cloud version


Sounds like you used an incorrect instance type/size on Atlas


> site called "Built With",

Do you have Alink. And are they any good?



I specifically asked the author if he could add some extra info on Builtwith.

I can Google. But then I don't know if its truly the site the author was talking about. And I certainly don't know his or her insights on that site.


Berkes wanted to do good by sharing a provision with the OP, in case he/she buys something at builtwith.

We all know how to Google. :)


managed elastic search could slash your cost by an order at least


Oh... no... $1500/mo?


Yo fuck mongo just use RDS or some digitalocean DB. Or really just use opensearch/elasticsearch, or even typesense (don't bother with raft it's so broken) or meilisearch


We’ve interacted before on Twitter and GitHub, and I want to address your point about Raft in Typesense since you mention it explicitly:

I can confidently say that Raft in Typesense is NOT broken.

We run thousands of clusters on Typesense Cloud serving close to 2 Billion searches per month, reliably.

We have airlines using us, a few national retailers with 100s of physical stores in their POS systems, logistic companies for scheduling, food delivery apps, large entertainment sites, etc - collectively these are use cases where a downtime of even an hour could cause millions of dollars in loss. And we power these reliably on Typesense Cloud, using Raft.

For an n-node cluster, the Raft protocol only guarantees auto-recovery for a failure of up to (n-1)/2 nodes. Beyond that, manual intervention is needed. This is by design to prevent a split brain situation. This not a Typesense thing, but a Raft protocol thing.


As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.


25 million products is really not much at all to scrape.


>I have always used standard python tools like selenium, bs4 and the like

There's nothing to scrap. You just download a JSON, the site owners kindly put on your disposal.

Scraping is a more complex process, where you have to work around rate limiting and captchas. For the tool I built I wrote tens of thousands of lines of code and I still find daily issues I have to deal with if I want to scrap a particular web page, issues I don't always have the time to solve.


I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.

There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.

Keep going! At the least, you'll come out of this with an excellent project in your portfolio.


Thank you, that means a lot. It has definitely been a whirlwind of emotions since posting on HN but glad I did. It's definitely an MVP so going to work fast to improve it.


Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is https://shop.app and a query for red shoes is mostly red shoes.


Ya, curation is sadly required in the Shopify ecosystem. There are millions of shops, there is a tonne of garbage. Its also difficult (but not impossible) to properly classify items so that you can better target results for a given query. One of the first problems that anyone attempting this will run into is the amount of mature content available on Shopify shops. Innocent queries turn up many NSFW images that may offend some users, so you have to be able to get on top of that one pretty quick.

I remember in once case, I found what appeared to be an escort service listing "models" on Shopify. It was super creepy. I needed to get in front of that one pretty quick as well, as it was turning up in results.


> a query for red shoes is mostly red shoes

well I get mostly black shoes lol

Edit: ah no, they just use half a page for shoe shops first with black shoes as logo??


ads baby


I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.

Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)

A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.


Thanks for the heads up! I spent some time trying to get the cart route to work. Doesn't seem to be supported anymore (link you sent leads to a 404 page). Tried it with every combination of Product ID, Variant ID, etc. Let me know if you have any ideas on how to get this to work. It would be a great feature to add to Agora.

And I agree on quality over quantity. Writing a script to remove all stores that are shutdown, products that are sold out, and a few other characteristics. Heavily focusing on the search algorithm and data quality now.


I didnt know about the link to checkout. That's a slightly nicer user experience for sure. Still, its confusing for users who want to do more shopping at the same time. I had users who clicked on a number of items, clicked "add to cart" in each one (all different shops), and then couldn't figure out how to checkout on the main site afterwards! Obviously people were looking for a more complete one-stop-shopping experience than I was providing at the time.


I mean a single checkout from multiple shopify stores isn't really possible (at least by 3rd parties)

My hypothesis is that, if you could drive traffic to your site and offer a fast checkout experience, there's probably multiple ways to monetize that. Driving the traffic is the hard part.


>otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

You mean like Amazon?


Hey, I have a Shopify store that sells e-paper calendars / smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?

https://shop.invisible-computers.com


You’re live on Agora:

https://www.searchagora.com/products/invisible-calendar-6266...

Thinking that we should have a page where store owners can submit their URL to be crawled.


Cool, thanks!


Super cool product! I'm currently using a list of Shopify stores, so it's still limited (i.e. wanted to start with a relatively small list to focus on the search experience). I'll submit your URL to the crawler now. If you want to reach out to support @ searchagora.com , I'd love to get your feedback as a Shopify store owner.


Hi, you could drop an email to onboard@peppyhop.com and we will be happy to onboard you. Please add target geography like you would like to target Indian market or US market


always be closing lol


There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.

Two points here.

- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.

- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.

So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.


Great site. Having built a search engine that needed to handle product data on a similar scale, it's not an easy thing to manage.

Some observations:

- Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.

- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)

- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.

- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.


Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.

Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.


Just upgraded the storage and put in a few fixes so it's working a bit faster now. Working on caching some responses locally as we speak. Great idea.


Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!


Amazing, glad you were able to find it. I also just learned about what a "Dala" is :)


The Terms page goes to "Jaggi Enterprises", "A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.".

So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?


Jaggi is a fake it until you make it fake portfolio. Most of the companies it runs are just lorem ipsum fake sites. I think it is likely true that this is a solo dev.



You mean the site is not owned by Jaggi?

Then why would the terms and privacy links go to Jaggi?


OP here. Yup, I am in the process of starting a holding company LLC for my software products and small investments. Just went ahead and deleted 2 from the Investments page that are not launched yet but still in-development (just had landing pages up for those). Wasn't planning on releasing the Jaggi site yet, as I'm still wrapping my head around the holding company structure / it's new to me.

Agora has been a side project of mine. TBH in retrospect, I wish I would have given this post more thought as the servers / search performance wasn't prepared for any significant traffic. So definitely didn't game HN.


Agora also doesn't return red shoes for the search query "red shoes". Seems like you haven't fully solved the problem yet :)

From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.


Definitely have not solved the problem yet! The search algorithm prioritizes the brand called "Red Wing Shoe" so still figuring out ways to show real 'red shoes'. Have been thinking about passing the images through a detection tool and tag them to enhance the search experience.

Re: Value Proposition. Absolutely, I think focusing on the SMB-angle and 'local shopping' will help direct users better. I'll definitely take this into account.


Best of luck on your marriage


You could try the method we used for our vector search demo for e-commerce (all open source, natch) - use CLIP to get vector embeddings for product pictures and then use these for boosting or matching. https://opensourceconnections.com/blog/2023/03/22/building-v... Our demo works pretty well for searches like 'blue network cable' when the colour isn't always explicitly mentioned in the product data.


I believe Shopify built their own app / website where you can search for products exclusively from Shopify merchants. https://shop.app/


Great project. If you continue to crawl the data, be sure to save it so you can detect price changes a la camelcamelcamel.


For all of Amazon's faults, the fact that they tolerate CCC does drive a lot of my online purchases there. CCC used to track other sites, and was eventually blocked on all of them. If more sites want my business, showing their pricing history (either from internal data, or by letting someone build the DB) would go a long way.


Amazon doesn't allow price alerting/tracking on their affiliate program anymore, you need explicit written consent.

I am the owner of https://pricetracker.wtf and got the boot today.


Is it somehow known that CCC hasn’t been co-opted by Amazon? Frankly I figured Amazon would have bought them out a decade ago, but maybe the CCC founders have a stronger ethical compass than I do.


is camel camel whitelisted by amazon? or can any scraper work


Amazon associates program doesn't normally allow price trackers except with written approval


Great call! I am doing back-ups on Mongo and this is a good use-case for tracking changes. Also trying to figure out how to detect is a product is sold out or not being sold anymore.


I worked on a competitor to CamelCamelCamel years ago. We had this exact issue since people would often click through to a page where the price or availability were different from what we were showing

Ultimately, we ended up adding an interstitial page between the product listing on our site and the page on the seller's site

This interstitial checked to see if we checked the price in the last couple of minutes, and if not, it would run a quick scrape of the page to ensure that we had the most up to date information

I can't remember exactly what the messaging or behavior was when there was a difference. I think there was a message that was displayed if the prices were different. Or if the product was actually out of stock, it would pull the user back into our site with a toast explaining that the product was no longer available

Anything less aggressive than this resulted in more customers experiencing price/availability errors or simply leaving the site, and anything more aggressive resulted in angry site owners who were losing bandwidth to our bots

> Also trying to figure out how to detect is a product is sold out or not being sold anymore

In these cases, either the page will say as much (eg: "Product Unavailable"), have some kind of stock or status code hidden beneath the UI to show that it's not available, or the target page will simply vanish from the web. However, none of these are guarantees. A site could say that a product has been discontinued, but the item could come back later, or under a different SKU, or whatever else


For those unaware, Shopify already has platform wide search. You can use https://shop.app/ (or the app), and it also has some chatbot thing that can offer suggestions


Yes, this has been available for a few years now. Initially, they only indexed a very small number of shops, so it was less useful. Based on a few queries, it seems like the are still using some form of text-based search with rank boosting. Seems like they still aren't searching their entire base of shops, but they have increased the number of shops for sure, and they seem to be continuing to invest in the product, which is nice. It seems more useful now than it did the last time I checked!


Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?


Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).


> and between $100k - $1m in revenue

Does "Built With" provide that data? How accurate do you think it might be?


https://www.shopify.com/robots.txt lists a lot of sitemap files, which tend to be a good starting point.


Did this suddenly get changed? Nothing but "# ,: # ,' | # / : # --' / # \/ />/ # /" is shown now.


It's just your browser's HTML parser. Line 6:

  #                         / <//_\
This is being interpreted as a malformed HTML closing tag, which (according to the HTML5 parsing algorithm published by WHATWG) gets treated as a comment. The file doesn't contain any > past this point. This leaves the uncommented contents from lines 1–6:

  #                               ,:
  #                             ,' |
  #                            /   :
  #                         --'   /
  #                         \/ />/
  #                         /
Or, with whitespace collapsed:

  # ,: # ,' | # / : # --' / # \/ />/ # /
Which should be exactly what you observe.

Ref: https://html.spec.whatwg.org/multipage/parsing.html https://developer.mozilla.org/en-US/docs/Web/CSS/white-space...


Weird. I think it did change. Google cache shows a 2229 line file: https://webcache.googleusercontent.com/search?q=cache%3Ahttp...


Seems it might be looking at the referrer. Loading https://www.shopify.com/robots.txt from clicking the link shows the weird line while opening it in a private browser window shows the right one.


For some reason, "view source" gets the right list. Maybe a referer issue like someone else said.


Looks like it's just Shopify's own pages and not anything related to actual stores.


It seems sort of questionable to use the list of things to not scrape as a starting point for scraping.... I mean, I get it's not actually enforced.


Not really sure why all the answers here are flagged, but you may be mistaken.

The robots.txt does not exclusively list what not to scrape.

It provides information on which parts are allowed and wich are not (disallowed).

It also provides sitemaps for crawlers as a starting point with more information (eg. which sites are available and how often are they updated, etc.)


Since ~2009 many crawlers recognize "Sitemap:" directives in robots.txt to link to sitemaps: https://en.wikipedia.org/wiki/Robots.txt#Sitemap


Shopify shops always have /collections, /products, and /pages in their URL. If you have a regular Shopify site, you're not allowed to change them. I don't know if Shopify Plus clients can change them.

Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com


That's funny, I made a domain-specific version of this for canadian coffee deals.

https://beangrid.mcconomy.org/


Which coffee seems to hit the best in Canada (your take). I find the espresso in Canada hasn't been as good as the coffee brands in the US but I'm open to possibilities.

Also like the project!


I enjoy Café St-Henri's Godshot when I can get it at a discount. Anything from Stereo is great, and I've enjoyed Monogram and De Mello. If I was buying at regular price, I would often get Social Coffee.

Of course I put this together because Black Friday is when I load up on (relatively) cheap coffee and chuck it in the freezer, so this time of year I always branch out and try new places and new offerings from familiar places. I built this list mostly from a reddit compilation I found, and I've been slowly updating the source url list as I learn of new canadian roasters that happen to be Shopify customers.


personal somewhat-pedestrian list: Pilot, Detour, Reunion, Propeller, Phil and Sebastian


Super cool project (especially as a coffee lover myself)!


The fun part was figuring out how I was going to put the site up without hosting ;)


github?


Yes - I have a daily cron-based scrape & commit job which updates the table data source CSV, along with github hosting for the static components.


What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?


Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end.

Here's an example: https://www.wildfox.com/products.json


What’s the trade off using js for this? Would it have been much faster to use go or something?


Oh nice, you deserve great things in life for this comment!


How did you detect that it was a Shopify store?


Not OP but:

"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.


There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.

Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.


In another comment, OP wrote:

> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).


Ah, that makes more sense, I used BuiltWith before.


Looked in the source of a random Shopify store, there are 200+ occurrences of "shopify", that's a clue :)


Did you only get the schema.json?


Excellent work


ooo that is a hot tip!


(not the OP, but I have some experience with Shopify)

Shopify stores publish their product catalog at /products.json. From personal experience, you can hammer it pretty hard without being rate limited.

A challenge is that the pricing info in that endpoint is based on the stock Shopify catalog fields, and can be misleading depending on the specific theme customizations that the merchant uses.


Cool project!

As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)

https://github.com/unum-cloud/usearch - for faster search

https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings

Feel free to reach out with feedback and feature requests!


Can't believe I missed this. Taking a look at both repos now. The further I get into this space, the less I feel like I know. Appreciate you sharing, will reach out!


Don’t worry, our solutions aren’t well known. They are either used by enthusiasts, or the Big Tech, and the latter don’t like mentioning that :)


I've been following your work for a while, was really excited to play with UDisk but I guess that got dropped in favor of AI solutions?


Due to constrained resources we had to prioritize the smaller and simpler projects - USearch, UForm, UCall, and on the personal side - StringZilla, and SimSIMD. That’s a lot for a team without revenue and VC funding.

I am still actively thinking about open-sourcing UDisk. Thanks for keeping tabs on us :)


This is cool


When I search for “op-1”, partial match like “Frontier Co-op Turkey Rub, Organic 1 lb. -- Frontier Co-op” gets ranked higher than “teenage engineering op-1”. I would expect the opposite.


Really neat. I tried your search for red shoes, and I found some, er, unexpected imagery on page 1.

One thing you could do is add semantic search so when a user searches "red shoes," the index returns images that look like red shoes even if the metadata doesn't say anything about color or item types. To do this, I'd use a model like CLIP. Here's an example of using CLIP and Supabase to do semantic image search: https://blog.roboflow.com/how-to-use-semantic-search-supabas...


Awesome, thanks for the suggestion / link! Actually left another comment about potentially doing semantic image search to improve results so wrapping my head around it now.



I see some Red Wing shoe products first, then some bearings, some shoes, then someone in lingerie with red heels, and then more red shoes. I know Google is the horse to beat right now but they showed me nothing but red shoes to buy plus a few ancillary results for such a generic search - and if I was a good husband I would add a few words to narrow my search.


This is great - just a couple UI things bugging me. 1. When clicking "Open" on a product, the user should be able to open that in a separate tab. Currently that's not possible; I'm sure because it's being delivered in a single page (can't check now because you're getting hugged to death by HN).

2. When the server's slow, as it just was, there should be some kind of waiter / loader to immediately show the user that the "Open" click was sent on a product. Otherwise people will keep clicking it (or worse, clicking other products) and there's no indication that it's loading.

3. Once a product is open, it's not clear how to get back from it. I see the "X" in the corner, but doing that seems to take me back to a blank search page, not to my search results. The back button also doesn't take me back to the search results...


Thanks for sharing this! Definitely wasn't expecting this level of traffic so didn't account for some front-end loading experiences. Implementing these now.

For 3, thinking to let the back button work the same as the "x", that way a user can return to where they are in a search result regardless of what they click on.


HN— Not sure if anyone will see this but I wanted to thank you all for the support. Although I haven't slept much since going live, it has been amazing getting early feedback from the community.

Agora is still in MVP stage but getting better by the day. Just pushed a big update: fixing an image shifting bug, a blur effect on loading, Redis for caching, brand pages, architecture fixes, and several other things. Currently working on improving the relevancy algorithm, adding all ~5 million Shopify stores, and then adding WooCommerce stores over the next few days.

If you have any suggestions or ideas, reach out to me at support @ searchagora .com :)


On the page where you show details about the product, I would like to have it include the same product from other Shopify stores by doing an image similarly comparison.

And then highlight how the price compares.

For example, here are some pretty crazy red shoes. But they are too expensive for me. Would be interesting to see if this is the only store selling these shoes, or if someone else has the same shoes much cheaper.

https://www.searchagora.com/products/vasco-4-47fb0f87-5b89-4...


How are you planning to monetize this? You mentioned you are spending around $2K just to run it. Is there a commission strategy or ads? Or populate with your products at one point so you sell your own thing?


It looks like from the "become a merchant" you can directly boost your products which I assume will prioritize them and hence drive more traffic to your Shopify store


Idea! Shopify has a ton of resellers that sell junk from China. If you figure out how to avoid them, your life would be 10x easier.


I was wondering on a tech solution for this lately.

What I now do when I shop, is that I compare the images (and descriptions) of listings on Amazon, bol.com or Shopify with listings on Ali and Temu.

If they're exactly the same, run. If they differ slightly, look closer (and most likely run). I guess automating this could make for a solution to at least detect cheap resellers.


This is amazing for finding cute collectibles from my favorite TV show that I would otherwise not noticed among random t-shirt and other "slap the picture and call it co-branded" products! I'm not super sure how long it is going to be around, but I think I'm gonna keep playing with it for a while.


Really happy to hear this. I'll do my best to keep this around :)


I searched for 'pão francês' and my store was the #1 result. I think you're doing it right! :)


Awesome! It would be good to listen to the enter key when typing in a search query. Your privacy and terms links point to what appears to be the saas code framework you used (just a guess). I was looking for your contact/email so I can ask you some questions.


Enter key should work, but the loading speed is very slow right now. Fixing that right now :)

You can reach out at support @ searchagora.com. Would love to talk!


Hey, I'm the CEO of Meilisearch. If your issue is performance, I would love to give you a try with Meilisearch. You'll be able to create an "as you type" experience with our engine that responds in less than 50ms!


Do you plan to add filters: price etc?

I was about to 'reviews' as well in the above list but decided not to as they are not always trustworthy. Now AI is so advanced, that it can be used to detect fake reviews and ignore them from sampling.


Yes. There's a very basic price range filter right now. Working on adding a ships to, location, and a few others. Open to any ideas that would help in the shopping experience.

There are 'reviews' now and made the decision to only let authenticated users leave users so they are more trustworthy (i.e. thinking is that adding more friction will lead to higher quality reviews).


cool project. You might have notice, but there's a non-trivial amount of fraud on shopify (fake shops, info stealers, etc). Might be interesting to look at that dataset and explore a bit =) it's quite fascinating


I've definitely noticed that already. Any advice on how to spot that?

Another challenge is that there are products sold on the original site and third-party marketplaces, both of which could sell on Shopify. So need to find a way to automatically detect the type of store.


you might check this for inspiration (https://seguranca-informatica.pt/shopping-trap-the-online-st...).

I used to have a huge IOC collection, but now stopped tracking them.

things like HTML markup, pricing patterns, IPs might inform on specific clusters of fraudsters.

I don't think Shopify cares tbh


Super helpful, thank you.


What are the telltales, for spotting those?

And how aggressively does Shopify verify/police them?


I have no clue how to implement a search but maybe some words are more important than others.

I searched for "mens dress shirts button long sleeve" and after about 6 results it was all women's clothing.


I'm a Shopify store owner myself. I saw there is a $99 per month to get your product verified, how would this compete in terms of CPC with a traditional channel such as google ads or meta ads?


Not the OP but counting on an average of $3 dollars per click in the US across those channels, I'd say this pricing is way more effective with the amount of searches the site is getting.


Amazing job! I've one question: how did you find the price of every products? I mean, every product page has a different id or class that identify a price. Do you use a regex?


Thanks! Actually a lot easier than you'd expect. Not touching anything on the front-end of the Shopify store.

Every Shopify store has a public JSON file at [Base URL]/products.json with 'price' as a field. Example here: https://wildfox.com/products.json

One thing I messed up on originally was not pulling the 'currency' field which is actually in a different public JSON file called 'meta.json'. Example here: https://wildfox.com/meta.json

Separately, this was primary reason to only start with US stores: to make sure the currency shows up correctly and to purposely limit the initial audience to keep loading times reasonable. Working on adding all Shopify stores in the world now (a list of about 5 million active stores from what I have found).


Clear! Thanks!! :)


Aside: The ending of the 1948 "The Red Shoes" was funny to me, but I think I was a little loopy after slogging thru it. I don't know if I recommend it or not.


I definitely need to watch it now.


I like it.

I need to be able to filter search to if it will deliver to my country.

It desperately needs some indication that your action is being processed, like a spinner, when you search.


Absolutely. Working on a "ships to" filter and enhancing the 'price' filter.

Also fixing the loading experience as we speak. Wasn't expecting this level of traffic so didn't account for slow server speed with the front-end experience.


Or even "ships from" / "located in".

I'm in Europe and don't want to deal with custom hassles or delays from shipping. Etsy and Reverb both have this option which I never fail to use.


Thanks! It is reasonably fast but slow enough that a cue is needed so I know the input triggered.


What's your revenue model? I see you expanded on the details of your $1.5K monyhly cost, but failing to see how you make money? Affiliates fees?


Right now, charging Shopify store owners $99 / product / month to give them a 'verified' tag and boost their product in search results. Currently not making money on affiliate fees.

I wanted to first prove that people would actually use this / find value in it. Fortunately a few merchants have reached out already via email to talk through the business model so this will likely evolve as we learn more.


What are you verifying?


wow! Nice work. I've been trying to build an index of shopify stores. Did you search for all domains pointing to shopify's name servers?


I don't think that would work as many people also use Cloudflare etc.

You may try using BuiltWith which is a paid service:

https://trends.builtwith.com/websitelist/Shopify


It works fine. Just issue a HEAD request when you are unsure and rotate proxies a lot l. Takes a bit of infra but definitely possible.


I mean simply querying nameservers won't work.


Worked well for me, great job. I searched for something I've been looking for and found some interesting options I haven't seen before.


You should def give Algolia and Typesense a try. You can get 10k in free Algolia credits for the first year too via Secret (startup deals site).


Will do. Thanks for the heads up!


Could you make it so, that I can easily open a product in a new tab. I like to compare lots of products at the same time.


Absolutely. Currently on the search results page, you can click on the title / price area to open the product URL in a new tab. Open to any suggestions as well. So the product URL is accessible from the search results page or if you click on the product image to open the product listing and click on 'visit product'. Let me know if that makes sense and if you have any suggestions to make it better!


Love it! Some improvements are needed on search but is an amazing MVP, I'll use this for my late christmas shopping


Why not manticore as backend? Much better perf than ES, less memory intense, sql syntax. Just fantastic all round!!


Clicking an item could show you similar items before it takes you to the item (or have capability for similar)


"There's about 25 million products on Agora right now."

How many stores are represented in index.


https://www.searchagora.com/about

Seems he is indexing nearly 650k shops.


If you check the "about" section you can see how many merchants were added, over 640k at the moment


Incredible. Where can I connect with you? Want to pick your brain & swap some thoughts :)


heh, I used to work on the data team at Shopify. I built something similar to search internal dbs for secret santa gifts based on some weird criteria. Scraping might have a large margin of error because a lot of products tend to be ephemeral.

Neat project though!


Agreed on the large margin of error. Working on a bot to store and convert the images to webp to improve performance. Having the bot do a check for any images that don't exist and removing those listing. Will likely also need to triangulate this with a 404 check. Recently added an option for users to mark a product as "sold out" on the search results which will help as well.

Unrelated, but what was the "weird criteria" for the secret santa exchange? Half joking but also helps with figuring out filters :)


Any Unicode input (Japanese or Greek text for example) currently causes a 500 error.


So cool, good luck in the marriage, you made a very cool thing!!


Thank you! Tied the knot back in May, so both marriage and Agora are new to me. Open to advice on either.


Amazing. Why doesn't Shopify built this natively?


How did you find the list of shopify stores and names?


Found a list on a site called "Built With". Mentioned this in another comment but I think it's meant for building sales outreach email lists.

Once you have the store URL, you can get general store information at [Base URL]/meta.json

Here's an example: https://wildfox.com/meta.json


how did you avoid ip based blocking? rotating proxies?


Maybe I'm clearly ignorant, but how does this differ from Klarna (https://www.klarna.com)?


Is there a similarity at all? One is a search engine, the other a leeching "buy now pay later" scheme as a service.


Do you mean pricerunner?

There's similar price comparison sites, but they don't index every store available.

You basically have to submit your listing.


Is this really within the TOS of Shopify?


Does it matter? Or you can't do anything not explicitly allowed by law?

Shopify is the company spotted in so many shenanigans, so anything that undermines it's business I would personally welcome very well.


It matters when you get sued.


Living in fear is pathetic


Ignoring reality is stupid.



But seems to have filters (lots of liquor stores use Shopify) - shop.app shows only candy and swag[0], while searchagora shows ~130k results for the actual product [1]

[0]: https://shop.app/search/results?query=Baileys+Irish+Cream+Li... [1]: https://www.searchagora.com/search?query=Baileys%20Irish%20C...


Was about to share the same link, it seems like competing against Shopify would prove quite the challenge.

The real way to differentiate IMO is with a targeted UX for different niches rather than the one search engine to satisfy all queries.


Yeah absolutely. I hadn't heard of Shop until today but the value proposition is definitely similar. In the next week, I'll add other e-commerce platforms like BigCommerce, WooCommerce, support for custom built sites, etc. to really differentiate the user experience.


Oh cool. That works a lot better than OPs


Amazing! Does it have an api?


Not yet! Definitely something to consider as I upgrade the architecture. Would it be helpful to have API access to all products on Agora for your own app?


Built the same thing a while back while collecting a lead list for sales. Not bothered to keep data updated but was a fun thing to build in a couple days. (disclaimer mobile experience is meh cause it was a fun project)

https://zensear.ch

How did you find list of all Shopify stores? I ended up just checking every .com, .net, etc as I didn't find an easy way to figure it out directly from shopify.


In another comment, OP writes:

> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).


Nice. But can I ask what motivated you? I don't see any affiliate details in the links - do you monetise at all?


Wanted to play with typesense.

The tech behind building something like this isn't hard. Marketing and traffic is. No point in monetizing with no users.


It looks like from the "become a merchant" you can directly boost your products which I assume will prioritize them and hence drive more traffic to your Shopify store


how did you get a list of the 25 million stores to crawl?


Basically it’s Amazon


Antizon


where did you find a list of shopify stores to scrape


> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).


Super cool!!!


Thanks! Laughed audibly at your comment in the other thread about Mongo haha

Looking into all options right now. A tradeoff between current stability, price, and performance.


Great idea


Gg


Incredible. Would love to connect with you. Where can I find you LOL


I'm sorry, but I have to question where this heartfelt story about looking out for your wife is in any way real?

The website certainly doesn't look like a side project, it has a fully fledged system for merchants to advertise on Agora for a fee, an affiliate system offering $50 commissions to onboard merchants and the ToS and Privacy policy link to a website with the following mission statement:

> We buy, build, and invest in software companies with recurring revenue and product-led growth.


OP here. Yup, haven't launched the holding company yet but idea is to have an LLC for all my software projects. Still a work-in-progress in both thinking and execution. Agora specifically came from a personal need and is obviously still an MVP project.

Spun up a Merchant Page and Affiliate Program page in a few hours on Webflow using a template. There is a merchant dashboard built but the 'affiliate program' is a test.


How is "Jaggi Enterprises LCC" involved, where for example the terms lead to?


[flagged]


OP said in another comment:

> ... narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).


https://news.ycombinator.com/newsguidelines.html

I wouldn't be happy to hear someone calling something I worked on "dogshit" to be honest. I learned from his work today and appreciate his approach. It doesn't hurt to be kind.


[flagged]


Google is a data scraper. OpenAI is a data scraper. Pretty much every major company uses data obtained from a data scraper.


Bad experience with someone?


Why? They're cataloguing public data. Do you think the same of search engines - which are equally scraping the web?

If it's not useful to you, nobody is forcing you to use this product.


You really didn't think about this. Search engine crawlers are specifically designed to ignore sites that have the right setups. In other words: they try to be good citizens of the web. Data scrapers on the other hand are known to steal others IP and use it to jump start their own services.

When you do that you're using all the bandwidth of the site while devaluing everything they've built. They will have invested significant time and resources building that IP. But data scrapers think that just because its on a public website it means they can leech it and do what they like. No, there is such a thing as copy right and respecting authors. Fuck anyone who says otherwise.

If you want to be a little kiddiot and steal content you're free to do so. But you're just undermining the work of everyone you steal from. Making it less and less viable for them to make more of it in the future. I stand by what I said. Data miners are fucking parasites and the web would be better off without them.


hm in this case, he is driving business to these websites. Quite the opposite from OpenAI and scrapers you are talking about, no?


He is putting a high amount of stress on Shopify servers. OpenAI is already in trouble with data scraping practices. Hypothetically if this site becomes successful one day, people will do whatever it takes to become #1 on the search results page and will drive business to only a handful of companies.


You're right in that it probably would be nice to have a civilized opt-out button like Google. But that's not hard to add (especially hypothetically if this site becomes successful). In my mind, total opposite of OpenAI thieving bulk data and giving 0 back to anybody.


Arrested? That’s a hot take.

What is op doing that would warrant such a response? If anything, they’re providing free advertising to their “victims”.


This isn't worth the cost or effort. Shopify already has an internal tool with this functionality that they are planning to publicize.


There is no need to limit it to that, most shops have some kind of product feed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: