Hacker News new | past | comments | ask | show | jobs | submit | ravetcofx's comments login

As a business owner thank God. That was an annoying stressful platform I disabled ages ago. Even when enabled it led to little to no leads, but created stress in response time which affected your placement and was featured prominently on the listing.

Location: Vancouver BC

Remote: Yes

Willing to relocate: No

Technologies: Full Stack Linux, Embedded Systems, Python, Objective C, Java, JS, MySQL, Git, REST APis, Microsoft 365/Azure, Docker, Nextcloud, Some PHP.

Resume: https://docs.google.com/document/d/1xVsce1-ojfkjdSEhAzWlwH2A...

Email: Corbin AT no-bs.ca

LinkedIn: https://www.linkedin.com/in/corbin-auriti/

With background as a developer for Linux Mint, and over the past nine years, I have run my own I.T and computer repair business, catering to the unique needs of small businesses. This role has sharpened my client relation and community engagement skills. In addition to my technical expertise, I have a keen interest in urban planning and Geographic Information Systems (GIS), which has driven me to integrate these interests into my professional work where possible. My dedication to continuous learning and improvement ensures that I stay current with emerging technologies and industry best practices.


What I keep learning as a skeptic, there is untold health benefits in nature (when nature is scientifically proven it is medicine), but this certainly isn't being shared by crystals or homeopathy. And there is a lot to learn from information generationally passed down, Even if it isn't scientific it likely has chemical or biological purposes, and needs to be isolated. Dozens to hundreds of generations know their shit how to keep alive, maybe just not the minutia.

Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory

Have you tried the more modern solutions like gripgrep, ack, etc.?

Or for something more comprehensive (to also search PDF, docx, etc.) there is ripgrep-all:

https://github.com/phiresky/ripgrep-all


As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).

I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).


Tracker is an open source project for that. It has been around for some 10+ years now. https://tracker.gnome.org/overview/

Try ripgrep.

Not everyone who smokes cigarettes agrees that they are a problem

Accessing the dataset to train from scratch will be the biggest hurdle, now a lot of the pile has had ladder pulled since GPT-4

https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T cleaned and deduplicated english web data tokens.

Holy crap, Does huggingface charge for bandwidth if you're downloading 45 terabytes??

Fun trivia: downloading 45TB costs about $60, according to Cloudflare.

That's what Cloudflare charges. It costs them around 6 cents.

That's what they said it costs on their blog, not that they charge that. https://blog.cloudflare.com/aws-egregious-egress

Where are you getting 6 cents from?


Wish I could say I'm surprised you're getting downvotes. Carrier costs are some of the lowest costs for hosting providers. Yet that fact seems to elude a majority of the community here.

I believe they are hosting it on Cloudflare who doesn’t charge for egress

More specifically, Cloudflare R2 doesn't charge for egress, and Cloudflare doesn't charge for egress to members in the Bandwidth Alliance which include Azure, Google Cloud, Oracle, Alibaba Cloud, and others, though critically not AWS.

They very much do charge egress fees elsewhere.


Someone will come along and say "Why don't you just mirror Anna's Archive?" in 3...2...1...

I think between Anna's Archive, fineweb and as many github repos as you can scrape you can get a pretty decent dataset.

I doubt Anna's Archive would produce a good model on its own though.


i suppose you wouldn't be able to use it for external services, but internally, I'm sure you can find some books that fell off the back of a truck...

No reason you can't go external. GPT was trained using ebook torrent sites

OpenAI has enough money to hire lawyers to defend it until the end of time though

I'm okay with paying for datasets

Depends on how the courts rule. If the copyright maximalists prevail, only the wealthiest entities will be able to afford to license a useful data set.

Paradoxically enough, this is the outcome that most "Hacker News" denizens seem to be rooting for.


It's almost as if people believe in fairness and compensating people for their work.

Also, it's worth noting that this is only true as long as we're stuck in the "must train on the entire sum total of human output ever created" local minimum for machine learning. Given that most biological entities learn with much less data, this might well be the thing that prods ML research to using an approach that isn't "IDK, buy a few containers of GPUs, and half a DC of storage, see if that makes things better".


> It's almost as if people believe in fairness and compensating people for their work.

Yet in this case we are talking about compensating the compilers/massagers/owners of the datasets, not the original authors from wherever the data was originally scraped.


Copyright is hideously broken, but in theory: the owners only own it because they compensate the authors, which they only do out of an expectation of future profit (on average).

That theory's a fantasy, because extractive systems involving gatekeepers get established, but in this specific case, enforcing copyright would make things fairer for authors. There's no extractive copyright-taking gatekeeper for websites: scrapers don't get copyright, so can't re-license the material they've scraped (unless it's permissively-licensed or something).


I'd still get most of my dataset from torrent but I could pay for specific things like high quality source code.

I've been using the free version of cal.com which has been phenomenal, and there's self hostable option which is nice

♥ thanks man! we're putting a lot of love into our product and happy to help anyone looking to move away from the old x.ai scheduling

They Have stacks already which kind of do a similar thing https://support.apple.com/en-ca/guide/mac-help/mh35846/14.0/...

What will happen on the domain gets flagged by every provider and banned from use

At many places freedom tld's are already shitlisted.

What’s a freedom TLD?

>What’s a freedom TLD?

It was probably a typo/autocorrect of "freenom" :

https://www.freenom.com/en/freeandpaiddomains.html

The significance is that the ".ml" tld is on that list and this thread's domain is "email.ml". Some entities block any tld registered with freenom.


Does Microsoft even push or care about .net anymore? They seemed to move on after UWP and now that seems to be forgotten in focus of more web apps.


You have to understand that Windows comes from a separate division than .NET and they have no overlap. Microsoft isn't a cohesive company. .NET comes from the developer division (DevDiv) and UWP comes from the Windows division (now Server & Cloud). The Windows folks always hated .NET and the developer division has been lukewarm about UWP.

The Microsoft panel of this comic sums it up nicely: https://bonkersworld.net/organizational-charts


Microsoft care about .Net. It runs the Corporate world like Java


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: