Hacker News new | past | comments | ask | show | jobs | submit | ccgreg's comments login

The western Internet has a bunch of government archives, in addition to the Internet Archive and Common Crawl.

Many of the government archives are not public for copyright reasons.


Both the Internet Archive and Common Crawl have tools that reveal actual crawl dates. Search engines are not really intended to be archives, so it's no surprise that they aren't very good archives.

Is it, though? I think you have to define what your search engine is searching to make a claim like that. Internet Archive and Common Crawl (which I will say has its own incentives discouraging the discoverability of old sites through its methodology and limitations of its web crawling) are search engines in their own right.

What are you doing when you use their services? Searching.


I wonder if Glaze/Nightshade makes it difficult for software to describe the image for a blind person?


Not really, because Nightshade should have made image labeling more difficult, but if you try it, you'll see that it doesn't do anything; multimodal models are too powerful nowadays to be fooled by small adversarial noise generated using CLIP LPIPS (small enough not to be too noticeable to us).

And Glaze does not try to interfere with labeling.


Antimatter has the same electrodynamic properties as matter.


I live in downtown Palo Alto, and it has a lot of mid-rise buildings and is walkable. Expanding the area that this is true for is not something completely different.


Parts of California pay way more for energy because of starting fires leading to bankruptcy.

Other parts of California pay pretty normal rates.


What parts pay normal rates? Are they heavily dependent upon solar, wind, and batteries? Or are they using gas or nuclear?


Places that aren't on SCE, SDG&E, or PG&E. My utility is municipal, and my rates are about 17 cents a kWh. It's awesome.


If you live in Santa Clara your rate will be around $0.166 per kWh.

I think a lot of folks there have solar. And by 2026 they say they will have 50% of electricity sourced via sustainable means.


I hear it is 2x to 3x that much depending on time of day in Santa Clara. Also, people built solar when there were different rules. The new rules make it much less advantageous.


> I hear it is 2x to 3x that much depending on time of day in Santa Clara.

Sounds like an amazing opportunity to make bank trading energy off-hours (as modeled by TX). Guess what technology enables exactly this!


Batteries charged by natural gas?


That's right, batteries! They enable us to capitalize on energy price fluctuations.

You can of course charge them however you like, but I have a feeling Santa Clara's purported 3x price fluctuations are not due entirely to natural gas.


No. If you get the time of use plan in Santa Clara it's 4 cents more on-peak vs. off-peak [1]. The situation with PG&E's rates is completely asinine and is a political failure more than anything else.

[1] https://www.siliconvalleypower.com/home/showpublisheddocumen...


What's the point of a time of use plan if it only varies 4 cents? Does that change behavior?


Well, it's a 25% discount, it just doesn't sound like much because PG&E's rates are so insane.


You’re probably hearing that from people who are in Santa Clara but somehow served by PG&E rather than Silicon Valley Power (the municipal utility). Maybe they are right on the city boundary, or they live in some other city in Santa Clara County (SVP only serves the city of Santa Clara). PG&E off-peak rates for Silicon Valley are 40-60 cents/kWh.


There are a bunch of Vera Rubin-sized mirrors in orbit, that's why this one is the size that it is.


No there’s not -JWST is largest mirror in orbit and it is a few meters smaller, despite being many smaller mirrors. The mirror size is largely due to a tunnel size on the way to the summit in Chile, and applies to the other nearby telescopes.

It’s also really hard to ship mirrors much larger than this on a boat.


Apologies, I was thinking Roman not Rubin.


Common Crawl Foundation | REMOTE | Full and part-time | https://commoncrawl.org/ | web datasets

I'm the CTO at the Common Crawl Foundation, which has a 17 year old, 8 petabyte crawl & archive of the web. Our open dataset has been cited in nearly 10,000 research papers, and is the most-used dataset in the AWS Open Data program. Our organization is also very active in the open source community.

We are expanding our engineering team. We're looking for someone who is:

* Excited about our non-profit, open data mission

* Proficient with Python, and hopefully also some Java

* Proficient at cloud systems such as Spark/PySpark

* Willing to learn the rest: crawling parsing indexing etc.

Contact me at jobs zat commoncrawl zot org.


This is interesting. I just send my application. A little bit late but I'm hopeful.


this looks an exciting opportunity. Sent you an email just now. Thanks.


Very excited. Just applied.


Thanks!


We like calling Common Crawl a crawl, not a scraper. Our 17 year old dataset predates the current AI explosion.


I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: