Hacker News new | past | comments | ask | show | jobs | submit | ccgreg's comments login

Common Crawl Foundation | REMOTE | Full and part-time | https://commoncrawl.org/ | web datasets

I'm the CTO at the Common Crawl Foundation, which has a 17 year old, 9 petabyte crawl & archive of the web. Our open dataset has been cited in nearly 10,000 research papers, and is the most-used dataset in the AWS Open Data program. Our organization is also very active in the open source community.

We are expanding our engineering team. We're looking for people who are:

* Excited about our non-profit, open data mission

* Proficient with Python, and hopefully also some Java

* Proficient at cloud systems such as Spark/PySpark

* Willing to learn.

Our current team is composed of engineers who do some data science, and data scientists who do some engineering. We are focused on improving our crawl, making new data products, and using these new data products to improve our crawl.

If you'd like a little tour of what our data looks like, please see https://github.com/commoncrawl/whirlwind-python/

Interested? Contact us at jobs zat commoncrawl zot org. Please include a cover letter addressing the above points. Thank you for your interest!


Given the geographical locations of some of your people, is it fair to say that you might accept applicants from a number of different time zones?


Yes


Crawl budget is relevant to every site in Common Crawl.


> Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco–based organization was best known prior to the AI boom for its value as a research tool. “Common Crawl is caught up in this conflict about copyright and generative AI,” says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawl’s role in AI training. “For many years it was a small niche project that almost nobody knew about.”

> Prior to 2023, Common Crawl did not receive a single request to redact data. Now, in addition to the requests from the New York Times and this group of Danish publishers, it’s also fielding an uptick of requests that have not been made public.


I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.

Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...


First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.

Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.

[0] https://darkvisitors.com/agents/ccbot

[1] https://github.com/anthmn/ai-bot-blocker/commit/ae0c2c40fd08...


Thank you!


The western Internet has a bunch of government archives, in addition to the Internet Archive and Common Crawl.

Many of the government archives are not public for copyright reasons.


Both the Internet Archive and Common Crawl have tools that reveal actual crawl dates. Search engines are not really intended to be archives, so it's no surprise that they aren't very good archives.


Is it, though? I think you have to define what your search engine is searching to make a claim like that. Internet Archive and Common Crawl (which I will say has its own incentives discouraging the discoverability of old sites through its methodology and limitations of its web crawling) are search engines in their own right.

What are you doing when you use their services? Searching.


I wonder if Glaze/Nightshade makes it difficult for software to describe the image for a blind person?


Not really, because Nightshade should have made image labeling more difficult, but if you try it, you'll see that it doesn't do anything; multimodal models are too powerful nowadays to be fooled by small adversarial noise generated using CLIP LPIPS (small enough not to be too noticeable to us).

And Glaze does not try to interfere with labeling.


Antimatter has the same electrodynamic properties as matter.


I live in downtown Palo Alto, and it has a lot of mid-rise buildings and is walkable. Expanding the area that this is true for is not something completely different.


Parts of California pay way more for energy because of starting fires leading to bankruptcy.

Other parts of California pay pretty normal rates.


What parts pay normal rates? Are they heavily dependent upon solar, wind, and batteries? Or are they using gas or nuclear?


Places that aren't on SCE, SDG&E, or PG&E. My utility is municipal, and my rates are about 17 cents a kWh. It's awesome.


If you live in Santa Clara your rate will be around $0.166 per kWh.

I think a lot of folks there have solar. And by 2026 they say they will have 50% of electricity sourced via sustainable means.


I hear it is 2x to 3x that much depending on time of day in Santa Clara. Also, people built solar when there were different rules. The new rules make it much less advantageous.


> I hear it is 2x to 3x that much depending on time of day in Santa Clara.

Sounds like an amazing opportunity to make bank trading energy off-hours (as modeled by TX). Guess what technology enables exactly this!


Batteries charged by natural gas?


That's right, batteries! They enable us to capitalize on energy price fluctuations.

You can of course charge them however you like, but I have a feeling Santa Clara's purported 3x price fluctuations are not due entirely to natural gas.


No. If you get the time of use plan in Santa Clara it's 4 cents more on-peak vs. off-peak [1]. The situation with PG&E's rates is completely asinine and is a political failure more than anything else.

[1] https://www.siliconvalleypower.com/home/showpublisheddocumen...


What's the point of a time of use plan if it only varies 4 cents? Does that change behavior?


Well, it's a 25% discount, it just doesn't sound like much because PG&E's rates are so insane.


You’re probably hearing that from people who are in Santa Clara but somehow served by PG&E rather than Silicon Valley Power (the municipal utility). Maybe they are right on the city boundary, or they live in some other city in Santa Clara County (SVP only serves the city of Santa Clara). PG&E off-peak rates for Silicon Valley are 40-60 cents/kWh.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: