I'm the CTO at the Common Crawl Foundation, which has a 17 year old, 9
petabyte crawl & archive of the web. Our open dataset has been cited
in nearly 10,000 research papers, and is the most-used dataset in the
AWS Open Data program. Our organization is also very active in the
open source community.
We are expanding our engineering team. We're looking for people who are:
* Excited about our non-profit, open data mission
* Proficient with Python, and hopefully also some Java
* Proficient at cloud systems such as Spark/PySpark
* Willing to learn.
Our current team is composed of engineers who do some data science,
and data scientists who do some engineering. We are focused on
improving our crawl, making new data products, and using these new
data products to improve our crawl.
> Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco–based organization was best known prior to the AI boom for its value as a research tool. “Common Crawl is caught up in this conflict about copyright and generative AI,” says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawl’s role in AI training. “For many years it was a small niche project that almost nobody knew about.”
> Prior to 2023, Common Crawl did not receive a single request to redact data. Now, in addition to the requests from the New York Times and this group of Danish publishers, it’s also fielding an uptick of requests that have not been made public.
I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
Both the Internet Archive and Common Crawl have tools that reveal actual crawl dates. Search engines are not really intended to be archives, so it's no surprise that they aren't very good archives.
Is it, though? I think you have to define what your search engine is searching to make a claim like that. Internet Archive and Common Crawl (which I will say has its own incentives discouraging the discoverability of old sites through its methodology and limitations of its web crawling) are search engines in their own right.
What are you doing when you use their services? Searching.
Not really, because Nightshade should have made image labeling more difficult, but if you try it, you'll see that it doesn't do anything; multimodal models are too powerful nowadays to be fooled by small adversarial noise generated using CLIP LPIPS (small enough not to be too noticeable to us).
And Glaze does not try to interfere with labeling.
I live in downtown Palo Alto, and it has a lot of mid-rise buildings and is walkable. Expanding the area that this is true for is not something completely different.
I hear it is 2x to 3x that much depending on time of day in Santa Clara. Also, people built solar when there were different rules. The new rules make it much less advantageous.
That's right, batteries! They enable us to capitalize on energy price fluctuations.
You can of course charge them however you like, but I have a feeling Santa Clara's purported 3x price fluctuations are not due entirely to natural gas.
No. If you get the time of use plan in Santa Clara it's 4 cents more on-peak vs. off-peak [1]. The situation with PG&E's rates is completely asinine and is a political failure more than anything else.
You’re probably hearing that from people who are in Santa Clara but somehow served by PG&E rather than Silicon Valley Power (the municipal utility). Maybe they are right on the city boundary, or they live in some other city in Santa Clara County (SVP only serves the city of Santa Clara). PG&E off-peak rates for Silicon Valley are 40-60 cents/kWh.
I'm the CTO at the Common Crawl Foundation, which has a 17 year old, 9 petabyte crawl & archive of the web. Our open dataset has been cited in nearly 10,000 research papers, and is the most-used dataset in the AWS Open Data program. Our organization is also very active in the open source community.
We are expanding our engineering team. We're looking for people who are:
* Excited about our non-profit, open data mission
* Proficient with Python, and hopefully also some Java
* Proficient at cloud systems such as Spark/PySpark
* Willing to learn.
Our current team is composed of engineers who do some data science, and data scientists who do some engineering. We are focused on improving our crawl, making new data products, and using these new data products to improve our crawl.
If you'd like a little tour of what our data looks like, please see https://github.com/commoncrawl/whirlwind-python/
Interested? Contact us at jobs zat commoncrawl zot org. Please include a cover letter addressing the above points. Thank you for your interest!