Hacker News new | past | comments | ask | show | jobs | submit login
The coming IP war over facts derived from books (abe-winter.github.io)
211 points by awinter-py 17 days ago | hide | past | web | favorite | 97 comments



There's no IP war coming over facts derived from books because copyright doesn't cover facts derived from books and the other forms of IP (trademarks, patents) are even less relevant.

I've done some work with corpus linguistics and quantitative linguistics, and large parts of these disciplines essentially are about facts derived from books in some manner. Modern approaches tend to involve machine learning, deep neural networks and other things fashionable on hackernews, but in general that's an old, traditional area that was working on facts derived from books for decades before the "ML era".

To work on facts derived from books, we're sourcing all kinds of books and other written language, such as newspapers. Some publishers and authors are cooperative and helpful for such research, some are uncooperative and prefer to intentionally make working on their sources difficult - but in any case, even in the case of disagreement and conflict there's no "IP war", the conflict in our case tends to be about practical convenience of access, not about IP, because they don't really have a leg to stand on in claiming a copyright violation. They hold the copyright on the original text, which gives them certain exclusive rights, there's a bunch of intermediary data that we can't make available to public without their permission, but these rights don't extend to facts derived from that text, and we legally don't need their permission to work on, analyze, transform, publish and use stuff based on facts in the text or facts about the text, we can do that openly even if they've explicitly made it clear that they don't want us to do that. That's nothing new, that's established law that probably predates modern computers.


I found it interesting from a legal point of view when someone pointed out that the recent "AI dungeon generator" that was using BERT to act like a game master was in some occurrences basically copying (relevant) excerpts from books.

Can an AI commit copyright infringment? BERT probably "knows" that Cthulhu is a giant thing evoking squids, tentacle and non-orthonormic dimensions. These are facts based on books, but you can produce copyright infrigement based on those facts. It is called "producing a derived work".

In the past years I never managed to get anyone with legal knowledge interested in what they saw as a totally impossible scenario: the idea that AI could one day produce original work y learning its craft, like human do, from copyrighted works. Their criterion was "if you fed copyrighted work into an algorithm to produce a new work, then that's a derived work".

Humans are somehow imbued with a magic property that allows them to watch read WH40K books, alien and predator movies, then produce the Starcraft universe, and have it count as original work.

We do have a philosophico-legal discussion to have there. And way overdue, if I may. The state of copyright is already late in acknowledging internet, DL-generated work will be even more of a conundrum for it.


> Humans are somehow imbued with a magic property that allows them to watch read WH40K books, alien and predator movies, then produce the Starcraft universe, and have it count as original work.

I feel like this is speculation. Do you have any citations? It seems to me that in this fuzzy area an AI will be judged identically to a human. While you give an example where he StarCraft universe is created and considered original work. There are many cases where a human learns their craft from copyright work, and creates a derived work, fanfic is a huge genre of example. I suspect that in the legal arena the nature and content of the new work will be far more influential in the status of the copyright, than the details of how the new work was created.

So, while I think it's likely that something that is generated by an AI that looks like original work will be considered original work. I think a question that is less clear, and much more important is if the AI itself would be considered a derived work. In some ways, it can be argued that an AI is a transformation of the original representation, and that substantial portions of the original work are/can be maintained within the AI itself, just how a work can be transformed by a compressor, but still be considered to maintain the copyright. However afaik this question likely remains still untested.

IANAL and all.

Also, I think:

> BERT probably "knows" that Cthulhu is a giant thing evoking squids, tentacle and non-orthonormic dimensions.

Is highly debatable.


The underlying presumption for copyright is that the work was created by a legal concept called "natural person" which ties into the framework of "legal personhood".

One is a natural person simply by the mere fact of having been born. And that's what makes all the difference.

This idea pretty much the basis of large swathes of jurisprudence across the world, really.

AI, as such, is legally speaking no different from a simple pencil when it comes to writing a book. It's a tool through which a natural person creates a creative work thus establishing a copyright on the part of the natural person.

See, what most people fail to see is that copyright isn't tied to the creative work; it's tied to its creator. Hence why copyright seizes to exist some arbitrary amount of time (20, 40, 70 years) after the creator - a natural person - has died.

So, when you say "a neural network acquires copyright by itself when it generates a new creative work", you are forced to consider whether a neural network is a "person". Which is a can of worms in itself. (consider animals as persons - case: monkey selfie)

https://en.wikipedia.org/wiki/Legal_person https://en.wikipedia.org/wiki/Natural_person


Obviously neural networks created with current technology are not persons in any sense, legal or otherwise.

I think the more interesting questions are whether the operator of the neural network is the legal author of its creations and whether such creations satisfy the creativity requirements for copyright.

I think the operator would be the author of the work, similar to how the operator of a camera or word processor is the author of works created by those tools. However I think in some cases the work may not meet the creativity requirement.

> “[T]he requisite level of creativity is extremely low.” Even a “slight amount” of creative expression will suffice. ... An author’s expression does not need to “be presented in an innovative or surprising way,” but it “cannot be so mechanical or routine as to require no creativity whatsoever.”

https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...


> However I think in some cases the work may not meet the creativity requirement.

Exactly. Copyright is very murky in that respect. The basic notion of "creativity" is willfully vaguely defined to ensure that there's a universal maxim.

For instance, did you know that databases are copyrightable? Even when their constituent parts consist of uncopyrightable facts? Copyright law considers the database as a "collection" or a whole, and so the entire collection can be seen as a creative work. But copyright only applies to the whole, not the constituent parts.

https://www.bitlaw.com/copyright/database.html

Other example, suppose you digitize an ancient piece of pottery by making a digital photograph. Have you then created a new creative work of art? Some would argue you did. Why? Because you didn't make a 1:1 copy of the pottery by creating a new physical pot with similar materials. You created an image using a particular mechanism, introducing elements such as lighting, color, contrast,... that may give your image an original element.

The latter example is actually a legal problem for digitization programs of cultural collections. Institutions hire a photographer to digitize collection, but then discover that the images are pretty much unusable because the photographer is able to enforce their own copyright i.e. demand a licensing fee every time someone wants to use or download an image. Which implies that institutions are also forced to add legal provisions in any contracts pertaining to the transfer of rights.

Hence why copyright law is rife with exceptions and exemptions. For instance, did you know that any image made by the U.S. Government automatically ends up in the public domain?

https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...

The problem with copyright is that digital technology is innately the act of creating copies. Each time I send a request over the Internet, I basically create a copy of the 1's and 0's stored at the other side. The basic tenets of copyright don't concern themselves with conceptual models and higher abstractions. They go back to the fact that a string of 1's and 0's was created on a physical carrier and then copied over to another carrier.

But that's not how humans work, we don't really apply notions of copyright to the physical representation on a disk, we apply them to the ephemeral, assembled representation on our screens and displays. This tension is what creates a ton of tension in this space.


Ok, legal personhood is something interesting I hadn't considered in my original response, and good to know about.

Then, is a work to have been produced by a tool which is found to be violating copyright also then judged to violate the copyright?

Would a reasonable analogy be a pencil that has the works of H.P. Lovecraft etched onto it's surface?


> I feel like this is speculation. Do you have any citations?

Is this speculation to say that Starcraft is protected by copyright as an original work?

Is it speculation to say that it borrows heavily from existing universes?


It's speculation to imply that humans alone are imbued with a magical property as related to copyright.


To your last point, I imagine parent meant that BERT "knows" that Cthulhu is geometrically proximate to giant things, squids, tentacles and non-orthonormic dimensions.


I meant that if you tell it to predict the end of the sentence "And before them stood Lord Cthulhu. They saw..." while it has ingested Lovecraft works, it is very likely to use the word "tentacles" in the following description.


> We do have a philosophico-legal discussion to have there.

There is not that much to discuss. If you leave vested interests out, that is. Future generations will see copyright in same way we look at feudalism or slavery today. Assuming we avoid the future pointed by Idiocracy.


I am by no means an expert on this area of the law, but I think this is a really interesting topic that I tried to explore in a recent dissertation.

I think the issues come from the fact that copyright law really fails to represent the realities of creativity in humans. As you point out, the laws don't really address the fact that often the things we create are based on all of our experiences and consumption of creative works, yet a machine which can produce the same process may fall foul of the exclusive rights of reproduction and adaptation. Is it merely the fact that humans have consciousness which means that we are able to do this without violating copyright law?

At least in the US there is more flexibility around derivitive works, which give creators of derivitive works some avenue to enforce exclusive rights over their creations, or at least avoid claims from original rights holders. Here in the UK we really lack such a flexibility, with the only exceptions along the same lines being 'fair dealing' which is not really a fair comparison because it requires the derivative creator to jump through a bunch of hoops.

Having said that, I'm not sure derivative works are really a suitable legal definition for AI created works, but until we can have a conversation about the role of originality and creativity and the role of consciousness in those proesses, this imperfect definition will probably continue to be applied to those works.

Lawrence Lessig writes a lot about this sort of thing, if you are interested.

EDIT: Also the academic Omri Rachum-Twaig recently wrote a book called 'Copyright Law and Derivative Works: Regulating Creativity' which also covers a lot of issues that are interesting, such as the disconnect between the psychology of creativity and the structure of copyright law.


I think it comes down to who publishes it. If someone posts it on their blog, then it’s technically them writing it.

Pretty sure ai and ghost writers fall into the same legal situation.


A single fact isn't protected by copyright, but if I understand correctly, collections of facts are, if creative work is involved. The article seems to describe digesting all the facts in a book, and making them available to third parties in a way that competes with the book itself. I can see copyright being an issue there.


The classic example of this in copyright law is the recipe.

A recipe can not be protected by copyright. This is one of the reasons that online recipe pages have turned in to long personal stories with (incidentally) a recipe at the bottom.

A recipe book, however, does have protection -- due to the creative work found in organizing the recipes, choosing which ones to include and to out near each other, and any creative work associated with introductions, photos, or other new expression.

That means that legally, you could buy a ton of recipe books, and then make your own by copying and pasting just the ones you like. You could use the recipes unchanged, but you can't reuse the photos or any descriptive text, or anything but the bare recipe.

Similar logic should apply to the publication or reuse of bare facts.

Of course, law is complicated and nuanced, and lawyers/judges/legislators don't always understand new technology well enough to apply existing principles properly to new worlds


As I understand it, while a cooking process is not protected by copyright, the prose describing a recipe can be. So if the recipe is more than just a matter-of-fact exposition of instructions you'd have to rewrite it in your own words.


Yes, any creative expression in the recipe might be protected by copyright.

And dish names might be protectable as trademarks, especially if they are not merely descriptions of the foodstuff.

The ingredients list and preparation steps, or any other text which is purely functional and without a creative component, is what is not protectable.

Interestingly, computer code is generally copyright protected, despite being literally steps a machine follows to perform a task. Courts have ruled that, because there are so many different ways to express any software of nontrivial size, the way the code is written (including comments, variable names, organization, etc ) represents sufficient creative expression to be protected.

I'm actually somewhat surprised that binaries still get the protection, especially since with modern compiler optimizations, it seems like any creative expression your code would be gone by the time the compiler was done with it.

But hey, as I said above, law is strange


The text of a recipe is protected by copyright, just like the source code for a computer program is protected.

A mechanical transformation of the recipe (say, converting it to all caps, or changing the font) will still be protected as a derivative work, just like the binaries for a computer program are protected as a derivative work.

A re-phrasing of the recipe in someone else's words which results in the same dish is not protected, just like a re-implementation of a piece of software is not protected.


"Courts have ruled that, because there are so many different ways to express any software of nontrivial size, the way the code is written (including comments, variable names, organization, etc ) represents sufficient creative expression to be protected."

If I removed comments and/or translated the code in a literal manner, I'm sure it would still qualify as plagiarism in school. Are you implying that it would not or should not be a copyright violation in that case? I have no idea legally, but I would assume the worst.


The answer is, it's complicated, but:

> If I removed comments and/or translated the code in a literal manner, I'm sure it would still qualify as plagiarism in school.

What's plagiarism and copyright violation have nothing to do with each other. Academics routinely copy large segments of text and rely on fair use exemptions to avoid breaching copyright. Meanwhile, copying a couple innocuous sentences can rise to be plagiarism when it would not be a substantive copyright violation.


> If I removed comments and/or translated the code in a literal manner

That is copyright infringement. See SAS Institute, Inc. v. S&H Comp. Sys., 605 F. Supp. 816 (M.D. Tenn. 1985)


> A recipe can not be protected by copyright.

Could some pro-IP person help me reconcile following statements:

1. If there was no copyright, "nobody" would write books/create art, thus we absolutely need to have copyright

2. Recipes have no copyright, but we are flooded by old and new recipes all the time.

If you claim new recipes are somehow less work a daily comic strip, thus needing less protection, please consult a chef of a michelin starred restaurant about the need of the work to come up with new recipes.


> This is one of the reasons that online recipe pages have turned in to long personal stories with (incidentally)

One of the most annoying features of a cooking site, a ten page story about your childhood Michigan is not necessary for a peach pie recipe


it is super annoying. And then when you get to the bottom, the recipe itself is an animated gif


And I don't mind a bit of story telling. Or background on why something is done a certain way etc. But we don't need a memoir


That means that legally, you could buy a ton of recipe books, and then make your own by copying and pasting just the ones you like. You could use the recipes unchanged...

At the scale of Google Books, we're not talking about using copyrighted recipes unchanged or copying and pasting. If you buy a ton of cookbooks, scan them, make them searchable by the world and derive new information about cooking from analyzing them, can you profit from the derived knowledge? Should that change if the publisher or author made an active decision not to make their content available electronically, or asked you to exclude them from your analysis?


In theory, if you stripped out absolutely all of the content except for the bare ingredients-list-and-procedures of the recipes (possibly you would also have to drop the names), you could make a recipe book called "all of the recipes ever published in any of the 140 most popular languages" and you'd be fine.

You could then make your recipe book searchable, sure.


This is super interesting to me, and was not something I knew about before. This is probably a naive train of thought, but can recipes be patented? It would be very interesting if not. Because in my mind a recipe patent would be very similar to a software patent. It's describing a way to put code together to produce a desired outcome.


I'm not an IP lawyer, I just play one on the interwebz, so don't pretend like this is legal advice, but:

Yes, technically, you could patent a recipe, provided that it was sufficiently novel and sufficiently non-obvious to a practitioner of ordinary skill in the art.

An ordinary recipe that only takes well-known ingredients and combines them in well-understood ways, applying well-known techniques is going to have a difficult time passing either the novelty or non-obviousness tests.

I would anticipate that, if your recipe-based patent application were to prevail, that your recipe would need to include some preparation steps that are themselves new and unusual. For example, if you described a novel method of processing an ingredient, or a novel way of combining two ingredients that relied on some previously-unexplored aspect of their chemistry (such as making use of the small ash content of coconut milk products, for example)


Taco Bell succesfully patented the hexagonal tortilla folding pattern they use for Crunch Wrap Supremes.


That's fascinating. Was that a design or utility patent?

Oh dear me, it was a utility patent.

They spent years trying to get it approved and ultimately abandoned the application (presumably realizing they were never going to get it)

Application #US20080020106A1


Funnily enough, because of this similarity the un-patentability of a cooking recipe was a frequently used example to hightlight the absurdity of patenting software in the campaign against software patents in Europe.


"A recipe can not be protected by copyright"

And how many introductions to programming compare a program to a recipe? Yet programs are copyrightable. If you're using logic, you already have lost.

I found an amusing web page today - someone wrote a book on programming for Windows, and also has a website, and on a particular page, they have a helpful snippet of code, which is really just a wrapper/reference to a Windows system call. Literally one line, no additional logic. However, they have ~5 lines of copyright declaration above it, saying it is theirs and you can only use it if you buy their book.

I thought this was really funny, given that they are essentially claiming a portion of the API defined by Microsoft just because they wrote a (pretty much the only possible) line that accesses it. It seems to be a controversial area of copyright recently.


I think for the same reason as the recipe, this one line calling of a Windows API wouldn't be covered by copyright. As I understand it, once the code is entirely functional and is unable to express any creativity you could argue it isn't covered by copyright as it exists as a fact.

I'm sure if you dig around enough you'll find really simple recipes that claim to be covered by copyright.


In fact, there is valuable and original content in writing that you should call this API function in those circumstances. But it's not embodied in the code. Once you've been to the page, it's in your head, and you can either remember it or write about it somewhere, neither of which, presumably, would be legally actionable.

It hardly seems like a coincidence that RMS has spent years advocating for sharing software by comparing it to sharing recipes.


When it comes to actual recipes, I find that most things I like come from a lot of repetition to adjust the details and figure out the pitfalls, so even if it's generally similar to what was published, a lot of value has been added from my perspective.

Code can be written deliberately in a kind of unfinished, unrefined way, and let those who would use it go through a similar process.

But the GPL seems significantly different from the absence of copyright.


This is an area of law which is still emerging.

In my opinion, it is emerging very foolishly and nonsensically, but it is still emerging, and so much of it is not completely settled.

Oracle V Google, for example, raised and then incompletely answered some questions like these


if I understand correctly, collections of facts are

That's what a guy who made a book of trivia thought when he sued Trivial Pursuit for using all of his facts. Turned out, he was wrong.

https://en.wikipedia.org/wiki/Trivial_Pursuit#Fred_Worth_law...


I followed the link and didn't see where it said they used all his facts.


The defendants not using all of his facts distinguished this case from ones where the plaintiff won.

https://law.justia.com/cases/federal/appellate-courts/F2/827...

Worth's reliance on cases involving infringement of one directory by another, see, e.g., Leon v. Pacific Tel. & Tel. Co., 91 F.2d 484 (9th Cir. 1937) (telephone directories), or one list by another, see, e.g., Eckes v. Card Prices Update, 736 F.2d 859 (2d Cir. 1984), is not persuasive. In Leon, plaintiff's entire selection of names and numbers were copied and listed in numerical instead of alphabetical order. Leon, 91 F.2d at 484-85. In Eckes, the plaintiff published a list of 18,000 common baseball cards and selected 5,000 of those cards as "premium" cards; the defendant's listing selected substantially the same 5,000 cards as "premium" cards. Eckes, 736 F.2d at 860-61.


Creative work is stealing from others work while nobody can find it how.

Current law restricts copying but doesn't protect ideas or facts.

Stupid laws are reaching tipping point to end.


I had a public repository of some of my notes and a gigantic chunk of those were highlights from books I've read that could be combined in a summary if one was inclined to doing so. I've stopped maintaining them because I didn't want to lose my GitHub account over that, though I don't even use it now since Microsoft bought it.


Use Google Books to sync your highlights automatically to a separate Google Doc for each book. Even the iOS app is superior to iBooks with the possibility to checkbox per book which should be available offline.


Nothing done by an algorithm is copyrightable either, as far as I know. It comes into the non-human "monkey selfie" category. The only copyrightable output would be taken directly from human-created copyrightable input, i.e., the original books.


Is this true? What are the extents of this? There are many movies that have tons of imagery created by algorithms that are surely covered by copyright. Is it enough for the algorithm itself to express creative expression? How about the inputs? Is any creative expression enough?


Copyright law is flawed.


> copyright doesn't cover facts derived from books

EU database right can and does.


Yes, EU database directive is relevant to local copyright law and can apply in certain cases (dictionaries is probably a very relevant example, and other types of books that are close to printed databases), but they do not apply in the vast majority of cases, for the 'stereotypical book'. For example, the database rights would not apply to extracting facts out of a physics textbook, and it would not apply to extracting a model of author's style from their complete works.


Contrary to your original comment there is lots of room for uncertainty, especially seen globally, in this area. While the copyright / fact extraction framing the author chose might not be the correct one in the years to come, this is certainly an area of volatility - at least here in the EU. Your comment w.r.t. "convenience of access" is pretty restricted here, no ML algorithm just boots up, grows legs, and magically walks into a public library to read books for fact finding. The process of extracting facts from books includes lots of steps that are subject to IP and other legislation over here (from crawling the content in the first place if your source is the internet, duplicating/storing the content if it is subject to copyright, ...).

While I'm not trying to suggest that your ultimate conclusion might not turn out to be correct, especially in the context of US legislation, that's far from settled and I expect to see quite a few lawsuits whenever this hypothetical actually affects a major player with some market power.


Are there any instances of that actually happening, in regards to a book being treated as a database?

It kinda makes sense, Guinness Book of Records, being an example.


Westlaw page numbers and Lexis citations. Carl Malamud has been involved in a number of struggles in the US. Mostly since Feist things have gone in the direction of intellectual freedom in the US and intellectual privatization in the EU.


"The coming IP landgrab of facts derived from books".

There probably will be such an attempt, making it at least as far as hearings in Congress. (Hopefully no further - hopefully no attempted legislation.)


Copyright law is gonna change pretty soon.


That complaint about books and stealing personally strikes me as deeply silly even by permission culture standards. The whole point of books is to learn from them. Proper summarization already separates plagerism from original content (even if it is preferrable to provide citations). It doesn't matter how it is derived - either the end product is fair use or it is effectively unauthorized publishing from including too much source content.

We should be rejoicing at the ability to have an assistant that digests the world's libraries not worrying that someone might make a profit off of it without permission.


But we won't have an assistant that digested the world's libraries. We'll have an advertising company gatekeeping the digitally digested world's libraries.

I think that's worth worry about. As well, if Google in their drive to monetize content that they don't own, causes the various publishers and IP owners to go on the legal attack, any other option/startup will be quickly dissuaded from building a similar, or better, assistant.


The HathiTrust is a partnership of the academic libraries involved in Google Books and other digitization efforts. They offer the Google-originated scans free to the public for works that are already in the public domain, and allow university members and research partners to access scans of books that are still under copyright.

https://www.hathitrust.org/partnership

They're not going to compete with Google in software development, but Google isn't the sole gatekeeper of the book scans.


Deliberately creating uncertainty around the copyright in ML-created works (through legislation), would be a low-key and indirect way of impeding the automation of creative work. Not that I'm advocating it.


I particularly like the author's heirachy of information value. It applies to what individuals should be reading too. But I would probably seperate blogs/articles into "clickbait" and "serious" and put the latter category equal with books. It's important to be very selective with your internet resources, most of the internet reguritates information in a continous and boring cycle, while select corners push novel content and engaging ideas.

What does the author mean by "CRS"? Coordinate Reference Systems?


Congressional Research Service? https://crsreports.congress.gov/


I'd tend to agree with this categorizing certain free blogs/web content above paywalled newspaper content. I can't really remember the last time information available from a news organization actually allowed me to change my behavior such that it helped me achieve any of my long term goals, or even modified my long term goals for that matter. Impacts of things like extreme weather and economic trends, and recently disease trends are about the only thing news is ok for, and even then, I find good comment threads give me a better general sense for the actual severity of things to come.


For weather tracking you could try windy.com or your local weather service forecast. Both are more engaging and beneficial than other ways of getting weather news!


I can absolutely assure you that FB and G's ad targeting algorithms are more complex and significantly more performant than "cluster and sort by click rate per keyword." For one, you rank ads on an eCPM basis, which includes bid, but leaving that aside, trust me, just because you don't know how ads systems work, that doesn't mean they're not working.

Here's a clear example of ML improving search: voice search. You might not use it, but it's extremely popular in India and other developing markets. "G search has gotten worse as they’ve focused on recency in the index, gotten more tolerant of synonyms, and gotten less strict about quoted phrases." None of these are "machine learning" - these are product decisions. If you wanted to say "Google is not an ML company," you'd point to the outsized human influence on search rankings (see, e.g. https://static.googleusercontent.com/media/guidelines.raterh...).

Google Maps is extremely valuable as a proprietary dataset, and we're all making it better whenever we do a captcha, doing object recognition from streetview. So are YouTube, News, Translate, and so many others.

There are so many papers detailing practical metric improvements from ML: https://arxiv.org/abs/1810.09591, is one ("Replacing the manual scoring functionwith a gradient boosted decision tree (GBDT) model gave one of the largest step improvements in homes bookings in Airbnb’s history, with many successful iterations to follow" and deeper neural nets offered significant improvements after that).


this airbnb paper is great, thanks for posting this


> This will do to non-fiction books what youtube did to music: drive down the price in ways that makes distribution only economical for low-margin platforms. It could give G a monopoly on the market and create a disincentive for production of new knowledge.

This is just tiresome. Wasn't piracy supposed to doom us all?

It's regrettable to see Google gaining more power, but the copyright cartel doesn't have a solid moral standing from which to complain.


My thought exactly. The alleged damage that piracy should bring about never actually happened, on the contrary, if we are to consider a number of studies.

Then you have you simplistic and ridiculous statements like

> When you free something that belongs to someone it’s called stealing.

As if anything that was ever invented, written, done or created by a human being was done so in complete isolation and not based on what others have done before.


> Wasn't piracy supposed to doom us all?

Piracy is what drove YouTube's monetization scheme, and there is a race to the bottom among musicians (which is certainly not only caused by piracy). So it may not be doom, but it's at least a PITA.


The opening bits about how “monetization of machine learning has been five years away for several years” reminds me of how fusion has been twenty years away since at least the mid-fifties (so about 65 years). And now I am wondering if anyone has ever looked at the track records on people saying “technology X will be usable in Y years” versus the actual amount of time things took to become usable, if ever.


I don't know any real-world figures, but there was a famous trope about science-fiction that only 10% of it ever became true, but what little becomes real is beyond what anybody had imagined.

Example: Asimov saw computers, and robots, etc. but surpringly (in retrospect of his brilliance) failed to see networks.


We are bad at predicting the future. So what? If something is desirable, we should continue to work towards it, regardless of when it will be done. OK, a company or project or country can go bankrupt or the original ppl working on it may pass on in the meantime, but progress is progress an sooner or later someone else will build up on it!

Soldier on, till death and beyond! When did we become such suckers for short term results at the expense of the future...


Reminds me of the Popular Mechanics magazines promising new battery technologies any day now. Similarly, battery technologies that show great promise in the labs of 2015/16/17/18/19 are still 'soon on the market, any minute now'.


That's entertainment though; ever since I was a kid (child of the 1980's) I've never heard a single physicist confirm that 'new' battery tech, operating through different first principles, was anywhere in sight for commercial use.

All we do is refine the materials and building and designs based on the same core principles (this optimization has yielded decent but not paradigm-changing results, unlike what a "new S-curve" would entail). The perceived increase in "lasting power" of our portable devices compared to 40 years ago was largely due to Moore's Law, optimization on the consumption side of the energy equation, not the source.


The first commercial lithium ion battery dates to 1991:

https://en.wikipedia.org/wiki/Lithium-ion_battery#Commercial...

The lithium ion battery has incrementally improved energy density every year since its commercialization (see Figure 4):

https://www.intechopen.com/books/ict-energy-concepts-towards...

Most breakthroughs or projected-breakthroughs that get Popular Science articles written about them are overstated or never materialize at all. But battery technology is improving over time. It's not improving at the pace of microelectronics, but hardly anything improves that quickly.


OK, so my perception may be wrong indeed.

I'm really nitpicking the theoretical aspect here, I guess. Is Lithion-ion really different from a first principle perspective? Is any battery technology not based on electrolyte principles?

See, when I look at a vapor engine, and compare it with an electrical engine, I really do have two different first principles driving motion, two different conversions of energy. When I look at a regular / convection oven (heated resistors) and compare it with a microwave oven, again two fundamentally different ways of heating a solid. Magnetic induction compared with thermodynamic heat conduction. X-ray compared with MRI. All of these are breakthroughs, using different first principles to complete the task.

I fail to see how battery technology is not all based on one and the same fundamental principle. Quoting your second link:

> “Batteries are electrochemical devices that store electrical energy by directly converting it to a chemical form.

I'm not sure we were talking about the same thing, because your I read comment and it seems to fit my view, actually substantiates it. Am I misunderstanding these concepts?

Edit: FWIW, I was heavy daily user of portable music devices in the 1990s, and while new battery tech gave you an extra hour or so every iteration, none of it was life changing, it was a slow increment, not orders of magnitudes — a sign that we're operating on the same principles, just with more efficiency. My point was that current battery life is fantastically aided by improvements on the consumption side, much more than on the source side. I'm not claiming there's none in the latter, not at all.

Edit 2: TL;DR: I believe there is no fundamentally new physics in "battery" (storing energy), it's been the same thing for centuries (and reportedly was invented but not used in Ancient times). Unlike many other technologies like engines, ovens, body imaging, etc. Please don't hesitate to teach me more.


Batteries are always based on chemistry. But since you said that portable device batteries today are little-changed from 40 years ago, I wanted to point out that the lithium ion battery is newer than that, and still improving.

That's why cordless saws and leaf blowers are practical now but weren't practical 40 years ago. Better batteries made them work. They didn't benefit from Moore's Law.


Ah, fair point about these tools. I think I understand better what you meant. Point very well taken, and thanks for the informed perspective.


This is broken thinking.

> If I’d published a non-fiction book in the last 100 years I’d put $10 right now into a class action to prevent this product from hitting the market.

Authors are a 1 100th of a percent of the population. If we do create new ways for people whose entire life and minds are derivative of millennia of civilization to own facts observed in the world around them the primary funding for and beneficiary of such a change would be an even smaller class of people who collect much more of the sweat of the authors brow than the author even will. The proper response to this is voting any bums who vote for this out of office. If this doesn't work the next step is the guillotine.

> Google will force us to create a new format for information by removing the profitability from the existing one.

The fact that actual scarcity is giving way to plenty in no way suggests that we ought to fight to impose artificial scarcity for the dubious privilege of ensuring that leaches can keep profiting in order to keep a minority of the money filtering down to the people who do the actual work. Perhaps we ought to discover a way for everyone to profitably enjoy the greater bounty instead of glorifying working for a living.

> It’s not like they didn’t tell us they were doing this. Their mission statement was to ‘free the world’s information’. Small wonder they don’t understand privacy. In this case we’re talking about information that’s protected by IP rights. When you free something that belongs to someone it’s called stealing.

Our inherent emotional reaction to real scarcity based on the rivalrous nature of physical goods is a poor foundation to build a case for inventing new rights designed to divvy up the world for the benefit of the rich. I'm sick unto death of hearing proponents of new and inventive varieties of imaginary property describing circumvention of their imaginary rights "stealing". There are no words in keeping with the dignity of this site that I could use to aptly describe my feelings for the authors words. People like him are emphatically the enemies of the people.


EXACTLY! The article is infected with a really vile and evil way of thinking that is almost the equivalent of the "but think of the children" argument for justifying privacy invasion and censorship, but in the field of IP...


The system described by the author of this post actually already exists, and was indeed created by google:

https://books.google.com/talktobooks/

It's just really not that good (yet)...


Pretty sure that ML is adding billions of dollars of value in ad ranking


The author is confused. The thing he's talking about is actually Wikipedia. Facts, mostly derived from books, is exactly how he describes Wikipedia. And indeed, it was transformative for education, and it did cause us to question parts of education.


No

There's lots wrong with this article

> Google has only one valuable proprietary dataset

They own none of the map/street view data?

Google already answers questions using books.

And if you too want the world's book repository you can just download it, just illegally. There are big megapacks of it. Way better data than Google books, The only thing Google books might beat you on is original documents for history.

But books are almost dead, maybe a decade?

ML with human assisted help will be able to pop out quality books quite easily. It'll still take a human, just it'll do in a month what took years.


> But books are almost dead, maybe a decade?

Paper book format maybe is.

"Long form content" is what "book" means now and it is not going away.


This article makes no sense. Quality non-fiction books have always cited other quality non-fiction books and this is a Very Good Thing.


I wish I shared your certainty but ours is the age of the rent seeker. They don't care about 'Good.' They want to get paid and if there is any remotely feasible way to impose themselves they will, tradition be damned.


Maps are collections of facts about road locations, and if I write directions based on a map, doing so doesn't infringe on the map producer's copyright.

But if I'm starting a map company, and I scan in and trace the roads in my competitors' maps? I'd say that's less clear cut - and may well be copyright infringement, even though I'm extracting facts from their publication and creating a new publication containing the same facts.

If I use an entire copyrighted book to train an AI, is it more like the first example, or more like the second?


Maps and charts have been statutorily subject to copyright in the US since the Constitution was ratified. Facts, on the contrary, have been protected from copyright.


I always liked that many of the ancients words are only known because somebody writes "plato tells us that socrates said..." which is in context, pretty much what monetising the actual semantic intent of those scanned books would be.


Where do "facts derived from books" come from? Archives and original research. Wouldn't it make more sense, and be economically (and legally) more defensible to index those primary sources, than books?


It's odd that the post mentions Wikipedia as it is a counter argument to the central premise.

I.e contesting book fact aggregation would already had to have sued an won against Wikipedia and Britannica before it.


If the cost of books went down I'd happily start buying them again. Until then, I can't justify paying $150 for non-fiction that will be outdated in 2 years.


The author makes a side note about getting access to gmail. aren't there at least a handful of third party services that require access to content of gmail?


People already mentioned map, but isn't search history Google's most valuable proprietary dataset?


IANAL, but this seems like a good video response topic for a copyright attorney such as Leonard French.


i mean, it's not impossible for a competitor to re-do the archiving of all books. Doesn't amazon own all the books!


Good job!




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: