"Overview of SHARD: A System for Highly Available Replicated Data" it's the first paper to introduce the concept of database sharding. It was published in 1988 by the Computer Corporation of America.
It is referenced hundreds of times in many classic papers.
But, here's the thing. It doesn't exist.
Everyone cites Sarin, DeWitt & Rosenb[e|u]rg's paper but none have ever seen it. I've emailed dozens of academics, libraries, and archives - none of them have a copy.
So it blows my mind that something so influential is, effectively, a myth.
Huh. OK, here's something that might be interesting. I found another paper[1] that cites SHARD, but the citation is slightly different. Instead of being a CCA memo, it shows it as a Xerox memo:
Sunil Sarin, Mark DeWitt, and Ronni Rosenberg, "Overview of SHARD: A System for
Highly Available Replicated Data," Technical Report 162, Xerox Advanced Information Technology (May 1988).
EDIT:
OK, I think I get this now. I had read the Wikipedia blurb about CCA being acquired by Rocket earlier, but only just now did I keep reading further down to find this bit:
in 1984, CCA was purchased by Crowntek, a Toronto-based company.[8] Crowntek sold Computer Corporation of America's Advanced Information Technology division to Xerox Corporation in 1988.[9] The balance of CCA was acquired by Rocket Software, a Boston-based developer of enterprise infrastructure products,[2] in April 2010.
So it seems like the portion of CCA that would be of interest here, is probably the bit that sent to Xerox. Maybe somebody at Xerox could help turn up the missing document?
I doubt it will help, but I took a stab at pinging them on <strike>Twitter</strike> X.
I found LinkedIn profiles for Sunil Sarin, Mark Dewitt, and Ronni Rosenberg who all worked at CCA during this time period.
I've gone ahead and sent them each a message asking if they might be able to make the paper available.
If you'd like to get in contact with them yourself and are having trouble finding their LinkedIn, shoot me an email and I'll be happy to provide you links.
> Yes, I was involved, 35 years ago! I believe it was an internal CCA paper. I don't have a copy and I have no idea how to get it. Sorry about that. It does seem to be the earliest reference to "shard" in the DB context. (The other early reference pointed to in Wikipedia is from much later, 1997.)
> Fortunately, you need not go back 35 years to read about sharding; it's easy to get current info. Cheers.
I've now sent a message to Andy Youniss, CEO of Rocket Software to see if he can help.
Sunil Sarin, Mark DeWitt, and Ronni Rosenberg, "Overview of SHARD: A System for Highly Available Replicated Data," Technical Report 162, Xerox Advanced Information Technology (May 1988).
2. Per Wikipedia[1] Crowntek sold Computer Corporation of America's Advanced Information Technology division to Xerox Corporation in 1988.
To me this suggests that it was the "Advanced Information Technology division" specifically which would have had the paper in question, and that bit of CCA wound up with Xerox.
That said, it can't hurt to reach out anybody connected to this in any way. You never know who will wind up "knowing a guy who knows a gal, who knows a ..." or whatever.
I also sent a note to another former Computer Corporation of America employee that I found on LinkedIn. I don't know them personally, but we live near each other and have some common connections, so maybe they will at least receive my unsolicited message with some favor.
FWIW, I did get a response from the individual mentioned above. However they were unable to help with finding a copy of the paper. The search continues...
Someone who isn't me is getting in touch with a computer science librarian at a top 10 university about this paper. I'll update here with whatever I (don't) learn.
Going through the bibliography of other people's papers and theses, looking for papers that you better cite "for good luck", or because "you gotta cite that one" is a classic PhD student behavior (I've done it) and it's not terribly surprising that something like this can happen. In fact I'd expect it to be much more widespread...
Feeling the need to cite a particular work out of convention or for social reasons is understandable and very common AFAIK, but I'd consider it a part of academic rigour to at least take look at the work one is citing. Blindly citing without ever even laying one's eyes on the work doesn't sound quite right.
Of course if nobody can get their hands on a particular work, as seems to be the case here, that makes things kind of hard. But I'd expect most works you need to cite in a fast-moving field such as CS to be available at least somewhere, even if it takes a bit of effort.
One of the only notes I got from my MS thesis defense was one of the professors being annoyed that I had cited a result from someone else's paper that he had reported (effectively the same but derived differently and less conclusively) in one of his own papers. I added a note referring to his result and a citation to his paper and everybody went home happy.
Sometimes it's just an attempt to do the due diligence of citing the primary reference rather than the reference that cited the reference that cited the primary reference. I experienced this recently with a very widely cited basic fact in a very hard to come by technical report from the early 80s. I'd have bet dollars to cents it did in fact contain the statement everyone claimed it did.. however, just in case I made an inter-library loan request to actually read those couple sentences.
> Sometimes it's just an attempt to do the due diligence of citing the primary reference rather than the reference that cited the reference that cited the primary reference.
True, but in this case you should include both references.
Funny, in my high school literature class I clearly remember being chastised for having sources in my works cited; but not warranting their inclusion with an actual reference in the work.
It's kind of wild that LLMs and other models/sequences will be able to quickly suss out which papers have high levels of referential integrity.
>which papers have high levels of referential integrity
The problem with this, as I see it IMO, is that there could be references that are cited due to their influence on the thought process/writing process of the work - thus citing them gives contextual zeitgeist - and this is something that AI would not be able to muster...
SO a LACK of referential integrity should show that it is written by a human as opposed to an AI.
While it may be fair to say the number isn't "hundreds" (or maybe it is?) I will say that I'd take that Google Scholar number with a grain of salt. Just poking around looking for stuff in the spirit of this sub-thread, I've found 4 or 5 additional documents that cite the SHARD paper, and which aren't on that Google Scholar list.
I've found that Google Scholar's coverage gets a little sketchy on older stuff, and since we're talking way back in the 1980's here, I don't think it's surprising that some things are missing.
(Replying directly to increase the chances you see this.) I received a couple more responses, from Sunil Sarin and from Mark Dewitt. Sunil had this to say, which was certainly surprising to me:
> John, I'll have to look in my hard copy archives, which are very disorganized and will take me some time. But I don't believe you need this exact paper. I can probably point you at other papers that were in fact published (and are not just company TRs).
> Furthermore, our concoction of the acronym SHARD is different from "database sharding" as currently used. We were referring to the network, not the data, being "sharded", i.e., partitioned, and ensuring high database availability (with some loss of consistency - see "The CAP Theorem").
Mark Dewitt shared:
> Hi John, I believe that was an unpublished report we wrote for our DARPA funding agency. I may still have a copy, but it's currently buried under some stuff in the garage. There is at least one published paper that describes the SHARD architecture for a partially replicated database. It's not entirely obvious because I think the SHARD acronym didn't start getting used until after the paper was published in 1985. By 1987, Sunil Sarin had published another paper with Nancy Lynch in which they refer to SHARD and reference the 1985 paper.
> Here are the two citations:
> S. K. Sarin and N. A. Lynch, "Discarding Obsolete Information in a Replicated Database System," in IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 39-47, Jan. 1987, doi: 10.1109/TSE.1987.232564.
> S. K. Sarin, B. T. Blaustein and C. W. Kaufman, "System architecture for partition-tolerant distributed databases," in IEEE Transactions on Computers, vol. C-34, no. 12, pp. 1158-1163, Dec. 1985, doi: 10.1109/TC.1985.6312213.
Happens far more than you think. It can be an innocent (sort of) mistake, where authors see the citation in a previous paper and simply copy it into their own.
This feels like something that would, at large scale, be unhealthy for science as a whole. While existing papers have already gone through their own quality checks, this enables bad, misleading, or false statements to propagate which can end up being a blow to the credibility of the entire model. Shouldn't there be an ethical duty to due one's due diligence?
There's a crowd who tilt the other way -- if I might possibly have hinted at the idea before you then it's borderline malpractice to not reference me. In many fields it's common to directly reference what appear to be the bigger transitive references then, even if they didn't directly influence this work in particular. I'd personally want to see a twidge more evidence before bringing out the pitchforks.
Funny. I read your previous comment (with the ?! ending) as sarcastic. Now I see you were serious. I would be astonished if most authors have actually read even a fraction of the sources they cite.
This is really interesting, thanks for posting it here!
On a semi-related topic, I love mysteries like these - mystery songs, those Japanese kanji in Unicode that nobody knows what they mean or where they came from, paper towns on maps.
If anyone else has anything else to read along similar lines, please post it!
Here's something that seems related. Maybe one of these authors would have a copy of the other paper? Not sure if they would be among the set of folks you've already tried or not...
It's a good question. I mean, it's not entirely out of the question that the UO folks independently developed the same term without being aware of the SHARD research. But OTOH, it's entirely possible they were aware. Without talking to somebody that was there, I doubt we'll ever know for sure.
I guess I'm not too surprised, this seems like a corporate tech report. Some companies were good at having public archives of these (like Bell Labs) but I'm sure it takes a lot of resources to keep that up. It's essentially some company's internal Wiki page.
1988? As far as anyone can tell, the use of the term "shard" in the context of database replication originated with Ultima Online, which was released in 1997, and which used the term in connection with its underlying mythos (the idea of representing world instances as shards of Mondain's shattered gem).
So a documented reference to sharding that's earlier than that would be interesting to see.
(Disagree? Instead of downvoting, consider posting a citation that actually resolves to a real paper.)
You can find reports from before 1988 mentioning SHARD being in development, like the one from June 1986 linked in this sister comment: https://news.ycombinator.com/item?id=36849634
Some more, from 1989[1][2][3]. Which again, reference the missing "SHARD" paper, but contain enough detail to make it clear that the idea of SHARD existed, regardless of the status of that particular document.
"SHARD" is the name of the software - it was common back then to name systems using acronyms. It's not clear whether the paper/report actually uses the term "shard" in the sense that it is now used in distributed systems, or even whether it uses it at all.
One of the related papers I stumbled across, while not the SHARD paper, does go into a fair amount of detail about SHARD and the problem they were trying to address. One bit of verbiage here might be illuminating:
The new SHARD) (System for Highly Available Replicated Data) system under development at Computer Corporation of America (CCA) is designed to address the problems described above. It provides highly available distributed data processing in the face of communication failures (including network partitions). It does not guarantee serializability, nor does it preserve integrity constraints, but it does guarantee many practical and interesting properties of the database.
The reader is referred to [SBKJ for a detailed description of the architecture of the SHARI) system. Briefly the main ideas are as follows. The network consists of a collection of nodes, each of which has a copy of the complete database. (Full replication is a simplifying assumption we have used for our initial prototype, many of our ideas seem extendible to the case of partial replication, but this extension remains to be made.) Replication allows transactions to be processed locally, thus reducing communication costs and delays, and providing high availability.
So it sounds to me like their main concern was availability through replication, and not so much horizontal scalability (which seems to be more the "point" of modern day "sharding"). Yet I would probably claim that there is enough conceptual overlap to say that SHARD does relate to the modern use of sharding in some sense. Although it's hard to be sure without that original paper.
I don't know why parent comment is stirring up drama but:
1. Not available online doesn't mean the paper's existence is made up. It's a very bold claim to make for the authors that they cite work that is fabricated.
From the available information, this looks like a technical report by a, probably now defunct, company back in the 80s. If this was its only form of publication, and not on some conference proceedings for example, it would be only found available on select university libraries as a physical copy. But most important,
2. This isn't even as an impactful paper as the parent comment states. Or if its proposed concept is, the original idea is probably derived from some other paper that is indeed the one that is highly cited and most definitely available online.
Accumulative citations number from Google Scholar and IEEEXplore doesn't exceed fifteen for the particular paper though.
Not available online doesn't mean the paper's existence is made up.
True, but note that the post you're referring to does say:
I've emailed dozens of academics, libraries, and archives - none of them have a copy.
So this isn't somebody just saying "I couldn't find it with Google, therefore it doesn't exist."
From the available information, this looks like a technical report by a, probably now defunct, company back in the 80s.
Yeah, I think that's the key point. An internal technical memo from a private company, from that far back, isn't likely to be easy to find. It's quite possible that it's never been digitized and put on the 'net, and it it wasn't published in a journal, it may never have been archived by any university libraries or such-like.
That said, I'd be a little surprised if a copy didn't turn up somewhere, even if it means a former employee of CCA finding a copy in a desk drawer and providing it. But who knows?
It's funny in that it was such a pivotal work looking back at it. It's possible in some filing cabinet at Xerox or elsewhere (from what I gathered in the thread). That said, it may have simply seemed somewhat obvious to many at the time and nobody bothered to hold on to or archive it. I suspect it's true of a lot of now commonplace software design strategies.
> it may have simply seemed somewhat obvious to many at the time and nobody bothered to hold on to or archive it.
I suspect this has been the case for many especially last century, everything was moving so fast and there was so much choice, it really was a free for all with no major players dominating the market place and some people didnt know the significance of what they were doing or building. People then retire or get moved to different projects and the knowledge gets lost.
That's pretty interesting actually. Someone must have a copy somewhere. Seems like a real failure of scholarship if it's truly lost, and a serious argument against walled gardens-style publishing.
Have you contacted authors of papers citing the one you're looking for, especially of papers that appeared shortly after / in the 90s? Maybe one of them still has a paper copy lying around somewhere.
I was in a similar situation before with some math paper from the 50s that's nowhere to be found (neither online nor in library indices) and you'd be surprised how many professors still use paper copies.
Do Sarin, DeWitt & Rosenb[e|u]rg exist? Are they still alive? Tracking them down and going directly to the source would seem to be the way to go. Perhaps even enlisting some "big names" in the industry to ask around?
Integral Neural Networks (CVPR 2023 Award Candidate), a nifty way of building resizable networks.
My understanding of this work: A forward pass for a (fully-connected) layer of a neural network is just a dot product of the layer input with the layer weights, followed by some activation function. Both the input and the weights are vectors of the same, fixed size.
Let's imagine that the discrete values that form these vectors happen to be samples of two different continuous univariate functions. Then we can view the dot product as an approximation to the value of integrating the multiplication of the two continuous functions.
Now instead of storing the weights of our network, we store some values from which we can reconstruct a continuous function, and then sample it where we want (in this case some trainable interpolation nodes, which are convoluted with a cubic kernel). This gives us the option to sample different-sized networks, but they are all performing (an approximation to) the same operation. After training with samples at different resolutions, you can freely pick your network size at inference time.
You can also take pretrained networks, reorder the weights to make the functions as smooth as possible, and then compress the network, by downsampling. In their experiments, the networks lose much less accuracy when being downsampled, compared to common pruning approaches.
Going just by your description this sounds like they are doing operator learning. It's actually a very old idea. The proof that started operator learning is from 1988 I believe. Mathematicians have been playing around with the idea since 2016 at least.
Indeed, this seems closely related, thanks for the pointer!
Unfortunately I'm not deep enough into the topic to understand what their contribution to the theory part of it is. (they have some Supplementary Material in [INN Supp]). In the discussion of the Integral Neural Networks (INN) paper, there's this paragraph about an operator learning publication:
"In [24] the authors proposed deep neural networks with layers defined as functional operators. Such networks are designed for learning PDE solution operators, and its layers are continuously parameterized by MLPs only along the kernel dimensions. A re-discretization was investigated in terms of training on smaller data resolution and testing on higher input resolution. However, the proposed framework in [24] does not include continuous connections between filters and channels dimensions."
Also the weight permutation to perform the resampling on pretrained networks in INNs seems to be novel? And I guess it doesn't hurt that they're bringing new eyeballs to the topic, by providing examples of common networks and a PyTorch implementation.
Great understanding of the work!
I will add more details about INNs.
* In fact, INNs concept opens possibility to utilise differential analysis for DNNs parameters. Concept of sampling and integration can be combined with Nyquist theorem (https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampli...).
Analysing the FFT image of weights allows to create the measure of a layer capacity. Two different size DNNs can be equivalent after conversion to INN because max frequency is the same for both networks.
* Tuning the integration grid is actually first steps for fast knowledge extraction. We have tested INNs on discrete EDSR (super-resolution) and have prune without INN training in 1 minute. We can imagine situation when user fine-tunes GPT-4 for custom task just by integration grid tuning simultaneously reducing number of model parameters keeping only important slices along filters/rows/heads etc. Because of smooth parameters sharing new filters/rows/heads include "knowledge" of neighbours.
* Also interesting application is to utilise integral layers for fast frame interpolation. As conv2d in INN can produce any number of output channels i.e. frames.
Nice. I was wondering if something like this is possible a few days ago. The next step would be somehow extending the discrete->continuous concept to layers.
After finally learning some complex integrals/residue theory and seeing the connection to continuous and discrete signal processing I was very happy that the "magic trick" disappeared, your comment has me interested in pulling the string farther. Thanks!
Researchers bought up a bunch of seized phones from police auction sites and found about 25% of them were trivially unlockable and still held sensitive data about suspects and victims.
The result is obvious, but the question is demonstrably not. Good researchers know how to ask interesting questions that no one had bothered to ask before. Seeing clever work like this makes me reflect and continually ask myself "What cool angles am I missing?"
Isn’t novelty also arbitrary and subjective? I recall when everyone seemed to think there was novelty in applying ML to X, where X was something very specific. “Look, it works here too!” I doubt that’s considered novel now.
I suppose recent is subjective, but this one totally blew my mind. It's a 260 year old academic publication, which is older than the USA:
Mr. Bayes and Mr. Price. “An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a Letter to John Canton, A. M. F. R. S.” In: Philosophical Transactions of the Royal Society of London 53.0 (1763), pp. 370–418. DOI: 10.1098/rstl.1763.0053. URL: https://doi.org/10.1098/rstl.1763.0053
Not only is this publication amazing because it is the source of conditional probability which is the basis for Bayes's Theorem, but a dated original of the document is available. The thing I find really beautiful is that it was submitted 2 years after the authors death, by his friend who found the material when going through his things. We will never know how much was changed prior to submission, but I find it a really beautiful tribute to his friend and the co-author makes it clear that the attribution should go to his friend, and that letter still exists today (and it is a lovely read). I can't think of anything similar which has been maintained 260 years later, and still a somewhat useful academic publication. I felt privileged to cite this reference in my MSc thesis.
The authors show that a biological type laboratory ultracentrifuge can efficiently function as a near-universal isotope separator. Any element that can be dissolved as a salt in water -- the entire periodic table, excepting the noble gases -- can be enriched according to its relative mass. This can reduce the cost of refining certain isotopes like calcium-48 by orders of magnitude compared to the previous best techniques.
Left unsaid, but implied by its universality: the new technique is also a new approach to producing enriched fissile materials for nuclear reactors and weapons. It requires less chemical engineering sophistication than current processes which require production and handling of gaseous uranium hexafluoride.
Surely something about the technique would make it impractical for enriching uranium though... you're not going to be able to produce kilograms of 20% U235 with this are you?
Not with off-the-shelf biological centrifuges, no. I don't know enough about centrifuge design to judge how easy/hard it is to build a 60,000 G centrifuge that can handle liters instead of milliliters of liquid per cavity.
The other even more exotic possibility is to use something like this to enrich ordinary reactor grade plutonium to weapons grade plutonium-239. The amount of mass to process with plutonium is orders of magnitude less because spent fuel plutonium is already more than 50% Pu-239, versus 0.7% U-235 in natural uranium, and a bare sphere critical mass of Pu-239 is only 10 kg vs 52 kg for U-235:
The United States considered enriching waste fuel plutonium to weapons grade in the 1980s, when it contemplated another big nuclear weapons buildup against the USSR, but the laser based separation technology to be used was much more complicated than centrifuge separation. The project ended shortly after the USSR dissolved. It was called the Special Isotope Separation Project.
The 1988 environmental impact statement for the project gives some background information:
Could someone knowledgeable describe the differences between this new method and “traditional” enrichment of nuclear isotopes? If I’m not mistaken the latter also uses very high RPM rotation, and I believe I have read that the hardest part around managing it is the needed precision/balance of rotating something heavy at such high speed — a tiny imbalance can cause a wobble and make the whole thing break.
Hypothetically, if the technique can enrich U, Pu, or other fissile isotopes in decent quantities, you run into other issues. Stirring an aqueous enriched uranium solution is a great way to have a criticality accident (cf. https://en.wikipedia.org/wiki/Criticality_accident). That’s not to say it’s impossible to manage, just difficult. UF6, the current compound of choice for enrichment, is in the gas phase during the process.
Certain bacteria can directly assimilate a mixture of hydrogen, carbon dioxide, and nitrogen to produce protein. You could consider it an alternative to bacterial nitrogen fixation in root nodules with much higher productivity. Or you could consider it an alternative to the Haber-Bosch process with much milder reaction conditions -- ambient temperature and pressure. It's a way to turn intermittent electricity into protein with simple, robust equipment. I wouldn't be surprised if this or a related development ultimately supplants much of the current demand for synthetic nitrogen fertilizers.
We hear that eating vegetables is more efficient in ecological footprint than eating meat, since it cuts out the middle man. Is it yet more efficient to cut out the plants and get dietary protein from the bacteria that feed them?
Possibly, yes. This is the dream of so called "single cell protein" production. One type of SCP has been sold for years under the brand name Quorn (derived from a fungus rather than bacteria).
Bacterial protein may trigger allergic reactions in people and bacterial biomass is purine-rich which can also be a problem for people prone to gout. It's possible that cell engineering, directed evolutionary selection, or additional post-growth processing can minimize these problems.
I personally think that the more likely path is using fast-growing bacteria as feed for animal agriculture or aquaculture. Solar panels are so efficient at sunlight conversion compared to plants that you could farm salmon protein starting from bacterial pellets grown on solar derived hydrogen with per-hectare productivity comparable to conventionally farming soy beans. But the solar farm can go on saline, dry, contaminated, or otherwise agriculturally useless land. And salmon has slightly greater nutritional value than soy protein plus significantly greater market value.
Plants use water, sunlight, and organic chemicals. In plants, protein is made from soil nitrates. Bacteria don't feed the plants, but they can fix nitrogen from the air into the soil. If we really wanted to cut out the middleman, we'd turn the nitrogen we breathe directly into proteins. For now, we have to eat to obtain the essential amino acidd, Histidine; Isoleucine; Leucine; Lysine; Methionine; Phenylalanine; Threonine; Tryptophan; and Valine
We've been using the same API to communicate with our NICs since 1994. That API severely limits network throughput and latency. By simply changing the API (no new NIC) you can get 6x higher throughput in some apps and 43% lower latency.
> ... 2010. Before nobody believed that backprop can be GPU-accelerated.
When I was doing my master's in 2004-06, I talked to a guy whose MSc thesis was about running NNs with GPUs. My thought was: you're going to spend a TON of time fiddling with hacky systems code like CUDA, to get basically a minor 2x or 4x improvement in training time, for a type of ML algorithm that wasn't even that useful: in that era the SVM was generally considered to be superior to NNs.
So it wasn't that people thought it couldn't be done, it's that nobody saw why this would be worthwhile. Nobody was going around saying, "IF ONLY we could spend 20x more compute training our NNs, then they would be amazingly powerful".
It's easy to see in retrospect, but hard in prospect: the original paper [1] on GPU acceleration of NNs reports a measly 20x speedup. Assuming a bit of cherry-picking on the author's side to make get the paper published, the 'real-world speedup' will have been assumed by the readership to be less. But this triggered a virtuos cycle of continuous improvements at all levels that has been dubbed "winning the hardware lottery" [2].
[1] K.-S. Oh, K. Jung, GPU implementation of neural networks.
I went to a talk on "general purpose GPU programming" at the Colorado School of Mines around 2001 that covered exactly that topic. It was very disappointing to have my interest in FPGAs for this purpose be so entirely destroyed by a quirk of graphics card design.
Hinton also addressed the contribution of hardware performance advances to practical deep neural net applications in his talks in the mid-2000s.
As recent as 2012, I was working for a major US investment and financial services bank (one of the very big ones). An algo dev sent his resignation email to a mailing list with all technologists at the firm, stating he was resigning because they weren't taking FPGA's seriously. Yes, it was very unprofessional, but his statement about "let's see who is proved right" seems to be accurate.
That's cool and all, but the one thing that really made it possible to train deep neural nets was the use of backpropagation, and its polynomial time complexity.
By contrast, there are no known polynomial time algorithms for program synthesis and the standard approach is to search some large combinatorial space [1]. That's the case for all the classical approaches: SMT, SAT, planning and scheduling, etc. At the same time there's very powerful heuristics for all the other classical problems that can solve many problem instances efficiently.
____________
[1] The one exception to this is Inductive Logic Programming, i.e. the inductive synthesis of logic programs, for which we do know a polynomial time algorithm (but that is my work so I'm not pimping it here).
Thanks, wow, that's a great question to check my assumptions. I turns out I don't know where my claim that backpropagation's time complexity is polynomial comes from! I am repeating what I know from the time of my Master's in 2014, when I studied neural nets. It was most likely something discussed with my tutors during the course (I had some excellent tutors).
In my defense, it certainly doesn't seem to be just my own, personal belief. For example, here are lecture notes from a course on backpropagation, which conclude with a sketch proof of a linear time complexity of O(|V| + |E|) for the computational of a partial derivative over a network with V units and E connections, if I got that right:
There's also other similar calculations I could find floating free on the internet. Those generally seem to look at the complexity of calculating partial derivatives by automatic differentiation, basically.
Then there's all the scholarly articles that claim that training neural nets by backpropagation is not efficient (which is not the same as claiming than automatic differentiation is not efficient) and that the corresponding decision problem is somewhere in the class NP, or worse. I found a whole bunch of such papers while investigating my assumption of polynomiality just now, thanks to your question, and I even found some of them on my hard drive, which I didn't remember!
Well, it seems there is an ongoing debate, continuing all the way to date, and that goes rather further back than the Blum and Rivest paper. Most results I could find seem to be on the side of NP-completeness (or worse).
However, all those results must be examined side-by-side with the irrefutable empirical evidence that training deep neural nets with thousands of layers and millions of units, on lagre datasets no less, is common in practice.
I think the answer lies in the observation that, to compute any complexity result, one must make certain axiomatic assumptions about the structure of the machine (in the abstract sense) that they are investigating. These assumptions may well differ from the assumptions made in common practice for the same general kind of machine. So for example, it seems that the Blum & Rivest paper was criticised, even dismissed as irrelevant, at its time because it assumed a discrete activation function, when continuous functions were already (1992) the norm.
Especially with neural nets it seems that the wild variety of architectures makes it very hard to derive results with general applicability. The following letter to the editor of the journal Neural Networks from 1997, makes this point, and also notes that a popular assumption that a result applying to a simpler architecture can be generalised to more complex architectures is contradicted by the observation that adding more layers or units can _sometimes_ improve the efficiency of training, counterintuitively (though that is not, itself, an assumption that holds in general):
Unfortunately this is behind a paywall, but email me at the address associated with this github account: https://github.com/stassa and I can send you a copy (totally clandestinely! We'll break the law together :)
The bottom line is: I have no idea whether my claim above about the polynomial time complexity of backpropagation is right or wrong. _But_ it is certainly possible, _in practice_ to train deep neural nets on large datasets, _and_ this was possible even before the use of GPUs was common.
Which is not to say that hardware was not a very important factor in the current domination of deep neural nets in AI research. Hardware- and data.
I'm working on an inductive logic programming algorithm, and from my understanding the search space for logic programs is just as vast as other types of program synthesis. Do you have any information you can share about the polynomial time algorithm?
(Not that my LaTex formatting is anything to write home about!).
To clarify, the algorithm described in our paper doesn't reduce the size of the search space for logic programs- it sidesteps it.
Note I'm the corresponding author in the paper and my email is in the Springer version (top of the page). I'm always happy to answer questions about my work.
Can you point me to papers with reproducible benchmarking that achieves big speedups on those?
Modern GPUs are GP-GPUs: where GP means "general purpose": you can run any code on GPGPUs. But if you want to gain real speed-ups you will have to program in an
awkward style ("data parallel"). I am not aware of GPU acceleration of the work-horses of symbolic AI, such as Prolog, or SMT solving. There has been a lot of work on running SAT-solvers on GPUs, but I don't think this has really succeeded so far.
I think we're conflating two things: shallow/classic ML is not symbolic AI. I'm not sure "ML" even encompasses anything "symbolic"; I see symbolic AI and ML as subfields with little overlap.
I'm not saying symbolic AI has been GPU accelerated in the past, but that non-deep ML has been.
Back when I took AI courses in the early 90s, ML was anything that was trained by data. It did not refer exclusively to dense numerical or statistical methods. It included decision trees, which I think of as being closer to the symbolic camp...
There is no agreement on the exact meaning of ML and AI. They are often used interchangeably. And for good reason, because it's all about getting computers to learn. We should not squabble about semantics.
Can you point towards papers that report substantial GPU acceleration on what you call "non-deep ML"?
You're certainly right that AI/ML is often used interchangeably, though I think the distinction is pretty clear among practitioners: AI is the big circle, ML is a subset of AI, deep learning is a subset of ML. Symbolic AI is also a subset of AI, though I'm not familiar enough with it to say how much it interects the others.
So ML is AI, but just a small part of it.
As for papers, here's one showing GPU speedups for the 3 big GBM packages: https://arxiv.org/pdf/1809.04559.pdf. Their setup shows a 3-7x speedup for XGBoost.
The parent comment is talking about SVM, which is not a form of symbolic AI. SVM and other kernel based algorithms contain naturally data parallel steps like summing f(z-x[i]) for some kernel function f and all sample data x[i]. These will work great on GPU once you have a few thousand data points.
> Deep learning took off precisely when the ImageNet paper dropped around 2010. Before nobody believed that backprop can be GPU-accelerated.
Deep learning kicked off with RBMs because you didn't have to do backprop and there was a training algorithm called "contrastive divergence". Each layer could be done in turn, which meant you could stack them way deeper. In ~2008-2009 I implemented Hintons paper on GPUs, which meant I could do the same scale of thing that was taking weeks in matlab in hours on a gpu, and then there were lots of gpus available on the cluster in the uni. Lots of fun cuda hacking (I'm just glad cublas was around by then). The original published learning rates/etc are wrong if I remember right, they didn't match the code.
Good question. All supervised learning is a form of search with three components:
- Specification: what are you are looking for?
- Search space: were are you looking?
- Search mechanism: how are you going through the search space?
Program synthesis is simply learning where the search space is syntax.
In deep learning, taking the ImageNet paper as an example, the specification a bunch of photos with annotations, the search space is multi-variate real functions (encoded as matrix of floats) and the search mechanisms is
gradient descent (implemented as backprop) with a loss function.
I think this paper uses regular expressions an example of how to search fast over syntax. It claims not to be tied to regular expressions.
Regexps aren't even Turing complete as far as I know, if whatever they have in their paper works for arbitrary programs it would be shocking. I'll give it a read.
*Edit*: The algorithm in the paper is a DP like algorithm for building regexes. They use a matrix, and it has all the potential strings to be checked on one axis, and all the potential regex programs on the other axis, and in-between values (the actual matrix values) are booleans saying whether the string matches the program. The algorithm builds the matrix iteratively.
I haven't understood how regex evaluation is done, probably directly, but obviously this algorithm is only for checking whether a particular regex program matches an output rather than general purpose synthesis.
We'll have to wait for AI chips to really scale genetic programming, GPUs won't cut it.
There is no magic in GP. It is just another form of searching the space of programs, i.e. program synthesis. The search mechanism is a local, stochastic search, known to be especially inefficient (for example you may hit the same program multiple times). What's good about GP is how simple it is, so it's a good starting point.
This seems like an early paper and agreed with the consternation.
The paper, in a modern context and based solely on the abstract and having been in the community, is chipping at the "uninteresting" part of the problem. Around that time, program synthesis started switching to SMT (satisfiability modulo theory) methods, meaning basically a super powerful & general SAT solver for the broad search ("write a wild python program") and then, for specialized subdomains, have a good way to call out to optimized domain solvers ("write a tiny bit of floating point math here"). The paper would solve what the regex callout looks like.. which is specialized. We can argue regex is one of the most minimal viable steps towards moving to general programming on GPUs. Except as a person who does SIMD & GPU computing, optimizing compute over finite automata is not general nor representative and I don't expect to change my thinking much about the broader computational classes. To be fair to the authors... back then, synthesizing regex & sql were hard in practice even for boring cases.
Separately, nowadays synthesis has shifted to neural (copilot, gpt), and more interesting to me, neurosymbolic in R&D land. We're doing a lot of (simple) neurosymbolic in louie.ai, and I'm excited if/when we can get the SMT solver side in. Making GPT call Z3 & Coq were some of the first programs I tried with it :) Till then, there's a lot more interesting low-hanging fruit from the AI/ML side vs solvers, but feels like just a matter of time.
Ironically, one of its few comparisons was FlashFill / Excel.. which had been going the way of NNs and now LLMs for years now. The paper largely skips measured and non-superficial comparisons, so I wouldn't draw comparative conclusions based on their writeup.
RE Coq/Z3 vs GPUs, I think about them as different kinds of things, and thus wouldn't compare directly. A solver like z3 encapsulates search tricks for tasks where brute force approaches would effectively run forever. Their tricks are symptomatically necessary -- effective GPU acceleration requires finding a clean data parallel formulation, and if a more naive algorithm is picked for that GPU friendliness, but at the expense of inflating the workload asymptotically, the GPU implementation would have high utilization, and high speedups vs a CPU version... but would still be (much) worse than Z3. Instead of comparing them as inherently different approaches, the question is if the practically relevant core of Z3 can be GPU accelerated while preserving its overall optimization strategies.
Which takes us back to the broader point of this thread of GPUs for synthesis... LLMs are successfully GPU accelerating program synthesis of more than just regex. It's amazing that half of github code now uses them! Closer to this paper, folks are making inroads on GPU accelerating SMT solvers that the more powerful (classical) program synthesizers use, e.g., https://github.com/muhos/ParaFROST .
That's a great yet narrow request, which is great for POPL and less for PLDI. As a reference, I believe starcoder does worse than GPT4 and has taken time to publish results on program synthesis that subsumes regex. The flashfill team has academic PL roots and would also be good to compare to, as I mentioned.
With prompt engineering being so easy, and naive search/refinement over them so easy too, not comparing via benchmarks, vs brush-off related work section, to these general solvers seems strange.
I find the most interesting papers I read all come from the same place: The National Library of Medicine https://www.nlm.nih.gov/
Here's some recent papers I liked:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8413749/ -- Lithium is used as a mood stabilizer for bipolar and in other disorders. The form of Lithium used in psychiatry is Lithium Carbonate. But other forms also exist. As a supplement: there is Lithium Orotate which some people use to help them sleep, deal with stress, and so on. This paper puts forwards the idea that Lithium Orotate is preferable to Lithium Carbonate due to lower quantities being needed for the same therapeutic results. Resulting in less side-effects.
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525098/ -- In bipolar disorder its known that there are abnormalities in the presence of brain derived neurotrophic growth factor (BDNF.) What's interesting about this is treatments for bipolar help to increase BDNF which may be something of interest to those who are into nootropics.
The papers on this site are honestly some of the best written, in-depth, and accessible works I've come across anywhere. There's enough information here to live a better life if you're willing to sift through papers. No joke.
Not only did we get a whole new type of Chess engine, it was also interesting to see how the engine thought of different openings at various stages in its training. For instance, the Caro-Kann, which is my weapon of choice, was favored quite heavily by it for several hours and then seemingly rejected (perhaps it even refuted it?!) near the end.
The super cool thing about MuZero is that it learns the dynamics of the problem, i.e. you don't have to give it the rules of the game, which makes the algorithm very general. For example, DeepMind threw MuZero at video compression and found that it can reduce video sizes by 6.28% (massive for something like YouTube)[2][3].
Curious if anyone else knows examples of MuZero being deployed outside of toy examples?
To be fair, it uses MCTS, which requires many simulations of the game. For this, it needs to know which moves are valid, and when a player wins or loses the game.
So it does need to know the rules of the game, but it doesn't need any prior knowledge about which moves are better than others.
Not quite, you can define an illegal move as losing the game, and winning/losing is a “meta-observation” - ie if the player wins/loses, you don’t invoke another search.
I think you meant "recently published paper" but others are bringing up old stuff. I've always enjoyed Einstein's (translated) papers. They are both precise and readable, and his reputation for genius is well deserved!
This one is interesting, "On the Influence of Gravitation on the Propagation of Light" you can read here: https://einsteinpapers.press.princeton.edu/vol3-trans/393. This was his initial stab at GR. It was a simple approach, later abandoned, that considered light to slow down in a gravity well rather than remaining constant and specifying that mass warps spacetime. I suppose I like the idea that he pursued an idea that didn't work, and was forced to do something far more complex. It's kinda relatable.
The book Einstein's Lost Key by German physicist Alexander Unzicker is all about that idea and its history. Unzicker argues that it's still a good idea, hasn't been falsified since it makes much the same predictions as GR, and we should look into it further.
I read about criticism of General Relativity, that I haven't verified so grain of salt please. The criticism is in that the equations are so complex that only the simplest limits of them are studied and so the theory is therefore not verified.
I wonder if there is any correlation with either political or socio-political volatility, e.g., bread and circuses, conspiracy theories, cults, political dysfunction, and an inability to double-blind test painkillers.
I am slooooowwly working on my own language that is planned to use work from both you and the Effekt team, so i may send you all an email in the future when i start having questions.
But I have really limited free time and only work on a language for really specific reasons. Understand, nothing else seemed to try to solve the problem i have at hand.
That said, if the offer is a solid long term employment at good salary, i am open to negotiations. My email is on my blog and my profile iirc
Lots of interesting facts are strewn around the paper.
——-
The first paper that squarely talked about the language resource gap in CS/ML. Before this came out, it was hard to explain just how stark the gap between English and other languages was.
Lost in Translation: Large Language Models in Non-English Content Analysis
“I run the world’s largest historical outreach project and it’s on a cesspool of a website.” Moderating a public scholarship site on Reddit: A case study of r/AskHistorians
More is Different by P. W. Anderson (1972)'arguing that “at each level of complexity entirely new properties appear” — that is, although, for example, chemistry is subject to the laws of physics, we cannot infer the field of chemistry from our knowledge of physics.'
The paper https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/ander...
> that is, although, for example, chemistry is subject to the laws of physics, we cannot infer the field of chemistry from our knowledge of physics
I mean we could, with infinite computing power and enough time to look into every interesting phenomenon (and to evaluate the corresponding multi-particle Schrödinger equation numerically) but there are simply too many such phenomena, Schrödinger equations are tough to solve, and quantum mechanics is also not a great level of abstraction for the reason you mentioned.
no, that is the point of the paper, even with infinite computing power you could not predict. it is a very strong case against reductionism in science. The phenomenological examples he considers in the paper cannot be derived from basic principles of quantum mechanics.
Hm, I can't say I really understand most of the paper, so I'm not sure if the paper actually implies this, but even then, I find this unconvincing.
Quantum mechanics allow for many things to happen at a macro level, such as magnetism. If you start from pure quantum mechanics, and try to predict something like that emerging, you'd have to be pretty lucky to find the correct initial conditions to lead to such a phenomenon. But that does not mean that it is impossible to predict it at all.
Another take on this might be that understanding phenomena at different levels requires different abstractions and different modes of understanding. In that sense, the knowledge at one level often does not help us to understand things at a higher level. But I think that says more about our limited intelligence than about the phenomena and emergence in themselves.
“A classification of endangered high-THC cannabis (Cannabis sativa subsp. indica) domesticates and their wild relatives”
By McPartland and Small.
Moving on from cannabis sativa indica and cannabis sativa sativa to cannabis sativa indica Himalayansis and cannabis sativa indica asperrima depending on distribution from the original location of the extinct ancient cannabis wildtype.
Following this new classification, I believe there’s a third undocumented variety in North East Asia.
If anyone else has noticed the samesameification of cannabis strains and is wondering what the path forward is, this may be illuminating.
I recently read "Enabling tabular deep learning when d ≫ n with an auxiliary knowledge graph" (https://arxiv.org/pdf/2306.04766.pdf) for one of my graduate classes. Essentially, when there are significantly more data points than features (n >> d), machine learning usually works fine (assuming data quality, an underlying relationship, etc.). But, for sparse datasets where there are fewer data points than features (d >> n), most machine learning methods fail. There's just not enough data to learn all the relationships. This paper builds a knowledge graph based on relationships and other pre-existing knowledge of data features to improve model performance in this case. It's really interesting - I hadn't realized there were ways to get better performance in this case.
A new way of doing real time physics that dramatically outperforms state of the art by simply introducing a new algorithm. No crazy AI or incremental improvements to existing approaches.
"Cyclic Combinational Circuits", by Marc D. Riedel
> we present theoretical justification for the claim that the optimal form of some [combinational] circuits requires cyclic topologies. We exhibit families of cyclic circuits that are optimal in the number of gates, and we prove lower bounds on the size of equivalent acyclic circuits.
Analyzes a claim made in the 1950s by a prominent statistician: the frequency of interstate wars follows a simple Poisson arrival process and their severity follows a simple power-law distribution.
He was right, but it’s not clear why.
The paper is interesting because it shows how a new bit of knowledge creates a large number of known unknowns from previously unknown unknowns.
It also shows statistical magic was very much possible prior to computers.
"Bounding data races in space and time" was an interesting one I saw recently! It's discussing the memory models of programming languages, and how they can fail pretty horribly when data races occur, and then talks about ways to avoid those. OCaml's multicore support is based on this work, meaning it's memory safety guarantees in when data races occur are pretty interesting
TBH, most of the research papers I've read the past few years have been around diet/nutrition. I can't think of anything in terms of recently blew my mind, some of it is definitely interresting, but most of it is low-quality noise.
Probably the biggest thing that blows my mind is the suppression of the Minnesota Coronary Study in the early 1960's. Literally half a century of dis/misinformation from the govt, pharma and medical industries that was disproven long ago. Nothing higher quality or more definitive since.
Basically, the whole, limit cholesterol and saturated fat intake in favor of more grains and seed oils is based on a theory that was long disproven. And, it's still pushed to this day. Why, there's big money/business in pharma and agriculture (corn, soy, wheat).
It is referenced hundreds of times in many classic papers.
But, here's the thing. It doesn't exist.
Everyone cites Sarin, DeWitt & Rosenb[e|u]rg's paper but none have ever seen it. I've emailed dozens of academics, libraries, and archives - none of them have a copy.
So it blows my mind that something so influential is, effectively, a myth.