Hacker News new | past | comments | ask | show | jobs | submit | sudosysgen's comments login

10-16 hours is not enough at all. On a cloudy day, solar output will only be 15-20%. On top of that, your panels really only generate for 8 hours on a very good day - the sun is a lot dimmer in the early morning and late evening. Really, you need 2x storage for a good day, if you want to deal with two cloudy days you'd want 50-60 hours of storage.

Could you possibly read the article you're replying to again?

Even skimming through it discusses the coverage of wind and a not 50/50 system particularly to cover winter & night time. There is also discussion of a ~2% from "other" and how much storage capacity is required.

The article even goes into using wind & solar data for the simulation and reducing further the output to be conservative.


I obviously understand it's not a 100% solar system. If it was you would need to be able to deal with at least 2 weeks of bad weather, not two days, and you would have to take into account winter (dropping to about 5 hours instead of 8).

Additionally, mixing solar and wind is not as easy as it seems, because the two are correlated. If you have a major storm that makes wind energy impossible due to wind speeds above ~100km/h, you will also have clouds making solar energy unworkable. I'm not aware of any simulations modelling a 95+% solar/wind grid for storage needs, taking into account extreme weather patterns, grid topology, and equipment damage, but if you do then please link it.

I don't see any article linked in the comment I replied to. Perhaps you're mixing up two comment chains.


It's likely enough battery capacity if you combine batteries with e-fuels for longer term storage.

Assuming batteries are used for all storage use cases is one of the classic errors of energy system analysis.


They did not bring it into existence. The MLP is older than the Hopfield network. The invention that made it practical was back propagation, which wasn't used here at all.

Well no, not in a non-array programming language. In any language that has a semi-decent type/object system and some kind of functional programming support, `avg a+b` would just be `avg(a, b)`, which is not any easier or harder, with an array type defined somewhere. Once you make your basic array operations (Which they have to be made in q anyways, just in the stdlib), you can compose them just like you would in q, and get the same results. All of the bounds checking and for-loops is unnecessary, all you really need are a few HKTs that do fancy maps and reduces, which the most popular languages already have.

A very real example of this is Julia. Julia is not really an array-oriented programming language, it's a general language with a strong type system and decent functional programming facilities, with some syntactic sugar that makes it look like it's a bit array oriented. You could write any Q/k program in Julia with the same complexity and it would not be any more complex. For a decently complex program Julia will be faster, and in every case it will be easier to modify and read and not any harder to write.


Why would it be avg(a, b)?

What if I want to take the average difference of two arrays?


mean(a - b)

I don't know what you mean by the q array operations being defined in the standard library. Yes there are things defined in .q, but they're normally thin wrappers over k which has array operations built in.

I don't consider an interpreted language having operations "built-in" be significantly different from a compiled language having basic array operations in the stdlib or calling a compiled language.

Hmm, why not? Using K or a similar array language is a very different experience to using an array library like numpy.

It is syntactically different, not semantically different. If you gave me any reasonable code in k/q I'm pretty confident I could write semantically identical Julia and/or numpy code.

In fact I've seen interop between q and numpy. The two mesh well together. The differences are aesthetic more than anything else.


There are semantic differences too with a lot of the primitives that are hard to replicate exactly in Julia or numpy. That's without mentioning the stuff like tables and IPC, which things like pandas/polars/etc don't really come close to in ergonomics, to me anyway.

Do you have examples of primitives that are hard to replicate? I can't think of many off the top of my head.

> tables and IPC

Sure, kdb doesn't really have an equal, though it is very niche. But for IPC I disagree. The facilities in k/q are neat and simple in terms of setup, but it doesn't have anything better than what you can do with cloudpickle, and the lack of custom types makes effective, larger-scale IPC difficult without resorting to inefficient hacks.


None of the primitives are necessarily too complicated, but off the top of my head things like /: \: (encode, decode), all the forms of @ \ / . etc, don't have directly equivalent numpy functions. Of course you could reimplement the entire language, but that's a bit too much work.

Tables aren't niche, they're very useful! I looked at cloudpickle, and it seems to only do serialisation, I assume you'd need something else to do IPC too? The benefit of k's IPC is it's pretty seamless.

I'm not sure what you mean by inefficient hacks, generally you wouldn't try to construct some complicated ADT in k anyway, and if you need to you can still directly pass a dictionary or list or whatever your underlying representation is.


> None of the primitives are necessarily too complicated, but off the top of my head things like /: \: (encode, decode), all the forms of @ \ / . etc, don't have directly equivalent numpy functions. Of course you could reimplement the entire language, but that's a bit too much work.

@ and . can be done in numpy through ufunc. Once you turn your unary or binary function into a ufunc using food = np.frompyfunc, you then have foo.at(a, np.s_[fancy_idxs], (b?)) which is equivalent to @[a, fancy_idxs, f, b?]. The other ones are, like, 2 or 3 lines of code to implement, and you only ever have to do it once.

vs and sv are just pickling and unpickling.

> Tables aren't niche,

Yes, sorry, I meant that tables are only clearly superior in the q ecosystem in niche situations.

> I looked at cloudpickle, and it seems to only do serialisation, I assume you'd need something else to do IPC too? The benefit of k's IPC is it's pretty seamless.

Python already does IPC nicely through the `multiprocess` and `socket` modules of the standard library. The IPC itself is very nice in most usecases if you use something like multiprocessing.Queue. The thing that's less seamless is that the default pickling operation has some corner cases, which cloudpickle covers.

> Im not sure what you mean by inefficient hacks, generally you wouldn't try to construct some complicated ADT in k anyway, and if you need to you can still directly pass a dictionary or list or whatever your underlying representation is.

It's a lot nicer and more efficient to just pass around typed objects than dictionaries. Being able to have typed objects whose types allow for method resolution and generics makes a lot of code so much simpler in Python. This in turns allows a lot of libraries and tricks to work seamlessly in Python and not in q. A proper type system and colocation of code with data makes it a lot easier to deal with unknown objects - you don't need nested external descriptors to tag your nested dictionary and tell you what it is.


Again, I'm not saying anything is impossible to do, it's just about whether or not it's worth it. 2 or 3 lines for all types for all overloads for all primitives etc adds up quickly.

I don't see how k/q tables are only superior in niche situations, I'd much rather (and do) use them over pandas/polars/external DBs whenever I can. The speed is generally overhyped, but it is significant enough that rewriting something from pandas often ends up being much faster.

The last bits about IPC and typed objects basically boil down to python being a better glue language. That's probably true, but the ethos of array languages tends to be different, and less dependent on libraries.


If you're affiliated with Mila, it may be worth it to ask for the book to be bought for the shared library - I'm sure many people would peruse it :)


That makes a lot of sense! Might just order it through a local bookseller though.


Not saying it is or isn't overpriced, but 20k/year is actually a good price for something that can avoid homelessness. Just the cost of extra medical care and/or jail, let alone social services and lost productivity, is worth it.


Members of society are not external organisms, they are the cells that make up the organism that is society itself. The most effective ways in which organisms fight these issues - take cancer for example - is to prevent the issue to bring with, not to supercharge the immune system against itself. Much like an immune system too eager to go after it's own cells can cause autoimmune disease, fighting antisocial behaviour primarily through repression leads to societal ruin.


But motor skills transfer extremely well. It's not uncommon for professional athletes to switch sports, some even repeatedly.


There’s some famous ass basketball players with mediocre but still existent MLB careers.


Wealth, network and fame transfers incredibly well between fields. Possibly better than anything else. It should be accounted for when reasoning about success in disparate fields. In addition to luck, of course.


> mediocre but still existent MLB careers.

If you have a MLB career at all, you are an elite baseball player.


Not really, the killer is latency, not throughput. It's very rare that a CPU actually runs out of memory bandwidth. It's much more useful for the GPU.

95GB/s is 24GB/s per core, at 4.8Ghz that's 40 bits per core per cycle. You would have to be doing basically nothing useful with the data to be able to get through that much bandwidth.


For scientific/technical computing, which uses a lot of floating-point operations and a lot of array operations, when the memory is limiting the performance almost always the limit is caused by the memory throughput and almost never by the memory latency (in correctly written programs, which allow the hardware prefetchers to do their job of hiding the memory latency).

The resemblance to the behavior of GPUs is not a coincidence, but it is because the GPUs are also mostly doing array operations.

So the general rule is that the programs dominated by array operations are sensitive mostly to the memory throughput.

This can be seen in the different effect of the memory bandwidth on the SPECint and SPECfp benchmark results, where the SPECfp results are usually greatly improved when memory with a higher throughput is used, unlike the SPECint results.


You are right that it's a limiting factor in general for that use case, just not in the case of this specific chip - this chip has far less cores per lanes, so latency will be the limiting factor. Even then, I assure you that no scientific workload is going to be consuming 40 bits/clock/core. It's just a staggering amount of memory, no correctly written program would hit this, you'd need to have abysmal cache hit ratios.

This processor has two lanes over 4 P-cores. Something like an EPYC-9754 has 12 lanes over 128 cores.


I agree that for CPU-only tasks, Lunar Lake has ample available memory bandwidth, but high memory latency.

However, the high memory bandwidth is intended mainly for the benefit of its relatively big GPU, which might have been able to use even higher memory throughputs.


40 bits per clock in a 8-wide core gets you 5 bits per instruction, and we have AVX512 instructions to feed, with operand sizes 100x that (and there are multiple operands).

Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."


The throughput of the AVX-512 computation instructions is matched to the throughput of loads from the L1 cache memory, on all CPUs.

Therefore to reach the maximum throughput, you must have the data in the L1 cache memory. Because L1 is not shared, the throughput of the transfers from L1 scales proportionally with the number of cores, so it can never become a bottleneck.

So the most important optimization target for the programs that use AVX-512 is to ensure that the data is already located in L1 whenever it is needed. To achieve this, one of the most important things is to use memory access patterns that will trigger the hardware prefetchers, so that they will fill the L1 cache ahead of time.

The main memory throughput is not much lower than that of the L1 cache, but the main memory is shared by all cores, so if all cores want data from the main memory at the same time, the performance can drop dramatically.


The processors that hit this wall have many many cores per memory lane. It's just not realistic for this to be a problem with 2 lanes of DDR5 feeding 4 cores.

These cores cannot process 8 AVX512 instructions at once, in fact they can't do it at all, as it's disabled on consumer Intel chips.

Also, AVX instructions operate on registers, not on memory, so you cannot have more than one register being loaded at once.

If you are running at ~4 instruction per clock, to actually go ahead and saturate 40 bits per clock on 64 bit loads, you'd need 1/6 of instructions to hit main memory (not cache)!


There might be a chicken-and-egg situation here - one often hears that there’s no point having wider SIMD vectors or more ALU units, as they would spend all their time waiting for the memory anyway.


The width and count of the SIMD execution units are matched to the load throughput from the L1 cache memory, which is not shared between cores.

Any number of cores with any count and any width of SIMD functional units can reach the maximum throughput, as long as it can be ensured that the data can be found in the L1 cache memories at the right time.

So the limitations on the number of cores and/or SIMD width and count are completely determined by whether in the applications of interest it is possible to bring the data from the main memory to the L1 cache memories at the right times, or not.

This is what must be analyzed in discussions about such limits.


CPUs generally achieve around 4-8 FLOPS per cycle. That means 256-512 bits per cycle. We're all doing AI which means matrix multiplications which means frequently rereading the same data bigger than the cache, and doing one MAC with each piece of data read.


The most important algorithm in the world, matrix multiplication, just does a fused multiply add on the data. Memory bandwidth is a real bottleneck.


The importance of the matrix multiplication algorithm is precisely due to the fact that it is the main algorithm where the ratio between computational operations and memory transfers can be very large, therefore the memory bandwidth is not a bottleneck for it.

The right way to express a matrix multiplication is not that wrongly taught in schools, with scalar products of vectors, but as a sum of tensor products between the column vectors of the first matrix with those row vectors of the second matrix that share with them the same position of the element on the main diagonal of the matrix.

Computing a tensor product of two vectors, with the result accumulated in registers, requires a number of memory loads equal to the sum of the lengths of the vectors, but a number of FMA operations equal to the product of the lengths (i.e. for square matrices of size NxN, there are 2N loads and N^2 FMA for one tensor product, which multiplied with N tensor products give 2N^2 loads and N^3 FMA operations for the matrix multiplication).

Whenever the lengths of both vectors are are no less than 2 and at least one length is no less than 3, the product is greater than the sum. With greater vector lengths, the ratio between product and sum grows very quickly, so when the CPU has enough registers to hold the partial sum, the ratio between the counts of FMA operations and of memory loads can be very great.


Is it though? The matmul of two NxN matrices takes N^3 macs and 2*N^2 memory access. So the larger the matrices, the more the arithmetic dominates (with some practical caveats, obviously).


This is moreso amortized optimization/reinforcement learning, not randomized algorithms.


Certainly not. You can buy great boots for 250$. I bought my Salomon boots 5 years ago, I wear them a heck of a lot in the Canadian winter, and they're good as new. Much more comfortable than a Doc Martens boot, even in their prime, resolable with a Vibram Megagrip sole, has a carbon fiber backing plate that will never wear out, etc...

What happened is that technology advanced, and now good quality boots don't look like Docs anymore. The market for high quality, vintage construction boots is tiny, because people who only care about quality won't buy them anymore, which is why they are expensive : only a special kind of people who care a lot about quality and looks/nostalgia are going to buy them.

People who need high quality boots just won't be satisfied with heavy, fussy, non-breathable leather and rubber boots.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: