All: submitted title was "`int('1' * 4301)` will raise ValueError starting with Python 3.10.7" and comments reference that, so you might want to take a look at both URLs.
Is this GitHub issue typical of the Python community nowadays?
A two-year old CVE is picked up by gpshead as suddenly urgent enough to justify a fairly significant breaking change in a patch release, despite another issue (#90716) already tracking better solutions to the same problem.
gpshead submits a PR which doesn't solve the issue (the DoS still happens, and then a ValueError comes afterwards!), so mdickinson submits a PR which works around the problem in the first PR. Nobody else is apparently involved in discussing whether this change should even be made, and whether the fixed fix genuinely eliminates the DoS.
mdickinson then has to fight for several comments to convince gpshead to make an obviously-correct and zero-cost change to the integer type, with gpshead dropping a passive-aggressive "pedantically correct" when they finally fix it.
Then when other people finally notice the issue and start to question the need for the change, especially in the context of another issue discussing a better solution, they are ignored and the discussion is shut down with a reference to the Code of Conduct.
Why would anyone want to participate in a community this toxic?
I was the one who reported this originally in 2020. Christian said it wasn't a vulnerability because it was intended that people who used standard library functions should validate user input. A day later I get another email saying that a CVE had been assigned, and that it was now suddenly a vulnerability.
I wait a few months, follow up and nothing. Pretty much dropped it for 2 years and now I see that they finally fixed it and never let me know. I asked a few days ago whether I'd be credited in the CVE, still no reply...
Why do I feel as if there are serious communication issues here?
Digging through our history, a person who reported the same thing earlier than you never got a response at all. Like I said, we've identified organizational issues to be addressed.
(I honestly don't know who should be "credited" on the CVE nor do I have control over that, sorry)
Yep, that's how orgs work. Any org older than a certain number of years must delete itself and allow others to form another entity - or else it will amass warm bodies that collect titles and dont do anything useful. Unfortunately orgs will do everything to continue to exist, despite becoming irrelevant, inefficient, and losing their merit.
Code of conduct is apart of this, it is designed to prevent strong dissenting opinion by claiming its offensive/improper. You cannot call out the true bad actors, because culturally we view offense as wrong and authority to be correct. COC is basically an HR document, it's to protect the org, not to do the right thing. History repeats itself.
Linking the Code of Conduct instead actually pointing out anything specific ist such a shitty approach to this. We're used to this kind of ToS-pointing from SV tech corportations but it's sad that it seems to be getting more common in open source communities too.
thats all you need to know. Google makes some great stuff (Go language), but their community has plenty of arrogant members that have no concept or interest in nuanced discussions.
You mean, a guy who works at Google; for a second, I thought you were implying that name was an acronym for Google Python Services or something like that.
It's a quote from, at the very least, a recent movie. And really, my original post was "Google is a made up word", yet you used it. Even if you were okay using "google", but not "googler", saying the person "is google" is wrong - I actually clicked on the link expecting to see a corporate Google account, not just some random "person who works at Google".
I'm pretty sure the decision on how to address the bug (and the determination of even if it's a bug) was not done by one dev. Other devs were involved and the determination was made as a team to make the change. Having a better fix, e.g. something like what is suggested in #90716 is not precluded. As yet, no one has actually stepped up with a better int-to-str implementation. I.e. something that can be reviewed, tested and then maintained in the long term As discussed in #90716, there is not much point in trying to do something similar to what GMP does. People can just install that library and use it.
I'm not sure why people are so excited about this issue. It's not much different than sys.setrecursionlimit(). We know how to implement tail recursion. Python doesn't do it though and so there is a limit, set high enough that most people don't care. It seems a perfectly practical approach to me.
If you are writing a server in python, you should expect these sorts of "vulnerabilites" (ie performance problems) in exchange for the convenient abstractions that you get.
but it's not a vulnerability in the traditional sense. No information gets leaked, no invalid memory is accessed. It's just slow. If being slow is a vulnerability, all python code is a vulnerability.
This is a bad take. The operation is not slow as in "takes longer", but as in "effectively never finishes". (that's what taking minutes on a single request means) That makes it trivial to DoS a service if you know it's running on a standard python runtime. And availability related to untrusted input is very much a vulnerability in a traditional sense.
They set the limit at (on my computer) 0.1ms. Also, their implementation is roughly 100x slower than it should be. Should python also error out on loops that run for more than 1000 iterations? The problem isn't that int parsing is broken, the problem is that web-servers aren't validating their inputs.
It's different layers. You're in control of the loops, so no. (guess what though - recursion level is limited) You're not in control of the int() implementation, so that falls on python.
It's similar to what we've done with hash collisions - we can't count on everyone fixing that in every place released so far, so all languages added hash randomisation at startup. That one fortunately had no side effects like this one does.
> They set the limit at (on my computer) 0.1ms.
That seems reasonable to me. I'm not sure if you're saying it's too low or too high?
> the problem is that web-servers aren't validating their inputs.
And they won't. This would be repeated many times in various form affecting Web servers, job queues, parsers, etc. and keep coming back for decades. We already have that with Java XML entities. They weren't fixed at the source and we get a new implementation with that bug ever day.
What do you mean by "all inputs"? Did you write your whole server from scratch? Did you verify the length of every single int that comes in your query? Fragment? Json? Header? Header part that may be parsed internally by python's standard libs? Chunk lengths? IP conversions? Every external library call? Variables that may turn numeric in future revisions? If that's true, how sure are you you haven't missed a single spot?
> like we have been doing for thirty years in web apps.
I have bad news for you. If everything you said is true, you've been working on some ideal code in a prefect team with no dependencies... But more likely you're going to have a bad collision with reality one day.
Could you point me to which part of my comment you found toxic? I thought I was careful to make sure it was only recounting facts, and the question I posed is genuine — as someone who used to contribute to Python (but hasn't worked with Python in a long time) I'd love to understand how things got to this point.
for what it's worth, I think you summarized the situation fairly accurately, but it's unclear what feeling you intend to convey with your comment other than to inflame readers emotionally
If you don't understand why I cited the code of conduct and redirected discussion to a more appropriate forum for constructive discussion, go read our code of conduct vs the language that was being directed at us and what being linked from this toxic site was about to bring.
There was no fighting. As soon as Mark piped up I was extremely pleased to see that he had found something that should've been obvious that we'd overlooked in the process of doing everything spread over time. Mark wasn't able to review the PR code before it was made public due to the current processes (lack of...) we're working to improve for the Python security response team.
"pedantically correct" was not intended to be read as passive aggressive. I use that term to mean exact vs almost when it comes to computations. I didn't need convincing. I wanted the reasoning to be made understandable to everyone else in the future (future selves included) who was going to read this code later. I still think there is room for better explanation of the math but that is true for large parts of Objects/longobject.c anyways.
I find your interpretation of events... amusing. :P
I went back and read the last comments before the CoC was invoked, and I read the CoC as well. Having done that, I don't understand which part of the CoC the comments were in violation of. Could you expand on why you think it was appropriate to shut the discussion down without even responding to those comments?
For anyone wondering, '1' * 4301 creates a string of '11111....' 4301 characters long. It doesn't result in an integer value of 4301 like in some other languages.
I find this a strange modification to the language, though probably not a particularly painful one. Has python saved you from yourself when dealing with non-linear built-in algorithms before? IIRC it is also possible to have the regex engine take an inordinate amount of time for certain matching concepts(I think stackoverflow was affected by this?), but the engine wasn't hobbled to throw in those cases, it is merely up to the user to write efficient regex that aren't subject to those problems.
Backtracking regular expressions as an intentional or accidental DOS vector are a moderately well-known issue, and while I prefer that a standard library implementation be robust against them, I can see the POV that it's buyer beware.
Converting a string to an integer is somewhat less well known as a DOS vector, more painful to avoid as an application creator, and easier to fix in code.
So there's a cost-benefit argument that you should just do this before you rewrite your regex engine.
On the other hands, lots of buyers are not aware that it's an issue, and more frustratingly there are regex engines which are very resilient to it... but are not widely used.
Python's stdlib will fall over on any exponential backtracking pattern, but last time I tried to make postgres fall over I didn't succeed. Even though it does have lookahead, lookbehind, and backrefs, so should be sensible to the issue (aka it's not a pure DFA).
This does seem like a strange level of handholding, even if the motivation makes lots of sense. If you start going down the road of protecting people who don't sanitize user input, you may have quite a long journey ahead...
Whether evaluating that expression results in undefined behaviour also depends on the basic execution character set and the bit width of the machine byte.
I worked with some MCUs whose only redeeming quality was their cost that had CHAR_BIT=16. The answer is basically everything. It made string processing was profoundly annoying, even for C.
No, it was with xap [1] chips. TI is also on my list of annoying chip vendors because they're one of the few companies still making big endian ARM chips, which should die in a fire.
In Algol 68, you can do that; it's part of the standard prelude. I think that some people who'd worked on Algol 68 in the Netherlands also worked on the ABC language, where it's "1" ^^ 4301, and Guido worked on ABC before Python.
I was more expecting '1111' / 4 = '1'. This would be the inverse operation. However, it opens up even more questions like what to do if your string has mixed values etc
The string multiplication is about _joining_ strings, the inverse is about _splitting_ them in several parts. It's only confusing because the * appends the string to itself, the / is actually very clear.
Disagree. The inverse "string" * value is logically splitting, and then collapsing the repeated values. The logical split can be omitted, but the collapsing cannot.
I think how Rust does it is fine, but I agree operators are often a mess. Yesterday I was looking at a memory dump where there was a problem in a destructor (a double free was detected) and it was an absolute mess trying to figure out the exact execution location in source code since it was setting the value of a smart pointer which triggered a decrement of a reference counted value in turn triggering a free. It's junk like that which starts to convince me that Linus was right to avoid C++. Rust obviously also has destructors, but it doesn't have the nightmare that is inheritance+function overloading+implicit casting.
> and it was an absolute mess trying to figure out the exact execution location in source code since it was setting the value of a smart pointer which triggered a decrement of a reference counted value in turn triggering a free.
Yes, probably. Depends on the compiler settings. Stuff can get optimized out and stripped.
When writing the code in the first place, though, it's difficult to see problems like that because it's all hidden behind magic calls to copy constructors, move semantics, and destructor calls. Out of sight, out of mind.
I think it's separate from his point but some of those things could potentially be tail calls, meaning the functions actually leading to the free/delete might not be in the stacktrace even if they were called.
Succinct string operations is honestly like half of what I use python for and the great numeric support with bignum by default and powerful libraries with overloads like numpy and tensorflow is the other half.
> Operator overloading sure seems to increase the prevalence of foot-guns, security issues, and other gotchas.
How exactly? What would you expect an expression like ('1' * 4301) to give you, and why would you think it would be different from ('caterpillar' * 4301)?
In Lua, the first is 4301 and the second is a runtime error. ('1' .. 4301) is 14301, the equivalent of the weird thing Python is fixing would be spelled `tonumber(('1'):rep(4301))` which is obviously wrong.
To my taste operator overloading is fine, but concatenation isn't addition, so they shouldn't be overloaded because... [gestures vaguely at a half dozen language]
Now I'm stumped. Isn't addition supposed to be commutative?
So yeah, without contracts in place, operator overloading is BAD. You can never know what the operator does, or what its properties are by just looking at how it's used. There's simply no enforced rules and so no-one's stopping you from doing
>>> class Complex:
def __init__(self, real, imag):
self.real = real
self.imag = imag
def __add__(self, other):
return Complex(self.real - other.real, self.imag - other.imag)
def __repr__(self):
return f'Complex({self.real}+{self.imag})'
>>> x = Complex(1, 2)
>>> y = Complex(1, 2)
>>> x + y
Complex(0+0j)
Now this intentionally being malicious of course, but plenty of libraries overload operators in non-intuitive ways so that the operator's properties and behaviour isn't obvious. This is especially true if commutative operators are implemented as being non-commutative (e.g. abusing '+' for concatenation instead of using another symbol like '&' for example) or if the behaviour changes depending on the order of operands.
Yeah, all the examples you give rub me up the wrong way. I don't like use of + for concatenate, even though it is present in Rust which I like very much on the whole.
Here's what Rust tells programmers in core::ops (the sub-library whose safe Traits result in operator "overloading" for Rust). I put "overloading" in quotes because these operators just don't exist at all for your type if you don't implement the appropriate Trait.
> Implementations of operator traits should be unsurprising in their respective contexts, keeping in mind their usual meanings and operator precedence. For example, when implementing Mul, the operation should have some resemblance to multiplication (and share expected properties like associativity).
You know what, just for shits and giggles I actually DID open an introductory textbook on abstract algebra, just in case my undergraduate degree in maths failed me. Here's what the first chapter says about addition ('+') in Z:
Here are 4 elementary properties that + satisfies:
• (Associativity): a + (b + c) = (a + b) + c ∀a, b, c ∈ Z
• (Existence of additive identity) a + 0 = 0 + a = a ∀a ∈ Z.
• (Existence of additive inverses) a + (−a) = (−a) + a = 0 ∀a ∈ Z
• (Commutativity) a + b = b + a ∀a, b ∈ Z.
Also:
Key Observation: There are naturally occuring sets (other than Z and Q) which
come equipped with a concept of + and ×, whose most basic properties are the
same as those of the usual addition and multiplication on Z or Q.
I could go on with rings, fields, and vector spaces which also rely on the concept of addition as defined in Z, but I'm really curious to learn about addition being commonly used as a non-commutative operation, especially in the presence of an accompanying multiplication operation.
edit: I forgot to provide proper references, so here's another example including a reference:
When dealing with Abelian groups, it is customary to use additive notation.
That is, the group operation will be called "addition," and
we will "add" two elements rather than "multiply" them. We write g + h instead of g • h or gh, and ng replaces g^n.
[Walker, Elbert A., "Introduction to Abstract Algebra", 1987; Ch. 2, Pg. 70]
Other than that being a terrible name (it's almost impossible to be sure what it does without consulting documentation), I personally do prefer fewer implicit/overloaded operations.
What name would you suggest? That was my 5 min of thought version.
cc prefix for concatenate because that word is very long and it seemed likely that strings may have a large number of different concatenation focused functions that could all share the prefix.
Clone as the type of concatenation operation to perform.
I originally wrote a reply here that I deleted because I realised somebody introduced the wrong idea below. Yes, the repeat() function on strings does what you describe.
str::repeat() behaves exactly as you describe. "No".repeat(4) is "NoNoNoNo" and "What".repeat(0) is "" and so on.
Note that str::repeat() returns a String, not a str, because if n > 1 it obviously needs to allocate somewhere to put the String. As a result it is not available in environments which don't have String, like a tiny embedded device - for them str::repeat() does not exist whereas stuff like str::starts_with() and str::split_once() are fine.
Repeat is an iterator, so you can apply it to any type you want, not just strings. You can chain it with other iterators, or collect it into some data structure. But yes, repeat(0) returns an empty iterator.
This example makes a repeat(0) but asks for just the first 12 things in it, they are, of course, all twelve zeroes. Feel free to adjust it to ask for more if you think maybe there aren't any more, or they stop being zero.
> A huge integer will always consume a near-quadratic amount of CPU time in conversion to or from a base 10 (decimal) string with a large number of digits. No efficient algorithm exists to do otherwise.
I don’t believe that. I did a quick search and didn’t find much, but:
Let d_0, d_1, etc be decimal digits, little endian (so the number is d_0 + 10d_1, etc). The goal is to compute that quantity in binary.
The naive algorithm is to convert d_0 to binary. Then compute 10 in binary, multiply by d_1, and add it. Then multiply to compute 10^2 in binary, and accumulate 10^2 · d_2, and repeat. For n digits, there are n steps, and each step involves two multiplications by small factors and an addition. The overall time is O(n^2).
But I would try it differently. To convert 2n-digit number, first convert the first n digits to binary (recursively) and convert the second n digits to binary. Then multiple the second result by 10^n and add to the first result.
Let’s simplify this analysis by making n be a power of 2, so 2n = 2^k. Then the big powers of 10 are all 10 to a power of 2, so 10 needs to be squared k times. Additionally, there is one 2^k-digit multiplication, two 2^(k-1)-digit multiplications, four 2^(k-2)-digit multiplications, etc.
With FFT multiplication, multiplication is O(digits * poly(log(digits)), as is squaring. This will all add up to k levels of recursion, each acting on a total of 2^k or so bits, taking time O(2^k) times some log factors. This comes out to O(n · poly(log(n)).
I have not implemented this or done the math rigorously :)
edit: this is also buried in the issue. There’s also:
Indeed, the naive algorithm is quadratic due to the multiplications. Yet everyone familiar with even a bit of crypto and/or otherwise working with huge numbers knows that multiplication can be subquadratic.
Integers / floating point / bignum is such a problem in almost any language, with multiple competing implementations in most languages. I’d argue it’s probably a harder problem than Unicode support, although Python also managed to make that difficult with multiple implementations in the wild (both UCS-2 and UCS-4 are commonly used, depending on how it was compiled).
You are correct and further in the thread someone suggests basically an O(n log² n) algorithm which does basically what you say. The statement that no faster algorithm exists is false. The Python devs, in response, redirect users to a different thread in which they talk about other algorithms.
As you can see, the Python developers have chosen to leave the original incorrect statement about O(n²) time up in the initial post.
In most C and C++ libraries, itoa is actually is done recursively like this: Divide by 100000 to get the top 5 digits and the bottom 5 digits and do the conversion for each in parallel (relying on the superscalar nature of CPUs for parallelism, no SIMD). For longs, you divide by 100000 twice, and run 4 streams.
String to integer conversion, however, is a lot harder to do in log n time, but each iteration is usually faster. The same trick doesn't work - you can't efficiently do math on the base 10 string, so the equivalent division by 2^16 is very hard. I think it has to be done in linear time, but this expands to O(n log n) for arbitrary word width due to math ops.
However, a lot of what we do for atoi/itoa assumes you have a finite length. Same with the FFTs: the algorithms rely on finite length. Infinite-length bignum libraries have a huge cost on trivial things like this, and it's part of the cost of doing business in python.
There is a very good chance that the bignum library used here is not optimized for things like atoi and itoa - most bignum libraries are written for cryptography and math where these are not done frequently.
What's next? A default socket timeout of X seconds for security reasons? What a joke and rather scary that apparently everyone or the majority on the internal side agrees with this change.
You would also need a limit on how large of a file can be written by python. Otherwise you could have a web server that takes an upload and stores it on disk which could fill the disk of the host machine. We can't expect developers to check for this, so Python must be patched to not write files larger than 2 kilobytes.
This. I don't really understand CPython decision-making process, but it just seems like a common sense that anybody who would find this a good idea surely must be a very junior developer who shouldn't be allowed to commit directly to the master branch of your local corporate project just yet… But basically breaking a perfectly logical behaviour just like that in a language used by millions of people… To me it's absolutely shocking.
> Everyone auditing all existing code for this, adding length guards, and maintaining that practice everywhere is not feasible nor is it what we deem the vast majority of our users want to do.
It's hard not to read this as "we want to use untrusted input everywhere with no consequences". Seems like we'll be kicking as many issues under the rug as we're fixing with this change, right?
Did they consider doing tainting (like Perl)? Input strings are marked as tainted and anything derived from them, except for some specific operations that untaint strings. If you use a tainted string for a security-sensitive operation then it fails. http://perlmeme.org/howtos/secure_code/taint.html
I think the broken link refers to the same information found here [0].
One of the nicest AND most frustrating parts of taint-mode, for myself, is that once activated it remains on until the end of execution. It's not scoped, which removes some of the headache of using such a thing. Switch on, and assume always on.
Back when we were writing websites in Perl (cough, 20 years ago), we also had taint mode on everywhere. It was fairly unobtrusive, and likely stopped a few attacks like:
The main thing you had to remember was the parameters (all tainted of course) had to be filtered through a regular expression before you could use them.
I read it the other way round - untrusted input is used in various places where doing such inline checks is prohibitively tricky. The examples given are quite telling: json, xmlrpc, logging. First two are everywhere in APIs. The third is just ... everywhere.
Are you really going to use a JSON or XML stream parser first before feeding it to the stdlib module? And one that does not try to expand the read values to native types? As for logging, that is certainly the place where you are not only expected, but often required to use untrusted input.
The fix feels like a heuristic and a compromise. None of the [easily available] solutions are robust, solid or performant, so someone picked an arbitrary threshold that should never be hit in sane code.
The linked issue mentions that GMP remains fast even in face of absurdly big numbers. No surprise, the library is literally designed for it: MP stands for multi-precision (ie. big int and friends).
this would all make more sense if python was using a reasonably fast string to int routine, but the one they are using is asymptotically bad, and the limit they chose is roughly a million times lower than it should have been.
It's hard not to read this as "we want to use untrusted input everywhere with no consequences". Seems like we'll be kicking as many issues under the rug as we're fixing with this change, right?
It reminds me of this old "feature" PHP had, no doubt everyone who was on the Internet at the time of its popularity saw its unintended effects:
This is way too low, I've used RSA keys in base 10 with half the size of this string. It corresponds to only 14,000 bit
numbers, there are 8192 bit keys. I'm pretty sure this will break some CTF challenges. The limit should be in the millions at the very least.
However, you shouldn't be passing million-digit numbers around as (decimal) text. Even if you're not at risk of DOS attacks, there's still the issue that it's very, very slow:
$ python3 -m timeit -s "s='1'*1000000" "i=int(s)"
1 loop, best of 5: 5.77 sec per loop
A ValueError alerting you to that fact could be considered a service.
Contrast and compare:
$ python3 -m timeit -s "s='1'*1000000" "i=int(s,16)"
200 loops, best of 5: 1.45 msec per loop
> However, you shouldn't be passing million-digit numbers around as (decimal) text
This is about numbers that are thousands of digits, not millions. Regardless, why not? What's the alternative that supports easy exchange? If you stick it in some hexified representation, you still have to parse text, and put it into some non-machine-native number container. It's going to be slow no matter what.
No, it's not going to be slow no matter what. Didn't you see my example? The hexadecimal non-machine-native textual representation was 4000 times faster than the decimal ditto. On a number that was much larger, I might add.
Sure it is. It's a concern that's addressed by using efficient libraries written in Rust or C.
Mostly.
One time I implemented RSA in Python. I needed it to read some legacy data with the wrong padding, something which the libraries couldn't (wouldn't) do. It performed just fine. The ternary pow() that does the heavy lifting is not quite as fast as the dedicated crypto libraries, but it was close enough.
In that case, you aren't doing cryptography in Python. You're using cryptography in Python. I'm not just saying that to be pedantic, I've seen some crypto implemented in Python and it is awfully slow.
This is the case of most Python code & is why many attempts to compile Python to native code results in slower overall performance.
> In that case, you aren't doing cryptography in Python. You're using cryptography in Python.
One of the great things about the Python community is that we don't have these kinds of artificial hangups. You just solve the problem, and if part of the solution involves something that someone wrote in a different programming language, that's fine.
Earlier this year, I implemented ECIES in Python. Not a very complicated algorithm, but it still needed to be done. So there is C involved in the underlying EC and AES implementations, big deal, I don't see how that should disqualify my ECIES work from being "doing cryptography".
-X int_max_str_digits=number
limit the size of int<->str conversions.
This helps avoid denial of service attacks when parsing untrusted data.
The default is sys.int_info.default_max_str_digits. 0 disables.
this should not be a runtime configuration setting, fix the sodding algorithm to not be quadratic
will we be getting PHP style magic quotes soon? that also protects developers against untrusted input (bonus! this could be configured too!)
or an inability to pass strings into the regular expression module? that can also cause DoS
I was surprised to see this in a bugfix release since it seems like a breaking change, but from reading, it seems that this was considered a security vulnerability (specifically a DOS opportunity) given the CVE status, so I imagine that compatibility concerns were secondary here. This seems in line with how other languages seem to do things from what I've seen; semver is important, but in a sense not every change is equally "breaking" to users, and breaking code that's unlikely to be common and potentially is not behaving correctly in the first place is not going to cause as much friction as most other types of breaking changes. Put another way, if there's a valid security concern, breaking things loudly for users forces them to double check their usage of this sort of code and ensure that nothing risky is going on. (I don't personally have enough domain knowledge here to know if the security concern is actually valid or not, but the decision to make this change in a patch release seems like a reasonable conclusion to come to for people who determine that it is a security concern).
> As a reminder to everybody the Python Community Code Of Conduct applies here.
> Closing. This is fixed. We'll open new issues for any follow up work necessary.
The issue was marked closed, because the associated work was completed and the PR was merged. The same comment happened to mention the code of conduct, but the code of conduct wasn't why the issue was closed--it was just because the work was done.
I think the comment mentioned the CoC because the previous comment, "This is appalling" was a bit rude.
> I think the comment mentioned the CoC because the previous comment, "This is appalling" was a bit rude.
The previous comment was indeed a bit rude. I personally wouldn't think it was rude enough to invoke a code of conduct.
Even just referring to a code of conduct has, IMO, a rather strong vibe of policing and perhaps even an implication of wrongdoing, more so than merely a suggestion to keep it calm.
I don't know the culture or context of Python development (either the language or CPython), but I'm inclined to agree with gdf that it's a bit weird to start reminding people of a CoC because of a slightly rude sentence or two, especially since the rest of the comment was reasonable technical argumentation even if unapologetic.
Even if closing the issue were entirely because of other reasons and benign (someone did still reference the issue in a commit later, though), it's all too easy to see the issue-closing comment as shutting out dissenting opinions, either because of a somewhat unpleasantly expressed argument or simply because "this is fixed, no further discussion needed".
The "this is appalling" comment may have been a bit rude but the closing one wasn't exactly a triumph in communication either.
> Even just referring to a code of conduct has, IMO, a rather strong vibe of policing and perhaps even an implication of wrongdoing, more so than merely a suggestion to keep it calm.
I'd say the opposite. A suggestion to "keep it calm" is inappropriate, because it carries the implication that someone is not calm. This is inappropriate because it is a comment on a person's emotional state rather than on what they say or how they say it.
In fact, if someone on my team said to "keep it calm", I'd take that person aside and explain, in private, the reasons why not to say that.
> Even if closing the issue were entirely because of other reasons and benign (someone did still reference the issue in a commit later, though), it's all too easy to see the issue-closing comment as shutting out dissenting opinions, [...]
If somebody thought that closing the issue shut out dissenting opinions, then that person has forgotten how GitHub issues work or how bug trackers work in general. Closing an issue just means that someone thinks that the work on it is done; it does not stop discussion on the issue. I can see why someone might forget and not realize that the issue was closed and not the discussion, but I don't think that it's a problem that someone visiting the bug from HN would forget how GitHub issues work for a minute.
With any online community above a certain size, there's a certain amount of policing not just of what is said, but where people have discussions. Anyone who regularly uses a forum, Subreddit, Discord server, IRC, Slack, etc. will see this pattern of behavior everywhere. For example--the discussion about whether this is the right way to fix a bug is a discussion which should be held elsewhere, where people can see the context and interested parties can respond to it.
Which is why there is a comment at the bottom,
> Please redirect further discussion to discuss.python.org.
It's crystal clear to me that this is not about shutting out dissenting voices, but just saying that this GitHub issue is the wrong place for this discussion.
You can see that there is a related issue which was closed, but there was a lot of discussion afterwards--but because the discussion was on-topic, the issue was not locked.
> I'd say the opposite. A suggestion to "keep it calm" is inappropriate, because it carries the implication that someone is not calm.
Perhaps a suggestion to "keep it calm" wouldn't be the best. English isn't my first language and my verbal expression isn't always the greatest. But referring to a code of conduct does also carry the implication that someone isn't minding that code, and I don't see how that would necessarily be better.
In my view, suggesting that someone isn't calm is less of a reprimand than suggesting they might be in breach of a code of conduct which, among other things, includes rules against outright harassment and other clearly reprehensible behaviour. It's normal to not be calm at times; it's another thing if someone needs to be reminded of the rules of a community. Perhaps it's a cultural thing but to me the latter is stronger judgement.
There may well be reasons for not saying to keep it calm (it sometimes simply doesn't work), but I can equally well see how people might see a reference to a CoC as strong-armed.
> If somebody thought that closing the issue shut out dissenting opinions, then that person has forgotten how GitHub issues work or how bug trackers work in general. Closing an issue just means that someone thinks that the work on it is done; it does not stop discussion on the issue.
That's fair enough. Perhaps the intention is clear enough within the community that it would indeed be deemed as simply closing that rather specific GitHub issue without implying that the matter is closed.
Human communication isn't always quite that simple, though. People get impressions from the way things are expressed. "This is fixed." makes it feel that there is nothing to be discussed about that particular change and that it is final.
I don't know the particular community well enough to know how it would be interpreted, though.
> Which is why there is a comment at the bottom,
>> Please redirect further discussion to discuss.python.org.
That's after the comment that closed the issue. Had it been in the issue-closing comment, that would have left a different taste to the closing.
> Perhaps a suggestion to "keep it calm" wouldn't be the best. English isn't my first language and my verbal expression isn't always the greatest. But referring to a code of conduct does also carry the implication that someone isn't minding that code, and I don't see how that would necessarily be better.
Yes, a suggestion to "keep it calm" is definitely bad.
It would be nice if there were an easy way for people who are not good at English to respond to comments without having to figure out the right way to respond. You could create a "standardized response", and have a committee of people with different backgrounds review the content of that response to ensure that it clear and conveys the right messages.
That "standardized response" is the code of conduct.
A code of conduct is a beautiful thing. You do not need to be a skilled English speaker to send someone a link to the code of conduct. The
> [...] but I can equally well see how people might see a reference to a CoC as strong-armed.
It sounds like these people may be prejudiced against the code of conduct, or prejudiced against codes of conduct in general. I think if you have such people in your community, that the right thing to do is to expose them to the code of conduct so they get used to it and realize that the code of conduct is not such a bad thing or a scary thing.
If you try and shield people in the community from the code of conduct out of fear that they might react poorly to it, then I think you're doing a disservice to the community.
Have you personally read the Python code of conduct? Or are you just imagining what is written in the code of conduct?
> Human communication isn't always quite that simple, though. People get impressions from the way things are expressed.
Yes... which is why for important messages, we have teams of people that review the messages over and over again to make sure that those messages are expressed properly and they are easy to understand. Messages like the code of conduct.
You cannot expect the same level of clarity from a two-line comment in a GitHub issue. This is why referring to the code of conduct is such a good idea--it is much clearer and easier to understand, because it has been reviewed so thoroughly.
Reading that thread left a bad taste in my mouth. OP sounds very toxic. Stuff like that is why I stay away from open source “communities” and stick to fixing issues that personally affect me.
If you need to make integers this big from decimal representations, I guess you could still use gmpy2.mpz(), and then either leave the result as an mpz object (which is generally drop-in compatible with Python's int type, with the addition of some optimized assembly implementations of arithmetic operations and some additional methods), or convert it to a Python int by calling int() on it.
Who uses a process per request for serving Python apps? That must be very uncommon. Even if you use a worker pool that isn’t going to restart a whole process just because of an errant exception in a request handler.
Also as noted if your whole process crashes because of errant input to int() you are beyond fucked in other ways.
There are inputs that can slow it down by hours. Maybe the set the limit too low. Maybe they should have instead merged the PR that improves the speed by a huge amount. They didn't.
> Chosen such that this isn't wildly slow on modern hardware and so that everyone's existing deployed numpy test suite passes before https://github.com/numpy/numpy/issues/22098 is widely available.
Could they not have modified the `int` function to `int(thingy, i_really_want_to_do_this=false)`?
Edit: Looks like they added a python argument to increase the limit. So if you really need this, I suppose you can search around until you figure out why it's not working and pass the correct argument to the python bin.
Yeah, we must prevent DoS at all costs. It seems that Python should not have integers at arbitrary size for "performance" reason in the beginning. Aren't int32/int64/int128 nice? Number of operations are all bounded. We should stick to them.
This was Python's behavior until Python 2; `long`, the arbitrary-precision integer, was a separate type, and `int` arithmetic overflow caused a ValueError. One of the big changes in Python 2 was to imitate the behavior of Smalltalk and (most) Lisp by transparently overflowing `int` arithmetic to `long` instead of requiring an explicit `long()` cast. Python 3 eliminated the separate `long` type altogether.
Having been bitten by the Smalltalk behavior, I am skeptical that the Python 2 change was a good idea.
I keep wondering if it was as well given code I've had to wrangle that _wants_ twos compliment fixed size math in Python. Both signed and unsigned. But our language tries not to have a bazillion different basic types and the ill-defined Python <= 2 `int` being whatever the platforms `C long` could hold was not great so simplifying to a single integer type in 3 was still a net win AFAICT.
It'd be nice to have a twos-complement fixed-size type too, but I think it's probably better that Python 2's int isn't that.
The problem with transparently overflowing to Python `long` is that, most of the time, the overflow is unintended, and the resulting performance collapse is a bug that's harder to track down than a ValueError.
> It takes about 50ms to parse an int string with 100,000 digits and about 5sec for 1,000,000 digits. The float type, decimal type, int.from_bytes(), and int() for binary bases 2, 4, 8, 16, and 32 are not affected.
Sure seems strange to set the limit to 4300. 50ms is not a DoS.
This is easy for huge corporations who live and breathe automated-DDoS protection without blinking an eye, but a major challenge for all of the little applications and small hosts.
For the curious. I've checked the corresponding perl5 code, and it is not affected by such json/yaml DOS attacks. It bails out early on overlarge bignums already.
Ruby, no idea.
If I'm understanding this correctly: the only way to convert an extremely large base10 string to an integer using the standard library is to muck with global interpreter settings?
It seems short sighted to not provide some function that mimics legacy functionality exactly. Even if it is something like int.parse_string_unlimited(). Especially since a random library can just set the cap to 0 and side-step the problem entirely.
try:
value = int(value_to_parse)
except ValueError:
import sys
__old_int_max_str_digits = sys.get_int_max_str_digits()
sys.set_int_max_str_digits(0)
value = int(value_to_parse)
sys.set_int_max_str_digits(__old_int_max_str_digits)
Or maybe just this:
class UnboundedIntParsing:
def __enter__(self):
self.__old_int_max_str_digits = sys.get_int_max_str_digits()
return self
def __exit__(self, *args):
sys.set_int_max_str_digits(self.__old_int_max_str_digits)
with UnboundedIntParsing as uip:
value = int(str_value)
> A huge integer will always consume a near-quadratic amount of CPU time in conversion to or from a base 10 (decimal) string with a large number of digits. No efficient algorithm exists to do otherwise.
This is pretty interestign in itself. Are there other sw compoenents that have flagged & fixed this vulnerability? Seems like there should be many.
The best algorithm isn't quadratic. It's M(n)log(n) where M(n) is the cost of your integer multiply (M(n) theoretically be as low as nlog(n), but in practice the best algorithms used are nlog(n)log(log(n))). Python just didn't bother to implement them.
No, because 10 is not a power of 2, so any digit in the source (base 10) can affect any digit in the result (base 2). Converting from e.g. base 16 to base 2 is linear, because 16 is a power of 2.
It's stated to be CVE-2020-10735, which is apparently about a denial of service by forcing Python to inefficiently convert a very large string to an integer, using a potentially ridiculous amount of CPU time.
The CVE hasn't been published, but for example there's an explanation at
str.__mul__ is just a conveniently short way to demonstrate the issue, the target is pretty much any parsing routine exposed to outside users e.g. any JSON API.
Apologies, my comment is snark. The algorithm in question is soft-linear, faster implementations exist, this seems like an incredibly myopic fix. Just make a bigger JSON blob and it will take longer to parse.
Because there is no linear-time algorithm for decimal-to-binary conversion. If we are to expose the bignum-aware `int` function to untrusted input there should be some limit anyway. I do think the current limit of 4301 digits seem too low though---if it were something like 1 million digits I would be okay.
there isn't a linear time algorithm, but there is an algorithm in O(n*log(n)^2) http://maths-people.anu.edu.au/~brent/pd/rpb032.pdf which is pretty close. it also seems weird to have a CVE for "some algorithms don't run in linear time". should there be a 4000 element maximum for the size of list passed to sort?
> should there be a 4000 element maximum for the size of list passed to sort?
Technically speaking, yes, there should be some limit if you are accepting an untrusted input. But there is a good argument for making this limit built-in for integers but not lists: integers are expected to be atomic while lists are wildly understood as aggregates, therefore large integers can more easily propagate throughout unsuspecting code base than large lists.
(Or, if you are just saying that once you have sub-quadratic algorithms you don't need language-imposed limits anymore, maybe you are right.)
Is there something bad going on with Python's internal representation of big integers, too? I thought I might have understood Tim Peters to be saying that in the latter thread.
It does look like gmpy2.mpz() is like 100 times faster than int() or something. Is this just because it's doing it all in assembly rather than in Python bytecodes, or are the Python data structures here also not so hot?
> It does look like gmpy2.mpz() is like 100 times faster than int() or something. Is this just because it's doing it all in assembly rather than in Python bytecodes, or are the Python data structures here also not so hot?
It's not the data structures. The data structures are really more or less the same: you have some array of words, with a length and a sign. The only real differences are in the particular length of word that you choose, which is not a very interesting difference.
Assembly language optimizations do tend to matter here, because you're working with the carry bit for lots of these operations, and each architecture also has some different way of multiplying numbers. Multiplying numbers is "funny" because it produces two words of output for one word of input.
There are also sometimes some different algorithms in use, and GMP uses some different algorithms depending on the size. Here's a page describing the algorithms used by GMP:
IMO, I wouldn't expect my language's built-in bigint type to use the best, most cutting-edge algorithms and lots of hand-tuned assembly. GMP is a specialized library for doing special things.
There are d additions, so the addition is linear time.
Each multiplication is potentially quadratic, but it seems optimizable since it’s never multiplication of two large numbers—always one large and one small number.
One mistake is assuming that the additions take constant time, but really they take ~n time each because you're adding up n-bit integers. If you accumulate your results from left to right you're adding an n-bit integer with a (~n-3)-bit integer, which results in an n-bit integer, which you add with an (~n-6)-bit integer, and so on. This sums up to \Theta(n^2).
Another issue you have not accounted for is how you convert to base two. If you want the final product to be in base two, your additions have to be in base two, so the result of your multiplications have to be in base two, which means you will have to convert 1000, 100, 10, 1 into base two as well. Again this takes \Theta(n^2) time if you cache results (1000_2 = 100_2 *_2 10_2): you're doing n operations, and the cost of each operation is 1, 2, 3, 4, ..., n (times some constant. 10^(n-1) * 10 all in base two takes \Theta(n) time).
But it's not actually worse than O(n^2), my mistake.
They’re probably seeing real-world performance being pretty good with this approach for some other reason then, perhaps it’s quadratic but those additions and multiplications are fast in practice so the O(n^2) badness doesn’t come up for a while
Each addition is linear in d, but there are d additions, so it's already quadratic before you even consider the multiplications.
In a power-of-2 base, the result of the multiplication is a constant number of digits (because the multiplication is just a shift of a single digit), so the additions could each be constant time in that case.
That means every limb operation should be done modulo 10^k, which would be pretty expensive and only makes sense if you don't do much computation with them so the base conversion will dominate the computation.
No. Many bignum libraries store numbers as sequences (or linked lists) of "digits", where each digit is an 8 to 256-bit binary number. Which is why I'm skeptical of lifthrasiir's claim that a bignum cannot be parsed in O(n) since it is analogous to initializing a list.