Prevent DoS by large int-str conversions

jwilk · on Sept 7, 2022

https://github.com/python/cpython/issues/95778 has more information.

dang · on Sept 7, 2022

Ok, we'll change to that from https://pythoninsider.blogspot.com/2022/09/python-releases-3.... Thanks!

All: submitted title was "`int('1' * 4301)` will raise ValueError starting with Python 3.10.7" and comments reference that, so you might want to take a look at both URLs.

jhugo · on Sept 7, 2022

Is this GitHub issue typical of the Python community nowadays?

A two-year old CVE is picked up by gpshead as suddenly urgent enough to justify a fairly significant breaking change in a patch release, despite another issue (#90716) already tracking better solutions to the same problem.

gpshead submits a PR which doesn't solve the issue (the DoS still happens, and then a ValueError comes afterwards!), so mdickinson submits a PR which works around the problem in the first PR. Nobody else is apparently involved in discussing whether this change should even be made, and whether the fixed fix genuinely eliminates the DoS.

mdickinson then has to fight for several comments to convince gpshead to make an obviously-correct and zero-cost change to the integer type, with gpshead dropping a passive-aggressive "pedantically correct" when they finally fix it.

Then when other people finally notice the issue and start to question the need for the change, especially in the context of another issue discussing a better solution, they are ignored and the discussion is shut down with a reference to the Code of Conduct.

Why would anyone want to participate in a community this toxic?

ehhthing · on Sept 8, 2022

I was the one who reported this originally in 2020. Christian said it wasn't a vulnerability because it was intended that people who used standard library functions should validate user input. A day later I get another email saying that a CVE had been assigned, and that it was now suddenly a vulnerability.

I wait a few months, follow up and nothing. Pretty much dropped it for 2 years and now I see that they finally fixed it and never let me know. I asked a few days ago whether I'd be credited in the CVE, still no reply...

Why do I feel as if there are serious communication issues here?

zelphirkalt · on Sept 8, 2022

Maybe it is not in their code of conduct to let you know or reply in a timely manner.

gpshead · on Sept 9, 2022

Because there are.

Digging through our history, a person who reported the same thing earlier than you never got a response at all. Like I said, we've identified organizational issues to be addressed.

(I honestly don't know who should be "credited" on the CVE nor do I have control over that, sorry)

devwastaken · on Sept 8, 2022

Yep, that's how orgs work. Any org older than a certain number of years must delete itself and allow others to form another entity - or else it will amass warm bodies that collect titles and dont do anything useful. Unfortunately orgs will do everything to continue to exist, despite becoming irrelevant, inefficient, and losing their merit.

Code of conduct is apart of this, it is designed to prevent strong dissenting opinion by claiming its offensive/improper. You cannot call out the true bad actors, because culturally we view offense as wrong and authority to be correct. COC is basically an HR document, it's to protect the org, not to do the right thing. History repeats itself.

account42 · on Sept 8, 2022

Linking the Code of Conduct instead actually pointing out anything specific ist such a shitty approach to this. We're used to this kind of ToS-pointing from SV tech corportations but it's sad that it seems to be getting more common in open source communities too.

svnpenn · on Sept 8, 2022

> Why would anyone want to participate in a community this toxic?

gpshead is Google:

https://github.com/gpshead

thats all you need to know. Google makes some great stuff (Go language), but their community has plenty of arrogant members that have no concept or interest in nuanced discussions.

userbinator · on Sept 8, 2022

You mean, a guy who works at Google; for a second, I thought you were implying that name was an acronym for Google Python Services or something like that.

googlryas · on Sept 8, 2022

I think you mean a googler. He isn't Google.

svnpenn · on Sept 8, 2022

sorry, I don't use made up words.

googlryas · on Sept 8, 2022

All words are made up.

svnpenn · on Sept 8, 2022

[flagged]

googlryas · on Sept 8, 2022

It's a quote from, at the very least, a recent movie. And really, my original post was "Google is a made up word", yet you used it. Even if you were okay using "google", but not "googler", saying the person "is google" is wrong - I actually clicked on the link expecting to see a corporate Google account, not just some random "person who works at Google".

TomSwirly · on Sept 8, 2022

"Google", which you used, is a "made-up word".

nas · on Sept 8, 2022

I'm pretty sure the decision on how to address the bug (and the determination of even if it's a bug) was not done by one dev. Other devs were involved and the determination was made as a team to make the change. Having a better fix, e.g. something like what is suggested in #90716 is not precluded. As yet, no one has actually stepped up with a better int-to-str implementation. I.e. something that can be reviewed, tested and then maintained in the long term As discussed in #90716, there is not much point in trying to do something similar to what GMP does. People can just install that library and use it.

I'm not sure why people are so excited about this issue. It's not much different than sys.setrecursionlimit(). We know how to implement tail recursion. Python doesn't do it though and so there is a limit, set high enough that most people don't care. It seems a perfectly practical approach to me.

adgjlsfhk1 · on Sept 8, 2022

It's a breaking change in a patch release that fixes the "Security Vulnerability" that python is sometimes slow.

pclmulqdq · on Sept 8, 2022

If you are writing a server in python, you should expect these sorts of "vulnerabilites" (ie performance problems) in exchange for the convenient abstractions that you get.

cuteboy19 · on Sept 8, 2022

The vulnerability is int-ing untrusted inputs.

adgjlsfhk1 · on Sept 8, 2022

but it's not a vulnerability in the traditional sense. No information gets leaked, no invalid memory is accessed. It's just slow. If being slow is a vulnerability, all python code is a vulnerability.

viraptor · on Sept 8, 2022

This is a bad take. The operation is not slow as in "takes longer", but as in "effectively never finishes". (that's what taking minutes on a single request means) That makes it trivial to DoS a service if you know it's running on a standard python runtime. And availability related to untrusted input is very much a vulnerability in a traditional sense.

As traditional as it gets really: https://www.cvedetails.com/vulnerability-list/opdos-1/denial...

It also has a financial impact definitely something security people are interested in.

adgjlsfhk1 · on Sept 8, 2022

They set the limit at (on my computer) 0.1ms. Also, their implementation is roughly 100x slower than it should be. Should python also error out on loops that run for more than 1000 iterations? The problem isn't that int parsing is broken, the problem is that web-servers aren't validating their inputs.

viraptor · on Sept 8, 2022

It's different layers. You're in control of the loops, so no. (guess what though - recursion level is limited) You're not in control of the int() implementation, so that falls on python.

It's similar to what we've done with hash collisions - we can't count on everyone fixing that in every place released so far, so all languages added hash randomisation at startup. That one fortunately had no side effects like this one does.

> They set the limit at (on my computer) 0.1ms.

That seems reasonable to me. I'm not sure if you're saying it's too low or too high?

> the problem is that web-servers aren't validating their inputs.

And they won't. This would be repeated many times in various form affecting Web servers, job queues, parsers, etc. and keep coming back for decades. We already have that with Java XML entities. They weren't fixed at the source and we get a new implementation with that bug ever day.

TomSwirly · on Sept 8, 2022

"Trivial," you say?

I sanitize all my inputs while they are still strings, like we have been doing for thirty years in web apps.

Now explain how you can "trivially" DoS my service.

viraptor · on Sept 8, 2022

What do you mean by "all inputs"? Did you write your whole server from scratch? Did you verify the length of every single int that comes in your query? Fragment? Json? Header? Header part that may be parsed internally by python's standard libs? Chunk lengths? IP conversions? Every external library call? Variables that may turn numeric in future revisions? If that's true, how sure are you you haven't missed a single spot?

> like we have been doing for thirty years in web apps.

I have bad news for you. If everything you said is true, you've been working on some ideal code in a prefect team with no dependencies... But more likely you're going to have a bad collision with reality one day.

Also, your code, I believe? https://github.com/rec/gitz/blob/0c15c9e3d213c3556f38f6fbf63... (I know that's not a Web service, but couldn't find one. Still untrusted input. If you point me at one, I'll be happy to find another example.)

tonto · on Sept 8, 2022

says the person posting in a fairly toxic manner

jhugo · on Sept 8, 2022

Could you point me to which part of my comment you found toxic? I thought I was careful to make sure it was only recounting facts, and the question I posed is genuine — as someone who used to contribute to Python (but hasn't worked with Python in a long time) I'd love to understand how things got to this point.

tonto · on Sept 8, 2022

for what it's worth, I think you summarized the situation fairly accurately, but it's unclear what feeling you intend to convey with your comment other than to inflame readers emotionally

gpshead · on Sept 9, 2022

If you don't understand why I cited the code of conduct and redirected discussion to a more appropriate forum for constructive discussion, go read our code of conduct vs the language that was being directed at us and what being linked from this toxic site was about to bring.

There was no fighting. As soon as Mark piped up I was extremely pleased to see that he had found something that should've been obvious that we'd overlooked in the process of doing everything spread over time. Mark wasn't able to review the PR code before it was made public due to the current processes (lack of...) we're working to improve for the Python security response team.

"pedantically correct" was not intended to be read as passive aggressive. I use that term to mean exact vs almost when it comes to computations. I didn't need convincing. I wanted the reasoning to be made understandable to everyone else in the future (future selves included) who was going to read this code later. I still think there is room for better explanation of the math but that is true for large parts of Objects/longobject.c anyways.

I find your interpretation of events... amusing. :P

jhugo · on Sept 9, 2022

I went back and read the last comments before the CoC was invoked, and I read the CoC as well. Having done that, I don't understand which part of the CoC the comments were in violation of. Could you expand on why you think it was appropriate to shut the discussion down without even responding to those comments?

googlryas · on Sept 7, 2022

For anyone wondering, '1' * 4301 creates a string of '11111....' 4301 characters long. It doesn't result in an integer value of 4301 like in some other languages.

I find this a strange modification to the language, though probably not a particularly painful one. Has python saved you from yourself when dealing with non-linear built-in algorithms before? IIRC it is also possible to have the regex engine take an inordinate amount of time for certain matching concepts(I think stackoverflow was affected by this?), but the engine wasn't hobbled to throw in those cases, it is merely up to the user to write efficient regex that aren't subject to those problems.

hyperpape · on Sept 7, 2022

Backtracking regular expressions as an intentional or accidental DOS vector are a moderately well-known issue, and while I prefer that a standard library implementation be robust against them, I can see the POV that it's buyer beware.

Converting a string to an integer is somewhat less well known as a DOS vector, more painful to avoid as an application creator, and easier to fix in code.

So there's a cost-benefit argument that you should just do this before you rewrite your regex engine.

masklinn · on Sept 7, 2022

> I can see the POV that it's buyer beware.

On the other hands, lots of buyers are not aware that it's an issue, and more frustratingly there are regex engines which are very resilient to it... but are not widely used.

Python's stdlib will fall over on any exponential backtracking pattern, but last time I tried to make postgres fall over I didn't succeed. Even though it does have lookahead, lookbehind, and backrefs, so should be sensible to the issue (aka it's not a pure DFA).

bo1024 · on Sept 7, 2022

This does seem like a strange level of handholding, even if the motivation makes lots of sense. If you start going down the road of protecting people who don't sanitize user input, you may have quite a long journey ahead...

gsliepen · on Sept 7, 2022

Well, in C++ int('1' * 4301) is a perfectly valid expression, but it evaluates to 210749, not 4301.

oldgradstudent · on Sept 7, 2022

Or some other value.

If sizeof(int)=2, the result is undefined.

eMSF · on Sept 7, 2022

Whether evaluating that expression results in undefined behaviour also depends on the basic execution character set and the bit width of the machine byte.

gsliepen · on Sept 7, 2022

Not if CHAR_BIT is 10 or more!

oldgradstudent · on Sept 7, 2022

I wonder how much software will fail on platforms where CHAR_BIT is not 8.

AlotOfReading · on Sept 8, 2022

I worked with some MCUs whose only redeeming quality was their cost that had CHAR_BIT=16. The answer is basically everything. It made string processing was profoundly annoying, even for C.

colejohnson66 · on Sept 8, 2022

Is it one of TI’s custom ISAs for their DSP line? The TMS430, for example, has a 16 bit char. That causes some unintuitive looking code to fight that.

AlotOfReading · on Sept 8, 2022

No, it was with xap [1] chips. TI is also on my list of annoying chip vendors because they're one of the few companies still making big endian ARM chips, which should die in a fire.

[1] https://en.wikipedia.org/wiki/XAP_processor

Kamq · on Sept 7, 2022

I've seen 2 programs in my life that checked for CHAR_BIT (and one was a toy decimal -> binary converter). My guess is basically all of it.

dark-star · on Sept 7, 2022

it doesn't evaluate to 4301 in Python either ;-)

jejones3141 · on Sept 7, 2022

In Algol 68, you can do that; it's part of the standard prelude. I think that some people who'd worked on Algol 68 in the Netherlands also worked on the ABC language, where it's "1" ^^ 4301, and Guido worked on ABC before Python.

ffhhj · on Sept 7, 2022

They should have made the analogous inverse operation: '1234' / 2 = ['12', '34']

tremon · on Sept 7, 2022

That's not the inverse of the multiplication though. The inverse would be '33' / 2 = '3', and '1234'/2 should then probably raise a ValueError.

bsdz · on Sept 7, 2022

I was more expecting '1111' / 4 = '1'. This would be the inverse operation. However, it opens up even more questions like what to do if your string has mixed values etc

ffhhj · on Sept 7, 2022

The string multiplication is about _joining_ strings, the inverse is about _splitting_ them in several parts. It's only confusing because the * appends the string to itself, the / is actually very clear.

dekhn · on Sept 7, 2022

Disagree. The inverse "string" * value is logically splitting, and then collapsing the repeated values. The logical split can be omitted, but the collapsing cannot.

jbverschoor · on Sept 7, 2022

Naming is hard they say

mjevans · on Sept 7, 2022

Operator overloading sure seems to increase the prevalence of foot-guns, security issues, and other gotchas.

str.ccClone(4301) # ConCatenate Clones of the source string N times.

Would even an abbreviated, named, function not be more self documenting and better for human and machine reviews?

slaymaker1907 · on Sept 7, 2022

I think how Rust does it is fine, but I agree operators are often a mess. Yesterday I was looking at a memory dump where there was a problem in a destructor (a double free was detected) and it was an absolute mess trying to figure out the exact execution location in source code since it was setting the value of a smart pointer which triggered a decrement of a reference counted value in turn triggering a free. It's junk like that which starts to convince me that Linus was right to avoid C++. Rust obviously also has destructors, but it doesn't have the nightmare that is inheritance+function overloading+implicit casting.

cma · on Sept 7, 2022

> and it was an absolute mess trying to figure out the exact execution location in source code since it was setting the value of a smart pointer which triggered a decrement of a reference counted value in turn triggering a free.

Isn't all that context there in the stack trace?

jlarocco · on Sept 7, 2022

Yes, probably. Depends on the compiler settings. Stuff can get optimized out and stripped.

When writing the code in the first place, though, it's difficult to see problems like that because it's all hidden behind magic calls to copy constructors, move semantics, and destructor calls. Out of sight, out of mind.

DSMan195276 · on Sept 7, 2022

I think it's separate from his point but some of those things could potentially be tail calls, meaning the functions actually leading to the free/delete might not be in the stacktrace even if they were called.

im3w1l · on Sept 7, 2022

Succinct string operations is honestly like half of what I use python for and the great numeric support with bignum by default and powerful libraries with overloads like numpy and tensorflow is the other half.

Gordonjcp · on Sept 7, 2022

> Operator overloading sure seems to increase the prevalence of foot-guns, security issues, and other gotchas.

How exactly? What would you expect an expression like ('1' * 4301) to give you, and why would you think it would be different from ('caterpillar' * 4301)?

samatman · on Sept 7, 2022

In Lua, the first is 4301 and the second is a runtime error. ('1' .. 4301) is 14301, the equivalent of the weird thing Python is fixing would be spelled `tonumber(('1'):rep(4301))` which is obviously wrong.

To my taste operator overloading is fine, but concatenation isn't addition, so they shouldn't be overloaded because... [gestures vaguely at a half dozen language]

naniwaduni · on Sept 7, 2022

Concatenation is obviously a multiplication!

(By extension, string repeat is exponentiation!)

qayxc · on Sept 7, 2022

Well, let's assume that the "expected" behaviour holds, shall we? Let's open up a python REPL and try

  >>> 'caterpillar' * 2
  'caterpillarcaterpillar'

OK, now for something different:

  >>> [1, 2, 3] * 2
  [1, 2, 3, 1, 2, 3]

Marvellous! How about this then:

  >>> True * 2
  2

Wait, what? Hm.

  >>> False * 2
  0

Whoops! Implicit type conversion takes place... Even worse:

  >>> 'abc' + 'efg'
  'acbefg'
  >>> 'efg' + 'abc'
  'efgabc'

Now I'm stumped. Isn't addition supposed to be commutative?

So yeah, without contracts in place, operator overloading is BAD. You can never know what the operator does, or what its properties are by just looking at how it's used. There's simply no enforced rules and so no-one's stopping you from doing

   >>> class Complex:
     def __init__(self, real, imag):
       self.real = real
       self.imag = imag

     def __add__(self, other):
       return Complex(self.real - other.real, self.imag - other.imag)

     def __repr__(self):
       return f'Complex({self.real}+{self.imag})'

   >>> x = Complex(1, 2)
   >>> y = Complex(1, 2)
   >>> x + y
   Complex(0+0j)

Now this intentionally being malicious of course, but plenty of libraries overload operators in non-intuitive ways so that the operator's properties and behaviour isn't obvious. This is especially true if commutative operators are implemented as being non-commutative (e.g. abusing '+' for concatenation instead of using another symbol like '&' for example) or if the behaviour changes depending on the order of operands.

tialaramex · on Sept 8, 2022

Yeah, all the examples you give rub me up the wrong way. I don't like use of + for concatenate, even though it is present in Rust which I like very much on the whole.

Here's what Rust tells programmers in core::ops (the sub-library whose safe Traits result in operator "overloading" for Rust). I put "overloading" in quotes because these operators just don't exist at all for your type if you don't implement the appropriate Trait.

> Implementations of operator traits should be unsurprising in their respective contexts, keeping in mind their usual meanings and operator precedence. For example, when implementing Mul, the operation should have some resemblance to multiplication (and share expected properties like associativity).

Gordonjcp · on Sept 8, 2022

True and False are internally considered to be one and zero, so that example is correct.

String concatenation is not addition. How would you expect commutative string concatenation to work?

I guess, to boil it right down, what I'm asking is "do you understand what the word 'context' means?"

qayxc · on Sept 8, 2022

> String concatenation is not addition.

Exactly! So why use the addition operator then? It's not as if alternatives (like the '&' I mentioned) aren't available.

oefrha · on Sept 8, 2022

> Isn't addition supposed to be commutative?

No. Maybe crack open an introductory textbook to abstract algebra.

qayxc · on Sept 8, 2022

You know what, just for shits and giggles I actually DID open an introductory textbook on abstract algebra, just in case my undergraduate degree in maths failed me. Here's what the first chapter says about addition ('+') in Z:

  Here are 4 elementary properties that + satisfies:
  • (Associativity): a + (b + c) = (a + b) + c ∀a, b, c ∈ Z
  • (Existence of additive identity) a + 0 = 0 + a = a ∀a ∈ Z.
  • (Existence of additive inverses) a + (−a) = (−a) + a = 0 ∀a ∈ Z
  • (Commutativity) a + b = b + a ∀a, b ∈ Z.

Also:

  Key Observation: There are naturally occuring sets (other than Z and Q) which 
  come equipped with a concept of + and ×, whose most basic properties are the 
  same as those of the usual addition and multiplication on Z or Q.

I could go on with rings, fields, and vector spaces which also rely on the concept of addition as defined in Z, but I'm really curious to learn about addition being commonly used as a non-commutative operation, especially in the presence of an accompanying multiplication operation.

edit: I forgot to provide proper references, so here's another example including a reference:

  When dealing with Abelian groups, it is customary to use additive notation.
  That is, the group operation will be called "addition," and 
  we will "add" two elements rather than "multiply" them. We write g + h instead of g • h or gh, and ng replaces g^n.

[Walker, Elbert A., "Introduction to Abstract Algebra", 1987; Ch. 2, Pg. 70]

adgjlsfhk1 · on Sept 8, 2022

Common convention in abstract algebra is to use + only for commutative algebras (* is used for non-commutative ones as well as in rings).

qayxc · on Sept 8, 2022

> No. Maybe crack open an introductory textbook to abstract algebra

Sources please! '+' is used as the standard operator for abelian groups.

salawat · on Sept 8, 2022

Just give them the TL:DR, friend.

If you didn't read it, you don't know. If you don't know, you have only yourself to blame when demons fly out your nose.

Friends don't let friends spew demons out their nose.

proto_lambda · on Sept 7, 2022

Other than that being a terrible name (it's almost impossible to be sure what it does without consulting documentation), I personally do prefer fewer implicit/overloaded operations.

mjevans · on Sept 7, 2022

What name would you suggest? That was my 5 min of thought version.

cc prefix for concatenate because that word is very long and it seemed likely that strings may have a large number of different concatenation focused functions that could all share the prefix.

Clone as the type of concatenation operation to perform.

proto_lambda · on Sept 7, 2022

Rust uses `repeat()`, which sounds much more descriptive to me. The types in the function signature make the "clone" part of the name redundant.

mjevans · on Sept 7, 2022

Offhand, is repeat(0) an empty string, repeat(1) the input string, etc? If so that's a great name for the function.

tialaramex · on Sept 8, 2022

I originally wrote a reply here that I deleted because I realised somebody introduced the wrong idea below. Yes, the repeat() function on strings does what you describe.

https://doc.rust-lang.org/std/primitive.str.html#method.repe...

str::repeat() behaves exactly as you describe. "No".repeat(4) is "NoNoNoNo" and "What".repeat(0) is "" and so on.

Note that str::repeat() returns a String, not a str, because if n > 1 it obviously needs to allocate somewhere to put the String. As a result it is not available in environments which don't have String, like a tiny embedded device - for them str::repeat() does not exist whereas stuff like str::starts_with() and str::split_once() are fine.

pezezin · on Sept 7, 2022

Repeat is an iterator, so you can apply it to any type you want, not just strings. You can chain it with other iterators, or collect it into some data structure. But yes, repeat(0) returns an empty iterator.

https://doc.rust-lang.org/std/iter/fn.repeat.html

tialaramex · on Sept 8, 2022

This is wrong, repeat(0) gives you a Repeat which will repeat 0 forever, that's its purpose:

https://play.rust-lang.org/?version=stable&mode=debug&editio...

This example makes a repeat(0) but asks for just the first 12 things in it, they are, of course, all twelve zeroes. Feel free to adjust it to ask for more if you think maybe there aren't any more, or they stop being zero.

pezezin · on Sept 8, 2022

Shit, I should not post in the morning before my first coffee :(

proto_lambda · on Sept 8, 2022

That's the wrong repeat(), I was talking about str::repeat(): https://doc.rust-lang.org/std/primitive.str.html#method.repe...

pezezin · on Sept 8, 2022

Shit, I should not post in the morning before my first coffee :(

UncleEntity · on Sept 7, 2022

It is really useful sugar for:

  for _ in range(4301):
    llama.append(‘1’)

(there’s probably an easier way to do that but you get the point)

where python can see both sides of the operation and optimize it on the C side of things.

The issue really has nothing to do with that though, it is converting a string to an int which is the whole point of the security update.

amluto · on Sept 7, 2022

> A huge integer will always consume a near-quadratic amount of CPU time in conversion to or from a base 10 (decimal) string with a large number of digits. No efficient algorithm exists to do otherwise.

I don’t believe that. I did a quick search and didn’t find much, but:

Let d_0, d_1, etc be decimal digits, little endian (so the number is d_0 + 10d_1, etc). The goal is to compute that quantity in binary.

The naive algorithm is to convert d_0 to binary. Then compute 10 in binary, multiply by d_1, and add it. Then multiply to compute 10^2 in binary, and accumulate 10^2 · d_2, and repeat. For n digits, there are n steps, and each step involves two multiplications by small factors and an addition. The overall time is O(n^2).

But I would try it differently. To convert 2n-digit number, first convert the first n digits to binary (recursively) and convert the second n digits to binary. Then multiple the second result by 10^n and add to the first result.

Let’s simplify this analysis by making n be a power of 2, so 2n = 2^k. Then the big powers of 10 are all 10 to a power of 2, so 10 needs to be squared k times. Additionally, there is one 2^k-digit multiplication, two 2^(k-1)-digit multiplications, four 2^(k-2)-digit multiplications, etc.

With FFT multiplication, multiplication is O(digits * poly(log(digits)), as is squaring. This will all add up to k levels of recursion, each acting on a total of 2^k or so bits, taking time O(2^k) times some log factors. This comes out to O(n · poly(log(n)).

I have not implemented this or done the math rigorously :)

edit: this is also buried in the issue. There’s also:

https://github.com/python/cpython/issues/90716

I don’t get it.

userbinator · on Sept 8, 2022

Indeed, the naive algorithm is quadratic due to the multiplications. Yet everyone familiar with even a bit of crypto and/or otherwise working with huge numbers knows that multiplication can be subquadratic.

...and in doing a bit of research on how Python does multiplication, I found some... rather odd opinions, considering it's one of the few languages with arbitrary precision integers: https://discuss.python.org/t/faster-large-integer-multiplica...

stingraycharles · on Sept 8, 2022

Integers / floating point / bignum is such a problem in almost any language, with multiple competing implementations in most languages. I’d argue it’s probably a harder problem than Unicode support, although Python also managed to make that difficult with multiple implementations in the wild (both UCS-2 and UCS-4 are commonly used, depending on how it was compiled).

ComplexSystems · on Sept 8, 2022

You are correct and further in the thread someone suggests basically an O(n log² n) algorithm which does basically what you say. The statement that no faster algorithm exists is false. The Python devs, in response, redirect users to a different thread in which they talk about other algorithms.

As you can see, the Python developers have chosen to leave the original incorrect statement about O(n²) time up in the initial post.

pclmulqdq · on Sept 8, 2022

In most C and C++ libraries, itoa is actually is done recursively like this: Divide by 100000 to get the top 5 digits and the bottom 5 digits and do the conversion for each in parallel (relying on the superscalar nature of CPUs for parallelism, no SIMD). For longs, you divide by 100000 twice, and run 4 streams.

String to integer conversion, however, is a lot harder to do in log n time, but each iteration is usually faster. The same trick doesn't work - you can't efficiently do math on the base 10 string, so the equivalent division by 2^16 is very hard. I think it has to be done in linear time, but this expands to O(n log n) for arbitrary word width due to math ops.

However, a lot of what we do for atoi/itoa assumes you have a finite length. Same with the FFTs: the algorithms rely on finite length. Infinite-length bignum libraries have a huge cost on trivial things like this, and it's part of the cost of doing business in python.

There is a very good chance that the bignum library used here is not optimized for things like atoi and itoa - most bignum libraries are written for cryptography and math where these are not done frequently.

fomine3 · on Sept 8, 2022

.NET also seems to mitigated this problem: "for really, really big BigIntegers" https://devblogs.microsoft.com/dotnet/performance_improvemen...

Phil_Latio · on Sept 7, 2022

What's next? A default socket timeout of X seconds for security reasons? What a joke and rather scary that apparently everyone or the majority on the internal side agrees with this change.

sidewndr46 · on Sept 8, 2022

You would also need a limit on how large of a file can be written by python. Otherwise you could have a web server that takes an upload and stores it on disk which could fill the disk of the host machine. We can't expect developers to check for this, so Python must be patched to not write files larger than 2 kilobytes.

linspace · on Sept 7, 2022

I find it completely unpythonic. Python has become too important to do the right thing, there is money on the table.

krick · on Sept 7, 2022

This. I don't really understand CPython decision-making process, but it just seems like a common sense that anybody who would find this a good idea surely must be a very junior developer who shouldn't be allowed to commit directly to the master branch of your local corporate project just yet… But basically breaking a perfectly logical behaviour just like that in a language used by millions of people… To me it's absolutely shocking.

LtWorf · on Sept 7, 2022

I think python is now completely owned by a couple big companies that decide everything.

By this logic they should also block me from running benchmarks on too big lists, because I'm dossing myself.

bo1024 · on Sept 7, 2022

From the link:

> Everyone auditing all existing code for this, adding length guards, and maintaining that practice everywhere is not feasible nor is it what we deem the vast majority of our users want to do.

It's hard not to read this as "we want to use untrusted input everywhere with no consequences". Seems like we'll be kicking as many issues under the rug as we're fixing with this change, right?

rwmj · on Sept 7, 2022

Did they consider doing tainting (like Perl)? Input strings are marked as tainted and anything derived from them, except for some specific operations that untaint strings. If you use a tainted string for a security-sensitive operation then it fails. http://perlmeme.org/howtos/secure_code/taint.html

rPlayer6554 · on Sept 8, 2022

That's very interesting. That link is dead though.

shakna · on Sept 8, 2022

I think the broken link refers to the same information found here [0].

One of the nicest AND most frustrating parts of taint-mode, for myself, is that once activated it remains on until the end of execution. It's not scoped, which removes some of the headache of using such a thing. Switch on, and assume always on.

[0] https://perldoc.perl.org/perlsec#Taint-mode

rwmj · on Sept 8, 2022

Back when we were writing websites in Perl (cough, 20 years ago), we also had taint mode on everywhere. It was fairly unobtrusive, and likely stopped a few attacks like:

http://example.com/upload.pl?file=../../../../etc/passwd

The main thing you had to remember was the parameters (all tainted of course) had to be filtered through a regular expression before you could use them.

bostik · on Sept 7, 2022

I read it the other way round - untrusted input is used in various places where doing such inline checks is prohibitively tricky. The examples given are quite telling: json, xmlrpc, logging. First two are everywhere in APIs. The third is just ... everywhere.

Are you really going to use a JSON or XML stream parser first before feeding it to the stdlib module? And one that does not try to expand the read values to native types? As for logging, that is certainly the place where you are not only expected, but often required to use untrusted input.

The fix feels like a heuristic and a compromise. None of the [easily available] solutions are robust, solid or performant, so someone picked an arbitrary threshold that should never be hit in sane code.

The linked issue mentions that GMP remains fast even in face of absurdly big numbers. No surprise, the library is literally designed for it: MP stands for multi-precision (ie. big int and friends).

adgjlsfhk1 · on Sept 7, 2022

this would all make more sense if python was using a reasonably fast string to int routine, but the one they are using is asymptotically bad, and the limit they chose is roughly a million times lower than it should have been.

userbinator · on Sept 8, 2022

It's hard not to read this as "we want to use untrusted input everywhere with no consequences". Seems like we'll be kicking as many issues under the rug as we're fixing with this change, right?

It reminds me of this old "feature" PHP had, no doubt everyone who was on the Internet at the time of its popularity saw its unintended effects:

https://www.php.net/manual/en/info.configuration.php#ini.mag...

Dylan16807 · on Sept 7, 2022

It's easy for me not to read it that way! Converting to an integer is a very good start for validating many kinds of input.

machina_ex_deus · on Sept 7, 2022

This is way too low, I've used RSA keys in base 10 with half the size of this string. It corresponds to only 14,000 bit numbers, there are 8192 bit keys. I'm pretty sure this will break some CTF challenges. The limit should be in the millions at the very least.

munch117 · on Sept 7, 2022

It does seem very low.

However, you shouldn't be passing million-digit numbers around as (decimal) text. Even if you're not at risk of DOS attacks, there's still the issue that it's very, very slow:

   $ python3 -m timeit -s "s='1'*1000000" "i=int(s)"
   1 loop, best of 5: 5.77 sec per loop

A ValueError alerting you to that fact could be considered a service.

Contrast and compare:

    $ python3 -m timeit -s "s='1'*1000000" "i=int(s,16)"
    200 loops, best of 5: 1.45 msec per loop

nomel · on Sept 7, 2022

> However, you shouldn't be passing million-digit numbers around as (decimal) text

This is about numbers that are thousands of digits, not millions. Regardless, why not? What's the alternative that supports easy exchange? If you stick it in some hexified representation, you still have to parse text, and put it into some non-machine-native number container. It's going to be slow no matter what.

munch117 · on Sept 7, 2022

No, it's not going to be slow no matter what. Didn't you see my example? The hexadecimal non-machine-native textual representation was 4000 times faster than the decimal ditto. On a number that was much larger, I might add.

Hex number parsing is linear time.

schoen · on Sept 7, 2022

I could imagine people overlooking that little "m" in your example's output!

nomel · on Sept 7, 2022

Indeed I did!

blibble · on Sept 7, 2022

you can convert hex into binary directly without any multiplications

adgjlsfhk1 · on Sept 7, 2022

python being slow isn't news. that's not a reason for an error.

sidewndr46 · on Sept 8, 2022

If you're doing cryptography in Python, performance is most definitely not a concern.

munch117 · on Sept 8, 2022

Sure it is. It's a concern that's addressed by using efficient libraries written in Rust or C.

Mostly.

One time I implemented RSA in Python. I needed it to read some legacy data with the wrong padding, something which the libraries couldn't (wouldn't) do. It performed just fine. The ternary pow() that does the heavy lifting is not quite as fast as the dedicated crypto libraries, but it was close enough.

sidewndr46 · on Sept 8, 2022

In that case, you aren't doing cryptography in Python. You're using cryptography in Python. I'm not just saying that to be pedantic, I've seen some crypto implemented in Python and it is awfully slow.

This is the case of most Python code & is why many attempts to compile Python to native code results in slower overall performance.

munch117 · on Sept 8, 2022

> In that case, you aren't doing cryptography in Python. You're using cryptography in Python.

One of the great things about the Python community is that we don't have these kinds of artificial hangups. You just solve the problem, and if part of the solution involves something that someone wrote in a different programming language, that's fine.

Earlier this year, I implemented ECIES in Python. Not a very complicated algorithm, but it still needed to be done. So there is C involved in the underlying EC and AES implementations, big deal, I don't see how that should disqualify my ECIES work from being "doing cryptography".

blibble · on Sept 7, 2022

new interpreter argument:

    -X int_max_str_digits=number
       limit the size of int<->str conversions.
       This helps avoid denial of service attacks when parsing untrusted data.
       The default is sys.int_info.default_max_str_digits.  0 disables.

this should not be a runtime configuration setting, fix the sodding algorithm to not be quadratic

will we be getting PHP style magic quotes soon? that also protects developers against untrusted input (bonus! this could be configured too!)

or an inability to pass strings into the regular expression module? that can also cause DoS

(what happened to Python?)

simonw · on Sept 7, 2022

My understanding is that there is no algorithm for this that isn't quadratic.

Update: I may have understood incorrectly, see https://github.com/python/cpython/issues/90716

blibble · on Sept 7, 2022

> My understanding is that there is no algorithm for this that isn't quadratic.

> If you know of one, the Python core development team would love to hear about it!

it's mentioned on the issue page that makes up the article...

(before they closed it due to the "code of conduct")

loeg · on Sept 7, 2022

Will Python's relentless campaign to break backwards compatibility never end? (80% sarcastic.)

klyrs · on Sept 7, 2022

Don't worry, it's a minor release. (110% sarcastic)

tremon · on Sept 7, 2022

It's a patch release, not even minor (100% serious).

chmod775 · on Sept 7, 2022

Every morning while enjoying my breakfast cereal I read about what python broke today (95% whole grain).

speedgoose · on Sept 8, 2022

Does CPython follow semantic versioning ?

klyrs · on Sept 8, 2022

That's what we're complaining about.

saghm · on Sept 7, 2022

I was surprised to see this in a bugfix release since it seems like a breaking change, but from reading, it seems that this was considered a security vulnerability (specifically a DOS opportunity) given the CVE status, so I imagine that compatibility concerns were secondary here. This seems in line with how other languages seem to do things from what I've seen; semver is important, but in a sense not every change is equally "breaking" to users, and breaking code that's unlikely to be common and potentially is not behaving correctly in the first place is not going to cause as much friction as most other types of breaking changes. Put another way, if there's a valid security concern, breaking things loudly for users forces them to double check their usage of this sort of code and ensure that nothing risky is going on. (I don't personally have enough domain knowledge here to know if the security concern is actually valid or not, but the decision to make this change in a patch release seems like a reasonable conclusion to come to for people who determine that it is a security concern).

gfd · on Sept 7, 2022

Why did they close the discussion due to code of conduct? I didn't see anything wrong with the previous comments before that point.

klodolph · on Sept 7, 2022

> As a reminder to everybody the Python Community Code Of Conduct applies here.

> Closing. This is fixed. We'll open new issues for any follow up work necessary.

The issue was marked closed, because the associated work was completed and the PR was merged. The same comment happened to mention the code of conduct, but the code of conduct wasn't why the issue was closed--it was just because the work was done.

I think the comment mentioned the CoC because the previous comment, "This is appalling" was a bit rude.

Delk · on Sept 7, 2022

> I think the comment mentioned the CoC because the previous comment, "This is appalling" was a bit rude.

The previous comment was indeed a bit rude. I personally wouldn't think it was rude enough to invoke a code of conduct.

Even just referring to a code of conduct has, IMO, a rather strong vibe of policing and perhaps even an implication of wrongdoing, more so than merely a suggestion to keep it calm.

I don't know the culture or context of Python development (either the language or CPython), but I'm inclined to agree with gdf that it's a bit weird to start reminding people of a CoC because of a slightly rude sentence or two, especially since the rest of the comment was reasonable technical argumentation even if unapologetic.

Even if closing the issue were entirely because of other reasons and benign (someone did still reference the issue in a commit later, though), it's all too easy to see the issue-closing comment as shutting out dissenting opinions, either because of a somewhat unpleasantly expressed argument or simply because "this is fixed, no further discussion needed".

The "this is appalling" comment may have been a bit rude but the closing one wasn't exactly a triumph in communication either.

klodolph · on Sept 7, 2022

> Even just referring to a code of conduct has, IMO, a rather strong vibe of policing and perhaps even an implication of wrongdoing, more so than merely a suggestion to keep it calm.

I'd say the opposite. A suggestion to "keep it calm" is inappropriate, because it carries the implication that someone is not calm. This is inappropriate because it is a comment on a person's emotional state rather than on what they say or how they say it.

In fact, if someone on my team said to "keep it calm", I'd take that person aside and explain, in private, the reasons why not to say that.

> Even if closing the issue were entirely because of other reasons and benign (someone did still reference the issue in a commit later, though), it's all too easy to see the issue-closing comment as shutting out dissenting opinions, [...]

If somebody thought that closing the issue shut out dissenting opinions, then that person has forgotten how GitHub issues work or how bug trackers work in general. Closing an issue just means that someone thinks that the work on it is done; it does not stop discussion on the issue. I can see why someone might forget and not realize that the issue was closed and not the discussion, but I don't think that it's a problem that someone visiting the bug from HN would forget how GitHub issues work for a minute.

With any online community above a certain size, there's a certain amount of policing not just of what is said, but where people have discussions. Anyone who regularly uses a forum, Subreddit, Discord server, IRC, Slack, etc. will see this pattern of behavior everywhere. For example--the discussion about whether this is the right way to fix a bug is a discussion which should be held elsewhere, where people can see the context and interested parties can respond to it.

Which is why there is a comment at the bottom,

> Please redirect further discussion to discuss.python.org.

It's crystal clear to me that this is not about shutting out dissenting voices, but just saying that this GitHub issue is the wrong place for this discussion.

You can see that there is a related issue which was closed, but there was a lot of discussion afterwards--but because the discussion was on-topic, the issue was not locked.

https://github.com/python/cpython/issues/90716

Delk · on Sept 7, 2022

> I'd say the opposite. A suggestion to "keep it calm" is inappropriate, because it carries the implication that someone is not calm.

Perhaps a suggestion to "keep it calm" wouldn't be the best. English isn't my first language and my verbal expression isn't always the greatest. But referring to a code of conduct does also carry the implication that someone isn't minding that code, and I don't see how that would necessarily be better.

In my view, suggesting that someone isn't calm is less of a reprimand than suggesting they might be in breach of a code of conduct which, among other things, includes rules against outright harassment and other clearly reprehensible behaviour. It's normal to not be calm at times; it's another thing if someone needs to be reminded of the rules of a community. Perhaps it's a cultural thing but to me the latter is stronger judgement.

There may well be reasons for not saying to keep it calm (it sometimes simply doesn't work), but I can equally well see how people might see a reference to a CoC as strong-armed.

> If somebody thought that closing the issue shut out dissenting opinions, then that person has forgotten how GitHub issues work or how bug trackers work in general. Closing an issue just means that someone thinks that the work on it is done; it does not stop discussion on the issue.

That's fair enough. Perhaps the intention is clear enough within the community that it would indeed be deemed as simply closing that rather specific GitHub issue without implying that the matter is closed.

Human communication isn't always quite that simple, though. People get impressions from the way things are expressed. "This is fixed." makes it feel that there is nothing to be discussed about that particular change and that it is final.

I don't know the particular community well enough to know how it would be interpreted, though.

> Which is why there is a comment at the bottom,

>> Please redirect further discussion to discuss.python.org.

That's after the comment that closed the issue. Had it been in the issue-closing comment, that would have left a different taste to the closing.

klodolph · on Sept 8, 2022

> Perhaps a suggestion to "keep it calm" wouldn't be the best. English isn't my first language and my verbal expression isn't always the greatest. But referring to a code of conduct does also carry the implication that someone isn't minding that code, and I don't see how that would necessarily be better.

Yes, a suggestion to "keep it calm" is definitely bad.

It would be nice if there were an easy way for people who are not good at English to respond to comments without having to figure out the right way to respond. You could create a "standardized response", and have a committee of people with different backgrounds review the content of that response to ensure that it clear and conveys the right messages.

That "standardized response" is the code of conduct.

A code of conduct is a beautiful thing. You do not need to be a skilled English speaker to send someone a link to the code of conduct. The

> [...] but I can equally well see how people might see a reference to a CoC as strong-armed.

It sounds like these people may be prejudiced against the code of conduct, or prejudiced against codes of conduct in general. I think if you have such people in your community, that the right thing to do is to expose them to the code of conduct so they get used to it and realize that the code of conduct is not such a bad thing or a scary thing.

If you try and shield people in the community from the code of conduct out of fear that they might react poorly to it, then I think you're doing a disservice to the community.

Have you personally read the Python code of conduct? Or are you just imagining what is written in the code of conduct?

> Human communication isn't always quite that simple, though. People get impressions from the way things are expressed.

Yes... which is why for important messages, we have teams of people that review the messages over and over again to make sure that those messages are expressed properly and they are easy to understand. Messages like the code of conduct.

You cannot expect the same level of clarity from a two-line comment in a GitHub issue. This is why referring to the code of conduct is such a good idea--it is much clearer and easier to understand, because it has been reviewed so thoroughly.

Guthur · on Sept 7, 2022

"This is appalling" is not even remotely rude, honestly are we all children now?

blibble · on Sept 7, 2022

your new comment violates the PSF "code of conduct" too!

this particular wording could be used to ban any criticism of contributions (regardless of the criticism's correctness):

> Being respectful. We're respectful of others, their positions, their skills, their commitments, and their efforts.

in this sort of environment I guess it's far from surprising that the technical decisions are suffering (to put it politely)

LudwigNagasena · on Sept 8, 2022

Reading that thread left a bad taste in my mouth. OP sounds very toxic. Stuff like that is why I stay away from open source “communities” and stick to fixing issues that personally affect me.

schoen · on Sept 7, 2022

If you need to make integers this big from decimal representations, I guess you could still use gmpy2.mpz(), and then either leave the result as an mpz object (which is generally drop-in compatible with Python's int type, with the addition of some optimized assembly implementations of arithmetic operations and some additional methods), or convert it to a Python int by calling int() on it.

svet_0 · on Sept 7, 2022

So now an unreasonable user input will crash my server instead of slowing it down by 50ms. Great DoS mitigation!

Ukv · on Sept 7, 2022

In addition to omnicognate's point, calling `int` on user input would generally already expect a possible ValueError.

omnicognate · on Sept 7, 2022

Your server crashes if a request fails?

xani_ · on Sept 7, 2022

it does with this change where it didn't before. At the very best you're still restarting the whole process instead of just wasting a bit of time

progval · on Sept 7, 2022

You should always catch ValueError when using int() on user input, because that input may not be a valid number.

mattnewton · on Sept 8, 2022

I should also check to see if the length is reasonable, no? But the whole point of the issue is that nobody finds that practical.

fuckstick · on Sept 7, 2022

Who uses a process per request for serving Python apps? That must be very uncommon. Even if you use a worker pool that isn’t going to restart a whole process just because of an errant exception in a request handler.

Also as noted if your whole process crashes because of errant input to int() you are beyond fucked in other ways.

aYsY4dDQ2NrcNzA · on Sept 7, 2022

Then don’t upgrade Python in your container?

aidenn0 · on Sept 8, 2022

There are inputs that can slow it down by hours. Maybe the set the limit too low. Maybe they should have instead merged the PR that improves the speed by a huge amount. They didn't.

im3w1l · on Sept 7, 2022

This will break correct code for a fairly small benefit. I don't think they should do this in a patch release.

js2 · on Sept 7, 2022

4300 digits?

> Chosen such that this isn't wildly slow on modern hardware and so that everyone's existing deployed numpy test suite passes before https://github.com/numpy/numpy/issues/22098 is widely available.

https://github.com/python/cpython/blob/511ca9452033ef95bc7d7...

eugenekolo · on Sept 7, 2022

Could they not have modified the `int` function to `int(thingy, i_really_want_to_do_this=false)`?

Edit: Looks like they added a python argument to increase the limit. So if you really need this, I suppose you can search around until you figure out why it's not working and pass the correct argument to the python bin.

qbane · on Sept 7, 2022

Yeah, we must prevent DoS at all costs. It seems that Python should not have integers at arbitrary size for "performance" reason in the beginning. Aren't int32/int64/int128 nice? Number of operations are all bounded. We should stick to them.

kragen · on Sept 7, 2022

This was Python's behavior until Python 2; `long`, the arbitrary-precision integer, was a separate type, and `int` arithmetic overflow caused a ValueError. One of the big changes in Python 2 was to imitate the behavior of Smalltalk and (most) Lisp by transparently overflowing `int` arithmetic to `long` instead of requiring an explicit `long()` cast. Python 3 eliminated the separate `long` type altogether.

Having been bitten by the Smalltalk behavior, I am skeptical that the Python 2 change was a good idea.

gpshead · on Sept 9, 2022

I keep wondering if it was as well given code I've had to wrangle that _wants_ twos compliment fixed size math in Python. Both signed and unsigned. But our language tries not to have a bazillion different basic types and the ill-defined Python <= 2 `int` being whatever the platforms `C long` could hold was not great so simplifying to a single integer type in 3 was still a net win AFAICT.

kragen · on Sept 9, 2022

It'd be nice to have a twos-complement fixed-size type too, but I think it's probably better that Python 2's int isn't that.

The problem with transparently overflowing to Python `long` is that, most of the time, the overflow is unintended, and the resulting performance collapse is a bug that's harder to track down than a ValueError.

williamstein · on Sept 9, 2022

A well reasoned argument that this change was a bad idea: https://discuss.python.org/t/int-str-conversions-broken-in-l...

justinsaccount · on Sept 7, 2022

From the linked bug..

> It takes about 50ms to parse an int string with 100,000 digits and about 5sec for 1,000,000 digits. The float type, decimal type, int.from_bytes(), and int() for binary bases 2, 4, 8, 16, and 32 are not affected.

Sure seems strange to set the limit to 4300. 50ms is not a DoS.

xani_ · on Sept 7, 2022

balooning 2ms request to 50ms is absolutely a DoS

that's only 20req/sec to fill a core of execution

LudwigNagasena · on Sept 8, 2022

Just rate-limit anyone who does 20req/sec.

gpshead · on Sept 9, 2022

This is easy for huge corporations who live and breathe automated-DDoS protection without blinking an eye, but a major challenge for all of the little applications and small hosts.

grnmamba · on Sept 8, 2022

It might be a DoS for a webserver that's supposed to be IO-bound.

Turning on that behavior for all python code in a patch release with only a global override is sloppy.

rurban · on Sept 8, 2022

For the curious. I've checked the corresponding perl5 code, and it is not affected by such json/yaml DOS attacks. It bails out early on overlarge bignums already. Ruby, no idea.

mywittyname · on Sept 7, 2022

What should you use instead if you want the original functionality?

Veedrac · on Sept 7, 2022

https://docs.python.org/3/library/stdtypes.html#configuring-...

mywittyname · on Sept 7, 2022

If I'm understanding this correctly: the only way to convert an extremely large base10 string to an integer using the standard library is to muck with global interpreter settings?

It seems short sighted to not provide some function that mimics legacy functionality exactly. Even if it is something like int.parse_string_unlimited(). Especially since a random library can just set the cap to 0 and side-step the problem entirely.

Someone · on Sept 7, 2022

> Especially since a random library can just set the cap to 0 and side-step the problem entirely.

Until another random library sets it to its preferred value (see https://news.ycombinator.com/item?id=32738206 for a similar issue with a CPU flag for supporting IEEE subnormals)

We might end up with libraries that keep setting that global to the value they need on every call into them.

mywittyname · on Sept 7, 2022

Oh fun. Just what Python needs more of, this...

    try:
        value = int(value_to_parse)
    except ValueError:
        import sys
        __old_int_max_str_digits = sys.get_int_max_str_digits()
        sys.set_int_max_str_digits(0)
        value = int(value_to_parse)
        sys.set_int_max_str_digits(__old_int_max_str_digits)

Or maybe just this:

    class UnboundedIntParsing:
        def __enter__(self):
            self.__old_int_max_str_digits = sys.get_int_max_str_digits()
            return self
    
        def __exit__(self, *args):
            sys.set_int_max_str_digits(self.__old_int_max_str_digits)

    with UnboundedIntParsing as uip:
        value = int(str_value)

dmurray · on Sept 7, 2022

Needs to be made thread safe!

sidewndr46 · on Sept 8, 2022

Yeah, I like the context manager version of this. That way it's maintainable!

fulafel · on Sept 8, 2022

> A huge integer will always consume a near-quadratic amount of CPU time in conversion to or from a base 10 (decimal) string with a large number of digits. No efficient algorithm exists to do otherwise.

This is pretty interestign in itself. Are there other sw compoenents that have flagged & fixed this vulnerability? Seems like there should be many.

williamstein · on Sept 9, 2022

It is an incorrect assertion. See elsewhere in this thread and later on the linked ticket.

ridiculous_fish · on Sept 7, 2022

Why is base 10 string -> int a quadratic algorithm? Are there no faster ones that could be implemented?

adgjlsfhk1 · on Sept 7, 2022

The best algorithm isn't quadratic. It's M(n)log(n) where M(n) is the cost of your integer multiply (M(n) theoretically be as low as nlog(n), but in practice the best algorithms used are nlog(n)log(log(n))). Python just didn't bother to implement them.

blahedo · on Sept 7, 2022

No, because 10 is not a power of 2, so any digit in the source (base 10) can affect any digit in the result (base 2). Converting from e.g. base 16 to base 2 is linear, because 16 is a power of 2.

krick · on Sept 7, 2022

This simply isn't true. Faster algorithm is even linked in comments to this particular issue: https://members.loria.fr/PZimmermann/mca/mca-cup-0.5.9.pdf (p.38)

blahedo · on Sept 8, 2022

My mistake! I knew it couldn't be linear and got sloppy. (For those checking that link, see p38, section 1.7.2)

nine_k · on Sept 8, 2022

It took me some time to parse this as (Prevent) (DoS by large int-str conversions) and not (Prevent DoS) by (large int-str conversions).

Sophistifunk · on Sept 8, 2022

I clicked thinking this was some sort of client-proof-of-work access protocol, boy was I wrong!

wmichelin · on Sept 7, 2022

Can anyone TL;DR why? Why wouldn't it just return that long integer of all 1s?

schoen · on Sept 7, 2022

It's stated to be CVE-2020-10735, which is apparently about a denial of service by forcing Python to inefficiently convert a very large string to an integer, using a potentially ridiculous amount of CPU time.

The CVE hasn't been published, but for example there's an explanation at

https://bugzilla.redhat.com/show_bug.cgi?id=1834423

klyrs · on Sept 7, 2022

Looks to me like the actual problem is in string.__mul__ -- that one's got arbitrary memory usage. Better limit those arguments...

masklinn · on Sept 7, 2022

str.__mul__ is just a conveniently short way to demonstrate the issue, the target is pretty much any parsing routine exposed to outside users e.g. any JSON API.

klyrs · on Sept 7, 2022

Apologies, my comment is snark. The algorithm in question is soft-linear, faster implementations exist, this seems like an incredibly myopic fix. Just make a bigger JSON blob and it will take longer to parse.

adgjlsfhk1 · on Sept 7, 2022

this seems like a dumb fix to the cve to me. why not just use a faster algorithm?

lifthrasiir · on Sept 7, 2022

Because there is no linear-time algorithm for decimal-to-binary conversion. If we are to expose the bignum-aware `int` function to untrusted input there should be some limit anyway. I do think the current limit of 4301 digits seem too low though---if it were something like 1 million digits I would be okay.

adgjlsfhk1 · on Sept 7, 2022

there isn't a linear time algorithm, but there is an algorithm in O(n*log(n)^2) http://maths-people.anu.edu.au/~brent/pd/rpb032.pdf which is pretty close. it also seems weird to have a CVE for "some algorithms don't run in linear time". should there be a 4000 element maximum for the size of list passed to sort?

lifthrasiir · on Sept 7, 2022

> should there be a 4000 element maximum for the size of list passed to sort?

Technically speaking, yes, there should be some limit if you are accepting an untrusted input. But there is a good argument for making this limit built-in for integers but not lists: integers are expected to be atomic while lists are wildly understood as aggregates, therefore large integers can more easily propagate throughout unsuspecting code base than large lists.

(Or, if you are just saying that once you have sub-quadratic algorithms you don't need language-imposed limits anymore, maybe you are right.)

tylerhou · on Sept 7, 2022

There is no practical linear time algorithm for multiplication; should Python disable multiplication for numbers greater than 10^4301?

Even a naive divide and conquer decimal to binary algorithm is only logarithmically slower than multiplication.

schoen · on Sept 7, 2022

It looks like there is some discussion of the algorithmic options at

https://github.com/python/cpython/issues/95778

https://github.com/python/cpython/issues/90716

Is there something bad going on with Python's internal representation of big integers, too? I thought I might have understood Tim Peters to be saying that in the latter thread.

It does look like gmpy2.mpz() is like 100 times faster than int() or something. Is this just because it's doing it all in assembly rather than in Python bytecodes, or are the Python data structures here also not so hot?

klodolph · on Sept 7, 2022

> It does look like gmpy2.mpz() is like 100 times faster than int() or something. Is this just because it's doing it all in assembly rather than in Python bytecodes, or are the Python data structures here also not so hot?

It's not the data structures. The data structures are really more or less the same: you have some array of words, with a length and a sign. The only real differences are in the particular length of word that you choose, which is not a very interesting difference.

Assembly language optimizations do tend to matter here, because you're working with the carry bit for lots of these operations, and each architecture also has some different way of multiplying numbers. Multiplying numbers is "funny" because it produces two words of output for one word of input.

There are also sometimes some different algorithms in use, and GMP uses some different algorithms depending on the size. Here's a page describing the algorithms used by GMP:

https://gmplib.org/manual/Multiplication-Algorithms

Here's a description of how carries are propagated:

https://gmplib.org/manual/Assembly-Carry-Propagation

IMO, I wouldn't expect my language's built-in bigint type to use the best, most cutting-edge algorithms and lots of hand-tuned assembly. GMP is a specialized library for doing special things.

thehappypm · on Sept 7, 2022

One of the comments showed the incredibly naive approach of just building the integer digit-by-digit:

‘1234’ => 1x1000 + 2x100 + 3x10 + 4x1

Is faster and has room to improve

tylerhou · on Sept 7, 2022

This takes (worse than) quadratic time.

thehappypm · on Sept 7, 2022

I’m not sure it does, in the best case.

There are d additions, so the addition is linear time.

Each multiplication is potentially quadratic, but it seems optimizable since it’s never multiplication of two large numbers—always one large and one small number.

tylerhou · on Sept 7, 2022

One mistake is assuming that the additions take constant time, but really they take ~n time each because you're adding up n-bit integers. If you accumulate your results from left to right you're adding an n-bit integer with a (~n-3)-bit integer, which results in an n-bit integer, which you add with an (~n-6)-bit integer, and so on. This sums up to \Theta(n^2).

Another issue you have not accounted for is how you convert to base two. If you want the final product to be in base two, your additions have to be in base two, so the result of your multiplications have to be in base two, which means you will have to convert 1000, 100, 10, 1 into base two as well. Again this takes \Theta(n^2) time if you cache results (1000_2 = 100_2 *_2 10_2): you're doing n operations, and the cost of each operation is 1, 2, 3, 4, ..., n (times some constant. 10^(n-1) * 10 all in base two takes \Theta(n) time).

But it's not actually worse than O(n^2), my mistake.

thehappypm · on Sept 8, 2022

They’re probably seeing real-world performance being pretty good with this approach for some other reason then, perhaps it’s quadratic but those additions and multiplications are fast in practice so the O(n^2) badness doesn’t come up for a while

singron · on Sept 7, 2022

Each addition is linear in d, but there are d additions, so it's already quadratic before you even consider the multiplications.

In a power-of-2 base, the result of the multiplication is a constant number of digits (because the multiplication is just a shift of a single digit), so the additions could each be constant time in that case.

bjourne · on Sept 7, 2022

But why convert it to binary? If you store the number as an array of digits the parsing process should be O(n).

lifthrasiir · on Sept 7, 2022

That means every limb operation should be done modulo 10^k, which would be pretty expensive and only makes sense if you don't do much computation with them so the base conversion will dominate the computation.

thehappypm · on Sept 8, 2022

Are you asking why computers store numbers in binary?

bjourne · on Sept 9, 2022

No. Many bignum libraries store numbers as sequences (or linked lists) of "digits", where each digit is an 8 to 256-bit binary number. Which is why I'm skeptical of lifthrasiir's claim that a bignum cannot be parsed in O(n) since it is analogous to initializing a list.

wyldfire · on Sept 7, 2022

But the multiplier is unbound, though. Faster wouldn't help in that case.

klyrs · on Sept 7, 2022

Maybe we should limit the lengths of strings altogether. 512k should be enough for anybody.

sp332 · on Sept 7, 2022

Yeah it's right at the top of the linked page?