
Be Careful with Python's New-Style String Format - BerislavLopac
http://lucumr.pocoo.org/2016/12/29/careful-with-str-format/
======
tedunangst
Maybe a more fully worked example is needed. You're making a blog hosting
service as a service service. Bloggers have different ideas about what page
titles should be.

    
    
        Post Title
        Blog Name: Post Title
        Blog Name - Post Title
        Post Title - Blog Name
        Blog Name ----embdash---- Post Title
        ~~~ xXx Post Title xXx ~~~
    

It's a little overwhelming to put every possibility in a dropdown, so you
allow the user to specify a format string.

    
    
        title = userformats.title.fmt(post)
    

This doesn't look so very dangerous. And then the user can say

    
    
        "{post.title} - {post.blog.title}"
        "{post.title}: Another fine post by "{post.author}"
        "~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~"
    

And then oops.

~~~
bubblesocks
Isn't ingesting user input directly considered a bad idea all around though?

~~~
raquo
Yes, but it's not obvious how to sanitize input in this case, or that it even
needs sanitizing. Formatting a string sounds pretty innocuous.

~~~
userbinator
I think "it's not obvious how to sanitize input" is the main point here ---
one advantage of the %-style format strings, they're easier to parse and
escape.

~~~
tedunangst
But we don't want the string escaped. We want it interpreted.

~~~
jgalt212
yes, basically eval is evil

~~~
masklinn
That's not even eval, the entire thing is a series of dynamic attribute
lookups[0], you can trivially implement that in pure Python without needing to
`eval` anything.

[0] it also supports mapping and sequence lookups IIRC but that's about the
same thing

------
anderskaseorg
The proposed idea of relying on undocumented internals and blacklisting
attribute names to securely sandbox formatting strings is _really_
_dangerous_. Never do that in production code! Language expansions could
render your sandbox unsafe at any time.

You can write your own safe formatting engine in much less code.

    
    
        def safe_format(fmt, **args):
            return re.sub(r'\{([^{}]*)\}', lambda m: args.get(m.group(1)), fmt)
        
        safe_format('{foo} {bar}', foo='Hello', bar='world')  # Hello world
    

Add bells and whistles as desired.

~~~
tedunangst
The first bell added is likely to be field access.

~~~
cyphar
Which immediately becomes a security risk because classes can overload
__getattr__ and __getitem__. So now you're executing random code.

Of course, %s does the same thing (__str__) but at least %s doesn't take any
arguments.

~~~
sirclueless
The assumption behind string.format and friends is that people who override
__getattr__ and __getitem__ know what they're doing. Which is probably a
reasonably safe tradeoff in order to allow attribute access.

I'm not worried about executing arbitrary code, because it's not user-supplied
and it's normal for a templating library to access attributes of objects you
pass to it (if it was unsafe you shouldn't have done that). This is more of a
snafu with the number of sensitive and powerful attributes available by
default on Python objects.

~~~
anderskaseorg
Read the article. The danger is not __getattr__ and __getitem__ but rather
things like event.__init__.__globals__[CONFIG][SECRET_KEY]. There is no such
thing as an object where arbitrary attribute access is safe, because every
object has a constructor function and every function references the globals
dict.

------
coldtea
Err, why would you allow for the user to enter arbitrary format strings in the
first place?

Might as well write "be careful about eval of arbitrary user provided
strings".

~~~
Animats
Internationalization, usually. Word order differs between languages, and this
kind of format allows reordering the inserted values.

~~~
coldtea
Yes, but that's meant for trusted translators of the UI, nor arbitrary users,
right?

~~~
icebraining
In some applications, users can change the their own translations, which are
stored in the DB, to allow for easier customization.

~~~
coldtea
Still, no reason to save them in the platforms native string format and with
full reign over native strings.

------
kazinator
Sane Lisp approach: provide easily analyzable target syntax.

    
    
      This is the TXR Lisp interactive listener of TXR 163.
      Use the :quit command or type Ctrl-D on empty line to  exit.
      1> (defvar foo 42)
      foo
      2> `@(list foo) ... @foo`
      "42 ... 42"
    

What is that backticked literal? Let's quote it:

    
    
      3> '`@(list foo) ... @foo`
      `@(list foo) ... @foo`
    

Hmm, prints back in same form. Probably syntactic sugar for a list; what is in
the car? cdr?

    
    
      4> (car '`@(list foo) ... @foo`)
      sys:quasi
      5> (cdr '`@(list foo) ... @foo`)
      ((list foo) " ... " @foo)
    

What if we type in this syntax ourselves:

    
    
      6> '(sys:quasi (list foo) " ... " (sys:var foo))
      `@(list foo) ... @foo`
    

Prints as the notation!

To sandbox this, we just have to walk a list and enforce some rule. For
instance, the rule might be that all elements after sys:quasi must be string
object or else (sys:var sym) forms, where sym is a symbol on some allowed
list. Thus (list foo) would be banned.

A custom interpreter which calculates the output string while enforcing the
check is trivial to write as a one-liner.

Of course if your program just eval such an untrusted quasiliteral, it has
access to the dynamic/global evironment:

    
    
       `Mouhaha! @(file-get-string "/etc/passwd")`
    

Very convenient for some attacker. :)

~~~
shakna
If you're LISP has first class environments, then this becomes much easier,
and safer.

    
    
        (eval (read (get-some-user-string))
          (let [
                 (post-title post-title)
                 (post-date post-date)
                 (post-author post-author)
                 (fmt string-append)
               ]
               (null-environment)))
    

eval is only executed in the context of some environment, and the _only_
things it has access to are some strings, and a string concatenation
procedure.

It doesn't have access to lambda, quasiliterals or quotes, or anything of the
like.

I do have code like this in production, because eval only looks up within it's
environment... Which is empty apart from the bound let. No closure access.

~~~
swift
Lua has this feature as well and it's fantastically useful for this kind of
use-case, or more generally anywhere where you want to allow scripting.

~~~
shakna
The more I use Lua, the more it seems like the sanest of the scripting
languages out there.

------
nhumrich
This article makes you thing the new f-strings will be discussed. What is
really being discussed is 'string'.format(), which isn't new in any way.

~~~
eternauta3k
It's referred to in the documentation as 'New-style', with the old style being
the % operator.

~~~
tedmiston
Yes but quite confusing when we have another new style string format actively
being discussed in recent months vs this "new-style formatting" which has
existed since 2006.

It's much clearer to just refer to it as str.format.

~~~
halomru
It's what invariably happens when you incorporate words like "new", "super" or
"high definition" in the name of something. I'm still hoping that at some
point our industry learns that lesson.

------
Animats
No, Rust does not have the ability to access any variable in the program via a
format string. Rust has this:

    
    
       format!("{argument}", argument = "test");   // => "test"
    

That's just named arguments to the format. Also, that's a macro; it's expanded
at compile time.

Python's approach is lame. It should have used something with a limited list
of named arguments, or maybe a dict.

~~~
gizmo686
Neither does python.

The problem being discussed here is that the format string can access and
attribute of the object passed in. From PEP3101 [0]:

" Unlike some other programming languages, you cannot embed arbitrary
expressions in format strings. This is by design - the types of expressions
that you can use is deliberately limited. Only two operators are supported:
the '.' (getattr) operator, and the '[]' (getitem) operator. The reason for
allowing these operators is that they don't normally have side effects in non-
pathological code."

[0]
[https://www.python.org/dev/peps/pep-3101/](https://www.python.org/dev/peps/pep-3101/)

~~~
echelon
> _Neither does python._

I would argue that being able to access any global value is a sufficient
enough concern.

~~~
zardeh
The security issue here isn't with f strings, which oddly enough, don't have
this flaw (because they are literals only). This has to do with new style
string formatting, which does _not_ have access to globals.

------
agentgt
This is actually a problem with a lot of languages that allow runtime template
like String interpolation.

For example Groovy on the JVM has GStrings which one can do fairly nasty
things.

As well it actually is fairly hard to lock down most of the template languages
on the JVM for user templates. (If you are going to allow user templates I
recommend one of the Java Mustache-like implementations).

------
ryanmccullagh
This is the same principle for not passing user input to the first argument of
`printf(3)` in C. Coming from C, I would never allow the user control of the
format string if I were writing code.

~~~
cyphar
The reasoning why printf strings in C are unsafe is not applicable in python
though. You cannot access memory directly, and %n doesn't exist (nor other
other fun POSIX requirements that write to random places on the stack). In
fact, Python's C-style formatting strings are immune to this problem (even the
extended form that allows you to name arguments).

~~~
jlarocco
That's pointless nitpicking. The point is that user specified format strings
allow the user to access things they shouldn't be able to access. It's the
same in every language that has format strings.

~~~
cyphar
> That's pointless nitpicking. The point is that user specified format strings
> allow the user to access things they shouldn't be able to access. It's the
> same in every language that has format strings.

It's not nitpicking. Go's format strings don't allow you to access stuff you
shouldn't be able to. In fact, Go's _templating language_ [1] doesn't allow
you to access anything that you haven't explicitly provided to the template
instance.

Allowing users to format text is not an impossible problem. It's the fault of
language creators that they don't make this easier.

[1]:
[https://golang.org/pkg/text/template/](https://golang.org/pkg/text/template/)

------
peterwaller
Reminds me of the turing complete interpreter lurking in libc's printf.

[https://github.com/HexHive/printbf](https://github.com/HexHive/printbf)

------
tzury
If within the context that parse and execute user input, sensitive data is
available, then what this has to do with _New-Style String Format_? I mean,

    
    
        {event.__init__.__globals__[CONFIG][SECRET_KEY]}
    

Shall not be available to any function call, whatsoever!

------
innocentoldguy
While I agree this is a possible attack vector, I think it is extremely
unlikely, at least in the localization realm, for several reasons:

1\. The localization company should never know what programming language
you're using.

2\. You shouldn't give localization companies direct access to your internal
strings. More than hacking your code, they're almost guaranteed to screw the
formatting up.

3\. Typically, translators are hired for their native language abilities, and
not for their technical prowess. I've met precious few who knew how to open a
text editor, let alone hack your product via its strings.

I worked with Python and I18N/L10N for about 15 years. The way I always
handled localization was to parse all our strings into a PostgreSQL database,
and then provide a web interface for translators to do their work. This
interface provided translators with the full-context of the strings they were
translating, which internal strings often don't, prevented the inclusion of
certain characters and keywords, and kept the translators from screwing up the
formatting. By doing it this way, we got much better translations, and our
internal strings were never out of our control.

~~~
the_mitsuhiko
Most internationalization these days goes through things like transifex.
Becoming a translator on a specific project is not completely unlikely if you
want to target it.

~~~
bubblesocks
I find it depends on the project and the localization company. I've used big
ones, like Lingotek and LION (or whatever they're called now), but there are
still a lot of small companies that do things manually, and for a lot less.

------
Manishearth
Yeah, I never liked Python's "new-style" format because of this. It didn't
occur to me that you could use field access to access globals (never done
enough Python metaprogramming to mess with the reflection stuff), but I was
afraid of arbitrary getters being invoked.

In general I'm very wary of runtime string formatting. Strings tend to be
untrusted input with a large degree of freedom. _format_ strings are almost
always known at compile time (and more trustworthy). If your interpolation
system is more than simply mapping keys to values or positions, you should
probably restrict it to compile time. Feel free to expose a harder-to-use
runtime API. Rust has compile-time format strings, for example. They're not as
powerful as `str.format`, but they could be without there being security
issues. JS has a different syntax for format literals. Regular strings cannot
be "formatted", you must specify a string literal with backticks and that gets
converted to a string value when the interpreter gets there. These literals
can execute arbitrary code, but since it's just literals there's no way for an
untrusted string to get in there.

One main use case for runtime string formatting is i18n. But that really
should use a different solution. Most string formatting APIs are geared for
programmer convenience -- the programmer is writing the code and the string.
The scales shift for translators, who are only writing the strings. They don't
need things like field access and stuff.

Besides, most string formatting APIs are inadequate for i18n. Not if you want
to handle stuff like pluralization ([http://mlocati.github.io/cldr-to-gettext-
plural-rules/](http://mlocati.github.io/cldr-to-gettext-plural-rules/)).

Another use case is template engines and stuff like that. In that case, field
access is useful, but you probably should exert more control on these things
(which is exactly what jinja2 seems to be doing here)

At one point I toyed with an idea for a super-type-safe template engine in
Rust. It would validate the templates at compile time, and additionally ensure
that the right types are in the right places. For example, it could ensure
that strings that get interpolated with the HTML are either "trusted",
escaped, or otherwise XSS-sanitized (using the type system to mark such
types). Similarly, url attributes (href, etc) can only have URLs that have
been checked for `javascript:`. Never got around to writing it, sadly.

------
unscaled
I was always puzzled by Python's insistence to forego string interpolation
until the latest version.

Runtime string formatting, even if done safely (e.g. .NET's String.Format()
which doesn't have property access AFAIK), can still cause unexpected
exceptions at the very least, and suffers from inferior performance.

------
yladiz
So is the issue that it's a problem to let the format string be controlled
arbitrarily? It's good to warn users around it because some may not know the
dangers, but in general you don't trust user input, so I don't really see this
as something you need to build something custom around, rather just be careful
and follow good practices. It should be understood not to pass user data
directly to internal code without sanitizing it. Proposing some custom code
that uses undocumented internal features is overkill and also dangerous since
things that aren't documented/internal can change suddenly.

~~~
geofft
> _It should be understood not to pass user data directly to internal code
> without sanitizing it._

But how do you sanitize it? If you hadn't read this article and were told to
sanitize a format string, how would you implement that?

Unless you know exactly what the problem is, you're basically reduced to
writing your own string formatter. (Though see 'anderskaseorg's example, it's
a lot less code if you do.)

~~~
tedmiston
I think asking how to sanitize it is really asking the wrong question. We
shouldn't sanitize a format string from a user in the first place. We should
get input and generate a format string rather than allow one to be supplied
"raw".

For example, using re.sub as another commenter pointed out.

An even more Pythonic solution would be to use the built-in string.Template
[1] which at a glance appears to be safe against this kind of attack (but I
haven't tested in code).

I'm not sure why Armin didn't mention string.Template in his post other than
that it is much more restricted than str.format, however that kind of seems
like the point to me.

[1]: [https://docs.python.org/3.4/library/string.html#template-
str...](https://docs.python.org/3.4/library/string.html#template-strings)

~~~
daveguy
Since 2.4:

[https://docs.python.org/2/library/string.html#template-
strin...](https://docs.python.org/2/library/string.html#template-strings)

------
angusp
It's a well known C pattern that you should never trust a user supplied format
string, E.g. printf(arg) vs printf("%s", arg). The same applies here

------
Cyph0n
Scala's solution to string formatting is the best out there imo. You can put
basically any code into the format placeholder.

[http://docs.scala-lang.org/overviews/core/string-
interpolati...](http://docs.scala-lang.org/overviews/core/string-
interpolation.html)

~~~
UnoriginalGuy
I don't really understand what you mean. They're pointing out unintended
consequences of allowing arbitrary string formats. You're linking to one you
call "better" which would allow even more powerful misuse of arbitrary string
formats.

And while Scala's string formatting looks cool, it hardly seems like your
reply applies to the topic/thread. In fact ironically it seems to point out
that this would be an even bigger issue in Scala.

~~~
merb
> be an even bigger issue in Scala.

not with string interpolation in scala. it only works on compile time so it's
not possible to actually take user input strings (without runtime reflection
of course)

> it hardly seems like your reply applies to the topic/thread

yes string interpolation (scala) has less todo with it. it can't be compared
to the python one.

~~~
icebraining
Python's string interpolation only works on compile time as well¹, this isn't
interpolation, it's a runtime method in the string class, which Scala also
has, in fact, with the same name (.format).

¹ (compilation in CPython generally occurs during initialization of the
execution, though you can pre-compile files manually)

~~~
merb
thanks for correcting me, actually my statement is not wrong that it has
nothing to do with `.format` I was just mistaken about the python code, I
tought it means the new stuff (i.e. interpolation), but I actually skipped the
first code snippet.

------
babyrainbow
Discussion when this pep when it was accepted and people handwaving and down
voting arguments against it.

[https://www.reddit.com/r/Python/comments/3k6qi8/pep_498_appr...](https://www.reddit.com/r/Python/comments/3k6qi8/pep_498_approved/)

~~~
tedunangst
And again, not the feature under discussion.

~~~
babyrainbow
Not sure what you mean...

------
FeepingCreature
If you are making a new language, the easy way to fix this is make formatting
a property of your _strings_ , not a runtime function. Ie. instead of having
"foo {bar}".format(bar=bar), have "foo {bar}" be equivalent to "foo "+bar.

This sidesteps the problem because only literal strings are formatted.

------
libeclipse
I tend to use the old style ("%s" % "lel"). Just wondering, does this affect
that too?

~~~
bubblesocks
The nice thing about the new format is that you can reverse the order of the
variables inside the string, which is nice when you're localizing strings into
languages with a reversed sentence structures, like Japanese. I haven't done a
ton of Python, but the old style won't let you do that, will it?

~~~
Franciscouzo
Reversing the order is not possible, but you can use a dict instead:

    
    
        >>> "%(foo)s" % {"foo": "bar"}
        'bar'

~~~
dozzie
Funny. I thought Python implemented _all_ the quirks of the original printf()
formatting, including using arguments at specific positions. Apparently it
didn't (and arguably it's a good thing).

    
    
      $ printf '%2$s %2$s %1$s\n' aaa bbb
      bbb bbb aaa

------
Pxtl
Odd they reference c# since c# does not have this feature in its traditional
format strings - they don't allow you to access arbitrary members.

C# recently added string interpolation, which does allow arbitrary code, but
string interpolation itself is compiled C# code and can't be stored like a
format string.

Personally I use mustache when i need format-string-like-behavior from semi-
trusted users.

------
foota
Wow, I had no idea you could access attributes from within format strings.

------
iamsk
This should be fixed built-in, like how sql injection fixed.

------
agumonkey
Once again strings and io formatting making fun of us.

------
bbcbasic
Pure functions ftw

------
atabdiwa
... Uncontrolled format string bugs? In 2016? Really? Someone would fall for
that? ..

~~~
the_mitsuhiko
Old style format strings are perfectly safe for arbitrary user input and
commonly used as such.

~~~
dom0
User-provided format strings seem rather rare to me ; at least in scenarios
where the user doesn't have full access anyway (eg. controlling output
formatting of some command).

------
dlbucci
I do love writing python, but it's pretty shocking when I find out you can
write something like `event.__init__.__globals__[CONFIG][SECRET_KEY]`. That
language just does not care about privacy or information hiding at all, I
guess.

~~~
hakanderyal
Python's general mentality can be summed as "We are all consenting adults
here". If you want to access some part of program or data, you generally can,
but the burden of not breaking anything while messing with internals is on
you.

This is a great power, but also can become an unlimited source of bugs.

------
ak217
Until now I thought the new features were confined to `f""` format strings.

Edit: As correctly pointed out, this feature has been around since the
introduction of str.format(). So this warning applies to all Python versions.

~~~
tedunangst
This is not a new feature.

------
stevebmark
Doesn't this apply to any language that has string interpolation, Ruby,
Python, Javascript Perl, etc? And doesn't it not really matter, because it's
not realistic to use a dynamic string template in a program?

~~~
Manishearth
Ruby format strings are like old-style Python format strings. They only map to
names or positional arguments. Perl also just maps to positional arguments,
like printf. The most you can do is mess up the string by moving around
positional args.

JS format strings get evaluated at "compile time", as do Ruby interpolation
literals. You cannot have a string with format specifiers in it, you can have
a backtick-delimited string _literal_ with format specifiers in it. The
formatting is done when the interpreter encounters this literal; the user
can't input such a value since these literals aren't a different kind of
value.

> because it's not realistic

Done reasonably often for i18n. Not a good i18n solution, but for many it's
good enough :|

------
chrisdone
> So what do you do if you do need to let someone else provide format strings?
> You can use the somewhat undocumented internals to change the behavior.

Well, that's just sloppy and shame on you for exposing programming internals
to a user in a service. What did you expect? Write your own format string
parser and stop being lazy.

