Maybe a more fully worked example is needed. You're making a blog hosting service as a service service. Bloggers have different ideas about what page titles should be.
Post Title
Blog Name: Post Title
Blog Name - Post Title
Post Title - Blog Name
Blog Name ----embdash---- Post Title
~~~ xXx Post Title xXx ~~~
It's a little overwhelming to put every possibility in a dropdown, so you allow the user to specify a format string.
title = userformats.title.fmt(post)
This doesn't look so very dangerous. And then the user can say
"{post.title} - {post.blog.title}"
"{post.title}: Another fine post by "{post.author}"
"~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~"
I think "it's not obvious how to sanitize input" is the main point here --- one advantage of the %-style format strings, they're easier to parse and escape.
We want it interpreted, but we don't have to let the language handle it directly. A much safer way is to define our own syntax for this and interpret it in our code, with that code only having access to a limited set of safe substitutions.
For instance, suppose we want to allow user formatting of contact information. A contact entry has a name, address, phone number, and email address. The user supplies a template string using %_NAME_, %_ADDR_, %_PHONE_, and %_EMAIL where they would like the name, address, phone number, and email address substituted, respectively.
I'd probably be doing this in Perl, and I'd do it something like this:
That's not even eval, the entire thing is a series of dynamic attribute lookups[0], you can trivially implement that in pure Python without needing to `eval` anything.
[0] it also supports mapping and sequence lookups IIRC but that's about the same thing
Would sanitizing for double underscores be enough to capture the most dangerous cases?
import re
def sanitize(user_input):
"""Sanitize user input for str.format
Usage:
sanitize("{post.title} - {post.blog.title}")
sanitize("{post.title}: Another fine post by "{post.author}")
sanitize("~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~")
"""
return re.sub(r'{[^}]*__[^}]*}', '', user_input)
Even better, we could specify which variables to allow in user input:
import re
def sanitize(user_input, *allowed_variables):
"""Sanitize user input for str.format
Usage:
allowed_variables = ["post.blog.title", "post.title", "post.author"]
sanitize("{post.title} - {post.blog.title}", *allowed_variables)
sanitize("{post.title}: Another fine post by "{post.author}", *allowed_variables)
sanitize("~~~ xXx {post.blog.__init__.dbconnection.__keys__.password} xXx ~~~", *allowed_variables)
"""
for match in re.finditer(r'{([^}]*)}', user_input):
if match[1] not in allowed_variables:
user_input = user_input.replace(match[0], '')
return user_input
Yes, so someone might think to verify that it matches a seemingly safe pattern like "word characters separated by zero or more periods", which is still insufficient.
"Ingesting raw user input is good if you only use it to interface with other systems that provide a way to separate data from instructions or at least escape strings."
sql injection is commonly caused by combining your query with its related data parameters in unsafe ways. you are emitting raw user input you received to another program, the database, it's your responsibility to give this to the DB safely.
you still have to be careful, and when you follow all the right best practices you can safely ingest raw user input.
I've worked at a company that escaped user input before inserting into the DB. it's a horrible nightmare I don't think anyone should have to experience.
Even if you don't emit the input you can run commands on the server. You can open up a reverse shell or deduce data from timing based side-channels. IMHO working with raw user data is bad; it should be sanitized/canonicalized before doing anything.
if you're running commands on the server I would consider your program as emitting output directly to another program. it's up to you to make sure you call it correctly and not emit raw user data
Format strings in python are not equivalent to eval. The syntax looks similar, but it's actually limited to evaluating methods and list elements in a given object. Now, with Python's monkeys patchability this is worrying, but it's far from being eval.
The proposed idea of relying on undocumented internals and blacklisting attribute names to securely sandbox formatting strings is _really_ _dangerous_. Never do that in production code! Language expansions could render your sandbox unsafe at any time.
You can write your own safe formatting engine in much less code.
The assumption behind string.format and friends is that people who override __getattr__ and __getitem__ know what they're doing. Which is probably a reasonably safe tradeoff in order to allow attribute access.
I'm not worried about executing arbitrary code, because it's not user-supplied and it's normal for a templating library to access attributes of objects you pass to it (if it was unsafe you shouldn't have done that). This is more of a snafu with the number of sensitive and powerful attributes available by default on Python objects.
Read the article. The danger is not __getattr__ and __getitem__ but rather things like event.__init__.__globals__[CONFIG][SECRET_KEY]. There is no such thing as an object where arbitrary attribute access is safe, because every object has a constructor function and every function references the globals dict.
Shouldn't that be based not on the built-in string capabilities, but on a template engine, that must also take care for much more narrow cases of abuse (e.g. SQL injection, or access to only of a restricted number of helper functions within the template)?
While the vuln is there, it's a rather obscure way of using the engine in the first place, so in "good" templates it shouldn't be exploitable. See the example in the release: https://www.palletsprojects.com/blog/jinja-281-released/
Sure, but like tedunangst pointed out, there are valid reasons to allow users to edit those templates, in which case you can't assume they'll be "good".
Native vs. non-native is irrelevant. A non-native formatting system could have the features of the python formatter that are dangerous. Or, if Python's native formatter lacked these dangerous features, there'd be no reason not to use it.
>Native vs. non-native is irrelevant. A non-native formatting system could have the features of the python formatter that are dangerous.
Hardly irrelevant. The idea is that a non-native formatting system is build specifically for the templating engine, and so the developers controls what it can do and gives it non-dangerous features.
>Or, if Python's native formatter lacked these dangerous features, there'd be no reason not to use it
Even if it did lack them, it's not under the control of the template platform author, so they should not rely on it for such a crucial part of the template engine as string formatting.
(Which is exactly the case here: Python lacked those features before, but got them, and bit the template engine in the back).
> it's not under the control of the template platform author
This doesn't really matter. If it does what you want, it's not like they're going to change the formatter's behavior from under your feet if it's part of the language's default runtime.
A template engine has multiple use cases its author can not foresee, not just creating pages.
Here they talk of using it for translation, with arbitrary user provided strings for example. Another could use it to create an SQL helper (accounting themselves for injection) etc.
In any case, it should not contain dangerous helper methods of capabilities not fit for strictly templating, just because the native string class allows them.
Internationalization should never use arbitrary string or system level formatting.
Look at MessageFormat or L20n. Yeah, I know that it's more complex than what you think you need (I've hear the "let's just use JS template literals" so many times) but you actually do.
In practice, translators are often externally contracted; the amount of work required to translate any given language is not large enough to make these full-time positions. So only trusted so far.
Although this only applies to format-string evaluation in Jinja2, which would look something like
{{ "Hello {user.name}".format(user) }}
which is really kinda obscure, especially since Jinja already has an explicit |format using the very safe / limited (but generally sufficient) alternative %(key)s syntax (also suitable for i18n, to some extend).
Not necessarily true. Type systems can ensure safety without too much effort. For example in Haskell, you might have some sort of format function with the type signature
-- Given appropriate Error and Context types
format :: Context -> Text -> Either Error Text
Which would be guaranteed to be safe (there is no way this function can execute arbitrary code or read from your file system, short of egregious abuses of unsafePerformIO in its implementation). Note that the signature I wrote above takes an explicit context (presumably some sort of mapping of available variables), which removes the need for supporting arbitrary attribute access in the first place, and is more explicit to boot.
There is no reason other languages or libraries could not implement a function with the same degree of safety, although their type systems might not be able to guarantee its safety. In fact I would wager that such pitfalls are only likely to be found in dynamic languages, which support runtime eval (which is, after all, the root of the described problem).
Sane Lisp approach: provide easily analyzable target syntax.
This is the TXR Lisp interactive listener of TXR 163.
Use the :quit command or type Ctrl-D on empty line to exit.
1> (defvar foo 42)
foo
2> `@(list foo) ... @foo`
"42 ... 42"
What is that backticked literal? Let's quote it:
3> '`@(list foo) ... @foo`
`@(list foo) ... @foo`
Hmm, prints back in same form. Probably syntactic sugar for a list; what is in the car? cdr?
To sandbox this, we just have to walk a list and enforce some rule. For instance, the rule might be that all elements after sys:quasi must be string object or else (sys:var sym) forms, where sym is a symbol on some allowed list. Thus (list foo) would be banned.
A custom interpreter which calculates the output string while enforcing the check is trivial to write as a one-liner.
Of course if your program just eval such an untrusted quasiliteral, it has access to the dynamic/global evironment:
eval is only executed in the context of some environment, and the only things it has access to are some strings, and a string concatenation procedure.
It doesn't have access to lambda, quasiliterals or quotes, or anything of the like.
I do have code like this in production, because eval only looks up within it's environment... Which is empty apart from the bound let. No closure access.
My Lisp has search through a first-class environnment which falls back on a dynamic/global environment. A way could be provided to suppress the fallback, like a special environment node indicating "do not search for bindings past me, stop here".
If we have a completely empty environment, then we cannot eval the (sys:quasi ...) form itself; the sys:quasi symbol itself has no binding.
A nice idea might be to have a filter object in the environment stack which says "only if you're looking for one of these specific symbols, can you proceed to the global environment". Then we can easily arrange for select global bindings to be visible.
You don't want the sys:quasi form to evaluate: It can be used as an escape hatch. If you're binding to the null-environment, you can provide everything a user can use.
But they can't escape the sandbox and do anything else.
To make a list, they need access to list, or quote, otherwise they just have number and strings, and anything else you give them.
If you give them anything more than a null-environment, you begin to lose the safety of a safe evaluation, which might be something you want to do yourself, but not something that you want a user to have access to, or they'll find an escape hatch, like sys:quasi, and gain access to areas they shouldn't.
Which is as bad as any SQL injection, if not worse.
So, a safe eval, can be created, without the need for any further analysis, if eval is handed an environment, and only uses it.
If the eval then searches the global environment, then the eval is unsafe, and you should never hand it user input, unless you have first ensured it's safe... Which defeats all the niceness provided by first class environments.
Eval is only not evil, if the programmer controls all the bindings to eval's environment.
REPL's usually use an unsafe eval and environment, and that makes sense, as you want to be able to do anything you want in code.
But a templating language doesn't need, and shouldn't use, all the unsafeness of a Turing Complete language, that might have access to your system.
Yes but quite confusing when we have another new style string format actively being discussed in recent months vs this "new-style formatting" which has existed since 2006.
It's much clearer to just refer to it as str.format.
It's what invariably happens when you incorporate words like "new", "super" or "high definition" in the name of something. I'm still hoping that at some point our industry learns that lesson.
The problem being discussed here is that the format string can access and attribute of the object passed in. From PEP3101 [0]:
" Unlike some other programming languages, you cannot embed arbitrary
expressions in format strings. This is by design - the types of
expressions that you can use is deliberately limited. Only two operators
are supported: the '.' (getattr) operator, and the '[]' (getitem)
operator. The reason for allowing these operators is that they don't
normally have side effects in non-pathological code."
Which is still not safe enough, and in effect can be used to access any global variable as described in the OP article.
Rust formatting is done by macros - which expand at compile time and accept only string literals. This makes it impossible to pass user input to format!(), println!() and their ilk unless the end-user can access your compiler, in which case you have a much larger problem.
The example is not pathalogical. The property that the PEP states is that the expression should not have side effects, which is the case.
Having pathological code in this context only elavates the severity from a pure data leak to partial code execution. Of course partial code execution in an unexpected context often leads to arbitrary code execution.
In many webapps, there's a very easy way to go from a data leak to arbitrary code execution. For instance, if you have signed cookies that are pickles of data, the cookie signing key lets you execute arbitrary code (because pickle deserialization can execute arbitrary code).
... which is why it's a poor idea to do that in the first place, since it has literally a single layer of protection against remote, unauthorized, arbitrary code execution, that depends on keeping a static key secret.
Much better idea to use a stupid, non-ACE serialization there.
The security issue here isn't with f strings, which oddly enough, don't have this flaw (because they are literals only). This has to do with new style string formatting, which does not have access to globals.
> It should have used something with a limited list of named arguments, or maybe a dict.
That's what "old-style" python interpolation did :)
You could do |"hi my name is %s and I live in %s" % (name, place)| or use named arguments with a dict (|"hi my name is %(name) and I live in %(place)" % {"name": name, "place": place}|).
I like JS format strings -- you can write arbitrary code in them, but they are compile time only (and use a different string syntax). So you can have |`Hello my name is ${name} and I come from ${place}. My profession is ${generate_random_profession()}`|. The backtick-string isn't a different type, and can't be moved around like a value. It's a different kind of way of specifying a string literal, and will be evaluated when specified. No way of doing injection there.
I suspect Python wanted to make it less verbose with new-style interpolation, and went a bit overboard with field access without realizing or caring about possible security issues like this.
Of course, python has a third kind of format string; interpolation string literals (|f"hello my name is {name} ..."|), which work like JS and Ruby. This is what you should be using IMO.
I meant that you cannot use a string value as a format string. It only exists in the form of a literal, and it gets evaluated when the interpreter evaluates that literal. You cannot store it and format it later.
Also, in C# the string interpolation works only on literal strings at compile time. So it's not possible to inject a malicious code this way from outside of the source code.
This is actually a problem with a lot of languages that allow runtime template like String interpolation.
For example Groovy on the JVM has GStrings which one can do fairly nasty things.
As well it actually is fairly hard to lock down most of the template languages on the JVM for user templates. (If you are going to allow user templates I recommend one of the Java Mustache-like implementations).
This is the same principle for not passing user input to the first argument of `printf(3)` in C. Coming from C, I would never allow the user control of the format string if I were writing code.
The reasoning why printf strings in C are unsafe is not applicable in python though. You cannot access memory directly, and %n doesn't exist (nor other other fun POSIX requirements that write to random places on the stack). In fact, Python's C-style formatting strings are immune to this problem (even the extended form that allows you to name arguments).
That's pointless nitpicking. The point is that user specified format strings allow the user to access things they shouldn't be able to access. It's the same in every language that has format strings.
> That's pointless nitpicking. The point is that user specified format strings allow the user to access things they shouldn't be able to access. It's the same in every language that has format strings.
It's not nitpicking. Go's format strings don't allow you to access stuff you shouldn't be able to. In fact, Go's templating language[1] doesn't allow you to access anything that you haven't explicitly provided to the template instance.
Allowing users to format text is not an impossible problem. It's the fault of language creators that they don't make this easier.
While I agree this is a possible attack vector, I think it is extremely unlikely, at least in the localization realm, for several reasons:
1. The localization company should never know what programming language you're using.
2. You shouldn't give localization companies direct access to your internal strings. More than hacking your code, they're almost guaranteed to screw the formatting up.
3. Typically, translators are hired for their native language abilities, and not for their technical prowess. I've met precious few who knew how to open a text editor, let alone hack your product via its strings.
I worked with Python and I18N/L10N for about 15 years. The way I always handled localization was to parse all our strings into a PostgreSQL database, and then provide a web interface for translators to do their work. This interface provided translators with the full-context of the strings they were translating, which internal strings often don't, prevented the inclusion of certain characters and keywords, and kept the translators from screwing up the formatting. By doing it this way, we got much better translations, and our internal strings were never out of our control.
Most internationalization these days goes through things like transifex. Becoming a translator on a specific project is not completely unlikely if you want to target it.
I find it depends on the project and the localization company. I've used big ones, like Lingotek and LION (or whatever they're called now), but there are still a lot of small companies that do things manually, and for a lot less.
Yeah, I never liked Python's "new-style" format because of this. It didn't occur to me that you could use field access to access globals (never done enough Python metaprogramming to mess with the reflection stuff), but I was afraid of arbitrary getters being invoked.
In general I'm very wary of runtime string formatting. Strings tend to be untrusted input with a large degree of freedom. format strings are almost always known at compile time (and more trustworthy). If your interpolation system is more than simply mapping keys to values or positions, you should probably restrict it to compile time. Feel free to expose a harder-to-use runtime API. Rust has compile-time format strings, for example. They're not as powerful as `str.format`, but they could be without there being security issues. JS has a different syntax for format literals. Regular strings cannot be "formatted", you must specify a string literal with backticks and that gets converted to a string value when the interpreter gets there. These literals can execute arbitrary code, but since it's just literals there's no way for an untrusted string to get in there.
One main use case for runtime string formatting is i18n. But that really should use a different solution. Most string formatting APIs are geared for programmer convenience -- the programmer is writing the code and the string. The scales shift for translators, who are only writing the strings. They don't need things like field access and stuff.
Another use case is template engines and stuff like that. In that case, field access is useful, but you probably should exert more control on these things (which is exactly what jinja2 seems to be doing here)
At one point I toyed with an idea for a super-type-safe template engine in Rust. It would validate the templates at compile time, and additionally ensure that the right types are in the right places. For example, it could ensure that strings that get interpolated with the HTML are either "trusted", escaped, or otherwise XSS-sanitized (using the type system to mark such types). Similarly, url attributes (href, etc) can only have URLs that have been checked for `javascript:`. Never got around to writing it, sadly.
I was always puzzled by Python's insistence to forego string interpolation until the latest version.
Runtime string formatting, even if done safely (e.g. .NET's String.Format() which doesn't have property access AFAIK), can still cause unexpected exceptions at the very least, and suffers from inferior performance.
So is the issue that it's a problem to let the format string be controlled arbitrarily? It's good to warn users around it because some may not know the dangers, but in general you don't trust user input, so I don't really see this as something you need to build something custom around, rather just be careful and follow good practices. It should be understood not to pass user data directly to internal code without sanitizing it. Proposing some custom code that uses undocumented internal features is overkill and also dangerous since things that aren't documented/internal can change suddenly.
> It should be understood not to pass user data directly to internal code without sanitizing it.
But how do you sanitize it? If you hadn't read this article and were told to sanitize a format string, how would you implement that?
Unless you know exactly what the problem is, you're basically reduced to writing your own string formatter. (Though see 'anderskaseorg's example, it's a lot less code if you do.)
I think asking how to sanitize it is really asking the wrong question. We shouldn't sanitize a format string from a user in the first place. We should get input and generate a format string rather than allow one to be supplied "raw".
For example, using re.sub as another commenter pointed out.
An even more Pythonic solution would be to use the built-in string.Template [1] which at a glance appears to be safe against this kind of attack (but I haven't tested in code).
I'm not sure why Armin didn't mention string.Template in his post other than that it is much more restricted than str.format, however that kind of seems like the point to me.
I don't really understand what you mean. They're pointing out unintended consequences of allowing arbitrary string formats. You're linking to one you call "better" which would allow even more powerful misuse of arbitrary string formats.
And while Scala's string formatting looks cool, it hardly seems like your reply applies to the topic/thread. In fact ironically it seems to point out that this would be an even bigger issue in Scala.
not with string interpolation in scala.
it only works on compile time so it's not possible to actually take user input strings (without runtime reflection of course)
> it hardly seems like your reply applies to the topic/thread
yes string interpolation (scala) has less todo with it. it can't be compared to the python one.
Python's string interpolation only works on compile time as well¹, this isn't interpolation, it's a runtime method in the string class, which Scala also has, in fact, with the same name (.format).
¹ (compilation in CPython generally occurs during initialization of the execution, though you can pre-compile files manually)
thanks for correcting me, actually my statement is not wrong that it has nothing to do with `.format` I was just mistaken about the python code, I tought it means the new stuff (i.e. interpolation), but I actually skipped the first code snippet.
What the Python code is doing is arguably a form of run-time reflection. Format strings contain embedded variables and expression, and the format function evaluates them: hence reflection.
Format strings occurring as literals in Python source could be treated at compile time, like in Scala.
If you are making a new language, the easy way to fix this is make formatting a property of your strings, not a runtime function. Ie. instead of having "foo {bar}".format(bar=bar), have "foo {bar}" be equivalent to "foo "+bar.
This sidesteps the problem because only literal strings are formatted.
The nice thing about the new format is that you can reverse the order of the variables inside the string, which is nice when you're localizing strings into languages with a reversed sentence structures, like Japanese. I haven't done a ton of Python, but the old style won't let you do that, will it?
Funny. I thought Python implemented all the quirks of the original printf()
formatting, including using arguments at specific positions. Apparently it
didn't (and arguably it's a good thing).
Odd they reference c# since c# does not have this feature in its traditional format strings - they don't allow you to access arbitrary members.
C# recently added string interpolation, which does allow arbitrary code, but string interpolation itself is compiled C# code and can't be stored like a format string.
Personally I use mustache when i need format-string-like-behavior from semi-trusted users.
User-provided format strings seem rather rare to me ; at least in scenarios where the user doesn't have full access anyway (eg. controlling output formatting of some command).
I do love writing python, but it's pretty shocking when I find out you can write something like `event.__init__.__globals__[CONFIG][SECRET_KEY]`. That language just does not care about privacy or information hiding at all, I guess.
Python's general mentality can be summed as "We are all consenting adults here". If you want to access some part of program or data, you generally can, but the burden of not breaking anything while messing with internals is on you.
This is a great power, but also can become an unlimited source of bugs.
Format strings aren't a template language (Python even has Template Strings for that purpose), and being able to do this is quite useful in eg. debug logs it's a shorter and easier to write "{foo.x} ({foo.__class__.__name__})".format(foo=something).
Conversely they don't prohibit expensive properties either (eg. ORM-based lookups), or generating misleading output or whatever.
Guys, this has nothing to do with interpolation. I like both Python and Ruby's string interpolation. I'm replying to the incredulous parent of my post who's griping about magic methods (which admittedly Ruby likely takes the crown).
Arguably Ruby is worse because you can execute instance methods in interpolation, but I still don't see why that's germane to this discussion other than to pick a fight about languages.
Doesn't this apply to any language that has string interpolation, Ruby, Python, Javascript Perl, etc? And doesn't it not really matter, because it's not realistic to use a dynamic string template in a program?
Ruby format strings are like old-style Python format strings. They only map to names or positional arguments. Perl also just maps to positional arguments, like printf. The most you can do is mess up the string by moving around positional args.
JS format strings get evaluated at "compile time", as do Ruby interpolation literals. You cannot have a string with format specifiers in it, you can have a backtick-delimited string literal with format specifiers in it. The formatting is done when the interpreter encounters this literal; the user can't input such a value since these literals aren't a different kind of value.
> because it's not realistic
Done reasonably often for i18n. Not a good i18n solution, but for many it's good enough :|
> So what do you do if you do need to let someone else provide format strings? You can use the somewhat undocumented internals to change the behavior.
Well, that's just sloppy and shame on you for exposing programming internals to a user in a service. What did you expect? Write your own format string parser and stop being lazy.