Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These two issues have been relevant for over 20 years, older than today's college grads. I find it fascinating the new blog posts are being written on a regular basis explaining these two pitfalls. And there are probably a million blogs with these same two examples going back decades. This is an epic re-post in spirit.

I'm tempted to ask, "Why hasn't this been fixed yet?" Where "fixed" means, "Something new programmers just starting off their careers don't have to jam into their brains?"

I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used, and we can talk about how JS and PHP have added native functions to construct custom code to address this problem (hack cough cough hack).

But these two cases in particular stick in my craw because there is no fundamental solution presenting yet.

So I ask again (and I've been asking this since 2004-ish):

Why do these two examples still persist? Why do the frameworks not eliminate them by construction? This is such a repeated pattern, why is it even there?



It boils down to two things. One is library/tool design typically makes it too easy to make user input be in-band with execution. The other is that most tutorials/guides only show you how to do things in-band.

An example of a tutorial for SQL:

SELECT first_name FROM users WHERE last_name = ‘Smith’

They then have an exercise to hook this query to a text box in the program, where through omission, the programmer is guided to use string concatenation to build the query.

If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.

Most modern web development frameworks make it very hard to insert un-escaped text into the DOM. You have to go out of your way to introduce a XSS vulnerability in your web application with one, and most of the tutorials and documentation about the framework warn you about using the raw HTML functionality.

Another way to look at it is that the out-of-band way of doing things is typically perceived as either lower in performance, harder to do and/or less elegant (eg: C-style strings vs pascal strings).

I consider anything with user input that is done in-band (eg: escaping is a fix) to be doomed to fail. This is similar in idea to the cryptographic doom principle where decryption before authenticating the message is ultimately doomed to failure.


> If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.

I dunno -- I've been doing C programming for 30-ish years now, but just learned SQL about a year ago. Every man page I looked at, as well as every stackoverflow question, emphasized the importance of using parametrized queries. And IIRC in Python, "only execute a single statement" is enabled by default; if you want to execute multiple statements, you have to use a different call. So even if you somehow manage to forget to parameterize your queries, you'll still be safe from Little Bobby Tables.

Do SQL injection attacks still actually happen? How is it possible?


SQL injection attacks still happen, a LOT.

Making sure your query only executes a single statement is not enough to prevent sql injection (depending on how you concat the query) - you just have to use the provided context to get/set data you need to escalate your permissions (eg if its SELECT stuff FROM table, you might be able to inject it such that your query replaces stuff and then you can select whatever the querying user has access to.)

2019: For its "State of the Internet" report, Akamai analyzed data gathered from users of its Web application firewall technology between November 2017 and March 2019. The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks. That's up sharply from the 44% of Web application layer attacks that SQLi represented just two years ago.


> The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks.

Are you sure that you don't actually mean "weird strings sent to web applications" when you write "web application attacks"? Sending a weird string to web applications is not an attack in any particularly relevant sense unless there is an actual vulnerability--other than when akamai wants to sell you snake-oil against all the danger of weird strings.


A lot of websites and applications are built for a purpose they get forgotten, I'm speaking especially people doing it in their free time for fun, or small company projects.


Many applications were written 5, 10, or even 25 years ago. They haven't been updated to current standards.


> The other is that most tutorials/guides only show you how to do things in-band.

Learning SQL is completely different from learning how to use SQL in whatever programming language/framework/library you may end up using. Learning how to safely interface with an RDBMS is going to be entirely specific to the stack you’re using for the rest of your application.


> Why do the frameworks not eliminate them by construction?

They do, other than the pure PHP example that simply predates modern approaches to web security and is somewhat of an intentional misuse, modern templating engines (that the article also mentions) default to escaped output. That still means that new devs have to be aware of the mechanism and not go out of their way to shoot themselves in the foot by not using the appropriate mechanisms, which I guess explains the blog posts. I honestly don't think it's that bad, that's just part of any generation of developers learning the basics of their domain. To me XSS still being in the OWASP top 10 was always more of an indication that we suck at training (for their basic stack and security minded development) rather than some conceptual failure of the frameworks we use.

There's plenty of "fixes by construction" out there, that doesn't eliminate new devs not using them or experienced folk making an error every once in a while.


Thanks for the concise response.

> that we suck at training

That makes me wonder what kind of training companies require. How many companies hire based on DIY examples in interviews, and think "ok, this new hire knows enough", rather than run the risk of essentially re-training 90% of what they already know, despite that 10% being critical knowledge?

I don't have a sense of what dev training looks like across the industry.


Company I'm at now requires basic security training every year. TBH it kinda sucks at showing solutions to this kind of problems, but at least it makes people aware of the risks. I think that might be a PCI compliance thing, but I'm not sure.


I feel like that's a hard problem that exceeds hiring these days, would love an answer to that as well. My personal approach for junior positions has mostly been to hire rather selectively (when I can) to get people that at least recognize when they might lack knowledge in a certain area, team them up during the onboarding period and somewhat strict code review policies at least in the beginning.

Avoiding stipulating this training to all new hires is a symptom of me having an aversion to most classroom settings though, I've had quite a few developers that enjoyed getting this style of training after they indicated they wanted it down the road. I personally wouldn't have enjoyed the 90% retraining scenario (monetary loss that implies aside). I've found training on specific aspects with a bit of practical engagement to be more effective, e.g. there are great and engaging courses to transport basic web security. Not that these are always up to date or trainees retain everything but it gets them into the right mindset to be aware of issues.

But of course even with an approach that works 100% of the time, these days that doesn't guarantee that none of your dependencies or outsourced code production is up to the same standard.

tl;dr is "I don't know either" I guess but maybe you can take something away from it.


> (hack cough cough hack)... there is no fundamental solution presenting yet.

Because people think those hacks are fundamental solutions (see: this blog title).

But really, the fundamental solution is finally at long last treating programming as a form of engineering.

> I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used... Why do the frameworks not eliminate them by construction?

Because in any non-trivial system there are always edge cases, and attackers will find the edge cases. This is why XSS persists even as template engines have taken over. "filter output" is not a panacea. Nothing can replace carefully thinking about the entire range of possible inputs and their related outputs.

But instead of educating programmers to think carefully about how to specify and design robust systems, the software industry repeats gang-of-four-style mantras like "escape output". Even while admitting those solutions don't work universally and offering "get security review" as some sort of universal fix.


It's interesting that single page apps actullay have a benefit here. If you generate DOM with code, you can just assign anything you like to el.textContent and you'll not need to muck around with sanitization libraries and edge cases.

Basically the same principle like using parametrized SQL queries.


Part of the problem is the use of a general "String" data types in many languages. Libraries that deal with SQL or HTML or anything similar shouldn't use String in their APIs. Instead they ought to have more specific "EscapedString" and "UnescapedString" types so that there's no ambiguity about which is which.


While I agree about String. EscapedString is conflating rendering/output with data model which is the core of the issue. Application developers should not care to touch escaping, nor text rendering on their right own for formats like xml, json, sql, etc.

There should not be xml build by concatenation, instead use DOM + proper render/transformer/write. Same for SQL, prepared statements + bindings...


Our programming languages suck at providing useful types.

"String" is a structural data type. "SQL query" and "HTML snippet" and "regular expression" and "user-entered text" are semantic types which can be stored in strings, but are all quite distinct in meaning and usage.

You shouldn't be allowed (by the language's type system) to pass user-entered text to a SQL query function, without perhaps first calling a function with a scary name like "convert_raw_unsafe_text_to_query". A string is not a string is not a string. Or make a DOM-for-SQL so we never have to touch syntactic strings.

(It's exactly the same problem as units of measure. 5.0 feet is not the same type as 5.0 meters, and you shouldn't be able to add 5.0 + 5.0 if you didn't declare they have matching units, or define a way to convert as necessary. Numeric types in most languages don't have associated units, either, unfortunately.)

Hungarian notation tried to partially solve this, by giving up on the built-in type system and using variable names to encode intent. That solution is ugly so it's been abandoned, and it's the wrong place to solve it, anyway.

Programming languages today don't provide appropriate abstract data types for strings, or make it easy to define your own. Popular libraries for SQL/HTML/regex/etc don't require special string types. Since there's no standard types, it'd be a pain for any users who need to use more than one library, too.

We need either one popular language to do this (which others might then copy), or two popular libraries (a coalition). It also needs a catchy name for this style of programming, to help shame old languages/libraries that don't support it.


React does this (at least the escaping of output part). Bunch of PHP frameworks do it too.

But I think it's just a natural part of the power of computers, except most people don't think in Lisp ("any" text could be turned into code), they think in Java (I have this static, rigid, compiled code, and that's the only thing that runs).


Why do these two examples still persist? Why do the frameworks not eliminate them by construction? This is such a repeated pattern, why is it even there?

If I can type a query into a SQL prompt but your framework won't let me put it in there, I am first going to conclude that your framework is broken. No matter how good the reasoning is for why you did it.

Worse yet, it sometimes is broken. Smart databases understand that whether they should use a particular index depends on the value that is passed in. Where using the index for the most common value is a huge performance penalty, and failing to use it for the rest is likewise. The only way to get good performance is to pass in the value for the case where it has to not use the index. (You can parametrize the rest, at least in Oracle. But the special one has to be passed hard-coded in the string so that the optimizer sees it.) This is a rare case but when it comes up, I really care.

If your framework won't let me fix a performance problem that I know how to fix, I'm going to switch frameworks.

And even worse, parsing SQL is more complex than you think. If your super safe framework doesn't agree with the database it can reject valid SQL or fail to provide the safety that it thought it did.

As an example, in PostgreSQL I can use $$ as a quote mark. This is super convenient for stored procedures. If your super-safe framework doesn't let me do that because it thinks it is a syntax error or recognizes it as unsafe, I will switch to something that can let me write stored procedures. If your super-safe framework doesn't recognize that it is a quote mark, then it isn't offering protection. If your super-safe framework tries to analyze it correctly, you're now attempting to analyze run-time strings that I am building inside of the database in a Turing complete language. Good luck with that. (Hint, Turing proved that it is an impossible task.)

Now I'm admittedly in the 0.1% of people using these tools. However others trust me to know what tools to recommend. So experts like me have an outsized impact.


Just recognize that you're holding SQL to an unfair standard here. You wouldn't reject an HTTP framework because you can't paste raw HTTP into it. You wouldn't reject an IMAP framework because you can't paste raw IMAP to it.

You're requiring that the maintenance hatch be the front door. It should be no surprise that such a design results in lots of people accidentally breaking things.

As you say, fewer than 1 in 1000 people have your needs. Why would you recommend a tool whose features are more dangerous than useful for them?


I am holding SQL to the same standard that I would hold, say, a web framework. Simple things should be simple.

Injecting stuff into a dynamic protocol is inherently harder than injecting text into a text document. A text framework that doesn't accept text is going to be a fail.

As you say, fewer than 1 in 1000 people have your needs.

My needs 99% of the time are not that unusual. What puts me in the 0.1% among general developers is the level of knowledge that I have about weird edge cases and how databases work on the inside.

Why would you recommend a tool whose features are more dangerous than useful for them?

Your question presumes the answer to a question that I think you are wrong on.

My very first point was that if I can type it into a SQL prompt, I need to be able to put it into my database.

For someone who is just learning, this convenience is essential. And any tool that complicates their life by forcing them to learn a bunch of stuff before they can do the very simplest thing is a barrier to learning. A barrier that they are likely to solve by finding a tool that makes the simple thing simple. They will only learn about the gotchas down the road.

Case in point. Back in the mid-90s someone wrote a bunch of CGI scripts to make personal home pages easy to write. In fact that is what it was called. Personal Home Page / Forms Interpreter. It accidentally turned into a language that, after several rewrites, is now known as PHP.

When I first encountered it in the early 2000s, every competent developer that I knew (myself included) said, "This is poorly designed crap that will cause a lot of problems." We were right. However it was poorly designed CONVENIENT crap. Convenience won.

Related, see https://www.dreamsongs.com/RiseOfWorseIsBetter.html.


Take input and transform it into output is the fundamentals of programming.

New developers often struggle with fundamentals and will usually only test the input they expect.

Some one else has to intentionally give you bad input before you realise thats a thing people will do and something you need to think about.

It doesn't help that most tutorials focus on getting output (yay results!) rather than focusing on how to get consistent transformation of input to output. The result is a lot of tutorials that focus on getting something done and forget / assume the fundamentals.


I haven’t used Elm, but I’ve read that it uses strong typing to distinguish between escaped and non-escaped data. That sounds like a good general solution to the problem, as the compiler will prevent you from using unescaped data in a dangerous context.


Any sort of language that allows you to define custom types (e.g., objects) and type-hint parameters allows you to do this. You can accomplish this same thing in PHP even (the type checking is at runtime, but same idea).

Types are not restricted to just a description of how the data is represented in the computer, otherwise we would need nothing but primitives.

When you perform calculations with physical measurements containing units you don't simply throw away all of the type information while performing calculations -- you perform the same operations on the units both as part of the answer and as an essential check that you've done the right thing. You should do the same thing with your data.

See, e.g., https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...


Even the Joel article makes what's arguably a mistake: he says that input from users is "unsafe" and must be escaped on output, while strings from elsewhere shouldn't. That may avoid security exploits, but it still results in incorrect output when a predefined value really does need to be escaped.

The issue isn't whether a value originated from the user. It's the units/data type, as you said, such as plain text vs. HTML.


You can accomplish this same thing in PHP even

Sure, you can, but the key question is, is it typically done in popular frameworks? (Maybe it is! I’m not a PHP user)

I should have distinguished between Elm the language and Elm the web framework; I guess it’s really the framework I’m talking about.

(the type checking is at runtime, but same idea).

That’s not the same at all.


I've heard that some Haskell frameworks do this as well.

I am heavily using Java's Servlet framework and the blatant spraying of Strings everywhere is astounding in this age. I understand that backwards compatibility is an issue, but one could have set another API beside for optional use and deprecate the current one.


I imagine all Haskell frameworks do; the ones I've tried surely do. Haskellers are accustomed to mixing string types, since the default String type is inefficient (a linked list of Char) and most import a library to provide immutable ~Pascal strings. And since this is such a common occurrence, the syntax has support form multiple string literal types. An application developer can literally create their own. It's also trivial to create a new type in Haskell that has minimal runtime cost, so it's pretty harmless.

I think if you don't have an easy way of creating string literals in the type you want, the developers will at some point reach for the deprecated api, and at that point you're just requiring good hygiene. Which is exactly what you're trying to stop. Language support is critical in being able to get away with this.


>the developers will at some point reach for the deprecated api

Agreed, however you can have organizational measures to prevent this (a build time check). And of course a change in the framework must be accompanied by decent conversion libraries (I don't think this is different in Haskell).


Have they been relevant? I haven't heard the ‘sanitise your inputs’ advice in years. In fact, the take in this blog post seems to be the predominant one.

Well, maybe it's still common in PHP, I don't know. I haven't touched that either in a while.


>These two issues have been relevant for over 20 years, older than today's college grads...

My very 1st reaction - wow it's 2020 already and the issue is still hot. And the truth is that they are quite common in practice.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: