

PostgreSQL: Don't use CHAR or VARCHAR - Jonanin
http://blog.jonanin.com/2013/11/20/postgresql-char-varchar/

======
susi22
I have to disagree. You should _always_ put limits on everything, as said
nicely here [1]:

"Put a limit on everything. Everything that can happen repeatedly put a high
limit on it and raise or lower the limit as needed. Block users if the limit
is passed. This protects the service. Example is uploading files of logos for
subreddits. Users figured out they could upload really big files and harm the
system. Don’t accept huge text blobs either. Someone will figure out how to
send you 5GB of text."

Sure, you should ideally do this in your application code. But if there is
multiple interfaces (such as a REST api etc) to your database then you have to
remember to put them in place everywhere. I don't see a good reason to make a
username field to be TEXT instead of a generous VARCHAR(300). If somebody
wants to choose a longer username than that, he's probably malicious. It
protects you with zero cost and allows you to make some user input sanitation
mistakes (we're all humans) in your application code.

[1] [http://highscalability.com/blog/2013/8/26/reddit-lessons-
lea...](http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-
mistakes-made-scaling-to-1-billi.html)

~~~
joevandyk
You can put check constraints on a TEXT field to prevent bad data.

~~~
Demiurge
Isn't that what varchar basically is?

~~~
radiowave
One difference is that changing the length of a varchar column in an existing
table can be a major pain if you have other database objects that must be the
same type. For example, any views which also carry this column will need to be
dropped and recreated (or otherwise modified), or else Postgres won't allow
you to make the change.

Working with the text datatype and using check constraints on length makes
this much easier.

------
rosser
FTFA:

 _If you want to change the max length to be larger, postgres will have to
rewrite the table, which can take a long time and requires an exclusive table
lock for the entirety of the operation._

As of (IIRC) 9.2, this is no longer true. If you alter a varchar column to be
_narrower_ than it currently is, you'll rewrite the table. If you make it
wider, or convert from varchar(n) to text, you won't.

EDIT: And if you're willing to hack the system catalogs instead of using an
ALTER TABLE, you can even get around the rewrite to narrow a varchar(n)
column. See:
[http://stackoverflow.com/a/7732880](http://stackoverflow.com/a/7732880)

------
dragonwriter
A few observations:

1\. While the linked blog post is new today, its mostly a link back to a
different 2010 blog post.

2\. The linked blogged post and the 2010 blog post basically discuss
performance considerations that have been documented clearly in the PostgreSQL
documentation for character data types since version 8.3 (and less completely
for several versions before that) regarding the performance considerations (
CHAR(X) worse than VARCHAR(X) worse than VARCHAR and TEXT.)

3\. The linked blog post says "don't use CHAR or VARCHAR", but really, it
should be "don't use CHAR(x) or VARCHAR(x)". VARCHAR is pretty much identical
to TEXT.

~~~
Jonanin
Regarding 2: even if it has been documented for a while, I don't think it's
very widely known. I still see a lot of

    
    
        models.CharField(max_length=x)
    

in Django code, where x is usually an arbitrary value only chosen because
CharField requires it.

~~~
tcdent
Probably a result of supporting multiple backends. AFAIR, MySQL _requires_ a
length to be passed on VARCHAR.

~~~
falcolas
Yes, because varchar is implemented as a fixed-length list of characters. Text
fields are implemented as blobs, and as such can grow large enough to have to
be stored off-page (with all of the associated performance hits associated
with that).

------
DharmaPolice
As someone who regularly has to write SQL in 7 different dialects I much
prefer VARCHAR to TEXT as the former is handled much more consistently across
platforms. A TEXT column in MS SQL Server is an entirely different proposition
to a VARCHAR column. VARCHAR on the other hand is treated with relative
similarity between most of the systems I regularly use.

Additionally, one of the key benefits of more explicit datatypes is
documentation. Knowing that a column is 30 characters wide is useful
information to have at hand (without having to check check constraints) and
often reflects a business rule. Where joins have to be performed on character
columns it also helps to know if both sides of the join are both (say)
CHAR(8). If every text column is TEXT then that's much less clear. Check
constraints help but you don't always know if they've been applied to all
current data (some platforms allow for constraints to ignore current data). If
I know a column is VARCHAR(50) then I am 100% certain that there will be no
value longer than 50 in it.

------
DrJokepu
Couldn’t agree more. Database constraints should be thought of as the last
line of defence against madness rather than as means to validate input.
Database constraints are not really suitable to defend against attackers.
Constraints might stops users from creating extremely large records but they
won't stop users from creating an extremely large number of records etc. You
need to sanitise your input thoroughly in the application layer.

~~~
epo
Silly example, who decides what is 'an extremely large number of records'?
Whoever has a view about this should monitor and police the limit. If you are
saying that you can't protect against every eventuality so you may as well
guard against none, then that is asinine.

Applications should enforce correct application behaviour regardless of user
behaviour. Databases should maintain database integrity regardless of
application bahaviour.

If this requires duplicated effort then so be it.

~~~
DrJokepu
I think it would be difficult to defend an argument claiming that constraints
on data size help maintaining data integrity.

------
hmottestad
From my database course I learnt that nothing is slow in a database until you
can't fit your join operation in memory. Having to do a a join operation on
varchar(300) is predictable. Having to do one on text is unpredictable and can
slow your system to a crawl if the operation needs to be done on disk rather
than in memory.

FWIK the article did not talk about joins at all. I would love if someone has
a good article comparing what happens when you do a join on varchar vs text.

EDIT: One question remains, how is the "text" stored when doing a join. For
varchar you just make an array, because all values are the same length. With
"text" do you first find the longest text and use that as the default array
element size? This may not seem important with in memory operations, but
seeking on disk is considerably faster when all your elements have the same
size.

~~~
jeltz
The memory usage of a join has many more variables than the size of the joined
field (in some cases it might not matter at all as long as that field is not
included in the result set) so I would not say that joining on a short string
is that much more predictable than joining on a possibly long string. What
matters the most is what the query actually does, what the data looks like,
and what indexes you have.

Joins are usually implemented in one of these three ways, and the planner can
select which one to use depending on the query and table statistics:

1\. Hash join: Build a hash table of all the joined fields of set 1 and then
traverse the rows of set 2 looking them up in the hash table.

2\. Merge join: Sort both sets of rows and merge them. With the right indexes
you may not even need to sort anything, just traverse in index order.

3\. Nested loop join: For every tow in set 1 find matching entries in set 2.

------
zzzeek
use VARCHAR because constraints are a good thing, and use CHAR if you are
storing strings of a fixed length, because semantics are a good thing. The
point about padding being wasteful for variable-length data in CHAR is moot
because CHAR is for storing fixed-size strings like state codes.

As the PG docs say, there is virtually no performance difference at all
between all three, so stick with standard practices.

~~~
jeltz
> and use CHAR if you are storing strings of a fixed length, because semantics
> are a good thing

The CHAR type is not what most people think it is. It is a blank padded
string, not a fixed length one. Inserting 'a' into a CHAR(2) results in 'a '
being stored and retrieved from the database. You should always used VARCHAR
or TEXT in PostgreSQL and never CHAR (at least I cannot think of a case when
you would want it). CHAR is there for SQL standard compliance. Instead use one
of these:

    
    
      field VARCHAR(2) CHECK (length(field) = 2)
      field VARCHAR CHECK (length(field) = 2)
      field TEXT CHECK (length(field) = 2)
    

The rules for working with blank padded strings is in my personal experience a
hell with non-obvious results when casting, sorting, comparing and other
operations.

EDIT: I can leave you with this little example.

    
    
      $ SELECT length('ab  '::char(4) || 'cd  '::char(4));
       length 
      --------
            4
      (1 row)

~~~
netfeed
We're storing currency codes, and they're always 3 chars (EUR, USD and so on)
so it would just be stupid to actually use VARCHAR and not CHAR for that

~~~
masklinn
Well not _stupid_ per-se, since the result's the same. But it would provide
less information to the reader.

~~~
jeltz
I do not see how either VARCHAR or CHAR would provide more or less information
since both are variable size strings. One is space padded and one is not.

If you care about semantics you should create a domain based on VARCHAR.

~~~
zzzeek
the fact that one sees "CHAR" in the schema definition is self-documenting
that this is a fixed length field (and CHAR is "fixed length", the size of the
string you get back will always be N), that's what I mean by "semantics".
E.g., "what does it mean that the developer used CHAR here and not a VARCHAR?"

The CHECK constraint you illustrated earlier can just as well be placed on a
CHAR (using trim() as well to adjust for padding). Then there's no chance of
any blank padding issues either.

~~~
jhaywood
But the semantics of CHAR are not what most people expect and almost never
what you actually want. If you want an actual fixed length of non-blank data
you need an additional check constraint to enforce the min-length.

CHAR semantically represents fixed length text fields from old data file
formats not this data always has n (non-blank) characters.

If you do have different length then a VARCHAR is more appropriate. Also a lot
of application frameworks that interact with the database only deal with
VARCHAR so then as soon as you use a CHAR you have to start worrying about
trimming your text data because one of the biggest database text type gotchas
is accidentally trying to compare a VARCHAR and a CHAR improperly.

While I can see good reasons to include length checks there is never a good
reason to use a CHAR unless you're trying to interoperate with COBOL programs
written in the 80's

~~~
zzzeek
> But the semantics of CHAR are not what most people expect

how do you know that? did you take a survey? I've been working with DBAs for
years, in my world everyone knows that's how CHAR works. The padding behavior
is nothing new, and is intuitive - the value _must_ be N characters in length,
so if you stored less, you get back a right-padded string. This is exactly
what I'd expect.

> CHAR semantically represents fixed length text fields from old data file
> formats

and two or three letter character codes like country codes, state codes, etc.
are what we use CHAR for, these are fixed length text fields. They are still
in modern use today. Plus lots of us still have to write apps that actually
read old files too - CHAR is appropriate for these as well, assuming you are
storing fields that aren't right-padded in the source datafile (such as social
security numbers, etc.).

Your app will of course work with a VARCHAR instead, but the point of CHAR is
that it's self-documenting as to the type of data to be stored in the field -
fixed length, as opposed to variable length.

> If you do have different length then a VARCHAR is more appropriate.

if you are storing variable length, then you should absolutely use VARCHAR.
That's why it's called "VAR", it means, "variable".

> Also a lot of application frameworks that interact with the database only
> deal with VARCHAR so then as soon as you use a CHAR you have to start
> worrying about trimming your text data

If you are using CHAR correctly, you don't have to trim anything, because you
are storing a string that is exactly the length of the CHAR type. I'm not
familiar with how an application framework would only know how to deal with
VARCHAR and not CHAR, database adapters return strings for both types. And if
there were such a framework, I'd not be using it.

> one of the biggest database text type gotchas is accidentally trying to
> compare a VARCHAR and a CHAR improperly.

all of which stems from the same, singular mistake - don't store variable
length data in a CHAR - plus if you are comparing VARCHAR to CHAR, that is
also usually doing it wrong, as an adequately normalized database wouldn't be
repurposing some kind of fixed length datatype out into a VARCHAR of some kind
elsewhere.

The aforementioned CHECK constraint is a good way to enforce that if the
developers/frameworks in question tend to be error-prone about this kind of
thing (it's not an error I've had much issue with, since I know how CHAR
behaves).

> While I can see good reasons to include length checks there is never a good
> reason to use a CHAR unless you're trying to interoperate with COBOL
> programs written in the 80's

as it turns out a vast portion of the world economy is supported by mainframes
and old software that often spits out fixed length datafiles, there is even
Python code in my current work project just written in the past six months
which is tasked with parsing such files (they are actually quite simple to
parse, since you just pull out each field based on an exact position). Not to
mention that boring things like state codes, country codes, and the like are
often fixed length fields.

~~~
jhaywood
I say it's not what people expect because everyone emphasizes the "fixed
length" rather than "blank padded" nature of CHAR. CHAR is only actually a
fixed length if you actually ensure that it is so yourself. That's possible
but then you're just using a CHAR as a placeholder for those semantics not as
something that naturally enforces them.

If you actually really really have fixed length fields then yes CHARs could be
appropriate. But for many things even though you intend for them to be static
length codes things change when you start having to interoperate with systems
designed with different constraints. (For example after mergers or aquiring a
competitor.) And I know that mainframes still exist but they aren't the use
case in mind when many say "USE CHAR".

Also the database adapter that handles CHAR poorly is none other than JDBC on
oracle [http://stackoverflow.com/questions/5332845/oracle-jdbc-
and-o...](http://stackoverflow.com/questions/5332845/oracle-jdbc-and-oracle-
char-data-type) (Yes that is hilariously bad.) But the mysql way of always
ignoring trailing whitespace is not standard in all databases.

------
mariusz79
What if the performance changes? What if you decide to migrate to a different
db at a later time? What if your software determines field types and sizes in
the gui based on the database schema? IMHO always use the right field for the
job..

~~~
bratsche
If you're thinking of migrating to another database later, you're basically
ruling out using half of postgresql's more compelling features up front. I
kind of don't understand this line of thinking.

~~~
sixbrx
Sure and ruling them out is a _very_ reasonable thing to do, if you don't need
them. No reason to adopt gratuitously incompatible features. I've needed to
move databases between Oracle and Postgres quite a number of times, and I can
promise you it wasn't part of the initial "requirements".

------
steven2012
That's interesting. So can you put an index on a TEXT column in PG? From what
I know, you can't do that in SQL Server/Oracle, you can only use full text
searching (I think).

~~~
masklinn
> So can you put an index on a TEXT column in PG?

Yes, but the default index (btree) will generate an error if you try to insert
data above 8k. The hash index will work.

Not that it makes much sense to use an index (rather than FTS) on such an
amount of data.

------
nness
Uh, shouldn't you use the most appropriate type available to describe your
data, since that will simplify the process if you ever need to migrate to a
different DBMS?

~~~
bunderbunder
Nowadays the most appropriate type available to describe your data is a non-
standard type that's specific to the RDBMS you're using, often as not. With
SQL databases, you can generally only pick one of the following:

1\. Any kind of expectation of hassle-free migration to a different RDBMS.

2\. Benefiting from much of anything that's happened in any dialect of SQL
since about the time Windows 3.1 hit the market.

Personally, I generally prefer #2, because #1 is kind of a myth anyway. For
example, PosgtgreSQL's VARCHAR type has different semantics from Oracle's: One
supports Unicode and the other doesn't.

~~~
Pxtl
Microsoft follows Oracle's approach and uses NVARCHAR to describe their
infernal 2-byte format that might be UCS2-wrongendian or it might be UTF-16
depending on their mood and the tool you're using at the moment. Naked VARCHAR
means you have to pick an 8-bit character-set like a savage.

~~~
bunderbunder
Ya, or for Oracle you might be better off using VARCHAR2, which uses UTF-8.
That way you aren't mired in savagery, but don't have to pay the performance
hit of storing 2 bytes per character for text that's mostly in western
European languages. Whereas SQL Server users are stuck choosing between
doubling up on I/O and suffering codepages. Meanwhile in PostgreSQL you just
use regular VARCHAR and pick utf8 as your character set like a proper
subgenius.

Fun fact: In earlier versions of Portal, it was database portability that
GlaDOS promised to give you after the experiment.

------
kevrone
I've used PostgresSQL quite successfully for the past few years at rather
large scales and I can tell you, using TEXT everywhere is sooooooooo much
easier on everyone involved. Especially on large teams (hundreds of
developers) where migrations are a big deal. And especially when the business
teams are essentially dictating the use cases. Those people change their minds
ALL THE TIME ("Yeah, I know we agreed that we only need about 20 characters
for the description here, but we now think 25 will really make the whole thing
pop, ya know?").

And as far as the argument for keeping schemas in strict SQL so that some
future database switch can be made more smoothly...I mean c'mon. Even if you
actually do that (who does that?) you're going to have more annoying things to
do than replacing some TEXT columns to VARCHAR.

------
znowi
I wish more articles had _tl;dr_ s like this one. Very helpful. Often you have
to skim through preludes and side stories with jokes to piece the gist
together.

------
poissonpie
I'm of the opinion that your data structure should model your data. If your
piece of data is best represented by char or varchar, then use it. I'm not
super familiar with Postgres, but among other things, modelling your data
correctly helps when another developer has to step in and maintain your app.
They can easily get a sense of how the presentation layer should look if
you've done so.

------
rsynnott
Clearly, this is an evil plot to make peoples' schemas break entertainingly in
case they ever try to move to MySQL :)

~~~
effn
There is nothing evil in preventing people from migrating to MySQL.

------
etler
So is there actually any benefit to using text over varchar when the
constraint is actually 0 to X, or instead of char when your input actually
needs to be exactly X characters? It seems the real point of the article is
make sure that these are really the constraints you want.

------
hobs
Does pg have the concept of a clustered index? If so, for frequent
inserts/updates it could actually matter.

~~~
effn
It has a index clustering operation but this is a one-time operation that will
not affect further inserts/updates.

But it shouldn't matter, the implicit constraint on a VARCHAR(n) does not
affect indexing.

~~~
hobs
Interesting, I was just thinking about a variable width item on the disk
expanding or contracting, and if it had to be arranged in a serial physical
order it would (necessarily) be forced to move the other bits if it wasn't
inserted at the end.

