
Faker is a Python package that generates fake data for you - yaph
http://www.joke2k.net/faker/
======
kerkeslager
I made a tool like this for my company in Ruby (it wasn't nearly as mature as
this). The largest challenge I struggled with (and never really solved) is
that ultimately, there's no way to generate data as useful as real data. The
value of real data comes from the fact that it's messy. Real data is different
sizes than you expect[1], collides with your sentinel values[2], and comes in
with unexpected encodings[3]. And sometimes people will enter data that is
intended to break your system[4][5].

The value of testing with real data is that it doesn't conform to your
assumptions.

As far as I can tell, this benefit is impossible to fake with a system that
generates fake data algorithmically. Generated data conforms to the
assumptions of the system that generated it and therefore can only be used to
test that a system conforms to those assumptions.

Fake data is still useful. Volume is often important (does your database slow
down or crash when there are 10 billion records?). And if your fake data has
very few assumptions, you can use that to reduce the assumptions made by the
system you're testing.

Nevertheless, I'd really like to see a system like this which integrates data
from some sort of general-purpose real dataset. Ideally it would be
configurable so that people can document and choose a 99% use case they want
to support (for example, a US company might want to support long names, but
might not get a ton of value from supporting names with Chinese characters).

[1] [http://jalopnik.com/this-hawaiian-womans-name-is-too-long-
fo...](http://jalopnik.com/this-hawaiian-womans-name-is-too-long-for-a-
drivers-l-1313683178)

[2]
[http://www.snopes.com/autos/law/noplate.asp](http://www.snopes.com/autos/law/noplate.asp)

[3]
[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

[4]
[http://en.wikipedia.org/wiki/Buffer_overflow](http://en.wikipedia.org/wiki/Buffer_overflow)

[5] [http://xkcd.com/327/](http://xkcd.com/327/)

~~~
sophacles
I'd suggest looking into fuzzers. Short version - tools designed to input
messy, non-conforming data to ensure the inputs don't cause problems, that
things are sanitized correctly, etc. At this point they are a mature
technology, with improvements constantly being researched. They are generally
thought of as security tools[1], but are very useful for basic development
too.

[1] The common use of fuzzers in a security context is to send malformed
packets to protocol parsers to see if they fall over or cause buffer overruns,
or otherwise do fun things in the context of exploiting a system. Another
common one being automatic sql-injection discovery tools.

~~~
Arnor
A quick search gave me this list:
[http://www.infosecinstitute.com/blog/2005/12/fuzzers-
ultimat...](http://www.infosecinstitute.com/blog/2005/12/fuzzers-ultimate-
list.html) is there a notable fuzzer missing? It's a pretty long list, does
anyone know which of these tools are really worth checking out?

~~~
juhanima
That's a pretty old list. Just to name one, I would recommend taking a look of
Radamsa

[https://www.ee.oulu.fi/research/ouspg/Radamsa](https://www.ee.oulu.fi/research/ouspg/Radamsa)

...from the Oulu University. It's more like a framework for generating
intelligent fuzzers than a shrink-wrapped product, though.

The OUSPG guys are really good at fuzzing. There is also a commercial spin-
off, Codenomicon, whose tools are quite widely used.

------
makmanalp
I love faker! It pisses me off a bit that the library is named faker but the
package is fake-factory. It trips me up all the time.

I use it with factory_boy
([http://factoryboy.readthedocs.org/en/latest/](http://factoryboy.readthedocs.org/en/latest/))
to generate test fixtures that seem to make logical sense. Usernames are real
names, birthdays are real dates, etc.

I think it helps with experimentation when you're using the REPL and also
makes bugs stand out a bit more easily. Very neat for demoing purposes too.

Perhaps my favourite is faker.bs() which always gets a giggle when doing live
demos.

~~~
mpron
It's probably fake-factory to prevent confusion with the Ruby gem that's been
around for more than 6 years [http://www.rubyinside.com/faker-quick-fake-data-
generation-i...](http://www.rubyinside.com/faker-quick-fake-data-generation-
in-ruby-665.html)

It's a standard gem in the community and it's used for generating various
types of seed database data. Don't know if does the exact same things that the
Python faker does.

~~~
joke2k
I used fake-factory because faker was already busy on pipy. Initially I asked
the author to leave the name, but the response was negative.

------
tcdent
I'm not a huge fan of the 'factory' singleton pattern used here.

I can see how explicit execution of the startup code is a good thing, but
can't help thinking how much better the experience would be if it just lazy-
loaded the same code.

Am I missing something obvious that would prevent this? Bad magic?

------
johnwatson11218
Has anyone ever done a project like this that builds test databases from
larger production databases? I mean something that can look at a prod db and
model the data in the columns and the relationships between tables and produce
a much smaller test db that has the same statistical properties? For numeric
columns you could just fit a statistical distribution and sample from that.
For names I'm thinking you could look at the frequency charts for first,
second, third ... letters and sample according to that. I believe it is also
called Markov Chains.

In Andrew Ng's Machine Learning class he talked about taking labeled images
and expanding the set by inverting, shearing, flipping, distorting them etc.
He called the technique 'data synthesis'.

The test data problem has been hampering my team's ability to create
maintainable automated tests.

~~~
femto113
I've been working on such a project for a couple years as a solo/on-the-side
thing. It's not available as a turnkey product yet, but the technology is
working and can build arbitrary amounts of realistic test data (based either
on your private data sets, or public sources like census data). I'd love to
begin working with others on how exactly to integrate this into their test and
development workflows. Email me at ken.woodruff@gmail.com if you'd like
discuss further.

------
Sami_Lehtinen
After quick look it seems that this is only poor randomizer. Good generator
generates data which isn't too random and internal correlations and ranges are
right.

I used to maintain one over 15 years ago.

At least City Street address post number and telephone had to be internally
linked. Those are things that can be easily and automatically checked. So
those constraints need to be checked also when generating data. It's also
silly to give flat address on area where there aren't any flats etc. 30th
floor on country side? Oh yeah. Distance based address downtown. As silly.

------
asm89
The name looked familiar and it's indeed inspired by the Faker library for
PHP:
[https://github.com/fzaninotto/Faker](https://github.com/fzaninotto/Faker)

Another interesting library that is build on top of Faker is Alice. It allows
you to define complex fixtures in .yml:
[https://github.com/nelmio/alice](https://github.com/nelmio/alice)

~~~
davidw
Aren't they all kind of inspired by the original Perl implementation? I use
the Ruby one for testing.

Something kind of similar and worth thinking about is this:

[http://en.wikipedia.org/wiki/QuickCheck](http://en.wikipedia.org/wiki/QuickCheck)

~~~
xzel
I was just about to comment the same thing about QuickCheck. Arbitrary data
galore.

I've only done serious work with QuickCheck in Haskell but here's the python
implementation I've played with: [https://pypi.python.org/pypi/pytest-
quickcheck/](https://pypi.python.org/pypi/pytest-quickcheck/)

------
jaymon
I've got a semi-similar python library that I've been adding to for a few
years called testdata:
[https://github.com/Jaymon/testdata](https://github.com/Jaymon/testdata)

testdata has a lot of unicode and file system stuff I've found really useful,
it looks like between this and testdata I'll be in generated data heaven :)

------
tel
I've used the Ruby version of Faker to do fuzz/property/quickcheck-style
testing in Ruby. I believe this to be an incredibly important, under-
recognized form of testing. Faker is not the best tool for this as you really
need more sources of randomness than it provides, but it's not a bad start.

The best places to learn are from the canonical libraries, quickcheck in
Haskell, Quviq in Erlang, simple-check in Clojure, and there are others.

The challenge with all of these methods is that you want some notion of
referential transparency in order to make useful properties. You can at least
do that in certain contexts for certain expressions in Ruby and doing so will
improve code readability.

I'd love to hear from others with experience using these techniques in Ruby or
Python.

------
ronaldbradford
Testing data serves multiple purposes. Boundary conditions (long fields,
unicode etc) is important for invalidation testing. (i.e. Testing to break
your code and functionality)

Testing at scale is important for performance and predicting bottlenecks as
you grow. (i.e. Testing to break your systems capacity)

It can be difficult to generate good quality test data at scale, and data
based on your specific schema.

This is how [http://goodtestdata.com/](http://goodtestdata.com/) came about.
It has the building blocks of core data and new sources can be built on
request.

------
akavlie
Awesome! I needed something just like this a couple of years ago. A couple of
comments:

\- I wanted fake CC numbers and SSNs/other national IDs at the time (don't
remember why). I see that Faker is missing those, so they might be useful
additions to the library.

\- Method names should be snake_case rather than camelCase
([http://www.python.org/dev/peps/pep-0008/#method-names-and-
in...](http://www.python.org/dev/peps/pep-0008/#method-names-and-instance-
variables)).

~~~
akavlie
Actually, it looks like this doc is out of date. The README on Github shows
that method naming has been corrected, and CC numbers have been added:
[https://github.com/joke2k/faker](https://github.com/joke2k/faker)

------
wiremine
I wrote a much simpler Python version a few years ago:

[https://pypi.python.org/pypi/Phony/0.5.0](https://pypi.python.org/pypi/Phony/0.5.0)

I don't recommend my version: it isn't maintained and isn't complete. However,
writing a faker-type library is a great way to learn a language: you learn
about how to organize code, how it handles different types, and how to package
it up for use.

------
slashdotdash
I created a .NET port[1] of the Ruby port of the original Perl Faker library.
It's also available on NuGet[2]

[1] [https://github.com/slashdotdash/faker-
cs](https://github.com/slashdotdash/faker-cs)

[2]
[http://www.nuget.org/packages/Faker.Net/](http://www.nuget.org/packages/Faker.Net/)

------
victorquinn
I wrote a similar package in JavaScript for the browser or for node I called
Chance:

[http://chancejs.com/](http://chancejs.com/)

[https://github.com/victorquinn/chancejs](https://github.com/victorquinn/chancejs)

------
mindcrime
Yet another entry from "I made something similar to this" category, although
the tool I wrote isn't as feature rich as this.

[https://github.com/mindcrime/dummydatagenerator](https://github.com/mindcrime/dummydatagenerator)

------
slajax
We use something very similar in node called Charlatan:
[https://github.com/nodeca/charlatan](https://github.com/nodeca/charlatan)

Very very useful in any build.

------
marktangotango
Thanks for posting this, I was unaware of the various Faker implementations. I
had often considered implementing a similar lib, but never invested the time.
Now I don't have too!

------
alexmic
Inspired by faker, I created [https://github.com/alexmic/mongoose-
fakery](https://github.com/alexmic/mongoose-fakery) for Mongoose.

------
blt
The amount of batteries included with this library is impressive.

------
super_mario
I wish it had support for more locales. Esp. more Asian locales (ja_JP would
be really useful).

------
notastartup
This sort of looks like Faker.js and I remember we used it to populate a table
with fake data so we could fill it up and demonstrate the app. It was very
useful.

