Hacker News new | past | comments | ask | show | jobs | submit login
Best Practices for Working with Configuration in Python Applications (tech.preferred.jp)
209 points by tgp 12 months ago | hide | past | favorite | 68 comments



Quick list of Python libraries that help with application configuration:

- Python application configuration -> https://github.com/edaniszewski/bison

- Configuration with env variables for Python -> https://github.com/hynek/environ_config

- Configuration library for python projects -> https://github.com/willkg/everett

- Strict separation of config from code -> https://github.com/henriquebastos/python-decouple

This is from my personal notes. See also: https://github.com/vinta/awesome-python#configuration

Anything else that we've missed ?


Marshmallow. You can use its schema validation for any dict/json, which makes it a nice fit for validating json config files (which mitigates some of the json concerns from the article). Just immediately move the json.reads through a schema validate, build some classes around it for different config files.

marshmallow.readthedocs.io/en/


I use a similar library called Schema at https://pypi.org/project/schema/ I love the expressive nature of it. Just this week I was validating both yaml and json configuration data with it.

One thing missing from this article is to always use a proper external key value store for everything but the stage (dev test prod etc) and the connections to the kv store. Config files on disk and even in environment variables suck for anything but the most trivial platforms.


dataclass_json is also very useful for schema validation. It combines python's native dataclass objects with marshmallow's schema to provide additional functionalities simply through a @dataclass_json decorator on your dataclass.

https://lidatong.github.io/dataclasses-json/


I made the similar dataclasses_serialization library. It doesn't require a special decorator on your classes, and is extensible for custom classes, and custom serialization methods (JSON and BSON provided by default).

https://github.com/madman-bob/python-dataclasses-serializati...


I do something similar with dataclasses using the `dacite` (https://github.com/konradhalas/dacite) library as a constructor for python dicts to dataclasses with runtime type-checking.

Works really well with marshmallow for additional validation.




The two big ones in machine learning at least are Google's Gin:

https://github.com/google/gin-config

and Facebook's Hydra:

https://hydra.cc/


the name usage of "gin" is quite common, it seems:

https://github.com/gin-gonic/gin


There's also confuse[0], used by beets. I've found it to be a good way to handle configuration when the end-user is not a programmer.

[0] - https://github.com/beetbox/confuse


Dynaconf [1], though I think this article's approach in combination with pydantic is better.

[1] https://dynaconf.readthedocs.io/en/latest/



https://github.com/crdoconnor/strictyaml - typesafe YAML parser (solves the issues mentioned in 2 and 1).


Here's my tiny contribution to the field: https://github.com/berislavlopac/figga


And I also like the .env approach: https://github.com/theskumar/python-dotenv


Cerberus for schema validation.

https://docs.python-cerberus.org/en/stable/



This and the list of other libraries in responses is a great reason to avoid Python. Its nearly as bad as Javascript, how are you supposed to live with this.



While most points are valid, I feel some pieces are missing in this article.

What should I finally do? How to put everything together without creating an hard-to-maintain mess of casting/parsing/configuration? Should I manually cast strings to integers (or other types) for all and each value I parse? Where do I keep default values? It's cumbersome to have them embedded in code for get().

I usually want a) a default configuration kept in a file b) a way to override that config with other files, but only for certain parts (I don't want to rewrite the configuration every time), c) a way to override the config at launch time (e.g. from cmdline)

The fact that Python is dynamically typed/type hinted only makes it harder than statically typed languages at configuration, where most configuration libraries instantiate an adequate type conversion function to put a string somewhere.

I ultimately found that the latest solution (parse from json) is good enough for most use cases for points a) and b); since json is typed, a decent conversion can happen for 95% of the use cases (for the others, just use a string and manually parse).

A sidenote to the author if he/she's reading: datetime.date objects, just as python's own naive datetime objects, are dangerous objects that can lead to unpredictable results when used with actual time-handling code. I wouldn't use them anywhere in my Python code.


Who is the targeted editor of a config file? If it's another programmer, just use config.py and make life easy on yourself.


No. Using a programming language for configuration is terrible. If code can be used, it will be eventually used and it will be impossible to understand what's happening until runtime. Configuration should be clear and not subject to modifications to the software. If somebody wants to use an external preprocessing tool, if using a standard format (json) he/she can.


Configuration, by definition, modifies the software.

How much it modifies the software is determined by the software author. If there aren't limits imposed in your example (json) you're still going to get unexpected results. You'll have to limit what's accepted.

Also, what if I allow a config.py file then translate that to .json then back to python? That is no better than just allowing the consumption of a python source file and simply ignoring any disallowed code (again, limiting what is accepted).

I know it's vogue to say configuration as code is bad; that simply has not been my experience.


I don't know whether it's vogue or not, I think it's just a bad idea, because you can't restrict HOW MUCH is is modifying the software, and then people needs to know the nuisances of your programming language. What if I need some configuration to be user (not admin) configurable? What if I don't want to learn programming language X just to use a software?

> If there aren't limits imposed in your example (json) you're still going to get unexpected results.

It's very unlikely for a JSON string to be parsed as a class, or to execute any code, unless you do eval() or do unsafe deserialization. A Python config could do anything.

> You'll have to limit what's accepted.

This is much, MUCH easier said than done. How would you do that in Python? Google App Engine had a sort of "restricted python" idea, and it was hard to implement AFAIK. Same thing for Zope/Plone (there was some templating with restrictions, don't remember the precise system). Then, you'd need do document such "restricted python".

> Also, what if I allow a config.py file then translate that to .json then back to python?

I suppose the first config.py is under the control of the user, and the second is under control of a software author. This is ok, because the "second python version" can perform validation and object construction.

Python configs can be OK (but can get messy) if the software you're building is mostly internal, and few people modify it and they all know what they're doing. As soon as you've got a large enough team and or userbase, IMHO executable configurations are painful.


I appreciate the thoughtful response and I respectfully disagree.

I agree with your point about non-technical users, I addressed that in the top level comment on this thread.

Limiting what is accepted isn't hard at all, here's how you do it:

In your config.py module, have a Config class that subclasses a superclass named, say, ConfigBase. ConfigBase implements __init_subclass__ and in there, you can dictate how a subclass is configured and raise descriptive errors if your rules are not followed. You can do the same thing with a metaclass.

With this approach the class does not have to be mainline executable. You can pluck out what you need and use those plucked items however you see fit.

You are essentially consuming a python script as a config file and not actually running it.


What’s naive and dangerous about Python’s Datetime objects?


Your question implies that you don't know about the nuisances of the datetime library :-) (see https://docs.python.org/3/library/datetime.html it's the first paragraph!)

Python datetime objects, by design, can be naive or timezone-aware. Timezone-aware datetime objects are OK; they identify a certain instant in time.

Naive datetime objects are Python-only abstractions (AFAIK) that don't identify anything in the real world; they're highly error prone, because there's no "right" way to use them.

They sort of work properly only if used in a very limited scope - e.g. your own code only, for small sections - but they're risky because they're not a different type (with respect to tz-aware), and it's hard to tell what any code accepting a datetime does if passed a naive object. Some libraries like java.time DO have a similar concept (e.g. LocalTime, LocalDate) but they keep it well separate from the "real" concept (e.g. Instant or Date in Java) so you can't use them accidentally.

Example: you pass a naive datetime object to any library which must translate it to an instant, like an ISO string with a well defined timezone. What does the library do? Throw an exception? Associate an arbitrary timezone (e.g. UTC)? Associate the local, current timezone? There's no "correct" behaviour.


I agree that python datetime objects are problematic, but for the opposite reason. It is tzinfo that is the sneaky disaster, the plain datetimes are fine.

Transparent timezone awareness always fail, unless you are 100% certain that a tzaware datetime object will remain uncoverted from the very top to the very bottom of the stack and all the way up again no matter who is reading and what they are doing.

For longterm minimization of pain, bugs and effort, you convert datetimes to UTC as early as possible and take them back to some localized version as late as possible (in the frontend, for a webapp, so that the backend never needs to know there is such a thing as timezones (except for separate validation and correction routines, since timezone definitions always end up being incorrect to some degree when you use them at scale)).

If the localization of the datetime is an essential aspect (such as the departure time of a ship leaving port), you store a UTC value together with a record of the location. Only at the latest possible moment of processing, should you do a lookup on the location data to make a local time.

Obviously, there will be exceptions to this rule. If you batch process billions of timestamps under a tight deadline and must do calculations in local time, it might make sense to have the values persisted localized.


Actually UTC + timezone is exactly the wrong thing for "wall clock times" (things like meetings or departures where the time at the location is relevant).

The conversion to UTC will lose the original local time so you cannot retrieve it once time zone data changes, unless you perform reconversions every time you detect such a change in tzdata. And countries changing time zones happens more often than we think (and also on short notice).

Thus it is important to distinguish between instants (e.g. for recording when exactly something happened after the fact) and wall clock time (e.g. for coordinating people and goods at a certain place, like meetings, concerts, departure times). For the former use UTC, for the latter use a localised time zone (e.g. Europe/Rome), not an offset time zone (e.g. not +0200).

For more information Jon Skeet has written about this multiple times.


I believe this "wall clock time" approach is broken by design as it pushes the burden of figuring out timezone details to those who are not located in that particular timezone.

A fair and therefore safer approach is to decide that by protocol the legally binding time is defined in UTC.

Your system will translate UTC times to and from any given local time using the IANA time zone database which is regularly updated. End users must be aware about the UTC time, that it is legally binding, and that the local time conversions are provided as-is.

This way the time of a meeting or deadline is protected from local governments messing around with timezone changes.

Additionally, dates are rendered in ISO8601 standard format with a proper footnote to help users learn about international standards.


I think whether UTC or wall clock time is binding is a problem in the legal and planning (so the business) domain and has to be treated as an external input to the software engineering problem.

Although you are of course free to advocate for UTC. I remember Swatch trying to establish something similar and it never took off: https://en.wikipedia.org/wiki/Swatch_Internet_Time


> The conversion to UTC will lose the original local time so you cannot retrieve it once time zone data changes, unless you perform reconversions every time you detect such a change in tzdata

I don't agree/disagree with your point, and neither I do agree/disagree with GP on the topic, but why couldn't I retrieve the original time? If, for an event, I save UTC + event TimeZone, I can always get back to the original time (actually, it doesn't even matter whether the timestamp is UTC; it's enough for it to have an explicit offset, i.e. to be the representation for an instant). Why should I change the timezone on a saved record? What usually changes is the user's timezone, not the records'.

> Thus it is important to distinguish between instants and wall clock time

Yes.

> For more information Jon Skeet has written about this multiple times.

I have read many things on datetime; would you care to share a couple of relevant links?


Offset timezones (e.g. UTC+2) don't change, what changes are local timezones (e.g. Europe/Rome).

For example here Turkey decided to change daylight savings time: https://github.com/JodaOrg/joda-time/issues/403 (if you have a look at the tzdata database you will find more, this one I remember because Turkey went back and forth about this).

If your timezone database changes you cannot retrieve the original wall clock time, unless you have a temporal timezone database and remember the date of conversion to UTC.

And if you used offsets instead of local timezones to begin with, you cannot even infer which offsets to change unless you have location data saved as well.

Here is a blog post by Jon Skeet https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a... where he says:

> For me, the key difference between the options is that in option 3, we store and never change what the conference organizer entered. The organizer told us that the event would start at the given address in Amsterdam, at 9am on July 10th 2022. That’s what we stored, and that information never needs to change (unless the organizer wants to change it, of course). The UTC value is derived from that “golden” information, but can be re-derived if the context changes – such as when time zone rules change


I'm sorry for the late response; yes, you are right that for future events the "right way to do it" is saving the place + local time. I think we were speaking of slightly different things.


No problem. Yes this is for future events where we want to coordinate people at a certain place.

For recording when something happened use UTC or UTC plus fixed offset.

Do you have a third context of using time?


> I agree that python datetime objects are problematic, but for the opposite reason. It is tzinfo that is the sneaky disaster, the plain datetimes are fine.

Why naive datetimes should be fine? How are they fine? What do they represent?

> For longterm minimization of pain, bugs and effort, you convert datetimes to UTC

This COULD work if naive objects had an IMPLIED UTC in their contract - e.g. naive objects are declared as ALWAYS UTC. Your argument fails as soon as you pass a naive datetime object somewhere in a library/framework and it gets accepted, and/or you try serializing it without augmenting with a TZ. As I said, naive datetime only works if you control 100% of their usage. No libraries, no external points of contact. And, the reason because tz-aware objects sometimes fail for the opposite reason (e.g. libraries assume naive objects) is a fault of the API design (they're not distinct types), but the problem lies in the existence of the naive version, not vice-versa.

For the records: in the backend I always use tz-aware datetime objects with a fixed UTC timezone. That's the best way IMHO not to get crazy with time problems in Python. So, your points about datetime handling are all valid and correct (timezone is mostly a "UI problem" and should not leak into the backend) but they don't prove your "naive works better" argument.


> Why naive datetimes should be fine? How are they fine? What do they represent?

They represent the date/time wherever the user is (location-independent). If I want to take a pill every Monday and Thursday at 10am, I don't want to get a notification at 5am just because I moved from the UK to NY.


This is LocalDate/LocalTime in java.time/joda.time parlance. But it's a different beast; in fact, you're talking about a repeated action. But if I tell you that on "March 23rd, 2020, 9:50:01am" I did something, what does that mean to you? When did it happen? That's a naive datetime.

It's got its place: but the idea that the API and the usage should be similar to a precise representation of time, as if it the two were interchangeable is... dangerous, and it's the source of a lot of problems with dt in the Python world.


Or maybe you do want to take the pill at 5am, since your are only there for a few days and it is critical that you maintain an exact 24 hour interval between doses.

As an assembly worker in the timestamp-wrapper-class factory, I am not in a position to try being clever about it. :-)


There is indeed no contract that datetimes without tzinfo are UTC. But there is a contract that they don't have a builtin timezone or DST concept and that you must define handle that separately.


>What does the library do? Throw an exception? Associate an arbitrary timezone (e.g. UTC)? Associate the local, current timezone?

Naive datetime is what datetime.utcnow() returns. UTC is essentially a "default" timezone. I've always thought it made most sense in a library to assume it's UTC.



Your assumption is just a guess. Take a look at the python docs I linked just above here: "A naive object does not contain enough information to unambiguously locate itself relative to other date/time objects. Whether a naive object represents Coordinated Universal Time (UTC), local time, or time in some other timezone is purely up to the program"


I'd consider any datetimes that have an implicit timezone that isn't or might not be UTC a bug in any system.

This isn't restricted to python. Servers that spit out logs with timestamps, for instance, should be spitting out UTC.

It makes sense to build systems that deal with timezones at the very edges (and sometimes not even then) and use UTC for everything else. It's simpler that way.


Unfortunately, in Python, whenever you try to render a datetime "aware" by using the provided conversion method (`astimezone()`) it will assume the naive datetime is in local time zone, not UTC.

The datetime module provides `timezone.utc` to be used whenever you want to have datetimes _in utc_, but it needs to be explicitly used by the programmer.


Basically everything.


As an ML engineer working in Python, I keep running into a problem:

If parameters are defined close to usage, and strongly typed, then it's hard to cleanly search for good configurations of the parameters. Especially for fancier search strategies, you want all parameter lookups to go through a single file.

On the other hand, there's a lot of code churn until an ML pipeline is finished. And errors from typos and type violations will often only show up after hours of training. So it's also painful to try to keep a separate, loosely-typed parameter file in sync.

So far, my compromise is to:

(1) on a first pass, define all parameters as global variables at the top of the files they are used in

(2) once mostly code-complete, pull them into a separate file that tracks initial values, current values, and search ranges. Make all usages go through a lookup where the key is an enum, but the value is untyped:

  def param(name: ParamName) -> Any:
    return params[name].current_value
Which is not ideal. Does anyone else keep running into this problem and have a better solution?


Alas, MyPy doesn't have the concept of "keyof" and mapped types like TypeScript.

So in place of that I would:

1. Define a variable called Params = Any;

2. Liberally use Param["foo"] and Param["bar"] anywhere.

3. Once you stabilize, reimplement Params as a TypedDict. You'll get failures if accessing any invalid key.

You can also use a NamedTuple if you prefer.

If you insist passing a param name, then you'll have to create a big list of Literals for param with every key in your dict. So Param = Literal["foo", "bar"] etc.


Thank you! It's surprising that mypy can typecheck based on string keys like that. Cool!


I think argparse covers most of the points mentioned as desirable in this article.

* validate at start (using the type keyword argument for add_argument) * access by name as identifier, not string

And what is more is, that default values are stored with the configuration, plus you add a help text for telling everyone what the argument is for.

One downside is that you command line call gets longer.


Pretty good guidelines. I like the idea of having a configuration close to the class (perhaps module-level depending on the project size) that uses it. With dataclass the class definition is fairly clean. In addition to that, I'd consider using https://docs.python.org/3/library/dataclasses.html#post-init... for business-specific validations.


Kensho has probably the best solution to this problem I've seen so far:

https://github.com/kensho-technologies/grift

Handles typing really well, as well as config defaults and fallbacks, giving you the ability to configure your app a few ways, and fall back on other configs if something isn't specified.


Shout out here for pydantic BaseSettings https://pydantic-docs.helpmanual.io/usage/settings/

That provides typed and validated auto-loading from env vars. I have been quite happy with that in conjunction with an optional .toml file, to do flexible config cleanly and simply like:

    import toml

    from myproj.conf.types import Settings  # a pydantic BaseSettings model


    try:
        _config = toml.load('myproj.toml')
    except FileNotFoundError:
        _config = {}


    settings = Settings(
        **{key.upper(): val for key, val in _config.items()}
    )


it's worth looking at python's own configparser[1] before rolling your own.

[1] https://docs.python.org/3/library/configparser.html


Isn't that basically the same end result as using json.loads except a different format (that has no actual spec).


JSON does not support comments nor string interpolation. Python ConfigParser language does.


True, you do get string interpolation but the comment support in ConfigParser isn't very good. Although actually they may have fixed some of that in Python 3 but I'm still using workarounds.

To be clear, I am not suggesting using JSON for config, I think that would be my last choice. My point is that ConfigParser isn't really an alternative to rolling your own if you want decent validation etc (those spec files are horrible to use). You very quickly need to start extending ConfigParser to the point where you've started rolling your own. And at that point you'd be better off with one of the other (tested) solutions already suggested.


What's wrong with the comment support?

You can't have comments at the end of a line, but that's sort of the nature of supporting arbitrary strings as values. I don't want my users to have to quote or escape special characters if they happen to want to use them. They're not programmers.

    # The note to display
    note = Our #1 customer!
rather than

    name = Our \#1 customer  # The note to display.
or

    name = "Our #1 customer"  # The note display


> What's wrong with the comment support?

Comments are simply ignored. You can't read them. You might want to read a commented config file in, make a change to a setting and then write that out. You can't do that. But you can write comments using the 'allow_no_value=True' hack, as long as long as you put it in a section.

> You can't have comments at the end of a line

You can. You need to use ';' for inline comments and you must proceed it with whitespace. Are your users ready for that?


> make a change to a setting and then write that out

Very good point.

> You can. You need to use ';' for inline comments

I have some bugs to fix.


As already written by others, the article does not go very deep and is missing many essentials. What I was mostly missing is more about keeping configuration parameters as simple as possible. A much more detailed best practices can be found here: https://www.libelektra.org/ftp/elektra/slides/cm/


Unless the end user is not technical, use a .py file and force them to subclass your Configuration class which has an __init_subclass__ method so you can enforce rules.

When you are ready to move to a more generic solution, your .config or .yml file can generate these.

The advantage here is both flexibility (it's Python) and control (allow/disallow whatever you want).

If you need nested items, use nested classes.


This.

Until you the app reaches the level of advanced yaml config files for cloud deployments, it’s really hard to beat a “config.py” that does a single read of all your ENV_VARS at startup


I have been working on a minimalistic application config library for Python, aiming to consolidate config loading from files, environment variable, and command-line argument parsing. It's an alpha, so please feel free to provide feedback if app configuration has been your pain point.

https://github.com/okomestudio/resconfig


FYI that sounds like exactly the same feature set of 'python-decouple'.


Indeed "python-decouple" looks like it serves similar niche. (I didn't know of the package. Thanks for letting me know.) I think I'd like to target a smaller niche though, someone writing a small applications, with a little more flexibility in things like YAML support and dynamic loading. Unless "decouple" eventually supports similar features, I want to keep experimenting.


Along these lines and unsatisfied with current solutions I started this project, "Turtle Config." It is format-agnostic and supports type checking as well:

https://github.com/mixmastamyk/tconf

I'll see if I can add any advice this article gives; feedback would be helpful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: