
Show HN: Appelpy – library for easier regression modelling in Python - mfarragher
https://github.com/mfarragher/appelpy
======
mfarragher
Hi there, I'm a data scientist and economist who uses the main Python stats
libraries regularly. I was frustrated by how long it takes to fit basic
regression models and diagnose models, so I began working on a package called
Appelpy (Applied Econometrics Library for Python).

The aim: Make regression modelling as easy as pie.

Now that I've tightened up my code coverage and fleshed out some documentation
– ReadTheDocs and notebook tutorials that can be viewed with Binder & Nbviewer
– I'm sharing the library more widely!

The library is built upon Statsmodels but I've tried to make a more cohesive
interface for regression modelling, with model diagnostics in mind especially.
Model diagnostics are the time-consuming and repetitive part of regression
modelling in Python, but through Appelpy diagnostics can be done with minimal
code.

This is the first project I've released on PyPI and I was working on it to
hone my software engineering skills, so I'm interested in tips and feedback.

\- Mark

~~~
mushufasa
Thanks. The lack of something like this has been keeping me drawn to R for
pure statistics work. I'll definitely give this a try.

------
Canadauni
I know you reference your introductory notebook throughout your docs but I
think it would be helpful to include some of the plots inline in your docs.

Seeing that this lib is built on top of statsmodels my first response was that
I'll just keep using statsmodels. However the simplicity of the diagnostic
plots actually seems really nice value add. Showcasing those plots right in
your docs might make it more attractive to people checking out your project
for the first time.

~~~
mfarragher
Yes, that's a good point on the plots. I'll try to make them more prominent in
future docs.

These are some things I've included in the library which aren't implemented in
Statsmodels:

\- Breusch-Pagan studentized test of heteroskedasticity (available in R)

\- Standardized / beta coefficients (still an open feature request in
[https://github.com/statsmodels/statsmodels/issues/3857](https://github.com/statsmodels/statsmodels/issues/3857)
)

\- Leverage vs residuals squared plot (there's an influence plot but not
something similar to Stata's lvr2plot)

Even the most common metric I use for assessing models – root MSE – isn't
stored in the Statsmodels object summary. To assess an OLS model in
Statsmodels I'd find I do so much repetitive code, yet in Stata the commands
are fairly succinct.

Other things I also added to make encoding of variables easier:

\- InteractionEncoder

\- DummyEncoder (to cover different ways of treating missing values)

The more I thought about these missing features, the more I thought they can
be wrapped up in a more coherent way. :-)

