Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Self-Published Book – “Data Science in Production”
172 points by bweber on Jan 4, 2020 | hide | past | favorite | 18 comments
Hi HN,

Over the past 6 months I've been working on a technical book focused on helping aspiring data scientists to get hands-on experience with cloud computing environments using the Python ecosystem. The book is targeted at readers already familiar with libraries such as Pandas and scikit-learn that are looking to build out a portfolio of applied projects.

To author the book, I used the Leanpub platform to provide drafts of the text as I completed each chapter. To typeset the book, I used the R bookdown package by Yihui Xie to translate my markdown into a PDF format. I also used Google docs to edit drafts and check for typos. One of the reasons that I wanted to self publish the book was to explore the different marketing platforms available for promoting texts and to get hands on with some of the user acquisition tools that are commonly used in the mobile gaming industry.

Here's links to the book, with sample chapters and code listings:

- Paperback: https://www.amazon.com/dp/165206463X

- Digital (PDF): https://leanpub.com/ProductionDataScience

- Notebooks and Code: https://github.com/bgweber/DS_Production

- Sample Chapters: https://github.com/bgweber/DS_Production/raw/master/book_sam...

- Chapter Excerpts: https://medium.com/@bgweber/book-launch-data-science-in-prod...

Please feel free to ask any questions or provide feedback.

I'm noticing no mention of virtual environments/venv. This is something I notice a lot of (Python) junior data scientists and data engineers struggle with. It's very important to set up environments properly (and following best practices) to avoid version collisions, global scope pollution, etc.

Great work, though! I'm also using bookdown (I instantly recognized the template) for a book I've been working on and it's a pleasure to use. Would love to see a blog post on how you marketed the book and how your sales are doing once the ball gets rolling!

Given how fragmented virtual environments are in python is which is the most popular currently? Last I looked there was venv, pyenv, pipenv, pipx, poetry and pipsi.

Haven't done any python work in a while but am curious to know what people use. Starting right now, I'd probably just stick with venv, though guessing the others do offer some extra benefits.

The virtual environment part isn't really very fragmented at all -- I believe all of those use virtualenv/venv under the hood for that and just add layers of additional features for package and/or interpreter management. (that part is where the fragmentation happens, especially packages.) The virtualenv/venv decision pretty much boils down to are you py3 only? Use the venv module in the standard library. Do you still support py2? Then virtualenv.

I see conda a lot more for new python coders in the machine learning space, having a GUI helps newer programmers. Check out Anaconda/Miniconda. Code is as simple as

  conda create -n myenv python==3.8
  conda activate myenv
Do all your things

  conda env export > myenv.yml
On your coworkers machine

  conda env create -f myenv.yml

conda can also manage packages that aren’t part of python such as R, which is why I mainly use it.

Virtual environments are useful when setting up a single machine, but many of the tools covered in the book do not directly support venv, such as Lambda functions, Cloud Dataflow, and Databricks. In general, the goal is to get readers to explore tools beyond Conda for setting up environments and dependencies.

Marketing will be a challenge. There's been great reception here, but I expect sales to taper quickly and then paid sponsorship will be necessary to continue generating sales. I'm currently testing out Amazon Advertising, but I don't seem to have bids high enough to get to my target budget.

Fantastic - the end result pdf looks great.

Please blog about the details of the production process (scripts you wrote, problems with paper sizes or the amazon real paper process.) I am 30,000 words into a book and would love to hear more.

Good luck !

For paper size, I decided to use 6x9 from the start. I also didn't consider the epub format until the end, and then used the "print replica" feature on Kindle Direct to create a kindle version, which lacks text resizing and a few other features. Once I settled on a page size, I wrote each chapter independently and made sure to avoid any widowed text or code samples. I decided that orphaned text would be fine, given that the size of the page is relatively small.

I didn't really need to write any scripts, beyond using the sample bookdown project. I did use a custom book class and made some tweaks for the code formatting, but these were mostly Latex changes.

Sounds like there is a few requests for this, so I'll look to authoring a post on this. And also talk about motivation for going the self-publishing route.

You can follow me on Medium for this update: https://medium.com/@bgweber

thank you - look forward to it

I really enjoy your articles on Towards Data Science and this seems to pull a lot from it. I bought the PDF copy. I have a Full Stack background and really like it all from the data engineering perspective


Thanks. I originally planned on covering more topics related to DevOps, such as CI/CD for model deployment, but felt that this might be a bit of a stretch for some readers, and it's any area where I have less experience. Glad to here it's useful from a full-stack perspective.

Way to go! Would you consider a blog post about the self-publishing experience?

As user dvt mentioned, this is built using bookdown[1], an R library (with the help of Pandoc). You can see that the example chapter of this book looks exactly like the bookdown output[2]. The bookdown PDF explains in detail how to use Rmd+RStudio+R+Pandoc+Markdown to publish this.

[1]: https://bookdown.org/yihui/bookdown/ [2]: https://bookdown.org/yihui/bookdown/bookdown.pdf

Here's the complete source for the last text I authored using this pipeline: https://github.com/bgweber/StartupDataScience/tree/master/bo...

You can use the same tooling to create an epub output, but the formatting will be substantially different.

I might do one of the trade-offs between self-publishing and working with a publisher. With Kindle Direct Publishing, it's pretty straight forward. It's more about the tooling to produce the text and the process of using Kindle Direct.

Well done. Just bought.

Please leave a review if you purchased on Amazon.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact