
Ask HN: Best way to stay organized as a data analyst? - elsherbini
I&#x27;m a graduate student doing a lot of data analysis. For a given project, I find myself using a mishmash of jupyter notebooks, shell and python scripts, and interactive ipython&#x2F;sql sessions. If I switch projects for a couple days, it becomes really hard for me to piece together what exactly I&#x27;ve done on the old project.<p>Is there a simple to use system that would at least give me breadcrumbs to retrace my steps? I&#x27;d like to strike a balance between the time it takes to do the organization and its usefulness. It would be helpful if I could use the same system to plan future goals as well as document past work.
======
achompas
A recent episode of Not So Standard Deviations (a great podcast for anyone in
the sciences or applied statistics) covered reproducibility:

[https://soundcloud.com/nssd-podcast/episode-5-irl-roger-
is-t...](https://soundcloud.com/nssd-podcast/episode-5-irl-roger-is-totally-
with-it)

I'd suggest treating your code like a software project: convert repeated logic
into methods, collect methods into modules/libraries depending on their use,
write _lots_ of documentation, and use version control (Github with a nice
README.md in each project is a great start).

If you transition projects, take 5-10 minutes to update your docs (I keep a
"captain's log") with the latest details and a list of todo items. I like to
note my victories ("on 1/19 I produced this plot that sent me in a different
direction; on 1/21 I demoed my project and received such-and-such feedback")
and next steps in the log, as a way to retrace my progress over weeks or
months.

That podcast also mentions knitr
([http://yihui.name/knitr/](http://yihui.name/knitr/)), which looks great for
docs.

~~~
elsherbini
Hilary's talk on reproducibility[1] was very inspirational, thanks!

[1]([https://www.youtube.com/watch?v=7B3n-5atLxM](https://www.youtube.com/watch?v=7B3n-5atLxM))

------
floppydisk
At the abstract level, get a standardized process for your analysis that you
can use from project to project. Doesn't have to be complicated, but it needs
to be consistent and applied to each project.

Here's what I'd recommend -

Take 5-10 minutes after getting a project and think through it. Break it into
chunks and note what you need to discover (i.e. go find a data set) and what
you already can/know how to do. Create an ordered task list for yourself (1.
get data set, 2. setup project db, 3. create schema for db etc. etc. etc).
This gives you measurable milestones you can track and helps you keep track of
what's next. You can calendar these out if you need to as well.

Keep a running notebook / text file for each project that's dated with a brief
explanation of what you did that day, what troubles you ran into, what needs
fixing, what you need to do next, and any other random thoughts you had
related to the project. It unloads your RAM and keeps it somewhere you can get
at easily. Write in this at the end of each day or when you switch projects.
It should take 3-5 minutes to update this. It's also the first thing you open
when you start working on a project again.

Get source control setup and give each project it's own folder. Store notes
and source code in there and keep it up to date. Remember process matters
right now more than specific implementations. Can be SVN or GIT. Can further
sub-organize if you need to. I.E. SQL folder for scripts/stored procedures,
BASH for shell scripts, PYTHON for python scripts etc.

Write a README for your project, that gets stored in the project folder, that
explains what pieces of software get run and in what order. That way you won't
forget order of execution.

Be systematic in your approach. The best process is the one you stick to and
actually use, not the one that's perfect on paper.

~~~
elsherbini
This is great advice and made me realize my biggest issue is mindfulness. I
just want to dive in to the code right away, but I should be taking the time
to be intentional about my goals and organization.

I'm going to use git and my school's enterprise github for this since it'll
make it easy for me to share with my lab mates. Thanks for your thoughtful
response.

------
twunde
This is exactly why businesses have some sort of bug tracking/task management
software. You don't need Jira, but something lightweight like Trello would
work. The most important part is to make sure to comment on issues with your
latest work and findings. This can be as simple as dropping new scripts into a
comment or pointing to a source control commit/pull request/whatever. At the
end of the day, you want something centralized where you can track your latest
changes and findings

------
edimaudo
You can take a simple approach and use a to do list. Split the list into to-
do, doing and done. Each project you are doing should have a header in each
section as long as it is not complete. When complete remove it from the to-do
and doing section. E.g.

To-do Project 1 \- task 2

Doing Project 1 -task 1

Done Project 1 Project 2 \- task 1 \- task 2

The work you want to do for a given day should be moved to the doing section.
At the end of the day if the work is completed move it to the done list. If it
was not completed move it to the to do section. You could add comments to tell
you were you stopped or any issues you came across

------
krmmalik
I second the "treat it like software development" approach.

If you use version control through say Git then you can comment all your
commits and back it up with an easy way to track what you've been doing and
roll things back as and when necessary.

------
dhogan
Source control?

