|
|
| | Ask HN: I am a data analyst and my code is a mess | |
5 points by elliott34 on Dec 3, 2014 | hide | past | favorite | 11 comments |
|
| I have been thinking about if I'd get laughed at for asking this question for a while, but it's gotten to the point where I really need some guidance. I have a spaghetti code problem. I am a data scientist/analyst, and my day to day is entirely in python/sci-kitlearn/pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. I try to create as many functions as possible to help organize the code, and put different parts of the project into different scripts. I utilize ipython notebook on the cloud for the interactive portion of my analysis, and sublimetext2 for the fixed data processing scripts. Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. Should I be creating more classes and objects? Are there any resources out there on how to code and structure large machine learning projects like this? Or is it doomed to be spaghetti code? |
|

Applications are open for YC Summer 2021
Guidelines
| FAQ
| Lists
| API
| Security
| Legal
| Apply to YC
| Contact
|
The goal is to find data that needs to be grouped, and group it. Find functions that only use that grouped data, and stick them in classes.
For example a query can be an object. I.E.: A database connection (in java)
Then you query specific pre-processing code can be added directly into the query. Which results in code like I don't know if this helps. But its a suggestion.P.S.: I've been spending my free nights the past 2 weeks trying to throw together a javascript based data processing engine in java. It should be mostly workable by the weekend. I could throw it on a ShowHN if you'd be interested.