Hacker News new | comments | show | ask | jobs | submit login
Ask HN: I am a data analyst and my code is a mess
5 points by elliott34 on Dec 3, 2014 | hide | past | web | favorite | 11 comments
I have been thinking about if I'd get laughed at for asking this question for a while, but it's gotten to the point where I really need some guidance.

I have a spaghetti code problem. I am a data scientist/analyst, and my day to day is entirely in python/sci-kitlearn/pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. I try to create as many functions as possible to help organize the code, and put different parts of the project into different scripts. I utilize ipython notebook on the cloud for the interactive portion of my analysis, and sublimetext2 for the fixed data processing scripts.

Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. Should I be creating more classes and objects?

Are there any resources out there on how to code and structure large machine learning projects like this? Or is it doomed to be spaghetti code?

The rule of thumb that most people stick with when doing OOP is duplicate code is bad.

The goal is to find data that needs to be grouped, and group it. Find functions that only use that grouped data, and stick them in classes.

For example a query can be an object. I.E.: A database connection (in java)

       public class DBconnect
               private connection Con = null;

               public DBconnect(String Ip, int port)
                        this.connection = mkConnection(ip, port);

                public Object query(String query)
                          return this.connection.ExecQuery(query);
Then you query specific pre-processing code can be added directly into the query.

                 public String query(String query, String regex)
                         return this.connection.ExecQuery(query).replaceAll(regex, "");
Which results in code like

                  DBConnect db = new DBConnect(, 150);
                  String[] quereies = { "yada", "yada", yada"};
                  for(String str: queries)
                         String result = db.query(str, "\\s+");
I don't know if this helps. But its a suggestion.

P.S.: I've been spending my free nights the past 2 weeks trying to throw together a javascript based data processing engine in java. It should be mostly workable by the weekend. I could throw it on a ShowHN if you'd be interested.

Sure! I'd love to take a look. I've actually written a mysqldb library in python that makes "data objects"

For example mydata=dataobject(query, connectionstring()) mydata.query() # get data mydata.data # pandas data frame mydata.write_to_db() # write to db method


Sounds similar to what youre doing here, although I don't know Java...

We are building Sclera, an extensible SQL engine that enables you to push your analytics operations into a SQL query. The idea is to tame the code complexity through a declarative interface to analytics libraries. You can add your own libraries using the Sclera Extensions SDK. http://www.scleradb.com/doc/sdk/sdkintro

From the FAQ: http://www.scleradb.com/doc/info/faq#i-am-an-analytics-consu... why-do-i-need-sclera > Specifically, Sclera separates the analytics logic from the processing and data access. The analytics logic is specified declaratively as SQL queries with Sclera’s analytics extensions. This is just a few lines of code, which can be changed easily. The analytics libraries, database systems and external data sources form their own modules and are separated from the analytics logic. The analytics queries are compiled by Sclera into optimized workflows that dynamically tie everything together.

Hey Elliott It might be worth checking out Warewolf ESB - it's a visual programming platform with flow-based programming principles. It's primarily a service bus, but for your needs it will really help you move away from the "spaghetti code" and into a more modular, visual application. It's open source and free:

Compiled version: http://warewolf.io Source code from GitHub: https://github.com/Warewolf-ESB/Warewolf-ESB

not a python dev, but:

- python probably has a lib like underscore (reduce map filter etc), could help

- check out the quake source code, any version, its huge and the entire thing is not only readable but possibley a work of art.

- have you tried lambdas? to some its more readable.. ex:

    nums = range(2,50)
    for i in range(2, 8):
        nums = filter(lambda x: x == i or x % i, nums)
personally when i have too complex process i like to go more functional, ex:

that allows me to focus on only on building one step and still have readable code.

many game-devs prefer breaking their project into many tiny files with a specific purpose instead of spaghetti, ex:

its also a bit easier to nav around the project and make sense of it this way. you might want to check out rust or D or F or another lang also.

Definitely see what you're saying with the prepare_data1()...prepare_data(2). Maybe a more functional approach would be suitable. At the end of the day, data goes into a tunnel and comes out in a different form, into a machine learning model. So perhaps I should look at functional programming styles in python. Thanks!

Python has a built-in map. Reduce is in functools. Filter is in itertools.

Between builtins, itertools, and functools, you pretty much have it covered.

Excuse my ignorance here: so are these functional programming paradigms?

Thought so - python is a rly nice lang

Has somebody reviewed your code and called it spaghetti, or is it your own opinion?

If it's your own opinion, then it's possible you're being unduly harsh on your own work. Perhaps you can publish it - or a suitable equivalent - on github and request people here for code reviews.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact