
Ask HN: I am a data analyst and my code is a mess - elliott34
I have been thinking about if I&#x27;d get laughed at for asking this question for a while, but it&#x27;s gotten to the point where I really need some guidance.<p>I have a spaghetti code problem. I am a data scientist&#x2F;analyst, and my day to day is entirely in python&#x2F;sci-kitlearn&#x2F;pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. I try to create as many functions as possible to help organize the code, and put different parts of the project into different scripts. I utilize ipython notebook on the cloud for the interactive portion of my analysis, and sublimetext2 for the fixed data processing scripts.<p>Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. Should I be creating more classes and objects?<p>Are there any resources out there on how to code and structure large machine learning projects like this? Or is it doomed to be spaghetti code?
======
valarauca1
The rule of thumb that most people stick with when doing OOP is duplicate code
is bad.

The goal is to find data that needs to be grouped, and group it. Find
functions that only use that grouped data, and stick them in classes.

For example a query can be an object. I.E.: A database connection (in java)

    
    
           public class DBconnect
           {
                   private connection Con = null;
    
                   public DBconnect(String Ip, int port)
                   {
                            this.connection = mkConnection(ip, port);
                   }
    
                    public Object query(String query)
                    {
                              return this.connection.ExecQuery(query);
                    }
             }
    

Then you query specific pre-processing code can be added directly into the
query.

    
    
                     public String query(String query, String regex)
                     {
                             return this.connection.ExecQuery(query).replaceAll(regex, "");
                     }
    

Which results in code like

    
    
                      DBConnect db = new DBConnect(127.0.0.1, 150);
                      String[] quereies = { "yada", "yada", yada"};
                      for(String str: queries)
                      {
                             String result = db.query(str, "\\s+");
                             doDataScience(result);
                      }
    

I don't know if this helps. But its a suggestion.

P.S.: I've been spending my free nights the past 2 weeks trying to throw
together a javascript based data processing engine in java. It should be
mostly workable by the weekend. I could throw it on a ShowHN if you'd be
interested.

~~~
elliott34
Sure! I'd love to take a look. I've actually written a mysqldb library in
python that makes "data objects"

For example mydata=dataobject(query, connectionstring()) mydata.query() # get
data mydata.data # pandas data frame mydata.write_to_db() # write to db method

etc.

Sounds similar to what youre doing here, although I don't know Java...

------
yorp
We are building Sclera, an extensible SQL engine that enables you to push your
analytics operations into a SQL query. The idea is to tame the code complexity
through a declarative interface to analytics libraries. You can add your own
libraries using the Sclera Extensions SDK.
[http://www.scleradb.com/doc/sdk/sdkintro](http://www.scleradb.com/doc/sdk/sdkintro)

From the FAQ: [http://www.scleradb.com/doc/info/faq#i-am-an-analytics-
consu...](http://www.scleradb.com/doc/info/faq#i-am-an-analytics-consultant-)
why-do-i-need-sclera > Specifically, Sclera separates the analytics logic from
the processing and data access. The analytics logic is specified declaratively
as SQL queries with Sclera’s analytics extensions. This is just a few lines of
code, which can be changed easily. The analytics libraries, database systems
and external data sources form their own modules and are separated from the
analytics logic. The analytics queries are compiled by Sclera into optimized
workflows that dynamically tie everything together.

------
Warewolf-ESB
Hey Elliott It might be worth checking out Warewolf ESB - it's a visual
programming platform with flow-based programming principles. It's primarily a
service bus, but for your needs it will really help you move away from the
"spaghetti code" and into a more modular, visual application. It's open source
and free:

Compiled version: [http://warewolf.io](http://warewolf.io) Source code from
GitHub: [https://github.com/Warewolf-ESB/Warewolf-
ESB](https://github.com/Warewolf-ESB/Warewolf-ESB)

------
mc_hammer
not a python dev, but:

\- python probably has a lib like underscore (reduce map filter etc), could
help

\- check out the quake source code, any version, its huge and the entire thing
is not only readable but possibley a work of art.

\- have you tried lambdas? to some its more readable.. ex:

    
    
        nums = range(2,50)
        for i in range(2, 8):
            nums = filter(lambda x: x == i or x % i, nums)
    

personally when i have too complex process i like to go more functional, ex:

    
    
        main:
            prepare_data1()
            prepare_data2()
            do_long_stuff()
            nextstep()
    

that allows me to focus on only on building one step and still have readable
code.

many game-devs prefer breaking their project into many tiny files with a
specific purpose instead of spaghetti, ex:

    
    
        file.py
        parser.py
        display.py
        function1.py
        function2.py
    

its also a bit easier to nav around the project and make sense of it this way.
you might want to check out rust or D or F or another lang also.

~~~
TheLoneWolfling
Python has a built-in map. Reduce is in functools. Filter is in itertools.

Between builtins, itertools, and functools, you pretty much have it covered.

~~~
elliott34
Excuse my ignorance here: so are these functional programming paradigms?

~~~
TheLoneWolfling
Generally speaking, yes.

(Look here: [https://en.wikipedia.org/wiki/Higher-
order_function#General_...](https://en.wikipedia.org/wiki/Higher-
order_function#General_examples) )

------
lovelearning
Has somebody reviewed your code and called it spaghetti, or is it your own
opinion?

If it's your own opinion, then it's possible you're being unduly harsh on your
own work. Perhaps you can publish it - or a suitable equivalent - on github
and request people here for code reviews.

