
Ask HN: What do I need to learn how to write a scalable app? - stonecharioteer
I&#x27;m a self taught programmer. I mostly code in Python. I have built a couple of fairly well received applications at my workplaces, using flask and either React or Plotly&#x27;s dash framework for UI. However I am facing a problem. My work does not involve much data. I write tools that are used to automate tasks. Visitors are below 10 to 20 at any given time, even though my company is a large fintech company. My work does not deal with much data. But I want to learn. I want to build something that involves billions of rows of data. I want to know how to speed it up.<p>Right now, I am building an application that has 2 million rows of data. It is centred around a lot of analytics around team productivity. I would like to increase how fast it is. I am already using redis as a cache, my database is postgres, I select only columns I need using flask Sqlalchemy and I have cut down a lot of the processing time already. But I am certain this can be faster. Where do I learn the design patterns for this? What do I use to figure out how to improve this application? How do I measure performance? How do I learn to build apps that achieve less than 300ms response time?
======
pritambarhate
>> I am building an application that has 2 million rows of data.

It will also be helpful to know what kind of hardware you are using for the
database. How much RAM the machine has? Does the whole dataset fit in the RAM?
If not, at least indexes should fit in the RAM.

Disk type: HDD/SSD?

Are you using indexes?

Did you check "explain analyze"[1] to check your query plans to ensure that
the indexes are being used or not?

There are many things that affect database performance.

>> I want to build something that involves billions of rows of data. I want to
know how to speed it up.

The key to this is "Shared nothing architecture" along with a Sharded
Database. I have tried to explain this architecture here [2]. To understand
Database sharding, I would recommend this Digital Ocean Article [3].

To learn more about highly scalable architectures I would suggest reading the
real world architectures section [4] of the High Scalability Blog.

[1] [https://thoughtbot.com/blog/reading-an-explain-analyze-
query...](https://thoughtbot.com/blog/reading-an-explain-analyze-query-plan)

[2] [https://mobisoftinfotech.com/resources/mguide/shared-
nothing...](https://mobisoftinfotech.com/resources/mguide/shared-nothing-
architecture-the-key-to-scale-your-app-to-millions-of-users/)

[3]
[https://www.digitalocean.com/community/tutorials/understandi...](https://www.digitalocean.com/community/tutorials/understanding-
database-sharding)

[4]
[http://highscalability.com/blog/category/example](http://highscalability.com/blog/category/example)

~~~
protonimitate
> Did you check "explain analyze"[1] to check your query plans to ensure that
> the indexes are being used or not?

Word of caution `explain analyze` actually executes the query and then returns
the execution plan. Just `explain` will give you the estimated plan without
executing (although usually less accurate than the analyze counterpart).

Learned that one the hard way.

------
codegeek
One thing that I have learned with large datasets is to add pagination in
queries/APIs and always do them server side. Client side paginations are not
good enough for large datasets.

Also in general, always query less data than more and then request more as
needed. For example, lets say you are loading a table for users. You could
have 2 million users in the database. Do you really want to query all 2
million at once even if you have optimized the code and hardware for it ? I
would say No. Query may be a few hundred at once and then query more as
needed.

~~~
stonecharioteer
Thank you. I don't think pagination would help since I'm building a dashboard.

Perhaps if I explain what I'm trying to build it would help.

I am building a git analytics dashboard. I want to help my organization
analyse developer productivity from bitbucket. So I am pulling in all the git
commit logs, the diffs for all projects my org within the company has. We have
1500 repos across 20 projects. I am building dashboards atop of this. The data
will grow daily. I would like to showcase project health in terms of commits,
pull requests and whether there are duplications in work (surprisingly we have
this. Actual repos being copies from here to there btw, insane shit). In doing
this I realized I am struggling to scale it. The UI Takes 2 minutes to load.
It is abysmal. The org is impressed but I am not, and it is for my own
learning.

~~~
Foober223
Dashboards with tons of charts are all about pre-computing. In the RDBMS world
you abandon normal relational tables and use a non-relational star schema.
Star schemas are basically just tables that store pre-computed data.

Sometimes it's a completely separate database for just reporting, with new
daily data being fed in all the time, and results pre-computed.

These systems tend to be queries against the past. It's accepted that you may
be 1 day behind. But that little concession to query past-only allows all
sorts of great tricks to have instantaneous reporting against huge data sets.
Making it seem like your reports have been blessed by the tears of the
performance gods.

Even google search is "cheating" by searching against the past, not the
present. Could you imagine how slow it would be if your google search query
triggered a web crawler in real time to scour the internet? Yeah, you might be
waiting several days for your crawl to complete.

------
DevX101
Read this book:

[https://www.amazon.com/Designing-Data-Intensive-
Applications...](https://www.amazon.com/Designing-Data-Intensive-Applications-
Reliable-Maintainable/dp/1449373321)

------
st1ck
I'm in no way a DBA, but if your main usecase is analytics (OLAP) and updates
are infrequent it's common to use column-oriented DBMS. Postgres has
cstore_fdw, but you can also use others: DuckDB, MonetDB, ClickHouse among
FOSS, and quite a few well-known proprietary options.

That said, 2M rows is not really a lot of data (unless the rows are huge). In
case you don't need to worry about updates, you can just load everything into
memory, e.g. in Pandas dataframe (large overhead, slow, many features) or more
efficient implementation, like `datatable` (lower overhead, faster, less
features).

Also I recently discovered BI tools (more like realized that despite the name
it doesn't have to apply to business data). E.g. Metabase provides nice UI for
non-complicated analytical (like SELECT avg(...) GROUP BY ...). So if it fits
90% of your queries, then maybe you got the frontend for free, and only need
to work on backend (and the rest 10% of queries).

------
juangacovas
Where exactly is your bottleneck right now? Database queries? I mean, where
those >300ms go, foreach request or just some of them?

~~~
stonecharioteer
Definitely the database. How would I begin speeding that up?

~~~
juangacovas
Two million rows in tables are not a problem even for old databases (I'm
thinking of mysql 5.1)... Are you confident on your indexes, normalization,
etc?

~~~
stonecharioteer
I definitely will look at the indices. I have several joins. Perhaps if I
explain what I'm trying to build.

I am building a git analytics dashboard. I want to help my organization
analyse developer productivity from bitbucket. So I am pulling in all the git
commit logs, the diffs for all projects my org within the company has. We have
1500 repos across 20 projects. I am building dashboards atop of this. The data
will grow daily. I would like to showcase project health in terms of commits,
pull requests and whether there are duplications in work (surprisingly we have
this. Actual repos being copies from here to there btw, insane shit). In doing
this I realized I am struggling to scale it. The UI Takes 2 minutes to load.
It is abysmal. The org is impressed but I am not, and it is for my own
learning.

~~~
LarryMade2
Optimise your queries as well, complex queries with multiple many-to-many
joins will add up quick to tons of rows to filter. Look at separating queries;
to keep the overall in a single query, the db may have to churn through tens
of thousands of rows. Where if you do a first select than a second based on
the result of the first, you can bring it down to just hundreds or tens - with
a quicker response than just one query.

------
Jugurtha
Since you mention you're using Flask. Before diving into data intensive stuff
if need be, a lot can be done by profiling, refactoring, and improving the
actual code:

\- Profiling and refactoring Python code in general:

Using yappi[^1] to profile and generate profile data, and KCachegrind[^2] to
visualize that data in the form of call graphs, number of cycles, etc. can
yield great results. You can find which functions in your code base are taking
too long, and this can give great pointers to where bottlenecks are.

Using _pyreverse_ [^3], now integrated in _pylint_ [^4], to generate say a PNG
image for class hierarchy and "UML diagrams" is extremely helpful. When I have
used it and was the arrows going all over the place, it has helped me eke out
better abstractions, remove a lot of code, write cleaner interfaces, and
frankly write code I and others could actually read.

After installing _pylint_. On a package level for instance. Say package name
is foo and follows standard hierarchy with `foo/foo`:

    
    
      cd foo
      pyreverse -o png .
      # generates classes.png and packages.png
      # You can also see pyreverse -o png foo
    
    

\- Profiling in the context of Flask:

Using Werkzeug's ProfilerMiddleware[^5] helps you see what's going on with
each request. What functions are called, number of calls, total time, per
call, which line, etc.

If the example in the documentation does not work, try the following:

    
    
      ...
      try:
          from werkzeug.middleware.profiler import ProfilerMiddleware
      except ModuleNotFoundError:
          # Older version
          from werkzeug.contrib.profiler import ProfilerMiddleware
    
      ...
      # Assuming you have an app object
    
      app.config['PROFILE'] = True
      app.wsgi_app = ProfilerMiddleware(app.wsgi_app, restrictions=[50])  
    

General things: it is very helpful to extract as much code from the routes.
This helps making the functions usable elsewhere, and not rely on Flask's
testing client which can be pretty frustrating when dealing with the app
context, especially in test suites involving database actions, and weird
connections in setUp and tearDown if you're using unittest*.

As I said, this is general and not very specific for "big data" or "billions
of rows", but these small things lead to bigger things in my opinion: making
the code easier to read and extend, easier to test and cover, easier to
profile and improve, compounds to a point you may postpone more involved
approaches.

[^1]: [https://github.com/sumerc/yappi](https://github.com/sumerc/yappi)

[^2]: [https://kcachegrind.github.io/](https://kcachegrind.github.io/)

[^3]:
[https://www.logilab.org/blogentry/6883](https://www.logilab.org/blogentry/6883)

[^4]: [https://github.com/PyCQA/pylint](https://github.com/PyCQA/pylint)

[^5]:
[https://werkzeug.palletsprojects.com/en/1.0.x/middleware/pro...](https://werkzeug.palletsprojects.com/en/1.0.x/middleware/profiler/)

------
davismwfl
Your first question is how to make what you are working on now faster. Follow
a process to find the places where you can make improvements and there are
bottlenecks. I always like to start at the data layer myself, and work my way
up the stack.

\- If using an ORM, dump all the queries that it is creating on your behalf.
Check how they compare to what you would hand write, look at the execution
plans.

\- Tune the queries, reduce joins where possible, add/remove indexes etc.
Sometimes removing an index helps more then adding new ones if insert/updates
are a bottleneck. Don't forget to check these.

\- In the application layer, check all your translations/transformations to
measure and tune those. These can add a lot of time if called repeatedly for
many records.

\- At the UI level find places where there might be a lot more going on then
needs be. I so often find places where a lot more data is being requested then
needed and then you have to filter through it all and if nothing else it adds
memory pressure which can lead to hard to find performance issues.

\- Cache's are awesome, but they sometimes just add complexity and delays
where you don't need them. Small datasets usually can do really well just with
some materialized views, data partitions and query tuning. I have actually
removed caches before and improved performance after a little query work,
don't ignore this. But I wouldn't say that is the norm, but it does exist.

\- Denormalize in places to remove joins if they are not substantially
necessary. When working on a system that had large marketing datasets, I found
that by caching a few fields on a highly used table we could eliminate many
joins in the application which were on numerous pages (common queries), in
some cases giving us 25x improvements. Of course, that adds a little extra
time and consideration on inserts/updates but that tradeoff is usually worth
it if you are read/query heavy.

As for learning to deal with large datasets. It starts with all of the above,
you need to really tune the data/queries at the database level first. Postgres
is awesome, relational data is great for a ton of things, but there may be a
reason to store data in non-relational ways for some things (cache is a great
example). This means you might have replicated data in places which adds a
little complexity but aids in performance. Always consider how the data is
going to be accessed and used by the application(s). Joins are an example of a
place where small changes can result in magnitudes greater performance.

Large data means small details are way more important, and what I have seen a
lot of times is people don't pay attention to a lot of the little details when
working with smaller datasets. That's pretty fair most of the time because it
really doesn't affect performance on small datasets. But in the end when you
get to large data all those details matter. Even little things like picking
the right datatype is more critical in large datasets. You don't want to store
a number as a string and always be doing conversions on it as an example. It
adds a lot of extra wasted cycles. Of course, on small data, you would rarely
see a problem with it, which makes people get complacent with their choices
many times.

