Data Structures for Data-Intensive Applications [pdf] (2023)

pgwhalen · 2024-02-09T23:48:43 1707522523

I've only had a chance to skim so far, but this is an absolutely superb survey of an enormous space. It doesn't just list out a bunch of data structures, it actully helps you build a mental model of all the concerns to think about when building or using data structures in an application.

3abiton · 2024-02-10T00:51:44 1707526304

That's the recap I needed!

PH95VuimJjqBqy · 2024-02-10T14:20:15 1707574815

Easily one of the best technical books I've ever read.

petra · 2024-02-10T06:25:39 1707546339

One of the authors of this book runs a research lab in this field.

They have this cool tool that can help one design an optimal data structure:

http://daslab.seas.harvard.edu/datacalculator/

5Qn8mNbc2FNCiVV · 2024-02-10T07:10:15 1707549015

I can't seem to find where the actual tool is

skavi · 2024-02-10T11:33:54 1707564834

If you manage to click on the correct image, you’ll get to this page [0]. Then you can find a link to the “interactive demo” [1]. Weirdly well hidden.

[0]: http://daslab.seas.harvard.edu/projects/evolution/

[1]: http://sigmod.dyndns.org:8000/app/#

kinow · 2024-02-10T11:01:12 1707562872

Me too, some text, links, but no actual code or link to download.

hermitcrab · 2024-02-10T10:40:31 1707561631

Same here.

bbsz · 2024-02-10T12:27:59 1707568079

More recommendations on this topic? Paper's great. I also know about Martin Klepmann's Designing data intensive applications but it's less about data structures and more about databases.

phyrex · 2024-02-10T13:18:46 1707571126

I got a lot of value out of “Algorithms and Data Structures for Massive Datasets” (https://www.manning.com/books/algorithms-and-data-structures...)

thesz · 2024-02-10T18:56:36 1707591396

It does not mention Array of Structures and Structure of Arrays dichotomy, which is extremely important if you design a structure to hold data for analytics of any kind.

Jtsummers · 2024-02-11T00:32:47 1707611567

6.1 talks about tradeoffs and reasons for row vs column oriented storage. So they do discuss it, just not in the array of struct and struct of array terms.

Octokat · 2024-02-09T19:23:54 1707506634

Thank you for sharing this!

Ozzie_osman · 2024-02-10T05:43:35 1707543815

I'd love to buy a copy but it's $100 on Amazon.

willsmith72 · 2024-02-10T06:14:34 1707545674

I'm still waiting for someone to disrupt and un-amazon the book industry. Authors lose, readers lose. It's broken

eimrine · 2024-02-10T08:06:24 1707552384

Torrents can do this, also offline book shops who sell used books.

lgkk · 2024-02-10T01:06:03 1707527163

This is a good read

russfink · 2024-02-10T04:02:40 1707537760

Needs a TOC

csbartus · 2024-02-10T10:58:38 1707562718

Opening with Firefox gives a full TOC: https://imgur.com/a/cgdy0nY

albert_e · 2024-02-10T05:43:09 1707543789

I uploaded PDF and asked ChatGPT 4 for ToC ..it is really struggling to process this.

Even after asking it to ignore page headers and footers.

I thought the state of the art was much better?

082349872349872 · 2024-02-10T08:40:37 1707554437

By hand (5 min):

  I Introduction                                          2
  II Performance Metrics and Operational Tradeoffs        9
  III Dimensions of the Data Structure Design Space      21
  IV From Workloads to Data Structures                   67
  V Adaptivity: Evolving Data Structures to a Workload   93
  VI Data Structures for Specific Application Domains   109
  VII Challenging Design Considerations                 124
  VIII Summary                                          132

(along the way I saw some interesting figures, so, for me, this exercise was worth the time)

albert_e · 2024-02-10T12:37:18 1707568638

Yes I was going to do the same but wanted to learn a general way to do this task along the way too and get more details (second level sections as well).

======================

Step 1:

Use pdftk to extract metadata

> pdftk input.pdf dump_data output > raw.txt

Step 2:

copy relevant sections of metadata "Bookmarks" into a new tetx file.

Step 3:

Upload this text file and ask ChatGPT to format it as you want -- markdown / HTML / plain text.

Took maybe 10 minutes and this was also worth it for a different reason (learning one more bit about pdftk capabilities).

=======================

CONTENTS

    1. Introduction (2)

    1.1. Data Structures Are Foundational (2)
    1.2. Tradeoffs in Data Structure Design (4)
    1.3. Audience & Prerequisites (7)
    1.4. Learning Outcomes (8)
    1.5. Overview of the Book (8)

    2. Performance Metrics and Operational Tradeoffs (9)
    2.1. Memory Hierarchy (9)
    2.2. From Read/Update to RUM: Memory & Space Costs (10)
    2.3. RUM Performance Costs (11)
    2.4. From RUM to PyRUMID (14)
    2.5. Chapter Summary (17)
    2.6. Questions (18)
    2.7. Further Readings (19)

    3. Dimensions of the Data Structure Design Space (21)
    3.1. Global Data Organization (24)
    3.2. Global Search Algorithms When Not Using an Index (26)
    3.3. Search When Using an Index (32)
    3.4. Local Data Organization (46)
    3.5. Local Search (48)
    3.6. Modification Policy: In-place vs. Out-of-place (48)
    3.7. Buffering (53)
    3.8. Key-Value Representation (55)
    3.9. Summary of the Design Space Dimensions (57)
    3.10. Data Structure Design Expert Rules (58)
    3.11. Chapter Summary (61)
    3.12. Questions (61)
    3.13. Further Readings (64)

    4. From Workloads to Data Structures (67)
    4.1. Point and Range Queries, Modifications, but Rare Scans (Traditional B+-trees and Learned Tree Indexes) (68)
    4.2. Similar Workload With a Working Set That Fits in Memory (Fractal B+-trees) (70)
    4.3. Point and Range Queries, Rare Scans, More Modifications (Insert-Optimized Search Trees) (70)
    4.4. Mixed Workload With No Short Range Queries (Hybrid Range Trees) (75)
    4.5. Mixed Workload, With Ever Increasing Data Size (Radix Trees) (78)
    4.6. Point Queries, Inserts, and Some Modifications (Static Hashing with Overflow Pages) (79)
    4.7. Read-mostly With Long Range Queries (Scans with Zonemaps) (81)
    4.8. Modification-intensive With Point and Range Queries (LSM-tree) (82)
    4.9. Modification-intensive With Point Queries Only (LSM-hash) (84)
    4.10. When to Design Heterogeneous Data Structures (85)
    4.11. Data Structures in Practice (86)
    4.12. Chapter Summary (88)
    4.13. Questions (89)
    4.14. Further Readings (92)

    5. Adaptivity: Evolving Data Structures to a Workload (93)
    5.1. Design Dimension: Reorganization Aggressiveness (96)
    5.2. Adaptivity for Frequently Accessed Data (96)
    5.3. Adaptivity for Value-Organized Data (97)
    5.4. Aggressiveness of Adaptivity during Initialization (100)
    5.5. Partial Adaptive Indexing (101)
    5.6. Adaptive Modifications (102)
    5.7. Adaptivity and Concurrency (103)
    5.8. Adaptivity Metrics (104)
    5.9. Open Topics (105)
    5.10. Chapter Summary (106)
    5.11. Questions (107)

    6. Data Structures for Specific Application Domains (109)
    6.1. Data Structures in Relational Database Systems (109)
    6.2. File Systems & Memory Management (117)
    6.3. Data Structures in Machine Learning Pipelines (118)
    6.4. Cross-System Design Considerations and Tradeoffs (120)
    6.5. Chapter Summary (121)
    6.6. Questions (122)
    6.7. Further Readings (123)

    7. Challenging Design Considerations (124)
    7.1. Concurrency (125)
    7.2. Distributed Systems (127)
    7.3. Emerging Workload Types (127)
    7.4. Hardware Considerations in Data Structure Implementation (128)
    7.5.Chapter Summary (129)
    7.6. Questions (129)
    7.7. Further Readings (130)

    8. Summary (132)
    Acknowledgments (134)
    References (135)

nwareing · 2024-02-09T21:01:20 1707512480

DSDIA ;)

datadrivenangel · 2024-02-09T20:42:16 1707511336

Data structures across distributed systems are challenging, so this is a useful resource.

Still, better if you can keep your application on a single machine if that's feasible.

teraflop · 2024-02-09T21:00:34 1707512434

This paper does not cover distributed systems, except for a very brief mention at the end.

fithisux · 2024-02-10T12:40:08 1707568808

https://arxiv.org/abs/2001.04235

here you are