Hacker News new | past | comments | ask | show | jobs | submit login
Data Structures for Data-Intensive Applications [pdf] (2023) (bu.edu)
583 points by mfiguiere 11 months ago | hide | past | favorite | 26 comments



I've only had a chance to skim so far, but this is an absolutely superb survey of an enormous space. It doesn't just list out a bunch of data structures, it actully helps you build a mental model of all the concerns to think about when building or using data structures in an application.


That's the recap I needed!


Easily one of the best technical books I've ever read.


One of the authors of this book runs a research lab in this field.

They have this cool tool that can help one design an optimal data structure:

http://daslab.seas.harvard.edu/datacalculator/


I can't seem to find where the actual tool is


If you manage to click on the correct image, you’ll get to this page [0]. Then you can find a link to the “interactive demo” [1]. Weirdly well hidden.

[0]: http://daslab.seas.harvard.edu/projects/evolution/

[1]: http://sigmod.dyndns.org:8000/app/#


Me too, some text, links, but no actual code or link to download.


Same here.


More recommendations on this topic? Paper's great. I also know about Martin Klepmann's Designing data intensive applications but it's less about data structures and more about databases.


I got a lot of value out of “Algorithms and Data Structures for Massive Datasets” (https://www.manning.com/books/algorithms-and-data-structures...)


It does not mention Array of Structures and Structure of Arrays dichotomy, which is extremely important if you design a structure to hold data for analytics of any kind.


6.1 talks about tradeoffs and reasons for row vs column oriented storage. So they do discuss it, just not in the array of struct and struct of array terms.


Thank you for sharing this!


I'd love to buy a copy but it's $100 on Amazon.


I'm still waiting for someone to disrupt and un-amazon the book industry. Authors lose, readers lose. It's broken


Torrents can do this, also offline book shops who sell used books.


This is a good read


Needs a TOC


Opening with Firefox gives a full TOC: https://imgur.com/a/cgdy0nY


I uploaded PDF and asked ChatGPT 4 for ToC ..it is really struggling to process this.

Even after asking it to ignore page headers and footers.

I thought the state of the art was much better?


By hand (5 min):

  I Introduction                                          2
  II Performance Metrics and Operational Tradeoffs        9
  III Dimensions of the Data Structure Design Space      21
  IV From Workloads to Data Structures                   67
  V Adaptivity: Evolving Data Structures to a Workload   93
  VI Data Structures for Specific Application Domains   109
  VII Challenging Design Considerations                 124
  VIII Summary                                          132
(along the way I saw some interesting figures, so, for me, this exercise was worth the time)


Yes I was going to do the same but wanted to learn a general way to do this task along the way too and get more details (second level sections as well).

======================

Step 1:

Use pdftk to extract metadata

> pdftk input.pdf dump_data output > raw.txt

Step 2:

copy relevant sections of metadata "Bookmarks" into a new tetx file.

Step 3:

Upload this text file and ask ChatGPT to format it as you want -- markdown / HTML / plain text.

Took maybe 10 minutes and this was also worth it for a different reason (learning one more bit about pdftk capabilities).

=======================

CONTENTS

    1. Introduction (2)

    1.1. Data Structures Are Foundational (2)
    1.2. Tradeoffs in Data Structure Design (4)
    1.3. Audience & Prerequisites (7)
    1.4. Learning Outcomes (8)
    1.5. Overview of the Book (8)

    2. Performance Metrics and Operational Tradeoffs (9)
    2.1. Memory Hierarchy (9)
    2.2. From Read/Update to RUM: Memory & Space Costs (10)
    2.3. RUM Performance Costs (11)
    2.4. From RUM to PyRUMID (14)
    2.5. Chapter Summary (17)
    2.6. Questions (18)
    2.7. Further Readings (19)

    3. Dimensions of the Data Structure Design Space (21)
    3.1. Global Data Organization (24)
    3.2. Global Search Algorithms When Not Using an Index (26)
    3.3. Search When Using an Index (32)
    3.4. Local Data Organization (46)
    3.5. Local Search (48)
    3.6. Modification Policy: In-place vs. Out-of-place (48)
    3.7. Buffering (53)
    3.8. Key-Value Representation (55)
    3.9. Summary of the Design Space Dimensions (57)
    3.10. Data Structure Design Expert Rules (58)
    3.11. Chapter Summary (61)
    3.12. Questions (61)
    3.13. Further Readings (64)

    4. From Workloads to Data Structures (67)
    4.1. Point and Range Queries, Modifications, but Rare Scans (Traditional B+-trees and Learned Tree Indexes) (68)
    4.2. Similar Workload With a Working Set That Fits in Memory (Fractal B+-trees) (70)
    4.3. Point and Range Queries, Rare Scans, More Modifications (Insert-Optimized Search Trees) (70)
    4.4. Mixed Workload With No Short Range Queries (Hybrid Range Trees) (75)
    4.5. Mixed Workload, With Ever Increasing Data Size (Radix Trees) (78)
    4.6. Point Queries, Inserts, and Some Modifications (Static Hashing with Overflow Pages) (79)
    4.7. Read-mostly With Long Range Queries (Scans with Zonemaps) (81)
    4.8. Modification-intensive With Point and Range Queries (LSM-tree) (82)
    4.9. Modification-intensive With Point Queries Only (LSM-hash) (84)
    4.10. When to Design Heterogeneous Data Structures (85)
    4.11. Data Structures in Practice (86)
    4.12. Chapter Summary (88)
    4.13. Questions (89)
    4.14. Further Readings (92)

    5. Adaptivity: Evolving Data Structures to a Workload (93)
    5.1. Design Dimension: Reorganization Aggressiveness (96)
    5.2. Adaptivity for Frequently Accessed Data (96)
    5.3. Adaptivity for Value-Organized Data (97)
    5.4. Aggressiveness of Adaptivity during Initialization (100)
    5.5. Partial Adaptive Indexing (101)
    5.6. Adaptive Modifications (102)
    5.7. Adaptivity and Concurrency (103)
    5.8. Adaptivity Metrics (104)
    5.9. Open Topics (105)
    5.10. Chapter Summary (106)
    5.11. Questions (107)

    6. Data Structures for Specific Application Domains (109)
    6.1. Data Structures in Relational Database Systems (109)
    6.2. File Systems & Memory Management (117)
    6.3. Data Structures in Machine Learning Pipelines (118)
    6.4. Cross-System Design Considerations and Tradeoffs (120)
    6.5. Chapter Summary (121)
    6.6. Questions (122)
    6.7. Further Readings (123)

    7. Challenging Design Considerations (124)
    7.1. Concurrency (125)
    7.2. Distributed Systems (127)
    7.3. Emerging Workload Types (127)
    7.4. Hardware Considerations in Data Structure Implementation (128)
    7.5.Chapter Summary (129)
    7.6. Questions (129)
    7.7. Further Readings (130)

    8. Summary (132)
    Acknowledgments (134)
    References (135)


DSDIA ;)


Data structures across distributed systems are challenging, so this is a useful resource.

Still, better if you can keep your application on a single machine if that's feasible.


This paper does not cover distributed systems, except for a very brief mention at the end.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: