Hacker News new | past | comments | ask | show | jobs | submit login

> This is what the parent comment means when referring to pointer chasing -- XML documents are a big random access graph in memory, CPU cache and prefetch is close to useless in that environment, so when walking the DOM as part of some parsing task, much of the time is spent waiting on memory, with the execution units lying idle.

Bare in mind that this is only true if you parse with the DOM model, if you care about efficiency and it's at all possible then the SAX model is much faster, you won't be bound by pointer chasing as there's very little in memory at once. IME the next big gain comes from eliminating string comparisons with hash values. By that point xml parsing is entirely limited by how fast you can stream the documents.




You can achieve a similar (although I guess not nearly as efficient) effect with DOM, without sacrificing convenience given a suitable library. For example the Python lxml library grants access to the tree as it is being constructed, if you are careful not to delete a node it will later modify, it's entirely safe to e.g. parse one element at a time from a big serialized array, then deleting the element from its parent container, so memory usage remains constant. By the end of the parse, you're left with a stub DOM describing an empty container.

The advantage is not losing access to lovely tooling like XPath for parsing

(If anyone had not seen this trick before, the key to avoid deleting elements out from under the parser is to keep a small history of elements to be deleted later. For an array, it's only necessary to save the node describing the previous array element)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: