
Parsing huge XML files with Go - dps
http://blog.davidsingleton.org/parsing-huge-xml-files-with-go/
======
willvarfar
Personally, I have a penchant for writing my own pull parsers. Its a mind-
expanding exercise.

The neat thing about Go is that parsers can return functions that consume the
next token. Rob Pike has an excellent video about this:
<http://www.youtube.com/watch?v=HxaD_trXwRE>

~~~
signa11
> Rob Pike has an excellent video about this:
> <http://www.youtube.com/watch?v=HxaD_trXwRE>

thank you ! this is an _excellent_ talk. having concurrent implementation of
lexer & parser as co-routines communicating over message channels is very,
very cool.

------
fleitz
Streaming parsers are key when dealing with XML files this big. Used to have a
C# parser that would parse about 1 TB of XML per day the biggest files were >
200GB.

It was impossible with out rewriting everything to use a SAX style parser.

~~~
barrkel
SAX style (parsing library callbacks) is not your only option; you can also
use an iterator style (i.e. something like XmlReader in C#).

~~~
fleitz
Oops, mistook SAX for iterator, I really prefer XmlReader to SAX style.

------
duaneb
As much as I like hearing about Go, SAX parsers are not exactly new.

~~~
iand
I think showing the convenience of parsing into tagged structs makes this a
cut above the usual SAX parsing examples.

~~~
duaneb
Hm, good point. Negativity redacted.

------
exim
In the first place, why should you have huge XML files? (Except those
wikipedia dump files :))

~~~
human_error
It happens sometimes. I had written a multithreaded parser in C++ to parse
around 800MB per day so another team could build up the rest of the project
based on the data. Someone had thought it'd be a better idea to store all
fetched data in XML.

------
pradeepprabakar
I had to do a similar task of parsing the huge wikipedia dump and rewriting
the Wikipedia XML (I had to add a couple of other tags to the main "page" tag)
I used a SAX parser in Python and rewrote the dump. I found SAX parsers very
simple to deal with huge XML Streams.

