Hacker News new | comments | show | ask | jobs | submit login
Where are command line tools to extract data from huge XML files?
5 points by reirob on Nov 22, 2010 | hide | past | web | favorite | 16 comments

I am working in a support team. Sometimes we are receiving huge XML files (tens of Gigabytes) from which which we need to test, mostly extract some data fields. Most of the work I am doing with Unix tools (grep, awk, bash) but for huge XML files I did not find a tool that will not eat up all the memory of the computer. I used xmlstarlet which is exactly what I need, except that xmlstarlet does not work for huge files.

So the question that I have is there some command-line tool, that is fast, processes XML as a stream and allows to extract elements by using a concise way (preferably XPath).

I just cannot believe that after more than 10 years of XML there are no reliable tools around - I would actually expect that such tools will find their way in the UNIX standard.

Noet 1: Please don't blame me for using huge XML files, it's not my design decision - the reality is that there are cases where we have to work with huge XML files. I know all the inconvenience of XML and I know that the overhead is very huge, but that is not the point here.

Note 2: Please don't tell me that I should use programming language X or Z to read XML. I really want something that is available at the command line and if really necessary can be combined with other commands in a shell script.

Thanks in advance for all your suggestions.


The problem with piping a large XML file is that you pretty much have to load the entire thing into memory before you can do stuff with it. What if the field you've requested goes on for 5GB-worth of text? You need to load 5GB to get the value.

The usual solution to dealing with large files such as these is to use a memory-mapped file. I'm not sure if it would help with the issue I mentioned above though...

You don't need to build the entire document tree in memory. You just have to use a sequential parser like SAX or StAX.

How would this work, exactly? Usually stream parsing requires handler functions, which would imply passing functions at the commandline. I can't think of how you would do this with pipes.

You could write a wrapper that outputs SAX events to predetermined files or named pipes, say, selectable as command line arguments; for instance, the startElement event could be outputted to startElement.txt. Then your script can monitor these files and do the processing from there.

I'm sure there are better ways to do it, though.

You're definitely right, that will work. I guess my point is: the OP seems to be wanting a dynamic way to handle XML at the commandline such that he can handle new types of events on the fly by piping together a parser with other Unix filters, etc. Doing things the SAX way implies creating hard and fast (and specific) handlers for each type of event, indicating that you're anticipating these events in advance.

I'm sure some handlers could be generalized a bit (for example, scanning for regular expressions). But I don't think that the SAX/commandline solution is much better than the SAX/program solution.

Am I missing something?

Hello again,

I tried it out. On the positive side is that it is an awesome tool, providing a mixture of SQL query syntax and XPath for parsing XML files. It supports as well quite a big set of input (not only XML files) formats. On the negative side is that for parsing XML it is a memory hog. I tried it with a 1.7 GB big XML file to just extract the datea of one element that is repeated through records and it crashed after many minutes with "not enough memory" message and this on a 4GB machine.

Unfortunately this is not the right tool.

Anyway thanks for the proposal.

Many thanks!

This is definitely worth looking into! Any similar tool for Unix environment?

Thanks again! This is exactly what I wanted. I'll give a try to the LogParser and the tools that are on stackoverlow and will publish here what is the best one for my needs.

Unfortunately the tools that are proposed are just for plain text log files. None of them supports XML :(

Waiting for more/other propositions.

You might want to have a look at this: http://tibleiz.net/asm-xml/index.html

It's a fast (200MB/sec) XML parser/decoder in pure x86 assembler.

This is a library, for which I would have to develop my stuff. What I need is a ready to use command line tool that preferably accepts XPath expressions to select the data elements that I want to extract (as xmlstarlet). I am not interested in programming.

Thanks anyway.

Write a small script in language X or Z using a XML stream parser.

Then I will end up to write for each task that I will have a small script in language X or Z. Whereas all I need is using a command line to extract data in flat text. You can have a look at the tool xmlstarlet (http://xmlstar.sourceforge.net/) - this is exactly what i need but only if it is able to process large XML files, which unfortunately is not the case.

You can write a small python script that does the flat text thing you're talking about.

Python scripts can run from the command line.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact