
Where are command line tools to extract data from huge XML files? - reirob
Hello,<p>I am working in a support team. Sometimes we are receiving huge XML files (tens of Gigabytes) from which which we need to test, mostly extract some data fields. Most of the work I am doing with Unix tools (grep, awk, bash) but for huge XML files I did not find a tool that will not eat up all the memory of the computer. I used xmlstarlet which is exactly what I need, except that xmlstarlet does not work for huge files.<p>So the question that I have is there some command-line tool, that is fast, processes XML as a stream and allows to extract elements by using a concise way (preferably XPath).<p>I just cannot believe that after more than 10 years of XML there are no reliable tools around - I would actually expect that such tools will find their way in the UNIX standard.<p>Noet 1: Please don't blame me for using huge XML files, it's not my design decision - the reality is that there are cases where we have to work with huge XML files. I know all the inconvenience of XML and I know that the overhead is very huge, but that is not the point here.<p>Note 2: Please don't tell me that I should use programming language X or Z to read XML. I really want something that is available at the command line and if really necessary can be combined with other commands in a shell script.<p>Thanks in advance for all your suggestions.<p>reirob
======
rcfox
The problem with piping a large XML file is that you pretty much have to load
the entire thing into memory before you can do stuff with it. What if the
field you've requested goes on for 5GB-worth of text? You need to load 5GB to
get the value.

The usual solution to dealing with large files such as these is to use a
memory-mapped file. I'm not sure if it would help with the issue I mentioned
above though...

~~~
iwr
You don't need to build the entire document tree in memory. You just have to
use a sequential parser like SAX or StAX.

~~~
djacobs
How would this work, exactly? Usually stream parsing requires handler
functions, which would imply passing functions at the commandline. I can't
think of how you would do this with pipes.

~~~
iwr
You could write a wrapper that outputs SAX events to predetermined files or
named pipes, say, selectable as command line arguments; for instance, the
startElement event could be outputted to startElement.txt. Then your script
can monitor these files and do the processing from there.

I'm sure there are better ways to do it, though.

~~~
djacobs
You're definitely right, that will work. I guess my point is: the OP seems to
be wanting a dynamic way to handle XML at the commandline such that he can
handle new types of events on the fly by piping together a parser with other
Unix filters, etc. Doing things the SAX way implies creating hard and fast
(and specific) handlers for each type of event, indicating that you're
anticipating these events in advance.

I'm sure some handlers could be generalized a bit (for example, scanning for
regular expressions). But I don't think that the SAX/commandline solution is
much better than the SAX/program solution.

Am I missing something?

------
cotsog
Microsoft Log Parser:
[http://www.microsoft.com/downloads/en/details.aspx?FamilyID=...](http://www.microsoft.com/downloads/en/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en)

~~~
reirob
Many thanks!

This is definitely worth looking into! Any similar tool for Unix environment?

~~~
cotsog
See here: [http://stackoverflow.com/questions/185665/logparser-
microsof...](http://stackoverflow.com/questions/185665/logparser-microsofts-
one-or-similar-for-unix)

~~~
reirob
Thanks again! This is exactly what I wanted. I'll give a try to the LogParser
and the tools that are on stackoverlow and will publish here what is the best
one for my needs.

------
fbnt
You might want to have a look at this: <http://tibleiz.net/asm-xml/index.html>

It's a fast (200MB/sec) XML parser/decoder in pure x86 assembler.

~~~
reirob
This is a library, for which I would have to develop my stuff. What I need is
a ready to use command line tool that preferably accepts XPath expressions to
select the data elements that I want to extract (as xmlstarlet). I am not
interested in programming.

Thanks anyway.

------
tetsuharu
Write a small script in language X or Z using a XML stream parser.

~~~
reirob
Then I will end up to write for each task that I will have a small script in
language X or Z. Whereas all I need is using a command line to extract data in
flat text. You can have a look at the tool xmlstarlet
(<http://xmlstar.sourceforge.net/>) - this is exactly what i need but only if
it is able to process large XML files, which unfortunately is not the case.

~~~
hasenj
You can write a small python script that does the flat text thing you're
talking about.

Python scripts can run from the command line.

