Hello,
I am working in a support team. Sometimes we are receiving huge XML files (tens of Gigabytes) from which which we need to test, mostly extract some data fields. Most of the work I am doing with Unix tools (grep, awk, bash) but for huge XML files I did not find a tool that will not eat up all the memory of the computer. I used xmlstarlet which is exactly what I need, except that xmlstarlet does not work for huge files.
So the question that I have is there some command-line tool, that is fast, processes XML as a stream and allows to extract elements by using a concise way (preferably XPath).
I just cannot believe that after more than 10 years of XML there are no reliable tools around - I would actually expect that such tools will find their way in the UNIX standard.
Noet 1: Please don't blame me for using huge XML files, it's not my design decision - the reality is that there are cases where we have to work with huge XML files. I know all the inconvenience of XML and I know that the overhead is very huge, but that is not the point here.
Note 2: Please don't tell me that I should use programming language X or Z to read XML. I really want something that is available at the command line and if really necessary can be combined with other commands in a shell script.
Thanks in advance for all your suggestions.
reirob
The usual solution to dealing with large files such as these is to use a memory-mapped file. I'm not sure if it would help with the issue I mentioned above though...