

Dartmouth researchers working on variants of diff and grep - dhruvbird
http://www.computerworld.com/s/article/9222509/Usenix_Dartmouth_updating_diff_grep_Unix_tools

======
tuukkah
_"You wonder why it hasn't been done before," one said._

To cite some previous work:

sgrep (structured grep): <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>

xmlstarlet (including xpath): <http://xmlstar.sourceforge.net/>

xmldiff: <http://www.logilab.org/859>

jsawk (including jsonquery): <https://github.com/micha/jsawk>

~~~
tuukkah
One might expect the research to cite previous work, but no:
<http://www.cs.dartmouth.edu/reports/TR2011-705.pdf>

_Related work: To the best of our knowledge, we are the first to reconsider
traditional UNIX tools given the increase in higher-level languages that
include non-regular constructs (such as blocks nested at arbitrary depth)._

------
DiabloD3
I think someone needs to tell Dartmouth about Perl, thats all the data
analysis on non-file data stores I ever seem to need.

~~~
pjscott
If you had better tools, perhaps you would find yourself needing them more
often.

------
KaeseEs
I would absolutely love to read this article, however there seems to be a
login barrier that precludes my doing so. Can anyone paste the article text?

~~~
crgt
Copy & Paste:

"With some funding from Google and the U.S. Energy Department, a pair of
computer scientists at Dartmouth University are updating the venerable grep
and diff Unix command line utilities to handle more complex types of data.
Such updates are needed because "we now tend to have more model-based
configuration languages that have meaningful constructs spanning more than one
line," said Gabriel Weaver, a Dartmouth graduate student who, along with
Dartmouth computer science professor Sean Smith, is creating the variants of
grep and diff. Weaver presented the new utilities at a poster session at the
Usenix Large Installation System Administration (LISA) conference, being held
this week in Boston. The new programs will allow administrators to extract
meaningful data from configuration files, log files and other sources of
operational data, the researchers maintain. Grep and diff are command line-
based text analysis tools available in all Linux and Unix distributions. Both
are designed to parse documents on a line-by-line basis. Grep offers the
ability to search through multiple text files and folders for a specific chunk
of text or regular expression. Diff compares two documents and highlights the
differences between them. As with most Unix utilities, the output from either
of these programs can be linked, or piped, to other utilities, so they can be
incorporated into scripts that automate routine system administration tasks.
The new programs, called Context-Free Grep and Hierarchical Diff, will provide
the ability to parse blocks of data rather than single lines. For each new
type of data structure, a vendor would provide a pattern library identifying
the basic structure of the data, which the software would then use to "extract
the constructs of interest from the document," Weaver said. Such utilities
could provide administrators the ability to work with more complex forms of
data now being generated by network equipment and infrastructure software. For
instance, Cisco's IOS (Internetwork Operating System), which is the company's
operating system for its routers and switches, will provide operational data
in block-like data structures. With this data, a tool such as diff "can be too
low-level," Weaver said. "Diff doesn't really pay attention to the structure
of the language you are trying to tell differences between." He has seen cases
where dif reports that 10 changes have been made to a file, when in fact only
two changes have been made, and the remaining data has simply been shifted
around. Grep has issues with data blocks as well. "With regular expressions,
you don't really have the ability to extract things that are nested
arbitrarily deep," Weaver said. Context-Free Grep is still in the design
stage, but should be completed within the next few months. A prototype of
Hierarchical Diff has been completed, though the researchers have not posted
the code yet.

Google's interest in this technology springs from the company's efforts in
cloud computing, where it must automate operations across a wide range of
networking gear, Weaver said. The DOE foresees that this sort of software
could play a vital role in smart grids, in which millions of energy consuming
end-devices would have connectivity of some sort. The software would help
"make sense of all the log files and the configurations of the power control
networks," Weaver said. In addition to system administration duties, the
utilities could also be used in with non-technical languages as well. They
could be used to parse legal documents, for instance, Weaver suggested. A
number of Usenix attendees praised the idea for its potential usefulness. "You
wonder why it hasn't been done before," one said. Another commented that such
tools could also be really handy for code repositories such as Git."

Joab Jackson covers enterprise software and general technology breaking news
for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's
e-mail address is Joab_Jackson@idg.com

