
TXR – A Programming Language for Convenient Data Munging - joshumax
http://www.nongnu.org/txr/
======
kazinator
Author here. Currently working on a debugger. (Threw the old crappy one out.)
Backtraces are working. Some of the remaining work is going to require long,
uninterrupted concentration that is hard to come by due to taking care of a
six-month-old baby.

I have over 50 unreleased patches. There are some bugfixes, including a
compiler one, involving dynamically scoped variables used as optional
parameters:

    
    
          (defvar v)
          (defun f (: (v v)))
          (call (compile 'f)) ;; blows up in virtual machine with "frame level mismatch"
    

Patch for that:

    
    
      diff --git a/share/txr/stdlib/compiler.tl b/share/txr/stdlib/compiler.tl
      index e76849db..ccdbee83 100644
      --- a/share/txr/stdlib/compiler.tl
      +++ b/share/txr/stdlib/compiler.tl
      @@ -868,7 +868,7 @@
                                           ,*(whenlet ((spec-sub [find have-sym specials : cdr]))
                                               (set specials [remq have-sym specials cdr])
                                               ^((bindv ,have-bind.loc ,me.(get-dreg (car spec-sub))))))))))
      -                 (benv (if specials (new env up nenv co me) nenv))
      +                 (benv (if need-dframe (new env up nenv co me) nenv))
                        (btreg me.(alloc-treg))
                        (bfrag me.(comp-progn btreg benv body))
                        (boreg (if env.(out-of-scope bfrag.oreg) btreg bfrag.oreg))
    

There is now support in the printer for limiting the depth and length.

I added a derived hook into the OOP system; a struct being notified that it is
being inherited.

------
otoburb
_" TXR Lisp programs are shorter and clearer than those written in some
mainstream languages "du jour" like Python, Ruby, Clojure, Javascript or
Racket. If you find that this isn't the case, the TXR project wants to hear
from you; give a shout to the mailing list. If a program is significantly
clearer and shorter in another language, that is considered a bug in TXR."_

That section made me chuckle. Admirable if true.

~~~
auvrw
i agree that the general-purpose programming language space is fairly crowded
... the lisp dialect/user ratio especially so.

DSLs, otoh, are in short supply. while awk or plain sed are great for shell
programming, this is the only (open source) DSL i'm aware of targeting certain
types of NLP-esque "munging". this space is mostly full of statistical
approaches, which, while conceptually pure, don't allow the kind of
flexibility that would be useful in many applications.

i wonder if, eventually, the DSL portion of TXR could be sheared off (possibly
via metacircular evaluation of the TXR lisp?) into something that's portable
across lisps or at least to semi-standardized scheme implementations?

~~~
kazinator
N. Westbury has been cloning it in Java:

[https://github.com/westbury/txr-java](https://github.com/westbury/txr-java)

------
notafraudster
This seemed interesting, but when I went through the "Accepted Stack Overflow"
links on the main page, I thought "how would I do this in an R tidyverse
stack?" and set the goal that my responses should be shorter, clearer, or
ideally both, and that I would favour clearer answers to code golf, except
that when posting to HN I collapse the code into a single line while in R
there would be linebreaks at each semicolon or after each pipe operator (%>%).
Here are three examples below:

"Customized sort based on multiple columns of CSV". In R, something like this:
`library(tidyverse); read_delim("file.tsv", delim = "@") %>% arrange(.[[2]])
%>% group_by(.[[2]]) %>% arrange(match(.[[3]], c("arch.", "var." "ver.",
"anci.", "fam.")), .[[3]]) %>% group_by(.[[2]], .[[3]]) %>% mutate(n = n())
%>% arrange(desc(n)) %>% ungroup() %>% select(1:4)`

"Extract text from HTML table". In R, something like this would suffice:
`library(rvest); library(tidyverse); read_html(URL_GOES_HERE) %>%
html_nodes("div.scoreTableArea") %>% html_table() %>% write_delim("out.csv",
delim = "\t")`

"Get n-th Field of Each Create Referring to Another File". In R:
`library(tidyverse); file1 = read_delim("file1.txt", delim = " ", col_names =
FALSE); chunks = readChar("file2.txt", 999999) %>% str_split(";") %>% unlist()
%>% map(function(x) { matches = str_match(str_trim(x), '^create table "(. _)
"([^(]_)\\\\(((.|\n)*)\\\\)$'); title = matches[, 2]; fields = matches[, 4]
%>% str_split(",") %>% unlist() %>% str_trim(); return(tibble(table_name =
rep(title, length(fields)), n = 1:length(fields), field = fields)) }) %>%
bind_rows(); file1 %>% left_join(chunks, by = c("X1" = "table_name", "X2" =
"n"))`

The third example trades off a little clarity for a little robustness by
adding a regex instead of assuming the SQL table definition is one field per
line.

~~~
kazinator
There is no HTML parsing library in TXR, yet the code still looks good.

TXR Lisp has support for that type of functional transformation of structured
data, with fairly tidy syntax. If a need for a full blown HTML parsing library
arises, someone will come up with one; maybe me. It could end up integrated
into the TXR flex/Yacc parser, which would make it fast.

In the "Get n-th Field" task, what we can do is snarf the data as a string,
then remove all the commas and semicolons. It then parses as a TXR Lisp with
the _lisp-parse_ function, resulting in this:

    
    
      (create table (qref "def" something)
       (f01 char (10) f02 char (10) f03 char (10) f04 date)
       create table (qref "abc" something)
       (x01 char (10) x02 char (1) x03 char (10))
       create table (qref "ghi" something)
       (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05 char (10)))
    

That seems to open an avenue to a solution. E.g. we can now partition it into
pieces that start with the _create_ symbol:

    
    
      28> (partition *26 (op where (op eq 'create)))
      ((create table (qref "def" something) (f01 char (10) f02 char (10) f03 char (10) f04 date))
       (create table (qref "abc" something) (x01 char (10) x02 char (1) x03 char (10)))
       (create table (qref "ghi" something) (z01 char (10) z02 intr (10) z03 double (10) z04 char (10) z05
                                             char (10))))
    

Now the (qref "def" something) parts are in fixed positions, followed by
fixed-shape triplets.

Only problem with this type of solution is that it takes the example data too
literally. The user's actual data might not cleanly parse this way.

------
anentropic
> The PDF rendition of the reference manual, which takes the form of a large
> Unix man page, is over 600 pages long, with no index or table of contents.
> There are many ways to solve a given data processing problem with TXR.

"Good luck, you're on your own!"

~~~
kazinator
The "no index or TOC" isn't being touted as a feature, just that the page
count is that without these (in documents like these, these features can
contribute dozens to the page count). An index would be nice; patches welcome!

The HTML version that most people would be using has a TOC with two-way
navigation to the section headings and is hyperlinked. Of course, man page
reading allows easy searching.

~~~
Jach
I guess threads like this remind me why it's nice to have professional doc
writers review my customer-facing text at work. ;) Congrats on your project
getting some more attention! If you'll indulge a bit of bikeshedding, this
particular miscommunication could probably be avoided in the future by
changing the sentence to the short "The PDF rendition of the reference manual
is over 600 pages long." Even if you add extra things to the PDF later the
statement won't be incorrect and so you won't have to deal with nitpickers
coming by next time with a comment like "But if you remove the index it's only
597 pages!"

Another edit preserving more of the original would be to replace the final
"with no" with something like "even excluding any"...

~~~
kazinator
Thanks; I fixed that.

------
js8
It would be interesting to have a DSL for data munging, but I am afraid TXR is
not it. My requirements would be that the language should be functional and
total.

Most transformations that we do on data do not require Turing completeness or
recursion. I think it would be useful to write these down in a language with
semantics that is easy to analyze.

~~~
kazinator
The funny thing is, I originally didn't intend the TXR pattern language to be
recursive. It needed functional decomposition (pattern functions) to break up
a big pattern match into simpler units. When those were implemented, I
realized after the fact, hey we have a push-down automaton that can now grok
recursive grammars.

I don't see why we would want to rule out a pattern function invoking itself
(directly, or through intermediaries); if that hurts, then just don't do that.

(Though I understand that there are languages deliberately designed without
unbounded loops or recursion, for justifiable reasons.)

~~~
js8
I found in practice that arbitrary recursion depth is (even on languages with
formal recursive grammar) very rarely needed. And where it's needed it can
probably be implemented as a primitive in the language (map total function
over all the nodes) that can do a similar thing.

------
cstross
From where I'm standing this looks like someone put a _lot_ of effort into re-
inventing Perl, minus the documentation and user community.

~~~
TuringTest
I've not studied this language yet, but if its syntax is in any way saner,
that would still be a net gain.

------
usgroup
I ashamedly had never heard of this before. Could anyone add any colour RE:

1\. Parsimony.

2\. Performance vs awk and friends.

3\. Multi threading.

4\. Ideal use cases.

~~~
nn3
4\. My use case was: If you have a some what fuzzy parsing problem that is
harder than a single regexpr and needs backtracing, and then generate a report
from it.

For these things TXR is great.

If you want to do multi threading or best performance it's probably not the
thing to use.

------
uptownfunk
We already have this, it is R with tidyverse. What we need is a fully baked
transpiler from R/tidyverse to sql.

~~~
crispyambulance
Yep. Seriously. R w/tidyverse is a ridiculously powerful data wrangling tool
especially when dealing with text files.

I tend use Notepad++ when starting out on a data-wrangling adventure. It has
an uncanny ability, unlike any other editor, to open hundreds of files at the
same time and to perform regex operations on all of them without dropping
dead. I uses Notepad++ for initial manual exploration to get the lay of the
problem, and then switch to R for the actual analysis.

~~~
flavio81
>I tend use Notepad++

I assume, then, that your file sizes are not so big. N++ is not good with big
(>25% of your ram) file sizes, refusing to open them.

Is R/tidyverse also limited on the size of the file it can handle? In my job i
routinely work with up to 100GB files.

~~~
crispyambulance
I guess it depends on what your definition of "big" is, I've never had to deal
with 100GB files!

------
mcguire
Confusingly, there's another language called TXL
([https://en.wikipedia.org/wiki/TXL_(programming_language)](https://en.wikipedia.org/wiki/TXL_\(programming_language\)))
that's both obscure and neat.

------
theon144
Well, this looks great, but I'm not about to start digesting the self-admitted
600-page tome just to see if it's worth learning for the tasks I encounter -
surely there's a "tutorial" somewhere?

~~~
TuringTest
This page is quite explanatory:

[http://www.nongnu.org/txr/txr-pattern-
language.html](http://www.nongnu.org/txr/txr-pattern-language.html)

~~~
tux1968
Way off topic, but as someone who has recently switched to using a non-
standard background color in my browser... that page is horrendous to read:

[https://i.imgur.com/pvCnmSa.png](https://i.imgur.com/pvCnmSa.png)

I can accept that doing something non-standard leads to some rough edges like
this, but i'm not sure how many web developers know this is an issue. At least
it has surprised me how many websites have this issue of assuming the default
color is bright white.

~~~
kazinator
Hi; try it now! I GIMP-ed the image such that the non-transparent pixels are
pure red, and only slightly opaque, instead of 100% opaque pinkish white. It
looks about the same on a white background. Thanks, again.

I tested it with a lightly grey background, as well as heavy gray.

This little experiment really made me notice HN's hard-coded light grey
background box, BTW.

~~~
tux1968
Yeah, looks great here. Really didn't expect you to dig into it at all let
alone so quickly. Thanks very much :-)

------
mark_l_watson
Interesting lisp’y language. Off topic, but I find the domain name nongnu.org
to be amusing for a GNU/FSF web site. “nongnu” to me reads as “not gnu”

~~~
buckminster
Exactly. It's GNU hosting for non-GNU projects.

~~~
mark_l_watson
Oh, that makes sense. thanks!

~~~
kazinator
Of course, Non-GNU projects that are licensed in such a way that they could be
GNU projects.

From the registration page, the kind of software project that can be hosted on
Savannah is _[a] free software package that can run on a completely free
operating system, without depending on any nonfree software. You can only
provide versions for nonfree operating systems if you also provide free
operating systems versions with the same or more functionalities. Large
software distributions are not allowed; they should be split into separate
projects._

------
jdmoreira
Very interesting. I'm wondering why they didn't implement the Lisp version on
top of CL with macros

~~~
kazinator
I can summarize this as follows. TXR is my research platform into various
topics, including many Lisp topics. It contains numerous innovations. As a
whole, that requires working at the implementation level, ground up.

~~~
flavio81
Thanks Kaz! I had the same question.

------
vcdimension
Has anyone run any benchmarks of TXR against awk, R, python, or miller?

