
How to Quickly and Correctly Generate a Git Log in HTML - foob
http://www.oilshell.org/blog/2017/09/19.html
======
jacobparker
Reinventing escaping over and over again (which bash scripts in particular
seem to encourage) is a suckers game. It's difficult to get right and if
you're constantly redoing it you're eventually going to make a mistake. I've
worked in web security and it's sad to see how likely it is for people with
good intentions to mess this up. I'm glad the author basically came to this
conclusion.

The winning strategy is to use a library/framework/whatever for embedding
user-provided content into HTML. Sane HTML template libraries will do this.
That library has had more time to get it right. Furthermore a well designed
API will clearly indicate what is trusted vs. untrusted data and all untrusted
data is properly encoded before being embedded. See the "Security Model"
section of golangs HTML templates below.

An alternative to using the git tools which is appropriate for serious work
(shell pipelines are great for prototyping) is libgit2. It has bindings for
many languages. It's very easy to use (sometimes (not always) easier than the
CLI) and often much higher performance vs. big shell pipelines (operating on
text gets slow pretty fast, and often you end up using xargs...)

An example set of tools:
[https://golang.org/pkg/html/template/](https://golang.org/pkg/html/template/)
\+ [https://github.com/libgit2/git2go](https://github.com/libgit2/git2go) .

It's not as succinct as a bash script but it's easier to build something
that's correct. Use the shell to prototype, build it right in a saner
environment.

~~~
falsedan
> _An alternative to using the git tools which is appropriate for serious
> work_

This belittles their and anyone else's bash scripts for git as 'toys'. That's
unfair, and completely unnecessary for your point.

> _big shell pipelines (operating on text gets slow pretty fast, and often you
> end up using xargs...)_

Big shell pipelines are actually blazingly fast, since each command in the
pipeline runs in parallel.

> _It 's not as succinct as a bash script but it's easier to build something
> that's correct. Use the shell to prototype, build it right in a saner
> environment._

I would say the ease depends on the relative difficulty and functionality of
the languages. Casting bash as the insane bad language doesn't win you any
points: we all know bash is a hot mess of a garbage fire, now let's use it to
do useful good work.

I would struggle to justify the work I'd need to do to write the equivalent in
C (or python, where it would be slower) for one command that's only run in the
release process.

~~~
jacobparker
> This belittles their and anyone else's bash scripts for git as 'toys'.

They often make great toys in the sense that toys are fun but it's harder to
build robust things in bash scripts.

Robustness may not matter (e.g. I have tonnes of scripts I use for doing work)
but if you're generating HTML for a website and are worried about escaping and
security then I think there are more suitable tools.

> Big shell pipelines are actually blazingly fast, since each command in the
> pipeline runs in parallel.

It's also possible to do that in programming languages (maybe not as
succiently.)

But where bash scripts fall down is the constant re-parsing of text, data has
to be passed through the kernel, and extra processes (when you use xargs,
which tends to happen for complicated tasks.) This adds up super quick.

I've replaced reasonable (i.e. weren't badly mishandling the data) shell
scripts that procesed and formatted the output of git utilities (on large
repos) and in one case shrunk a background job from 5 minutes to a few
seconds. This is common.

> Casting bash as the insane bad language

I never said that. I said it was great for prototyping. It is bad for the task
FTA, though.

> I would struggle to justify the work I'd need to do to write the equivalent
> in C

I didn't recommend C (I did link to some Go stuff, though.) Unless your
company is all bash scripts (heaven forbid) you're probably already using some
language. C# at my work. PHP, Ruby, JavaScript, whatever.

~~~
falsedan
> _They often make great toys in the sense that toys are fun but it 's harder
> to build robust things in bash scripts._

Programming is hard, regardless. You're being very dismissive towards shell
scripters, and it comes off as elitist.

> _But where bash scripts fall down is the constant re-parsing of text, data
> has to be passed through the kernel, and extra processes (when you use
> xargs, which tends to happen for complicated tasks.) This adds up super
> quick._

What do you mean by parsing? Copying, reading? The buffers sit in kernel
memory, sure… but that's pretty fast if every line in your program is
concurrent.

> _xargs_

Did you have a bad experience with xargs?

> _extra processes_

Did you have a bad experience with… running out of… PIDs?

> _> Casting bash as the insane bad language_

> _I never said that_

You implied it by calling a non-bash language 'sane'; thus, bash must be
insane.

> _I didn 't recommend C_

Your blanket recommendation of 'anything but bash' did. I feel that sometimes
you need to have more context of a situation before dictating a course of
action that must be followed.

~~~
jacobparker
> Programming is hard, regardless. You're being very dismissive towards shell
> scripters, and it comes off as elitist.

I think you're taking this too personally. I write _a lot_ of bash scripts and
like bash (as I've already indicated.) It's not always the right tool for the
right job (no tool is...)

> What do you mean by parsing? Copying, reading? The buffers sit in kernel
> memory, sure… but that's pretty fast if every line in your program is
> concurrent.

The buffers don't just sit in the kernel, they are copied between processes
(via write/read.) This involves many many context switches. By contrast (but
this is just an example), libgit2 will hopefully mmap your packfiles, and once
the data is mapped in you don't have to leave your process.

Parsing is sometimes obvious, for example multiple passes over the data with
grep and sed to shape it the way you want. It's true that components of the
pipeline can run in parallel but the problem is they do a lot of unnecessary
work - parallelizing unnecessary work doesn't remove it and once you hit the
limit of # of cores (which is very relevant when generating HTML - something
often done by webservers with many concurrent requests where you don't have
cores to spare on parallelizing inefficient algorithms.)

Parsing is sometime not as obvious, for example steps in the pipeline doing
redundant Unicode validation. This is an artifact of squeezing things through
read/write with reusable components. It's convenient (which makes it great for
prototyping, or for things that "don't matter") but this adds up to poor
performance.

> Did you have a bad experience with xargs? > Did you have a bad experience
> with… running out of… PIDs?

I think you're being very uncivil. xargs is useful but it (with the common
options like -n and -P) can spawn many processes which has a tremendous
overhead vs an alternative like a function call. It's purely a performance
thing.

(I probably should be using parallel but I'm used to xargs :( )

~~~
falsedan
> _I think you 're being very uncivil._

I'm sorry. I see how my questions come off as questioning your experience, and
I didn't mean to do that.

> _I think you 're taking this too personally._

I think your reluctance to admit that you needlessly insulted some class of
devs is exclusionary. Casual readers of these comments may think it's bad to
write bash, or that it's ok to bash people who do.

> _The buffers don 't just sit in the kernel, they are copied between
> processes (via write/read.)_

Yes, that's right. I know how pipes work, thanks.

What you call 'parsing' I would call processing. Parsing has a well-defined
meaning in the context of computer science: turning tokens into a data
structure (like a syntax tree).

> _parallelizing unnecessary work doesn 't remove it_

Is unnecessary work a problem? If you're being charged by the second, perhaps
it may be cost-effective to eliminate it… but most of the time, developer
productivity is a bigger cost.

I see that you are mindful of the performance implication of your code. I
usually have to think of developer throughput, so I'm very tolerant of
inefficient use of hardware if it means someone gets their job done faster.

~~~
hnbroseph
> class of devs

wait... what? is this marxist class struggle framing?

there is no 'class of devs' in this sense. there is no 'social justice' here.

bash is not a person, bash is not an identity, none of this applies.

~~~
falsedan
Class in the OO/taxonomy sense. I feel like you are projecting some of your
existing issues onto this discussion.

~~~
hnbroseph
it's more related to the nature of your wording and framing. no group of
humans is being abused or discriminated against.

~~~
falsedan
Then I should have said, type of developer. I see how 'class' has weighty
connotations.

> * no group of humans is being abused or discriminated against*

I disagree, you make it abundantly clear that 'real' tools cannot be written
in bash, and that people who write in bash are not producing as high-quality
work as those who use other languages (like python).

------
falsedan

        git log --pretty=format:"%H%x00%s" | sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g; s@\(.*\)\x0\(.*\)@<tr><th>\1</th><td>\2</td></tr>@'
    

You could do the dumb html entifying in a real language. The article's
solution is a straw man, since it's promoting their personal language.

Why did they see \x01 & \x02 as possible sentinels but not nulls? python is
fine with nulls…

~~~
chubot
My solution is easier with respect to multiple fields that contain spaces. The
real example has both the description and committer name. And you can add and
remove fields without changing the second part of the pipeline -- you just
have to add 0x00 and 0x01 in the format string.

Similar qusetion here:

[https://www.reddit.com/r/commandline/comments/719pm2/how_to_...](https://www.reddit.com/r/commandline/comments/719pm2/how_to_quickly_and_correctly_generate_a_git_log/dn9klp1/)

Also, I've used that kind of sed, and it looks horrible. It makes shell look
bad.

~~~
falsedan
> _You could do the dumb html entifying in a real language_

and xargs -0 -n2 printf '<tr><th>%s</th><td>%s</td></tr>'

------
tzs
The underlying problem with the first, simple, approach is that the template
it is using to get things from git,

    
    
      "<tr> <td>%H</td> <td>%s</td> <tr>"
    

interpolates values that need to be escaped, but includes literal text that
must not be escaped. (My guess is that the author meant "</tr>" for the last
element, but the article says "<tr>" so I'm going with that).

The author's approach to deal with that is to mark the places in the template
where escaping will be needed, and then make and use an escaping tool that
recognizes those marks and just escapes the marked segments.

A simpler approach is to eliminate the underlying problem. For getting the
data out of git use a template where the literal text is safe to escape, such
as this:

    
    
      "%H,%s"
    

The escaping can then be done by a tool that escapes its entire input. That
will leave the comma from the template alone, and will not introduce any new
commas. The interpolation of %s might have introduced commas, but they will
all be after the literal comma from the template. The interpolation of %H will
not introduce commas.

The output from the escaper can then be transformed into the final output by
replacing the first "," with "</td> <td>", prepending "<tr> <td>", and
appending "</td> <tr>". All of these are simple in a shell pipeline using sed.

~~~
chubot
Several people brought up alternative solutions like this, and I addressed
them:
[https://news.ycombinator.com/item?id=15295556](https://news.ycombinator.com/item?id=15295556)

Summary: I may have oversimplified the problem in my example. I think my
solution is nicer for the real problem. It has fewer assumptions and will
"scale up" to more fields with arbitrary text. I want to write the escaping
ONCE, not modify it every time I change the format of the output table.

------
dahart
You can skip having to escape any characters or worry if the content is
correct, if you put an unformatted git log into a script tag, and then line
split and set the content of each element via a JS call.

I just tried it, and it works beautifully, no problems with illegal
characters.

What's wrong with this? It'd be super easy to extend if you want columns or
colors or links...

    
    
        <script id='gitlog' type='text'>
          c0c3150f5 09 - 15 dahart Color widget!, #1 improving < hsv > && things [Finishes #8736345] \m/ '",.;:%$#@*
        </script>
        <div id='lines'></div>
        
        $('#gitlog').html().split('\n').forEach(line => {
          $('#lines').append($('<div class="line"/>').text(line))
        })

------
pixelbeat__
Also consider
[https://www.pixelbeat.org/scripts/ansi2html.sh](https://www.pixelbeat.org/scripts/ansi2html.sh)
for the general case of (colored) output to html conversion

------
no_protocol
`gitweb` is a server that comes with your git install.

The `gitweb` web interface includes both a log and shortlog view for
repositories. You can probably use those to some benefit.

This seems to be the source of the shortlog command:

[https://github.com/git/git/blob/master/gitweb/gitweb.perl#L5...](https://github.com/git/git/blob/master/gitweb/gitweb.perl#L5889)

------
masukomi
why do people insist upon reinventing the wheel badly:

git log --color=always <whatever funky coloring, options, etc you want> | aha
> git_log.html

side note: aha is not installed by default on macOS but homebrew will fix that
for you. Also, it has many color and styling options.

~~~
chubot
It doesn't look like it has hyperlinks:

[https://github.com/theZiz/aha](https://github.com/theZiz/aha)

Compare:

[http://www.oilshell.org/release/0.1.0/changelog.html](http://www.oilshell.org/release/0.1.0/changelog.html)

Hyperlinks are the whole point of HTML :)

------
stephenr
How is it that git still doesn't have machine readable output built in?

~~~
sevensor
In what respect is --pretty=format not producing machine-readable output? Is
putting null bytes between the fields not machine-readable enough for you? You
want to use 0x1d instead? You could do that.

~~~
stephenr
In the way that you have to futz around like this with null byte separators to
try and avoid escaping issues.

Svn had a flag to output resonses in xml specifically for machine consumption.

No ambiguity about what it would provide, or if it was escaped or how to
handle things like new lines etc.

------
Sir_Cmpwn
I don't see what was wrong with the first solution. Keep it simple!

~~~
qznc
If you use < or & in your text, it is not HTML5. Neither XHTML. However, it is
probably HTML4, isn't it?

~~~
Sir_Cmpwn
Probably. In any case, cross that bridge when you get to a release that has
one of those symbols in the changelog.

~~~
yorwba
Did you overlook the part where <& appears in commit messages?

~~~
Sir_Cmpwn
No?

~~~
yorwba
That's what was wrong with the first solution. The bridge was crossed just
when it should have been.

~~~
Sir_Cmpwn
Ah, I misread the article.

------
mattacular
>Some programmers might stop here and say, Let's switch to a real programming
language. Do it the right way.

Isn't using Python switching to a real programming language?

~~~
hyperpape
It's a little misleading how it's written. He said "use a real language", but
the real distinction was using a library for the git api. He's only using the
Python stdlib for text manipulation as part of a pipeline.

------
whipoodle
I don't want to use an API and do it the right way. That's too complicated,
poindexter! (50 lines of garbage script follow)

