For example gtk-doc generates documentation from comments in c code. I'm guessing there is some perl code to parse c code, which i would substitute for the great 'pycparser' library. With this method you would copy the existing code, fix it, then throw it away.
The methods to generate HTML from the comments is likely a lot of code. Sure, you could copy whatever method the current perl code does... Or you could substitute a lot of it for Jinja2.
Perhaps these examples are incorrect in the context of this specific project, but i don't see the point in copying the lines one for one. Copy the meaning of the code into idiomatic python from the get go, and test the output against the known good perl code. I doubt perl code copied one for one is ever going to be idiomatic python code, so why bother, especially if it takes so long?
Edit: Yeah, tests are good. No tests are bad. Having a complete understanding of the code doesn't require translating it line-for-line. Re-writing a project in a different language is a breaking change, but in the context of this project your initial tests could be (for every bit of C code you can possibly find that has gtk-doc comments):
> gtk-doc.py > a
> gtk-doc.pl > b
> diff a b
My point is spending 1000000000 hours hand-converting perl into python then again re-writing it into idiomatic python so you can catch any theoretical small perl-specific edge case in what is effectively a complete rewrite and a major version change is IMO a bit pointless, especially when converting Frankenstein python-perl into idiomatic Python may result in deleting large parts of the said slow-to-convert code.
This is always the worst kind of Perl code imaginable to try to tinker with.
It will be fiddly. It will have exceptional cases piled upon exceptional cases. It will have hideous numbers of undocumented corners.
All because it isn't a real, actual, fscking parser.
Because we don't really need a parser for this, right now, do we, really?
Yeah, this is one of those cases where "Do the simplest thing" breaks down horribly and very rarely does someone have the wherewithal to apply 2x4 cluebats to the people who persist in writing parsers with regular expressions.
Yes, I have done so (in Python actually). However, my code has a little "counter" in the comments along the lines of "You have debugged a regular expression in this code: <n> times. Use a real parser you idiot."
It is depressing how large I have let that number get before I actually pull my head out of my ass.
Getting there though takes a bit of practice, from recursive descent parsers to parser generators.
You have to do the simplest thing that could actually, possibly work!
I did this once - converting a few kloc of Java to C# - I'm sure it helped that these are very similar languages in many respects. I actually copied the Java source code into C# source files, and then tweaked them until they compiled. The initial commit even had e.g. a "polyfill" for Java's ArrayList<T> class.
https://github.com/MaulingMonkey/poly2tri-cs
By virtue of being such a mechanical translation, it went very fast. I intentionally avoided translating into C# idioms - or even naming conventions - in the initial commit. I wanted to avoid refactoring the entire codebase simultaneously - my experience has been that multi kloc refactors are a great way to bite off more than I can chew, and spend a lot of time in refactoring or debugging hell, and to create an unreviewable mess, all of which is very slow.
By saving all the refactoring into C# idioms etc. for followup changelists, the translation went so smoothly that I still haven't learned the math behind constrained Delaunay triangulation. A non-1:1 translation I'm sure would've forced me to learn it, if only for debugging purposes. ArrayList<T> is now gone - replaced with standard List<T> and IEnumerable<T> from .NET's standard libraries, foo.setX(value) replaced with foo.X = value;, etc.
This isn't an approach I'd recommend for all such projects, but I'm surprised just how well it worked for that conversion. The initial translation took maybe a few hours (if that), and the subsequent cleanup a few more.
For instance, using refactoring, when possible, provides far better guarantees of correctness than tests do.
But now you have a new problem: a Chesterton's Fence (https://www.chesterton.org/taking-a-fence-down/) problem. The old logic is in the form it is (knotty and complicated, most probably) because previous developers made it that way in response to problems they were facing at the time -- problems you may not be aware of, because their old changes made those problems go away. Those knots represent places where those previous developers learned that the problem they were trying to solve wasn't as simple as it first appeared.
You can definitely simplify the porting job if you just ignore all that and write new logic that does The Simplest Thing That Could Possibly Work, but by doing so you can set yourself up for future heartburn as all those forgotten edge cases come roaring back. You end up re-learning the hard way lessons your predecessors learned the first time around.
Also you probably still have to understand every line of the old code, because within will be buried the many undocumented special case workarounds and bug fixes that code accumulates over time. See the classic Spolsky "Things You Should Never Do"
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
I'd argue that perl made TDD mainstream, though we did not call it that way 20 years ago. Well, 30, but I did not join in until the mid 90s. Perl's test protocol TAP was around since at least 1988[1] and back then when I was coding perl for a living, I rarely encountered "known good perl code" that did not have tests (notable exception: usemod wiki).
[1] https://testanything.org/history.html
because wtf else are you going to do when you develop using lines like
next if (/^(#|$)/);
I just use this example because it ends in what is line noise. I'm pretty sure it means "don't process this line (skip to the nexct one) if it's a line that either starts with a #, or is an empty line). But how do I really know what it does? By testing it, of course!
When good code is line noise, how else do you know that it works? Of course you're going to test it. :D
-
EDIT: Downvoters aren't disagreeing with me. You can't downvote me just because you don't like the truth. Perl coders have a heavy culture of testing. It's the only way anyone writes Perl code - period. This is in contrast with C++ programmers, for example, who historically were more content to reason about their code. Historical fact - sorry.
This line is simple.
More complicated lines are...more complicated. Testing lets you know that code is working. Perl is the first major project that had amazing, obscene amounts of test coverage. I would say it is due to the grammar and difficulty.
I agree with you that it is clear and idiomatic Perl! Clear and idiomatic Perl is also easy to misparse using just your brain. Test code is what tells you what you wrote is doing what it is supposed to be doing.
Nobody writes Perl code, reads it carefully and reasons about it, and figures that if it compiles it is probably correct. Further, nobody has ever written code in Perl that way. Instead Perl has a heavy culture of testing.
You can argue with it but it's simply true. Contributors to CPAN have had more test coverage than the average project, and this has been true for a very long time. There's no real way to debate this.
my $comment_or_blank =~
/^ # beginning of line
\s+ # bugfix - any amount of whitespace
(\#|$) # comment char or blank line
/x; # x modifier allows commented regexen.
next if $comment_or_blank;
if (line == "" || line[0] == '#') continue;
Such a thing may look cryptic when not used to regular expressions, however once used, it's actually easier to understand and less bug prone compared to imperative implementation (as there is less code, which means less chance to have a bug somewhere).
if (line == "" || line[0] == '#') continue;
next if (/^(#|$)/);
I am making a simple statement about actual practices, not some kind of larger point. TDD came to Perl earlier. simple fact.
Yes, as much as possible!
> This seems pretty wasteful.
No, it's not! The more 1 to 1 the rewrite is, the more of it is possible to be done automatically.
Assuming your language gives you enough expressive power, it's often faster to write a library of shims for semantics and translate syntax automatically... in the worst case with a bunch of regexes - even that works better than manually writing the code from scratch, even if you read and drew inspiration from another implementation.
As an example, I started porting Python code, the PyParsing library, to Io[1]. I stripped docstrings for the moment, but if you open it side-by-side with the Python code, you'll see that it's almost line-to-line identical, except for the syntax. To do this I first wrote a library emulating some of Python features in Io[2], then passed the original code to a series of around 20 regexes, which resulted in not-working-but-close Io, which I manually corrected.
Now, writing 1.5k (it's about a half of PyParsing for now, not finished) lines of a library (not to mention 3 times as much docstrings, which you can reuse if you keep your translation close to original), even if you had previously a quite good understanding of such a library, wouldn't take two days. Which is exactly how much time it took me using this approach.
Of course, how beneficial this method may be to you in practice depends on the target language and how well it matches features of the original or how easy it is to extend it. Io and Racket are two examples of languages extremely well suited to this approach, but you can do the same with JavaScript, Python, Ruby, PERL and many other high-level, meta-programmable languages, just with a bit more effort.
Anyway, my two cents, as I recently had such an experience :)
[1] https://github.com/piotrklibert/ioparsing/blob/master/src/pa...
[2] https://github.com/piotrklibert/ioparsing/blob/master/src/sy... and str.io
* you can reuse test-cases
* you have something to check the program flow against (in worst case with print statements)
* by the time you are done you really should know the code and step #3 is a breeze
PS: Your example of "just use jinja2" - have you verified that a rewrite is non-trivial? Have you checked that there are test cases you can run against your code or are you risking to manually have to verify that your new version does the same?
Unfortunately strict determinism and test coverage often go hand in hand. Either the developer gets it or (s)he doesn't.
Fortunately you often do get a least a moderate level of determinism even from the most awful software.
It's not that bad. I've converted many programs from one language to another. For example, in the 1980s I converted the FutureNet DASH schematic editor from 16 bit x86 assembler to C (so it could be ported to other platforms). I converted my game Empire from BASIC to Fortran to PDP11 assembler to C. I've converted parts of Optlink from assembler to C. I've converted a lot of the DMD compiler from C++ to D.
The conversions I've done have all produced valuable results.
It's actually rather enjoyable work, but I like obsessive detail work.
When the codebase has evolved for a really long time, and you can see the different coding styles of different generations of developers, you get to feel like an archeologist or a geologist who's digging through multiple layers of rock.
Some of the best parts:
1. Trying something totally new only to realize that one of your predecessors has attempted to do it before. Bonus points if you can solve it this time around.
2. Digging around and realizing that your predecessor was in fact yourself.
3. Randomly encountering one of your predecessors in real life and becoming instant friends while reminiscing about old times. ;-)
That is true. There are so many things I can work on, I try to pick only the ones with the highest ROI of my time.
> feature parity
The initial conversion is feature parity. This is deliberate, as conversion only works if you ruthlessly avoid any attempt at fixing, improving, or refactoring code. Just translate.
But once the translation is done, and the new program is working exactly like the old one, then the benefits start accruing as you can start removing the technical debt, and take advantage of what the new language offers.
Well there is some greater context here, namely that they want to put all their tools on one platform. That's not at all meaningless, though I too doubt the importance here. The poor deployment story of scripting languages likely does not play a big role here.
"Note that GTK-Doc wasn't originally intended to be a general-purpose
documentation tool, so it can be a bit awkward to setup and use."
So pure 'rewriting' is not usually what you want. You want the same functionality but in an easier to use and maintain package. The approach I usually take is finding a well supported project that is already solving the same problem and extending it to solve mine.
In this case, a documentation generator seems a suspect thing to rewrite as there are already many of those out there. I'd look at extending Doxygen to support GTK-Doc's syntax (or automating a GTK-Doc to Doxygen translator).
Anyway this is an approach i've used with some success before. It can't always work, sometimes the problem you're solving is fairly unique or sometimes the available open source projects don't have much better code bases than the one you're trying to replace.
As far as "Chesterton's Fences" I say tear 'em down (but instrument and be prepared to rollback). Sometimes the only way to tell why something exists is to remove it. This is effectively just paying the price of technical debt - a cost of doing business. The higher cost is to live in a world littered with rusting fences.
On the other hand, you can’t expect to keep your entire software world standing still. At some point, the operating system or even the hardware will make it really hard to keep an old code base going, and you’ll have to do very creative things just to keep it all working. And no, you typically don’t have the freedom to force people to use some ancient system for your benefit; you’re not in a vacuum, your users are doing lots of other things and their other apps are going to keep moving even if you don’t.
You will reach a point of real regret if you don’t spend at least some time to move code to more modern concepts/languages/libraries. It doesn’t have to be 100% at once, nor does it have to be an outright replacement; leverage multi-language bindings, testing frameworks, etc. and beta users to make progress. And whatever you do, never release a “MyApp 5” that is a completely different code base than “MyApp 4”; this just aggravates people when nothing works quite right. You need MyApp 4.1, 4.2, 4.3, 4.4, etc.
You do not have to when you choose a more modern safer language that can export functions with C linkage and can trivially call C functions. You decide to use a new language and write new functionality or functions that have to be rewritten anyway.
For instance, gcc first switched to g++ as the default compiler and started allowing a subset of C++. Firefox started using Rust here and there where/when it makes sense. There are countless other examples.
I think very few folks would actually propose converting a 100KLOC or an MLOC project to another language. (Though Go did did it with their runtime and compiler :), though that's quite a special case).
Should I rewrite X in Y?
/ \
/ \
/ \
| |
| |
Am I doing this just Is my team full of
for the features in Y? experts in Y?
| |
Yes ___/ \___
\ | |
\ No Yes
\ | |__________
\ | |
\ Are there any experts |
\ on my team in Y? |
\ | | |
\ No | |
\ | Yes |
\ | _____/ |
\ | | |
\ | Did they |
\ | propose it? |
\ | | \ |
\ | Yes | |
\ | | No |
Don't rewrite. | |
| | |
| Were you going to
| rewrite it anyway?
| No |
|______________| Yes
|
\
\
\
\
\
\
\
\
\
\
|
Think about it.
[0] http://roscidus.com/blog/blog/2014/06/06/python-to-ocaml-ret...
That is why, when I run an engineering team, I constantly push back against "clever" or "fancy" stuff. Engineers love that stuff, eat it up. I get it, it's fun. But when they move on to something else and a mere mortal has to maintain, fix, or convert it, the cost of that clever stuff becomes apparent. I've had to be that mere mortal and I can assure you it is a miserable experience.
I've often thought that one of the most useful things about me is that I'm not smart enough to be that clever, I like simple, obvious code. It's a little bit of a lie, I'm smart enough, sort of, to be that clever but it's a lot more work. And I know that any code that is more than 6 months, doesn't matter if I wrote it or someone else wrote it, it always feels like someone else wrote it. And man, do I love it when that someone else went for straightforward rather than clever.
But unless you can point at benchmarks, nobody will listen to you, and the better choice may even be described as blub.
> Manually converting code from one format to another is the most boring, draining and soul-crushing work you can imagine.
> we can estimate that a sustained rate of conversion one person can maintain is around 100 lines of code per hour
> This gives us a clear answer on why people don't just convert their projects from one language to another: There is no such thing as "just rewrite it in X"
If there's some indication that the request is from someone qualified and willing to help, maybe you give it more consideration.
I'll ask someone to write a one page summary of the project goals. Similar result.
Or are you implying that, because you don't know a measurement that has a causal relationship with it, that you can't be improving it?
Three projects come to mind. 1) Rewriting a set of internal libraries from Python 2 to 3, because it's better to do early on, before compatibility problems arise. 2) Rewriting an ancient cli tool written in C++ into Go (nobody here knows C++, some speak Go). It has worked for years untouched, but you never know. 3) Rewriting a Fortran modelling tool into Go, because while people know Fortran, it maintenance costs are annoying and adding functionality is difficult or impossible.
In all cases, no functionality was gained or removed by the rewrite, but future pain was spared thanks to these endeavours.
The "1 hour per 100 lines" rough number from the article is probably optimistic, at least in some settings. As one example, a coworker of mine manually converted a complex 500-line file, and it took two days to convert, one day to go through code review, and introduced two bugs.
Probably any reasonable strategy needs to find a way to work incrementally and focus on the most valuable parts first. For example, if you focus your efforts on converting files that are the most active, then you may be able to culturally move your team to the new language even if there's still a lot of legacy code in the old language.
In terms of the three phases from the article, I'm hoping that decaffeinate can become stable enough that it completely automates step 1 and avoids the need for step 2, but step 3 will always take time. In my case, my plan is to do a fully-automated conversion over the ~150k lines of code (broken up into maybe 10 chunks), call the style issues tech debt, and slowly clean those up as we work through each part of the code.
[1] https://github.com/decaffeinate/decaffeinate
[2] https://github.com/decaffeinate/decaffeinate/blob/master/doc...
[3] https://github.com/codecombat/codecombat/issues/4276#issueco...
People do it all the time, every day. Code has side affects and features you don't necessarily need. NIH is usually, "don't need most of that here" or "language X would do it cleaner/with-less-bugs(tm)", which is different wording for the exact same reasoning. So I'm not sure why the focus on this line-by-line conversion that nobody would choose to do.
On the other hand, rewriting my java chat server in erlang
(in 2008) went from 5000 lines to about 500. That was a point by point functionality conversion of my own project, for existing java clients.
Take for example a rewrite of coreutils in Rust. The LS utility in original C (https://github.com/goj/coreutils/blob/rm-d/src/ls.c) and its rewrite in Rust (https://github.com/uutils/coreutils/blob/master/src/ls/ls.rs) are quite different. The rewrite doesn't implement half of the original options. It's reasonable to assume that not many people even use the rewritten LS, so it contains bugs that the original widely used version doesn't have.
The bottom line is that if you want to achieve comparable quality and features, the time and effort invested in rewriting some code is almost the same as the time and effort invested in creating that code from scratch.
If it's reasonably complex but also reasonably well designed, it can be completely but incrementally rewritten by module, without the problems you describe.
OTOH, those are usually the things that there is less pain with the existing system driving by a desire to rewrite. Chances are if you want to rewrite it, it's going to be a nightmare.
Every time I've been handed a perl script to 'fix', if I think I'm ever going to have to touch it again, I rewrite it in python. Our perl code is a migraine-inducing mess of unmaintainable punctuation vomit.
More on the conversion here: https://lpeer.blogspot.co.uk/2010/04/switching-from-c-to-jav...
* working on the codebase full time
* intimately familiar with the quirks
* relying on the transpiling to be mostly automated
IMO as far as rewrites go, that's a pretty tall order.
I'm personally contemplating a rewrite of an old project (mostly due to GTK having gone downhill in past years), and harbour no misconceptions about the scale of the effort. It's only about 10k lines of python, across two different projects, but if it takes me less than 2 months (of active spare time work), I will be very much surprised.
Out of curiosity, what's the app, and what UI toolkit would you port to? (And would you stick with python?)
I am tempted to move to Qt instead, and for a number of reasons will probably try Go instead.
1) As much as I have traditionally disliked Go's dependency management, with the vendoring support it can be made reasonable
2) I never got python's threading to work properly, which made the UI unresponsive when dealing with database operations; and threading in python is a hack anyway
3) I believe (cough hubris!) that I can simplify the internals by further splitting the architecture
One of the best ideas I ever had was to forego all attempts at subprocess management and just go for DBus and its process autolauncher instead. I mean, have you ever tried to make a fork/exec work from within GTK software without causing X hiccups?
https://blog.forrestthewoods.com/the-eighth-dirty-word-just-...
Only allow coding that lets the company change its mind.
If you choose a language which has conservative, well established features, devoid of parsing quirks, and you enforce very strong coding standards per-team, then it is possible for every program to be composed of per-team idioms which can then be automatically translated to idiomatic code in a different language through isomorphic changes.
You will still need to have a project to port your program, but instead of trying to outstrip the rate at which production code changes, the effort is instead going towards perfecting a software machine which translates the program in an instant. By doing it this way, you can always upgrade your technology without slowing down the production team, or without trying to outstrip a production team which is already trying to move competitively fast.
I know this trick works. I've done it. What makes this challenging:
- Management will be afraid of this, because it's unconventional
- Teams may not want to or be able to usefully adhere to such standards
(Think of it this way: Do you think it's easier for a very capable team to outstrip your ability to keep up with features, or your ability to keep up with new programming idioms? If you team is competent, they will be quickly producing new features, but seldom producing new coding idioms!)
