Hacker News new | past | comments | ask | show | jobs | submit login
What Lua scripting means for Wikimedia and open source (wikimedia.org)
118 points by kibwen on March 18, 2013 | hide | past | favorite | 73 comments



This is a big deal: Wikipedia is allowing contributors to extend the functionality of Wikipedia with new code (written in a powerful language), with minimal or no supervision. It's hard to predict what kinds of improvements we will see over time. Every Wikipedia article is now, effectively, an interactive application.

As far as I know, something like this has not been tried by anyone else at this scale before.

Quoting from the original post: "Anyone can write a chunk of code to be included in an article that will be seen by millions of people, often without much review. We are taking our 'anyone can edit' maxim one big step forward."


It's sort of been the case on Wikipedia already, but until now via a pretty crufty, home-grown templating system. Most people thought of it as a mere templating system, but it had #if/#else constructs for conditional inclusion, and some similar features, so you could program in it, if you tried hard enough (and some people did).


The good thing about Lua is that you can sandbox it entirely and set hard limits on memory/CPU time, after which the process exits. You can just give an error message if RAM grows over a limit or if it takes longer than a few secs to execute. It's pretty safe.


Is it possible to run a program in 'steps,' between which the state is paused?


The debug lib lets you hook behaviors on line changes, entering a function and leaving it, but nothing for single bytecodes iirc.

The debug lib, allowing to take control of the whole program, is disabled in most sandboxed environments including wikipedia's.

However, bytecode instruction and line changes both are poor metrics to measure time, cpu or any other kind of resource usage. Measuring process time is relatively easy and much more accurate.


I don't think it's possible with the stock interpreter, but I'm guessing it's pretty simple to modify the interpreter to do it. Why do you ask?


Simple time sharing between low-priority, potentially long-running programs submitted by untrusted users.


Run a process per script and let the OS handle it, probably.


You don't even need to do that, you can run Lua in a multithreaded program, as long as no lua_State is accessed from multiple threads. You can spawn independent interpreter states, have them do their job and terminate. The whole Lua codebase as such is completely stateless (apart from the state that you keep for your code). The shorter the scripts actually run, the smaller overhead you get by this.


That isn't what I asked about and isn't really helpful


Don't read it, I guess? I don't know any other way to do it, unfortunately. I'm sorry I wasted your time.


Why not? Isn't that the whole purpose of coroutines in Lua?


No, quite the contrary: coroutines only yield explicitly, due to a call to coroutine.yield(). They let you control your scheduling deterministically (although multitasking is far from being their only use)


Note that GP didn't mention the means in which the yielding was to be invoked.


You can't yield if there is a C function lower down in the stack so I don't think you can always abuse the coroutine system to implement OS threads like that.


Are you sure about that? I thought that yields across C functions were added in Lua 5.2. Not to mention that in a controlled environment like the runtime that Wikipedia exposes to you, you can take care of avoiding exposing APIs that would allow people to code themselves into a corner like that (if the C calls are leaves, you're OK even in 5.1).


You are right that they could use a carefully crafted API to work around the problems. Its also true that yields across C were added in 5.2 but you need to code the C functions in continuation-passing-style using a special API to take advantage of that.

http://www.lua.org/manual/5.2/manual.html#4.7


Well, two other options for you are using the coco library with stock Lua 5.1.x, or simply using LuaJIT.


> Every Wikipedia article is now, effectively, an interactive application.

I don't think that's quite true. MediaWiki templates are now more powerful, but I'm not sure even that counts as interactive. This makes editing easier, but the reader isn't going to interact with a Wikipedia page any more than they did last week.


I'm not quite sure if this applies, but can't anyone currently write a Lua script for World of Warcraft? In which case the script could possibly be connecting to and interacting with the WoW servers?

If one thing can be said about Lua, it is that embedding Lua is a solved problem, and not something dangerous to be worried about.


It's similar, I suppose. Like the Wikimedia example, the Lua you write for WoW is tightly limited -- there's an API available [1], and you can request some extra data from the server sometimes. However, there's absolutely no loading of external modules, contacting any sort of outside server, or any direct file access. There's also large swaths of "what the player can do" which are blocked off from the API, in the name of not allowing people to automate core game elements.

[1]: http://wowprogramming.com/docs/api


While more efficient templating sounds like a good thing and is clearly a pain point, I'm not convinced Wikimedia is really thinking this one through.

1. Every page is now a "real" program. Will separation of code and content result from this? The article hints that structured data is on the way, but I have a hard time imagining the average contributor will restrain themselves given this awesome new power. Instead I foresee bigger and badder messes.

2. Does this get us any closer to a parseable format for Wikipedia content? This is the world's knowledge. The fact that it's currently trapped in an ad hoc, PHP-inspired template "language" (there is no formal grammar for it, it's whatever some PHP code accepts as input) should be very concerning.

3. Why Lua? Javascript seemed like the obvious choice. Rendering templates server-side at Wikipedia's scale sounds incredibly challenging; it boggles the mind really. Surely most page views are from js-enabled browsers -- why not offload all that rendering to the client?


Lua was chosen over JavaScript for its ease of embedding. It's easy to limit how much memory the Lua interpreter uses to prevent users from DoSing Wikimedia servers with templates. V8 and SpiderMonkey do not make this easy. More details here:

https://www.mediawiki.org/wiki/User:Sumanah/Lua_vs_Javascrip...

Wikipedia pages are massively cached: rendered pages are cached in Squid, and I believe parsed page contents are stored in memcached. Templates are only re-rendered when pages are edited, if all goes well.


It would be impossible to move rendering to the client.

Transforming wikitext into HTML is surprisingly IO-bound, since most Wikipedia pages transclude many other pages for templating and logic. It's very slow now, and you need to be close to the database for this to be remotely practical.

Lua gives the community a way to replace certain horrible templates which perverted transclusion and macro-expansion to do logic. But most Wikipedia pages will still transclude a lot of other pages.

Also, web browsers are not the only consumer of Wikipedia pages; you can also download PDFs and so on.


"Transforming wikitext into HTML is surprisingly IO-bound, since most Wikipedia pages transclude many other pages for templating and logic."

Sounds like a case for in memory caching. Or is it that any random Wikipedia page requires logic from a random subset of a large pool of other pages? Somehow I doubt that, with Zipf's law coming to my mind.

Also remember that an HTML5 front-end would be able to cache a substantial amount of this (a few megs at least) and Lua can be quite effectively compiled into JS.


MediaWiki uses lots of caching on the server, but the question is how hard it would be to move rendering to the client.

For a sense of how hard this would be, try using the Special:Export page on Wikipedia.

http://en.wikipedia.org/wiki/Special:Export

If you download transcluded templates, the article for Barack Obama is 781K.

My experience of Special:Export is that it has some flaws that cause it to miss some things it needs to export, so the real total may be much larger.

And that's just the data - one would also have to download a lot of related code, which might balloon that up to a megabyte or more.

Wikitext is particularly ornery (because it's just based on grinding regular expressions against each line, it is not easy to describe with regular grammars) so you'd have to download a very large parser, with various plugins as each MediaWiki install uses them to warp how Wikitext is processed. This is assuming some optimistic scenario where MediaWiki's rendering, and all related plugins, are entirely ported to JavaScript compatible with all desired browsers.

I'm not denying that, if you wanted to create a new Wikipedia from scratch today, based on JavaScript, you could probably move a lot of rendering to the browser. You would choose more browser-friendly formats, like JSON or XML, rather than making up some random text-based format, just because it was easy to type into a textarea. You would make transformational operations work in JavaScript, or be exportable to JavaScript. You could definitely get it to the point where it would be practical for quick previews while editing.

For the Wikipedia that we have today, it's really hard.


It would be impossible to move rendering to the client.

?! I can play Quake in Javascript.


You could also play Quake over the internet in the 90s, which should show you that Quake is not IO-bound.


Why Lua?

I have been running Node.js and Lua at fairly large scale on EC2 recently.

I like the V8/Lua/Luajit platforms just fine. Each has some nice strengths.

With Lua, I see significantly less memory use, and moderately increased performance.

As long as RAM is the expensive bottleneck for hosted servers, it looks like I will be able to get better performance (in operations per dollar) by using a smaller runtime.

Why not client-side rendering?

Accessibility? Cross-platform uniformity? (better answer from 'capnrefsmmat: render once, then cache massively)


Also of note is that (AFAICT) the ultimate decision came down to a choice between Lua and Javascript:

http://thread.gmane.org/gmane.science.linguistics.wikipedia....


This makes me both happy and sad. I'm happy that Wikimedia is adding a code path. I'm sad that it's not JavaScript.

Wikimedia needs to move beyond 1990s internet and allow interactive articles. Articles on 3D that have interactive 3D. Articles on Physics that have interactive simulations. If you asked me in 1980 what an encyclopedia would look like in 2013 I would not have guessed static text + static pictures.

I know it's not an "either or" thing. They can have both. But I think once they get around to adding JavaScript they'll regret having added Lua now and having 2 code bases with 2 sets of libraries and code to synchronize the 2 when they need to communicate with each other.


I completely disagree. Lua is SUCH a cleaner, faster, more consistent language than JavaScript, ESPECIALLY when considering how easy it is to embed in another app (like binding it to PHP), that it's a no-brainer to use Lua. They were looking for speed, and LuaJIT is FAST; approaching C speed under the right circumstances.

What I'm sad about is that JavaScript is so locked-in on the browser that I can't use Lua there as well.


I would agree that Lua is cleaner. But it is not perfect either, for example indexing starting at 1 is definitely an oddity.

As for fast, I would love to see more comparisons on real-world code, there are too few on the web.

It is easier to embed Lua than JavaScript, however, that's say a week of work, then you're done, so it isn't a reason for something like Wikipedia to prefer Lua.

With all that said, Lua and JavaScript are both good languages, adding either one to Wikipedia is going to be a huge improvement.

edit: looks like my phrasing has offended mikemike, who I have the utmost respect for, so I removed some stuff he disliked.


>for example indexing starting at 1 is definitely an oddity.

Only to programmers raised in a C-based world. Normal people, mathematicians, and Lua start counting at 1.

I remember when I was new to programming I was horribly annoyed by the fact that C started counting at 0 - that was so unnatural! (the BASIC I had used up until that point used 1-based indexing too)

0-based indexing is good for low-level code, for memory and screen addresses. For high level concepts starting at 1 is the natural choice.

However, I admit after many years of C/C++ I was quite irritated by everything starting at 1 by default. My brain adapted after a few days, though.

I think the Lua devs made the right choice with the 1-based indexing. Just like 0-based indexing was the right choice for C. As I said both ways are natural but in different scenarios: 0-based for low-level machine code, 1-based for high level human concepts.


> Only to programmers raised in a C-based world. Normal people, mathematicians, and Lua start counting at 1.

I mostly agree, except for mathematicians ;) , where it depends on the context. For example aleph numbers start at 0,

http://en.wikipedia.org/wiki/Aleph_number

and even the natural numbers can start at 0 or 1,

http://en.wikipedia.org/wiki/Natural_number


I think the key here is that people ALWAYS complain about Lua indexing from one. Aside from that, not having a huge standard library is the second complaint. There are debates that come up all the time on the Lua list about what would be best, but I think those are really the main complaints people have.

Large standard libraries would mean huge potential for security holes in this case; they've locked features down to only support things that they could make secure.

And starting from 1 you can get used to. LuaJIT doesn't even REQUIRE you start from 1; it'll optimize arrays you start from zero. And even vanilla Lua will accept 0 as a table index; it just won't treat it as an array (for ipairs() or length).

But if you start asking people for complaints about JavaScript, you will find people complaining about operators acting wonky, auto-type-conversion problems from hell, and even odder things like [1].

Lua is just a cleaner design. I am stuck using JavaScript when I want to work on the web, but knowing what an elegant language CAN be makes me enjoy it a lot less. And don't get me started on how ugly NodeJS is because JavaScript doesn't have coroutines...

[1] https://news.ycombinator.com/item?id=3401074


LuaJIT is being used in various commercial games. You may not see that real world code, but it is out there.

Watch the YouTube video of the Wikipedia presentation linked elsewhere in this discussion.

From the video, they picked Lua also because they could have multiple concurrent Lua states (no globals), easy to override the default memory allocators for security, Lua VM instances are very cheap and fast to create and launch, and the code base was small enough that their security guy could actually audit all the code (and they even submitted a patch).

All these properties are hard to find with Javascript, but with Javascript, you have to go further and first decide which Javascript implementation to use. The ECMA spec says nothing about C interfaces or embedding so every Javascript engine (e.g. V8, JSCore, SpiderMonkey) is completely different.

They also said they are using the standard Lua interpreter and not LuaJIT because stock Lua happened to already be really fast and fast enough for their purposes right now.


> LuaJIT is being used in various commercial games. You may not see that real world code, but it is out there.

I didn't say otherwise, I know LuaJIT is used. What I did say was I have not seen speed comparisons on such real-world code - but I would love to see a link if you have one.

> and the code base was small enough that their security guy could actually audit all the code

A small codebase is indeed good, but JavaScript VMs have very large teams of people working on hardening them, both reading and reviewing the code and applying methods like fuzzing to look for security problems. The fact is that JS VMs are likely the more hardened of anything out there, simply because they are used in web browsers.

> with Javascript, you have to go further and first decide which Javascript implementation to use.

Lua has at least two implementations as well, mainstream Lua and LuaJIT.

> They also said they are using the standard Lua interpreter and not LuaJIT because stock Lua happened to already be really fast and fast enough for their purposes right now.

That's cool.


> Lua has at least two implementations as well

The difference is that the Lua C API is standardized meaning that C code can work with either just fine. The only big incompatibility problem is that LuaJIT provides an extra, non-portable, FFI library in addition to the standard one.


and luajit isn't 5.2 :(


The Lua index from 1 complaint is becoming the new Python whitespace gripe.


And just like the Python whitespace thing, in practice it doesn't seem to be a problem after the first day or so. You get used to it quickly.


Not if your brain is used to doing modular arithmetics since childhood. :/ Python whitespace and Lisp parens seem trite in comparison to that. I just get 1-based indices more often wrong that I get 0-based indices wrong.

(When doing things like "given an interval of indices a-to-b, partition it into n subintervals so that every element of the original interval is in precisely one of the sub-intervals, and the lengths of the intervals are either the same or as close to each other as possible", I find the "count from one, bounds are inclusive" model horrible. How do you express an interval from a with the length of zero? It's just <a,a) with half-open intervals. What, <a, a-1>? So many special cases with the <a, b> model...)


[deleted]


By all means please correct me when I am wrong. Is that limitation no longer correct?

I can find that link to the v8 blogpost, if you want?

edit: While I didn't mean them the way you interpreted them, I edited out the parts of my previous comment that you took offense to.


    LUAJIT_ENABLE_CHECKHOOK
edit: ditto.


Thanks, I am going to read up about that now.


> If you asked me in 1980 what an encyclopedia would look like in 2013 I would not have guessed static text + static pictures.

Static text + pictures is exactly the way I like my encyclopedia to be. It is simple, lightweight, works everywhere, printable, usable on e-ink displays, and does the job just fine. You can always link to interactive toys, but please keep the main articles this way.


They should certainly be useable for disabled/blind users, but it seems silly to retard the growth of human knowledge in this manner, just to please conservative elements.


Does it really gain something for human knowledge to add 1990s-style "multimedia" to everything? I had a version of Encarta with interactive doodads when growing up, but I don't think they added much to the encyclopedia.


A fair number of the articles I have read have animated gifs to explain some concepts that are difficult to explain with just text. You can certainly make 'doodads' if you so desire but that is like saying that there is no use for books, because you can write a trivial sex novel.


I'd say the biggest issue of going with not-JavaScript is the fact that if they used a JS library for template compiling, they could use it both server-side (like they do now) and client-side, eg. for previewing article/code changes with no round trip to the server and back. I figure it could potentially save quite a few CPU cycles, especially when it comes to more complex code creations which will require a lot of iteration and testing.


Good point (that js could be evaluated on client side as well). Although someone else said the template scripts need access to the database, and Lua can run in the browser as well (via emscripten for instance).


Via emscripten you have to download 2mb of javascript and it's slow as molasses. Via the other "best" solution, which compiles lua to javascript, you don't get the whole language, or access to any native libraries you might be linking to.


I think the use of Lua language doesn't prevent Wikipedia to provide these features in future.

They are more resolving server performance issues. The client processing is not to be affected.


What would JavaScript bring that Lua does not?


As pointed out, you can't run Lua in the browser. It would be nice to have articles like this in Wikipedia.

http://tests.web2py.com/physics2d/default/code/10

That points out that arguably Wikipedia will eventually have to allow JavaScript in pages it serves if it wants to progress.

When that happens there will now be 2 code bases. The Lua code that generates pages and the JavaScript code that does interactive examples for articles.

So the question is, why have 2 language which requires twice as many libraries, twice as much knowledge, twice as much expertise, and various glue to get them to interact with each other, vs having 1 language used for both?


> So the question is, why have 2 language

One perspective: Because choosing a language because it is conveniently in place should not be the primary reason. Many people are comfortable with JavaScript, but there are many server side code bases in other languages because they are preferred for various reasons.

There is a large amount of PHP code deployed, but there does not seem to much drive or concern that it does not run in browsers. Both JS and PHP earn a large amount of negativity because of their shortcomings (which we don't need to go into here), and they are obviously not the tool for every job.

I would simply make the point that choosing a language to avoid having to learn more than one is not an approach for improvement. Otherwise as I said in another post, we would all be using IE with ActiveX controls.


Many companies limit the number of languages used internally. There are both positives and negatives of course. The biggest positive though is that it makes it easier for programmers to understand code written by other programmers. It also means code written by one programmer will be usable by other programmers.

If the code is in a different language then both of those are often false. The 2 programmers don't speak the same language and even if they both happen to understand both languages they can't share code.


A user base, for one thing.


Which is the case for all technologies at one time. With that line of thinking, we would all be using Internet Explorer.

JavaScript did not birth with a fully formed userbase.


But the fact is that JavaScript does have a fully formed user base now.

The whole idea behind this effort is to facilitate contributions to Wikimedia from a diverse set of (presumably busy) contributors, correct? If so, what's the point in making them learn a new language in order to contribute, when an existing well-known language is a perfectly reasonable fit?

Sounds like someone had a Lua hammer, so everything looked like a Lua nail.


> Sounds like someone had a Lua hammer, so everything looked like a Lua nail.

Same could be said for most of the JS proponents in the thread.


Performance change after some wikipedia templates were rewritten in Lua:

https://en.wikipedia.org/wiki/User:Dragons_flight/Lua_perfor...


Templates were already pretty complex a few years ago when I was a casual contributor. I am so glad they did this!

Now, if we could only go back in time and replace Vimscript with Lua or Lisp or anything...


Video presentation and demo: http://www.youtube.com/watch?v=PrhzAtC8fCc


Some of the examples were shocking. The inefficiency of str len astounds me.


To people grousing about it no being in JS: The hell is wrong with you? JS being the only language your browser will run is the problem, not people choosing to use something besides JS for something they do on the web.


Can anyone link to a good example of a simply lua template that's live on Wikipedia right now? The linked article wasn't pretty great for examples.



"But, because we’d never planned for wikitext to become a programming language, these templates were terribly inefficient and hacky — they didn’t even have recursion or loops — and were terrible for performance."

That sounds like the life story of PHP! And many other similar solutions.


Finally, i hope this makes Lua much more widespread.


This news made look closer into Lua. I discovered that it actually does not provide any support for unicode. I am wondering now how this will impact the goals for Lua as scripting language? Or is there a special module for Unicode support that will be used?


From the comments:

"People who have programming skills should enjoy better productivity and lower frustration while editing Wikipedia. Perhaps most exciting is how many people will have a gentle and practical introduction to programming because of this."

A lot of people on HN get down on Wikipedia because of deletionists etc. and support a forked version but the fact is Wikipedia is the most salient expression of the spirit of the internet. Google is great but is a business with all the downsides that entails, project abandonment etc. Maybe I am gushing but Wikipedia is without peer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: