Hacker News new | past | comments | ask | show | jobs | submit login
Wikipedia starts work on $2.5M internet search engine project to rival Google [pdf] (wikimediafoundation.org)
437 points by e15ctr0n on Feb 14, 2016 | hide | past | web | favorite | 184 comments

This was/is actually an extremely controversial project. The corporation (basically the Executive Director) pursued the grant and the idea without soliciting input or really disclosing it to the community of editors, and eventually one of the community-elected trustees was removed for questioning the lack of transparency. The community has a long list of software improvements that they'd like to see to the core platform.

A recent employee survey showed only 10% of WMF staff approved of the Executive Director, probably in large part due to things like this.

A critical take on the project as it has been handled: http://permalink.gmane.org/gmane.org.wikimedia.foundation/82...

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2... is a pretty good (but dated) overview from Wikipedia's weekly newspaper, but there's a few others in the Signpost and a few blog posts across the web.

> and eventually one of the community-elected trustees was removed for questioning the lack of transparency

This was a hypothesis about his removal when he was initially was removed, and has since been refuted by multiple sources.

[citation needed]

There's a new Signpost edition, covering this grant document and also three other internal documents that were leaked.


A bit more detail is shared in this comment below: https://news.ycombinator.com/user?id=chris_wot

Wikipedia needs to solve it's management issues before it starts developing new software...

I've vouched for this comment.

The biggest problem is the lack of data about what people are searching. It's a catch-22 that's very hard to break in the face of Google's search dominance and ubiquity.

By Google being the best, it only becomes better, and introduces a huge barrier to entry to competitors. It used to be possible to know what people were searching for to end up in a given Wikipedia article, but the process is now only asynchronous (and limited) through Webmaster tools[1]

In my mind, the most interesting aspect of the announcement should not be how much money they have to spend, but how they plan on solving this paradox.

[1] http://webmasters.stackexchange.com/a/60350

I think facebook should be able to build a search engine. Don't know why they don't have one yet

Facebook wants a single platform to provide the Web through. If they built a search engine, it would only search Facebook.

They already have a FaceBook search engine. I noticed when I start typing a friend's name in the search bar, my friend matches appear below pages that I haven't liked/discussions or if I search an artist I liked, other results appear above it in the quick search drop down. I haven't noticed sponsored search results but I'm sure they're working on it.

Example of a search query URL. Verified pages appear to get more weight. https://www.facebook.com/search/top/?q=president

Facebook search is broken, twitter search works much better.

It's intentionally broken.

Despite many real (though also some exaggerated) counter-examples, Facebook does have features to protect privacy.

One of those things is you generally can't get very useful results from search unless you're friends with someone, or a friend of a friend (depending on the user's privacy settings). You can't see general trends, or even search for every person with a given first and last name in an area, for example.

It used to be more open, but they've heavily restricted the breadth of data returned from searches within the past few years.

It can't even find anything in my own posts. If it's a permission problem, that's a pretty fuckin' serious permission problem right there.

On top of which, Graph Search is still disabled inexplicably in many regions, for reasons which were never explained.

> It can't even find anything in my own posts.

Ditto. I have seen this as a crippling deficiency in the Facebook platform. Not being able to search properly in my own posts or in the groups I'm a member of really sucks. For all the engineering prowess, open sourced tools, etc., shown by Facebook, the lack of a working search makes the company seem incompetent from the top down.

Sometime ago I started storing important information (like others links, my own comments, etc.) outside of Facebook where I can find them easily. I also started a Facebook page and added Notes into it to make it easier to document, find and share things. It seems ridiculous that I'd have to do this just to have access to information, but that's been the sad state for years.

FB search is terrible because there is no incentive for an engineering team to work on it, and in fact maybe quietly discouraged from doing so. FB make money on from the news feed, so any feature that distract users from scrolling down their feed will come up losing on an A/B test where ultimate metric is ad views and clicks.

It's only broken because most people share with their friends only and search cannot expose their posts. This makes real time search way less efficient than twitter.

That's what security trimming is for. You index everything, including the permissions for each item and provide the searcher's permissions (whether it's an acl, a claim, or some other form) as part of the query. Anonymous or unauthenticated users get a "public" claim that only returns results available to everyone.

Also, one might suppose, by being the best they may have become lazy in their core offering. That's what theory tells you about monopolies anyhow.

Keyword Planner tool tells you what people search for.

Google uses a highly sophisticated set of data and algorithms to determine what to present to who when they search for what from where. The algorithms are obviously secret, but any user paying a bit of attention will notice the results are clearly not generated from the text in the text field alone.

Summary of the approach (p10):

"1) Public curation mechanisms for quality;

2) Transparency, telling users exactly how the information originated;

3) Open data access to metadata, giving users the exact date source of the information;

4) Protected user privacy, with their searching protected by strict privacy controls;

5) No advertising, which assures the free flow of information and a complete separation from commercial interests;

6) Internalization, which emphasizes community building and the sharing of information instead of a top-down approach."

My first thought: How will transparency impact SEO? Will spammers be able to better game the algorithm when they know its internals?

However I am excited at the prospect of a wikipedia-like public curation system for the entire web. I admit I'm flabbergasted that the whole thing ever worked, but it does.

1) Public curation mechanisms for quality;

The Mozilla / open directory project tried this. Curation doesn't scale and often assumes a single unifying ontology. This is particularly problematic in a cross-cultural context. Besides, 'quality' is not a unidimensional metric in a result set: consider timeliness, authority, notability, uniqueness, comprehensibility, etc.

2) Transparency, telling users exactly how the information originated;

Most search engines include a URL, I can see a [crawldate] button like the [cache] or [translate] buttons on each hit adding some information, but this will be of dubious additional utility for most searches.

3) Open data access to metadata, giving users the exact date source of the information;

As above.

4) Protected user privacy, with their searching protected by strict privacy controls;

We have duckduckgo already, friends are welcome but it's hardly a unique offering nor a trustworthy one given Snowden's revelations regarding the scale of systematic 5 eyes traffic monitoring/recording.

5) No advertising, which assures the free flow of information and a complete separation from commercial interests;

DDG or Google or Bing with plugins can supply this. Not ground breaking.

6) Internalization, which emphasizes community building and the sharing of information instead of a top-down approach.

This is so amorphous as to be a non-point.

So out of six points, 2 things (33%) are only useful in edge cases, 1 thing (16%) is too vague to be useful, and the other 3 things (50%) are currently implemented by others and have been tried before.

I would like to see the input of the former Blekko guys on this, https://news.ycombinator.com/user?id=ChuckMcM + https://news.ycombinator.com/user?id=greglindahl

> Curation doesn't scale and often assumes a single unifying ontology

Wikipedia is a pretty big exception to that assertion. Perhaps DMOZ (a clone of Yahoo circa 1996) is not the only way to do curation. Perhaps Wikipedia could apply what has worked for Wikipedia, i.e. develop a set of POV-neutral criteria for organizing collections of links and then invite everyone to participate.

It's really easy to be negative. But that's something that might at least be an interesting research project for the #1 open-curation system in the world.

You make a fair point. I'm not rubbishing Wikipedia, just questioning the supposed USP. I would also point out in response to your argument that a Wikipedia article and a set of search results are apples and oranges.

The article is written once then modified or evolved occasionally by (almost exclusively) humans, but very frequently read. It is intended to be intelligible, being structured and based in natural language. It has a very well defined scope within a flat namespace, and often clear relations to multiple formal ontologies. It is structured to be consumed in part or in whole, and may contain rich media and strong supporting contextual information (related pages).

By contrast a search result summarizes a set of potential information sources that may answer a search query in whole or in part, to various definitions of "answer". It is generally written once, by a computer, and thrown away after some period of caching. It is intended to be concise. Each component result has relatively poor context, relying upon the searcher to interpret timeliness, authority, notability, uniqueness, comprehensibility, etc. with the limited information presented, typically a very short content excerpt. It is structured to be scanned, classically in a ranked fashion from "best hit" to "worst hit", and is generally a wall of text.

Wikipedia successfully attracts people to contribute to the former, but the latter - where the information product is generated on the fly and lasting impact is amorphous (nothing particularly concrete for contributors to point to and say "I did that! Warm and fuzzies!") - is a very different beast.

I too believe there is room for innovation ... there are potentially low hanging fruit like inter-linguistic semantic queries (not keyword search) ... but there are no such key problem areas identified in the paper's summary.

The other big problem is that curating search results is inherently about prioritising a position rather than establishing a sourced and reasonably neutral version of the truth.

I'm imagining the edit wars and debates that take place on contentious wordings or facts in some parts of Wikipedia, but on a much wider scale involving hundreds of SEO consultants each aware that changing a particular criterion will have a quantifiable impact on their clients' bottom line. It doesn't sound like it would be fun to police.

Wikipedia already curates links to some extent on every page under "External Links". So there is a seed there.

And even the page text is not immune from the problem you describe. Grading and prioritizing sources is a fundamental part of producing a "reasonably neutral version of the truth." It's what determines what gets cited and how prominently it influences the article.

So while I wouldn't equate text and links in terms of the difficulty of managing POV-neutrality, I would say they sit on a spectrum.

There was a remark recently that most Jeopardy answers are Wikipedia titles. Consider Wikipedia as an ontology, with Wikipedia titles as the vocabulary. A search engine could associate articles with relevant Wikipedia titles, and try to do the same with queries. The first step of search is then relatively straightforward.

> We have duckduckgo already

DuckDuckGo a meta-search-engine! It relies mainly on Yahoo Boss API which uses Bing search (for most countries)! Yahoo Boss API turned from free to expensive in early 2015 and the future of Yahoo (tech company, not Alibaba stock) is uncertain.

We definitely need more search engines, only 6-7 exist that cover a wider range (international). Search on HN to retrieve the list, we had this discussion before.

I agree with you, but this quote was in the context of the claim of a privacy USP. You have taken it out of context.

BOSS is being discontinued on March 31st, so I imagine that DDG has some sort of idea what they're going to be doing after that date.

Any insight?

They could use the Bing API directly or just stick with Yandex.


They are using Yandex now with a bit of their own crawling.

Curation scales for some topics. It was hard to build a curated list of Linux sites, they come and go. But there are only a couple of new good, comprehensive health websites per year.

I was Blekko's founder/cto. And it's worth noting that our founding team was the Open Directory Project's founding team. Blekko's curation data was even better than dmoz in its day. Check it out: https://github.com/blekko/slashtag-data

Duckduckgo is ad free? I never knew this. How do they make money?

My mistake. You are right, they do have ads, I just always use a blocker. I've updated the text.

Ddg has an option to disable ads, they just ask that you help promote them.

> 4) Protected user privacy, with their searching protected by strict privacy controls;

They'll have to keep the servers outside the USA then. It's illegal for European organisations to transfer personal data to the USA now that Safe Harbour is invalid.

"Public Curation" doesn't make quality. It makes a mob-rule system where only the most popular ideas flourish.

Nonsense. It means content will be filtered through the lens of one or more individuals. The results vary dramatically. Mob-rule is one possibility. Yahoo Directory used to be a great example of mid-level of quality where it gave nice start and obscure stuff people overlooked. On high end, the link below shows Stanford Encyclopedia of Philosophy set a pretty awesome precedent for high-quality curation:


I'm pretty sure that the spammer argument is just an excuse used by Google to allow them to keep their business practices out of public scrutiny. Google search results are biased in favour of content produced by those who have money and power.

Google ranks everything based on popularity - Not based on quality. Popularity and quality are two independent concepts and not necessarily related. That's something which Wikimedia understands but which Google doesn't.

Google does take quality into account. That's the whole point behind static ranking algorithms. However, quality isn't some universal concept. I'd say the inevitable paper on gravity wave detection is highest quality on it, but it certainly isn't popular because it's impenetrable to the lay masses, unlike say a Wikipedia article, which falls into look-at-me-i'm-oh-so-smart territory when it gets into double gradients and other math formulas with more letters than digits.

If you think "spam" isn't the defining problem of web search then you've never tried building a search engine. It's 90% of the problem.

Google does take plenty of quality features into account[1]. PageRank is one of course, but that isn't some corporate conspiracy, it's that it is a good feature.

[1] https://moz.com/search-ranking-factors/correlations

I'd ask you to cite your claims, but we both know you can't. It's a pity your issues with Google cause you to pollute discussions with BS.

Do a Google search for anything even slightly obscure, and you're likely to find the first page or so of results filled with highly-SEO'd sites that offer little in the way of deep, detailed content. The smaller sites which do have that content, but just haven't been SEO'd much, have been eclipsed. They're still there, but rendered nearly inaccessible.

Interesting discussion on a search engine that does sort of the opposite of Google: https://news.ycombinator.com/item?id=3910304

I hear this a lot, but I'd love to see an example (especially including the sites that should rank).

This may or may not be what you're looking for, but my freelance site just can't hit the front page in my industry when dozens of highly funded agencies can dominate it. I've published a 50k word industry-specific book on my site, have SEO'd as much as possible and have an older domain than those better funded. Won't link to it here, but it seems to be a real issue to me.

If a search engine let you differentiate and sort between content match vs pagerank match vs Adwords spend we might be able to mitigate the issue somewhat.

btw, your site's redirect to the HTTPS version doesn't seem to work correctly in Firefox, Safari, or IE. After reading your comment, I was curious to learn more. When I typed in just your domain name plus CMD+ENTER (which adds "www." and ".com" to the address bar text in Firefox), I got a 404 page, not the 301 redirect to the HTTPS site. When I add "http://", the redirect seems to work.

Thanks for letting me know, and for the more detailed info - the redirect worked for the basic permutations I tested for (www.domain.com, and http://domain.com etc.) but I am redirecting non-www and www traffic to https://www. in Nginx. (Solution found, see edit below.)

I'd not heard of the CMD+ENTER method before, so thanks for the heads up. Still not entirely sure what Firefox is submitting in that case. Will test.

I wasn't referring to this site in the parent comment, but to my freelance site. I'll put that site in my profile for 24 hours just in case anyone wants to take a look.

EDIT: Fixed, as noted in reply to nl's comments. A recent change led to a redirect line being mistakenly commented out.


I don't work in this area, but I'd say that 99% of the time I hear someone complaining about how Google is favoring sites that pay for advertising over them I find that they are making these incredibly basic errors.

For me, http://www.linguaquote.com/ gives a 404. It's only when I go to https://www.linguaquote.com/ that it works.

That's all well and good, and thanks for checking, but I wasn't referring to that site in the parent comment. The other site does have SSL enabled, but only recently and via Let's Encrypt. The issue is much longer standing than this.

So I'm afraid it's not quite as simple as you make out in this case.

In other news, I've just pinpointed the missing line in the recently changed nginx config for Linguaquote; the http://www block had it's redirect commented out. Still, for this site in particular Google Webmaster tools is set up for the https version where no errors have been reported and SSL Labs gave an A+ for the stapling, PFS, heartbleed etc. etc. efforts I went to. I don't think this redirect was having an adverse effect on ranking, but I don't expect this site to hit the front page just yet - much more content to add before aiming for that.

Is it bad form to quote myself?

I'd love to see an example (especially including the sites that should rank).

I put the link to the freelance site in my profile, which I only mentioned in my reply to cpeterso above - apologies for not making it clearer.

Will leave it there a bit longer in case you do come back to this thread as still genuinely interested in your opinion on the matter.

One example I've experienced is searching for iphone jailbreak related stuff. Perhaps that's to be expected.

If you've ever read about the original PageRank algorithm, the parent post is a pretty reasonable way to describe it.

I have no idea what the current algorithm looks like but I'd be shocked if it somehow switched to evaluating the 'quality' of content, however one might do that with an algorithm.

Well they do have some quality metrics, like duplication with other content, words used and so on. I suspect more, eg writing style measurements, correlated with other things that are found useful. There is a lot that could be done without actually understanding the content, although if course it can be gamed.

The primary measurement used for Google's PageRank algorithm is the number and "value" of backlinks that a page has. The "value" of a backlink is determined by the cumulative number of descendent sub-backlinks that it has. This is common knowledge among SEO professionals.

Basically, when judging quality, Google is making assumptions like: "This page has a lot of backlinks, and those backlinks themselves have a lot of backlinks... Therefore this page is of high quality." This approach puts all the power in the hands of content providers (bloggers) who are funded by big companies (or well-funded startups) and who serve the interests of those companies.

Google wrongly assumes that content-providers serve the interest of consumers and that they can be trusted - Which is not the case.

This is a very, very, very small amount of money if you want to build a search engine, let alone one "to rival Google" (source?). Looks like the goals are realistic, though - look how wikipedia search could be extended beyond results from wikipedia.org, build some test sets. And get a better idea what it really is that is supposed to be built.

I run a knowledge engine project that actively mines facts from web-based sources and third-party data dumps. It was featured on the front page of HN a while ago, and has a total of $10 in funding (from a single donation; not a typo). I have, however, put a ton of time into it, and it's something I'm very passionate about. I'm fairly confident Wikipedia can have success making initial headway on their grant objectives with $250,000.

If anyone's interested, here's a demo of my own project: https://tuvalie.com/fae/?q=Albert%20Einstein

Over 15 years ago, for my undergraduate thesis I set up a "Hypermedia Textbook" on the history of my field. I had to manually collate all the info, manually scan in every photo, and type in every last bit of text and html. The end result was a couple hundred pages that looked very, very similar to what your Einstein page looks like! At the time, I knew a better way would emerge, but didn't know how or when. It's moments like this that I (a) feel old :( and (b) am amazed by the times we live in and the speed at which things are happening! :) Thank you for providing such a wonderful, if unintentional, moment of self-reflection!

Thank you for checking it out! And if you have any ideas for how things can be improved, I'd love to hear them :)

So apparently Einstein has his own, official website (einstein.biz). Who knew? It has got a proper shop and everything!

Treat yourself to this relatively expensive USB stick:


Cuil spent about $30M before they went bust. By the time they went under, due to a total lack of a revenue model, they had a halfway decent search engine that did its own web crawl. So that's a data point on how much it costs.

If only Cuil came about after the Snowden revelations, they might have been able to make a real name for themselves.

That's been a real boost to DuckDuckGo.

I run a search engine that indexes an entire western country for beer-worth of vps. I spent few months to build it though. I don't see scaling for the world to be hard given the money they have.

Building a basic search engine is relatively easy. Building one that rivals Google is extremely difficult, and not just because they're so big and convincing people to switch is hard. It's much easier to have good results when you know that the websites you're indexing don't care about you at all. Once you get popular enough to rival Google everyone and their mother will be trying to game you and that changes the problem significantly.

The original implementation of the Google search engine would get obliterated today, though I guess you have to start somewhere.

There's another incumbent advantage here: I imagine it's much easier to provide good results when you also have data on which results thousands or millions of people clicked on for millions of search terms.

Google started with less, and so almost every project on the internet.

The basics of search are pretty well established today. Just because it initially cost Google a ton of money doesn't mean it would cost nearly as much today.

This holds true for nearly every human endeavor.

It is a small amount of money and the database for it has to be huge to index a lot of websites. So the server will have a large hard drive and be able to serve up enough users to not suffer outages.

I imagine they will build a proof of concept and get more money for it later. Have volunteers work on building it to save money. Open source the project and have others look into fixing the issues with it.

I think "this is a very, very, very small amount of money if you want to build an encyclopedia" was probably said by a bunch of people in the early Wikipedia days.

Weird; they have anual drives to raise money to keep the site running; would not expect that the'd have 2.5m lying around to do pet projects like this...

If you look at the papers in the link, a private foundation is donating 850k solely for this purpose. I didn't read all the legalese in the paper, not sure where the rest is coming from.

So apparently it's not money of the Wikimedia foundation funding this project.

Most non-profits, especially the ones that are always asking, usually have a lot of funds.

Before I give, I go to guidestar, hit free preview(they try to trick you into a paying membership), download last few years of 1040's, and see if everything looks copacetic. I look at who is making the most money. Their is usually one person making a very good living. California non-profits are much easier to scrutinize than Deleware non-profits.

They are required to give you the 990 form if you email them usually.

In anycase, https://wikimediafoundation.org/wiki/Financial_reports

And the internet archive is a lot more deserving.

I believe in internet's archive mission even if I don't use the site that often but I use wikipedia too much, I can't justify not donating to them when they need it.

FWIW the archive has a lot of 990s archived at [1]. We love Wikipedia, why not donate to both? :-)

[1] https://archive.org/details/IRS990

the same thing I was thinking.

I support it 100%. I love Google but they have too much power and I'm sure they'll start taking advantage of that soon (like they did with Google+ and Youtube).

I can't see it being much better I find Wikipedia impartiality lacking it's very US-centric.

For example something like the history of the Alaskan panhandle as seen from the US perspective is totally different when seen from a Canadian perspective.

I never use Wikipedia as a primary source of info even the linked sources I try to use as least three independent sources.

I would certainly like to see "Reliable and trustworthy information" but who do I trust who is reliable?

I don't know if you can trust anyone, I mean bias is everywhere. You can pick a side.

For example the Portuguese Wikipedia, shared by Brazil, Portugal (and other countries), with controversial matters many on colonization in the last 500 years, where both sides have academic work to support their contradictory views. Which views prevail?

An example of things being done differently are some Ex-Yugoslav countries (Serbia, Croatia, Bosnia, Montenegro, etc) whose languages are more similar than Portuguese and Brasilian Portuguese, and each one has its own Wikipedia, with different articles on the same subject depending on their point of view. Lately, I've been seeing more of the Serbocroatian Wikipedia, which I think aims to unite more of the others.

I don't know which way is better, I'm just a user.

Another reason you can't trust anyone, and this is general to the Internet, is that shilling, commercial and political interests aiming to change perception are everywhere. On reddit or facebook, with or without sources. It's the worst aspect of the internet for me these days.

Google has been the dominant search engine for a solid 12 years. They have a practical monopoly. They're under constant anti-trust scrutiny - of one form another - in all of their major markets. And you think they're going to start taking advantage (as in particularly egregious behavior) of their position soon?

No need. It's generating $23 billion per year in operating income and growing. There are no serious challengers. It's far more likely their search engine will be constrained under piles of government oversight in the coming years. About the time governments start worrying about products like this in tech, is about the time they're just beginning to become less central. The exact same thing happened to both IBM and Microsoft.

You do? How much of that $2.5M is your money?

The title is misleading. It's 250k, not 2.5m and the goal is a knowledge engine, not a search engine.

Not only is it a search engine, but it is a grant application that has had WMF staff leaving in droves, and has greatly upset many, many others - who will quite likely also leave.

It's very, very sad. And it's also a shameful moment for the WMF.

edit: and don't just think it's me saying it. The WMF has had a mass exodus of staff in the last week or so. If you speak to any WMF non-executive staff members directly, you'll quickly find out that morale is at an all time low, and confidence in the WMF Board is sitting at something like 12%.

Can you say more/explain? What about the application is so upsetting? What is shameful about this?

The Knight Foundation is about as upstanding as you can get, so it can't be that (full disclosure, I've received funding from them, so I'm definitely not unbiased on that point). So, what exactly is it that's so shameful here?

To be clear: my issue (and in fact, most peoples' issues) are not with the Knight Foundation. In fact, they appear to have been above board in every way in this whole debacle. It is the WMF board who are the problem here.

See my comment here for just a few comments on this issue:


Frankly, there's a lot more - to understand the issue better you might want to read Liam Wyatt's blog posts:


and here:


Thanks – I appreciate the references

That's OK, ironically it was Wikipedia that taught me to always back up my statements with references :-)

As an aside, I think the internet is simultaneously great at spreading absolute bs and disinformation and pushing people to have citations handy... paradoxically, both seem to be getting more frequent. (It all depends on where you browse, clearly).

Yeah, nobody knows this more than myself. I created [citation needed] and I've watched it be misused for years. I am glad I came up with the idea, but I'm resigned to the fact that it's human nature to misuse a valuable idea.

What's wrong with this project? It seems wildly ambitious, but maybe the blockchain makes to possible to succeed where Wikia failed previously.

Ironically, it's not the project itself that is the problem. It is the way it was done. It is causing masssive, massive problems internally within the WMF and frankly it's spilling over into the wider Wikipedia community. Can't speak for the other projects though, I don't know enough about them to say what the general feeling is around there.

It's not nice to be the only one here on HN pointing out that there are some absolutely massive problems going on at the WMF at the moment, but I'm an outsider who was once an insider and I still know enough influential people through Facebook and other mechanisms to see enough to know that there is a crisis happening right now within the WMF.

edit: I should note that, as an outsider who doesn't ever really want to be hugely involved in Wikipedia-related matters again (for various personal reasons not necessarily related to Wikipedia or the WMF), I don't really have any fear in stating what I see - nobody can really come back at me so I have no fear of any reprisals.

Care to explain? Do you have some links/sources?

Sure. I'll start off with the following email from Liam Wyatt:


The grant application you are looking at was only revealed due to a MASSIVE amount of controversy and pressure within the Wikimedia Foundation.

The community representative (James Heilman) on the board was let go the other day, in part because of concerns around this grant. You might want to look at the Wikipedia Signpost article he wrote about this:


Many people have questioned this. Lila, their Executive Director, seems to have conjured this up out of thin air, without consultation of any WMF staff members, or anyone in any of the various communities. Even highly influential, well respected people like Tim Starling appear to have been broadsided by this.

Here is what Lila Tretikov wrote about the search engine:

It was my mistake to not initiate this ideation on-wiki. Quite honestly, I really wish I could start this discussion over in a more collaborative way, knowing what I know today. Of course, that’s retrospecting with a firmer understanding of what the ideas are, and what is worthy of actually discussing. In the staff June Metrics meeting in 2015, the ideation was beginning to form in my mind from what I was learning through various conversations with staff. I had begun visualizing open knowledge existing in the shape of a universe. I saw the Wikimedia movement as the most motivated and sincere group of beings, united in their mission to build a rocket to explore Universal Free Knowledge. The words “search” and “discovery” and “knowledge” swam around in my mind with some rocket to navigate it. However, “rocket” didn’t seem to work, but in my mind, the rocket was really just an engine, or a portal, a TARDIS, that transports people on their journey through Universal Free Knowledge.

From the perspective I had in June, however, I was unprepared for the impact uttering the words “Knowledge Engine” would have. Can we all just take a moment and mercifully admit: it’s a catchy name. Perhaps not a great one or entirely appropriate in our context (hence we don’t use it any more). I was motivated. I didn’t yet know exactly what we needed to build, or how we would end up building it. I could’ve really used your insight and guidance to help shape the ideas, and model the improvements, and test and verify the impacts.

However, I was too afraid of engaging the community early on.

Why do you think that was?

I have a few thoughts, and would like to share them with you separately, as a wider topic. Either way, this was a mistake I have learned enormously from.

(this can be found here: https://meta.wikimedia.org/wiki/User_talk:LilaTretikov_(WMF)...)

That's a very, very real problem. An executive director of the Wikimedia Foundation should never have felt too afraid of engaging with the wider community on an issue as fundamental as this one.

It's even more concerning that a half-thought through idea didn't get discussed and yet a grant application was made. All those who say that the application is only for $250,000 are entirely missing the point - the entire project would be $2.5 million, this is just the first, initial stage.

It's even worse when Jimmy Wales states that:

"To make this very clear, no one in top positions has proposed or is proposing that WMF should get into the general "searching" or "try to be Google". [1]

Yet that is precisely what is being done here.

The WMF appear to have known about this, because they seem to have done a large number of hirings to be dedicated to search - which I hear through contacts was questioned at the time as it seemed an odd way to allocate WMF resources.

There have been, in the last week, I believe 5 or 6 influential staff leave the WMF. IN fact, they appear to be haemorrhaging staff currently, with no real sign of any abatement.

None of this is at all satisfying to me. I was very, very involved in Wikipedia years ago. I started their Admin Noticeboard, and I did lots of article work, and helped kick off some key things, one of which was the [citation needed] tag which I have to admit I have some mixed feelings about. But for such an important project, it saddens me greatly to say that as an outsider now, it looks like things are being badly mismanaged.

I hope for everyone's sake (and not just the folks at the WMF) that this can be resolved. It's not like governance issues can't be addressed - when Sue Gardener was in charge of the WMF, things not only ran like clock-work, but she ensured maximum transparency and we all trusted her implicitly because she earned that trust. I can't say the same for the current Executive team.

1. http://www.theregister.co.uk/2016/02/12/wikipedia_grant_buil...

It may be of little importance vs your excellent references and what they show but... does anyone else notice how she worded that message is just... so... weird? The wording comes off like a combination of academia, PR, and email scams to me. Just straight BS that no normal, caring person in a mission-oriented organization should ever say.

I mean, there's certainly styles I'm unfamiliar with. I'm always open to new experiences. Could be the case here. Hers just instantly set off red flags in my intuition. I hope she didn't always write like that as it might mean whoever brought her in either fell for a con or were part of it.

Yeah, it's super weird. I'm usually pretty understanding of corporate communication, especially when it's going to be public, but those paragraphs are just bizarrely phrased. "Ideation," "retrospecting", "beings," "universe," frequent references to a "rocket." It reads like one of those crazy-people websites from the 90s with a dozen different fonts.

Her native language is Russian and country she was born in was Russia. I always assumed that was the reason the language used is like this.

I've been critical of her "rocket" imagery, but I like to think I'm understanding about the odd use of English in the rest of her comments. Especially as I'm a monolinguist, heavens only knows how I would sound if I tried to learn and speak Russian amongst Russians...

Thanks, I did wonder if she is a native-English speaker or not. It didn't have any of the telltale grammatical or sentence construction errors, just really weird word choice.

That makes sense. Could be. I'm with you on heaven knows how retarded I'd probably sound trying a foreign language haha.

Indeed, and morale at the WMF office is pretty damn low. https://www.facebook.com/photo.php?fbid=10154689170123475&se... for an example.

250k grant, matched by WMF to 2.5M.

http://i.imgur.com/w89dQ4i.png sounds like a search engine to me...

It is a search engine called the knowledge engine. That is what I read of it.

This is exciting. Recently I started to work on an answer engine/search engine. It still sucks but it's a good project to work on when bored


In a few weeks I'll publish the source code and do a Show HN.

I wish a lot of luck to




too. DuckDuckGo also started to crawl the web with its own bot (right now they're using Yandex's api).

We need more competition from different countries. Just think about the censorship done by Baidu or how Google never plays by its own rules.

It's also interesting to think about a way to monetize a search engine. For kairos.xyz I was thinking about paid accounts (1 euro per month) providing more features, like the ability to search from the command line. For example you write "kairos Richard Stallman" and it prints basic information about Richard Stallman on your terminal.

(Default nginx page showing up on your website)

It works for me. Care to show me a screenshot?

The host (assuming the same host) is responding with a different website when accessed via IPv4 vs IPv6.

  $ curl -4s http://kairos.xyz/ | grep title

  $ curl -6s http://kairos.xyz/ | grep title
  <title>Welcome to nginx on Debian!</title>

  $ host kairos.xyz
  kairos.xyz has address
  kairos.xyz has IPv6 address 2604:180:0:a54::24d9

Thanks, I found the problem. I thought that with Nginx, Ipv6 would just work but I had to add

    listen 80;
    listen [::]:80;
to my server block.

"Welcome to nginx on Debian!

If you see this page, the nginx web server is successfully installed and working on Debian. Further configuration is required.

For online documentation and support please refer to nginx.org

Please use the reportbug tool to report bugs in the nginx package with Debian. However, check existing bug reports before reporting a new bug.

Thank you for using debian and nginx."

I meant here: http://kairos.xyz/

When I make a request from another IP, it gives me the right page.

" (...) Try...

Who is Richard Stallman? define bravado 10 USD to EUR RFC 2460 generate username help - about"

This is so strange. If I fetch the page from http://archive.is I get the same default Nginx page


the only thing I can think off right now is that the http "Host" header field is not sent. I have several sites on the same server and Nginx is used as a reverse proxy and uses the Host field to redirect traffic to different ports.

You should really consider using multiple server blocks instead of relying on the Host field.

I do. Nginx does the matching using the Host field http://nginx.org/en/docs/http/ngx_http_core_module.html#serv...

Hmm, I could've sworn that server_name doesn't rely on the Host header. My mistake.

I also get the Nginx default page when I click on your link from my iPhone. I'd take a screenshot but not sure how to link that from here.

Link to Wikimedia's wiki page on the project includes a decent FAQ: https://meta.wikimedia.org/wiki/Knowledge_Engine

Anyone want to hazard a guess at the technology they plan to implement to get this started?

Surely this is not designed to be written from scratch, so..

- Are they using known lexical & semantic scanners? - Is it focused on English language first? - What crawlers will scan content? - I'll assume it's an open platform, but license for contributors? - What database architecture will hold the graph? - How does it know the mark of authority, and is this primarily based on human input learning or machine learning?

I'm sure $2.5M wont touch the sides, but maybe if it's a well directed project, with healthy user contribution, based on interesting technologies they might develop a good backbone architecture. Ambitious for sure.

Computing is so cheap now, google isn't going to be so dominant on text search for long. Their money is needed for video and pictures and audio, but the text internet can be cached whole by small entities now.

Maybe Wikipedia should launch a video encyclopedia to try to provide a 5 minute video of every article, for people who like videos more than reading.

Doesn't it say 250k in the letter?

The grant amount is $250,000

Correct. As per page 9 the most of the budget allocated for the project comes from wikimedia themselves, totaling 2,445,873.00 USD in for the fiscal year 2015-2016.

do these funds come from the donations they solicit on wikipedia.org?

Yes, it's the main (only?) source of income for the Wikimedia Foundation.

There are substantial donations from other foundations and company match programs (Huuuge page):


I guess the foundations aren't reasonably a response to the Jimmy banners, the company matches probably are.

Misleading title. I don't think they want the result of the grant to rival Google

I am financing a $10 project to rival Tesla.

I'm not kidding when I say that if they want to know where to spend the $2.5m I would start with cleaning up their core codebase. IMO Mediawiki open source code is a disaster.

EDIT: Not because it's written in PHP. Because it's architected poorly.

It's funny you should mention that. That was a point that apparently a number of WMF staff expressed, and it was apparently ignored.

There used to be a team dedicated to making MW Core better / cleaner, but that was lost in a re-org earlier in 2015.

That's a darned shame. I'm aware that there are a lot of areas that people want to fix on MW Core.

I get really concerned when I hear that the person who holds the vision and direction for the Wikimedia Foundation didn't really participate in it beforehand, and I get even more concerned when I see that she branches off into proposals for search technology that appear to be far outside the scope of Wikimedia projects.

Nobody has ever thought search in Wikipedia or the various projects was particularly effective. However, bringing everything together doesn't just involve searching, and frankly there are a number of more pressing governance and community issues that need to be managed.

Perhaps I'm being a bit unfair here, but she was profiled when she first joined the WMF Board, and the following was said about her:

At the meeting, she described the impact on friends and family of the Chernobyl nuclear disaster, and the difficulty of getting reliable information in the face of “so much secrecy.”

Yet we see that this is precisely what happened with this grant proposal. A major grant was applied for and awarded and not even WMF staffers knew about it. You can see on the mailing list that it was a total shock when it was finally revealed.

I'm watching this train wreck from afar, but closer than others because some of my friends are deeply involved in Wikipedia and the WMF. I'm always amazed that a leadership change can complete kill an organisation. I've seen it in the corporate world, and I see it all the time in the volunteer world as well. The Wikimedia Foundation seems to be yet another victim of the appointment of a clueless leader, with no experience in the area or with the group they are meant to be leading, thrashing around, making changes without really understanding how systems work, the history of the organisation or relying on the experience and sage advise of the many expert and dedicated people around them, ultimately leading to a great deal of unnecessary turmoil, ill-will and frankly destruction in their wake.

If nothing else, I hope it improves the currently abysmal search features for Wikipedia today.

turns out that is the only thing this project is about. There is no web crawler, there is no external content. The grant and the money WMF is spending are being spent to improve internal search at wikipedia.

Good luck. We definitely need more search engines. (That Google announced to lower the PageRank(R)/site score for non-HTTPS sites is a clear indicator that they about to cross the line (monopoly). And no, DDG and most others are "just" meta search engines that rely on Yahoo Boss ($$$) which future is uncertain and relies on Bing.)

There was "Wikia Search" by Wikipedia founder Jimmy Wales:

"Wikia Search was a short-lived free and open-source Web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley.

Wikia Search followed other experiments by Wikia into search engine technology and officially launched as a "public alpha" on January 7, 2008. The roll-out version of the search interface was widely criticized by reviewers in mainstream media. After failing to attract an audience, the site closed by 2009."


I used Wikia Search back then, it was good enough (like Bing in comparison to Google back then).

It was based on Apache Nutch and Solr/(Hadoop(?)/Lucene ...

Maybe you can rely on Lucene or SphinxSearch projects to kick-start.

Google doesn't want to lower the PageRank of http sites, it just wants to use http vs https as a feature in ranking. That isn't particularly surprising, I would be willing to bet google already uses hundreds of such features (one of which might be PageRank).

Given the terrible state of advertising, I would welcome a search engine that penalizes pages with popovers, animated ads, auto playing audio, and so on. Google would never build this, given its business model.

I hope Wikipedia brings some innovation to search, untethered from advertising revenue.

Google does do this AFAIK. They dont get any money when people turn on adblock and have always leaned towards simple ads as well as filtering out sites with crummy ux in search

Google does actually penalize this. It's also not that easy to catch given the layers involved.

As a former NLP engineer and former wikiHow engineer, I have some perspective on this. Google has included more and more information from Wikipedia. Furthermore, Google includes snippets of external websites in the knowledge box on more and more pages.

How long will be until can algorithmically generate its own Wikipedia articles? Wikipedia relies upon coming to its site for contributions and donations. Without search, Wikipedia risks being subsumed by Google. They have a difficult position of thinking about the future without pissing off Google.

Computers are getting more and more powerful. Wikipedia needs to do stay relevant. I think this is the right decision.

I don't know how should I feel about all those donation campaigns they usually do after this.

Looks like this is a grant specifically for this work (at least in part.)

I think this is an incredibly good use of their money. Google is the world's biggest surveillance machine, I hope that Wikipedia can do to them what they've already done to Encarta.

I've read a log of the inside wikimedia links here, and I'm confused about all the talk of gnashing of teeth and rending of cloth. This is controversial because some want to pay down technical debt rather than have a small team do knowledge graph search?


It's not a small team compared to the size if the organisation. And you mischaracterise the situation: this is a problem with engagement, transparency and openness.

Yeah the relative merits of the initiative seem to be besides the point. If you have a toxic environment, even a proposal to cure all disease for everyone for free will attract derision.

It's hard to get worked up over some other team's morale. If its such a crappy place, just quit. The could probably literally go across the street and get a new job. I don't really care. It's all way too inside baseball for me.

You cared enough to comment. This wasn't just a place of work for me, I volunteered my time because I believed in what they were doing.

You might never have contributed in a significant fashion to Wikipedia and other WMF projects, but I did. Sure, I didn't get employed, but then again I know a lot of people I met and have continued to be friends with who are still deeply involved. You may not care about your friends' morale, and you might think it's easy for people to "go across the street and get a new job", but then you seem like a pretty thoughtless person.

Of course, you've not understood at all what the larger issues are. You must have a bit of a comprehension issue, because I supplied quite a few links that you apparently read that explained the underlying problems.

Just remember though: you don't care :-)

0) Stop with the personal attacks. It's rather unbecoming, and I don't appreciate it.

1) Being genuinely confused about what the big deal isn't the same thing as caring. It's asking for confirmation of a conclusion.

2) If you're in a situation where you're unhappy, then you have a responsibility to make yourself happy. Staying around in a crappy situation and whining about doesn't help, and neither does insulting people.

3) Wikimedia is in San Francisco. If I had to take a guess, I would say there was a literally a hundred other tech organizations in that city alone, including nonprofit organizations with a societal purpose. 18F comes to mind. Again, See 2.

0) you don't seem to realise how you come across

1) you literally wrote "I don't really care".

2) they aren't whining. Saying so is pretty much a personal attack. It's certainly insulting. I don't appreciate it, and I'd say neither do they. Funny how that works both ways.

But interestingly enough, as has been pointed out already - people ARE leaving in droves.

I'm no longer involved in Wikipedia, but I can still be unhappy with the direction they are taking.

3) if you think that just leaving a non-profit you have emotionally invested in is an easy decision, then you really haven't thought things through. If you think it's elementary to just step out of one job and into another, that's also thoughtless.

That's how much 5 qualified software engineers would cost to employ for a year (gross, including compensation, benefits, payroll taxes, office space, hardware, etc, and that's on the low end of the range). Good luck with that.

They plan to hire 8 engineers, 2 data analysts, and 4 team leads. Relevant details on page 8.


Why do they need 4 team leads for 8 engineers?

WMF projects are usually open to volunteer help

So? I've found 1:10 to be a good ratio between leads and grunts. 1:5 if lead is technical and also contributes work and not just leadership/management. And that's when those 10 (or 5) people are working full time.

It costs half a million to employ one software engineer for a year?

A software engineer qualified to work on this kind of thing is worth about $350K in combined compensation on the market right now. Typically half of that is base, while the other half is stock and other taxable benefits. The number can be higher. This is the cost of just compensation to the company, excluding the payroll tax. You can, of course, find someone a lot cheaper, but then you'd be a fool to expect the result to be anywhere near as good as what Google can pull off, because if that someone could do what Google can, why would she work for half the compensation instead of applying to Google or FB or whoever pays competitive salaries these days.

Cost per employee is much greater than just their salary.

As a former Wikia employee, I am somewhat of a MediaWiki insider. I sped Wikia's search engine up by several orders of magnitude and then went on to pilot a number of NLP/machine learning initiatives in the company.

Jimmy Wales' already tried to make a "Google Killer" ten years ago. It was tilting at windmills to say the least. Letting individuals help manage algorithmic search results was harder than you could imagine. Let's not even get into the difficulty of building an effective crawler.

One of Wikia's former CEOs, Gil Penchina, notoriously undervalued search as a result of this very public gaffe. By the time I came in, it took over five seconds to do a simple on-wiki search. Searching across wikis took so long they actually just sent the search to Google and had you abandon the site. I personally fixed a lot of these problems, and that part was pretty cool.

So now let's get to the subject at hand, which is a search feature based on an authoritative knowledge graph. Something like this should adequately surface factual information in an intuitive manner -- optimally based on natural language. Wikia already tried this, too. They brought on a very seasoned advisor who played a crucial role in the semantic web movement far back into the early oughts. I remember going to semantic web meetups in Austin when I was in grad school quite some time ago now to hear this guy talk.

This guy was essentially the SF-based manager or lead for a small team located in Poland whose job it was to take some of the "structured data" at Wikia and attempt to build some kind of knowledge graph on top of it. This project was unsuccessful.

So why did it fail? We'll start with a lack of product direction. Wikia had and probably still has a very junior product organization that is mostly interested in the site's UI and (recently) a focus on "fandom" (yuck). The team allocated to the project was based in Poland (Poznan, to be exact), and primarily kids coming out of a technical school on their first job. Your assumption about communication being a problem would be correct. However, the subject matter expert was so entrenched in his area of specialization, the problem was even more compounded on the native English-speaker side. There was too much getting in the weeds, and not enough focus on incremental progress.

To make things worse, they tried using a proprietary, not-ready-for-primetime data store because it most closely matched the SME's preconceptions on how the data should be structured. There was absolutely not an existing business use case for this data store, and problems getting it to work turned even building a simple demo into a death march.

Either way, what I'm saying is, $250,000 is not enough to solve this problem. We have attempted to solve this problem before in the MediaWiki world. It's not going to magically get better. To make something like this work, you need:

1) Best-in-class UX people who would know how a knowledge graph provides a significant improvement over existing solutions 2) Leadership that can bridge the gap between SMEs and implementers 3) Very skilled engineering resources with backgrounds in less conventional technologies

This is a massive investment that no one is willing to spend on what is essentially a media play.

About six months later, I had built a proof-of-concept that sucked data out of MediaWiki Infobox templates into Neo4j, a well supported graph database. I was able to answer questions like, "Which cartoon characters are rabbits", and "What movie won the most Oscars in 1968" using the Cypher query language.

At that point in time, Wikia had decided they were tired of investing in structured data, and wanted to re-skin the site for a third time in as many years to make it look more like BuzzFeed.

Structured data is cool. In many cases, unsupervised learning may be what you're actually looking for. But in the end it has to satisfy a real user's needs.

Wikipedia has five million English articles. Wikia has over 20 million. As far as capitalizing on this wealth of knowledge, the devil is truly in the details. But it's a real shame that all of that information isn't put to better use than to encourage the socially maladjusted to take quizzes over which anime character they're more like.

How did you arrive at 20 million? This sounds like one of those "technically true" facts that are cooked up for investors. http://wikis.wikia.com/wiki/List_of_Wikia_wikis puts the combined total of the top 1,000 wikis (in all languages) at 12.4m.

20 million pages, not wikis -- sorry if I mistyped?

There aren't 20 million pages. Read my comment again.

There are over 300,000 wikis. I usually worked with the top ten thousand English wikis, which had over 15 million pages.

Or I'm just making the number up. Doesn't really matter to me.

Google's advantage isn't just that they were first, or that their algorithm is the best- it's the CPU resources they have available to keep their data updated faster.

Search for any news item and you'll have all articles published more than 2 minutes ago included in your results, all blog posts, everything. They consume it all, and offer the output in near-real-time.

Wikimedia don't have the resources to do that. And they especially won't without advertising to pay for it.

Wikipedia main asset seems based around user contributions and human interactions not programming or hard algorithms, this seems quite a leap into another field with not much money.

Bing cost MS $5.5 Billion in their field of expertise


This is like competing with Intel in CPU server market, where he has 98% market share.

So they are trying to compete using 2.5 mil dollars with software backed by multi billion dollars, hundreds thousands of servers, tons of data, thousands of developers, ML integration etc.?

Good luck with that. Many tried backed by x times more resources than this 2.5 mil, unfortunately all failed.

I am guessing this has a different focus than their previous attempt at making a search engine, wikia search, which they abandoned fairly quickly https://en.wikipedia.org/wiki/Wikia_Search

Wikia is not part of the Wikimedia.



(Edit: Who decided that enter is not equal to enter?)

This is great. It feels like we live in an information overload era opposite of North Korea.

Search "Are cookies really bad for me" and find an answer that supports what you want to hear.

"Live a little" Sponsored by Nesthouse Cookies INC

Comparison point: it's Bing's budget for about 5 hours.

The link headline is highly editorialized, there is no mention of "google" or "rivalry" in the pdf in the link.

So this is why they've been asking for donations? Made it seem like they're on the ropes.

most of my google searches include a wikipedia result on the first page. I would estimate this could reduce Google's web search revenue by upwards of 40% worldwide.

Google doesn't make that much money on research quieries. It's mostly your other queries are what enables them to sell ads.

well nearly everything one might conceivably google has a wikipedia page...

a JV might be a better idea...$2.5M isnt that much money and i doubt it will even come close to being useful relative to the other search engines

Surprised no one has been mentioning qwant.com yet.

$250000 != $2.5M

Part of a 2.5 million dollar grant, this is the first stage.

I support it 100%, but still think they should use search advertising to cover costs and further development instead of asking for donations every year.

especially if they can make something that actually does rival google... other companies have spent billions and not gotten very close.

Google already is the search engine for Wikipedia. And Wikipedia is the content provider for Google. Why mess up such a beautiful arrangement? http://newslines.org/blog/google-and-wikipedia-best-friends-...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact