
Wikipedia starts work on $2.5M internet search engine project to rival Google [pdf] - e15ctr0n
https://m.wikimediafoundation.org/wiki/File:Knowledge_engine_grant_agreement.pdf
======
jcrben
This was/is actually an extremely controversial project. The corporation
(basically the Executive Director) pursued the grant and the idea without
soliciting input or really disclosing it to the community of editors, and
eventually one of the community-elected trustees was removed for questioning
the lack of transparency. The community has a long list of software
improvements that they'd like to see to the core platform.

A recent employee survey showed only 10% of WMF staff approved of the
Executive Director, probably in large part due to things like this.

A critical take on the project as it has been handled:
[http://permalink.gmane.org/gmane.org.wikimedia.foundation/82...](http://permalink.gmane.org/gmane.org.wikimedia.foundation/82112)

[https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2...](https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-01-13/In_focus)
is a pretty good (but dated) overview from Wikipedia's weekly newspaper, but
there's a few others in the Signpost and a few blog posts across the web.

~~~
araneae
> and eventually one of the community-elected trustees was removed for
> questioning the lack of transparency

This was a hypothesis about his removal when he was initially was removed, and
has since been refuted by multiple sources.

~~~
uhlume
[citation needed]

------
Rauchg
The biggest problem is the lack of data about what people are searching. It's
a catch-22 that's very hard to break in the face of Google's search dominance
and ubiquity.

By Google being the best, it only becomes better, and introduces a huge
barrier to entry to competitors. It used to be possible to know what people
were searching for to end up in a given Wikipedia article, but the process is
now only asynchronous (and limited) through Webmaster tools[1]

In my mind, the most interesting aspect of the announcement should not be how
much money they have to spend, but how they plan on solving this paradox.

[1]
[http://webmasters.stackexchange.com/a/60350](http://webmasters.stackexchange.com/a/60350)

~~~
rgovind
I think facebook should be able to build a search engine. Don't know why they
don't have one yet

~~~
dexwiz
Facebook wants a single platform to provide the Web through. If they built a
search engine, it would only search Facebook.

~~~
elmar
Facebook search is broken, twitter search works much better.

~~~
meowface
It's intentionally broken.

Despite many real (though also some exaggerated) counter-examples, Facebook
does have features to protect privacy.

One of those things is you generally can't get very useful results from search
unless you're friends with someone, or a friend of a friend (depending on the
user's privacy settings). You can't see general trends, or even search for
every person with a given first and last name in an area, for example.

It used to be more open, but they've heavily restricted the breadth of data
returned from searches within the past few years.

~~~
jarcane
It can't even find anything _in my own posts_. If it's a permission problem,
that's a pretty fuckin' serious permission problem right there.

On top of which, Graph Search is _still_ disabled inexplicably in many
regions, for reasons which were never explained.

~~~
newscracker
> It can't even find anything in my own posts.

Ditto. I have seen this as a crippling deficiency in the Facebook platform.
Not being able to search properly in my own posts or in the groups I'm a
member of really sucks. For all the engineering prowess, open sourced tools,
etc., shown by Facebook, the lack of a working search makes the company seem
incompetent from the top down.

Sometime ago I started storing important information (like others links, my
own comments, etc.) outside of Facebook where I can find them easily. I also
started a Facebook page and added Notes into it to make it easier to document,
find and share things. It seems ridiculous that I'd have to do this just to
have access to information, but that's been the sad state for years.

~~~
boomzilla
FB search is terrible because there is no incentive for an engineering team to
work on it, and in fact maybe quietly discouraged from doing so. FB make money
on from the news feed, so any feature that distract users from scrolling down
their feed will come up losing on an A/B test where ultimate metric is ad
views and clicks.

------
abalone
Summary of the approach (p10):

"1) Public curation mechanisms for quality;

2) Transparency, telling users exactly how the information originated;

3) Open data access to metadata, giving users the exact date source of the
information;

4) Protected user privacy, with their searching protected by strict privacy
controls;

5) No advertising, which assures the free flow of information and a complete
separation from commercial interests;

6) Internalization, which emphasizes community building and the sharing of
information instead of a top-down approach."

My first thought: How will transparency impact SEO? Will spammers be able to
better game the algorithm when they know its internals?

However I am excited at the prospect of a wikipedia-like public curation
system for the entire web. I admit I'm flabbergasted that the whole thing ever
worked, but it does.

~~~
contingencies
_1) Public curation mechanisms for quality;_

The Mozilla / open directory project tried this. Curation doesn't scale and
often assumes a single unifying ontology. This is particularly problematic in
a cross-cultural context. Besides, 'quality' is not a unidimensional metric in
a result set: consider timeliness, authority, notability, uniqueness,
comprehensibility, etc.

 _2) Transparency, telling users exactly how the information originated;_

Most search engines include a URL, I can see a [crawldate] button like the
[cache] or [translate] buttons on each hit adding some information, but this
will be of dubious additional utility for most searches.

 _3) Open data access to metadata, giving users the exact date source of the
information;_

As above.

 _4) Protected user privacy, with their searching protected by strict privacy
controls;_

We have duckduckgo already, friends are welcome but it's hardly a unique
offering nor a trustworthy one given Snowden's revelations regarding the scale
of systematic 5 eyes traffic monitoring/recording.

 _5) No advertising, which assures the free flow of information and a complete
separation from commercial interests;_

DDG or Google or Bing with plugins can supply this. Not ground breaking.

 _6) Internalization, which emphasizes community building and the sharing of
information instead of a top-down approach._

This is so amorphous as to be a non-point.

So out of six points, 2 things (33%) are only useful in edge cases, 1 thing
(16%) is too vague to be useful, and the other 3 things (50%) are currently
implemented by others and have been tried before.

I would like to see the input of the former Blekko guys on this,
[https://news.ycombinator.com/user?id=ChuckMcM](https://news.ycombinator.com/user?id=ChuckMcM)
\+
[https://news.ycombinator.com/user?id=greglindahl](https://news.ycombinator.com/user?id=greglindahl)

~~~
abalone
_> Curation doesn't scale and often assumes a single unifying ontology_

Wikipedia is a pretty big exception to that assertion. Perhaps DMOZ (a clone
of Yahoo circa 1996) is not the only way to do curation. Perhaps Wikipedia
could apply what has worked for Wikipedia, i.e. develop a set of POV-neutral
criteria for organizing collections of links and then invite everyone to
participate.

It's really easy to be negative. But that's something that might at least be
an interesting research project for the #1 open-curation system in the world.

~~~
contingencies
You make a fair point. I'm not rubbishing Wikipedia, just questioning the
supposed USP. I would also point out in response to your argument that a
Wikipedia article and a set of search results are apples and oranges.

The article is written once then modified or evolved occasionally by (almost
exclusively) humans, but very frequently read. It is intended to be
intelligible, being structured and based in natural language. It has a very
well defined scope within a flat namespace, and often clear relations to
multiple formal ontologies. It is structured to be consumed in part or in
whole, and may contain rich media and strong supporting contextual information
(related pages).

By contrast a search result summarizes a set of potential information sources
that may answer a search query in whole or in part, to various definitions of
"answer". It is generally written once, by a computer, and thrown away after
some period of caching. It is intended to be concise. Each component result
has relatively poor context, relying upon the searcher to interpret
timeliness, authority, notability, uniqueness, comprehensibility, etc. with
the limited information presented, typically a very short content excerpt. It
is structured to be scanned, classically in a ranked fashion from "best hit"
to "worst hit", and is generally a wall of text.

Wikipedia successfully attracts people to contribute to the former, but the
latter - where the information product is generated on the fly and lasting
impact is amorphous (nothing particularly concrete for contributors to point
to and say "I did that! Warm and fuzzies!") - is a very different beast.

I too believe there is room for innovation ... there are potentially low
hanging fruit like inter-linguistic semantic queries (not keyword search) ...
but there are no such key problem areas identified in the paper's summary.

~~~
notahacker
The other big problem is that curating search results is inherently about
prioritising a _position_ rather than establishing a sourced and reasonably
neutral version of the truth.

I'm imagining the edit wars and debates that take place on contentious
wordings or facts in some parts of Wikipedia, but on a much wider scale
involving hundreds of SEO consultants each aware that changing a particular
criterion will have a quantifiable impact on their clients' bottom line. It
doesn't sound like it would be fun to police.

~~~
abalone
Wikipedia already curates links to some extent on every page under "External
Links". So there is a seed there.

And even the page text is not immune from the problem you describe. Grading
and prioritizing sources is a fundamental part of producing a "reasonably
neutral version of the truth." It's what determines what gets cited and how
prominently it influences the article.

So while I wouldn't equate text and links in terms of the difficulty of
managing POV-neutrality, I would say they sit on a spectrum.

------
cmarschner
This is a very, very, very small amount of money if you want to build a search
engine, let alone one "to rival Google" (source?). Looks like the goals are
realistic, though - look how wikipedia search could be extended beyond results
from wikipedia.org, build some test sets. And get a better idea what it really
is that is supposed to be built.

~~~
tuvalie
I run a knowledge engine project that actively mines facts from web-based
sources and third-party data dumps. It was featured on the front page of HN a
while ago, and has a total of $10 in funding (from a single donation; not a
typo). I have, however, put a ton of time into it, and it's something I'm very
passionate about. I'm fairly confident Wikipedia can have success making
initial headway on their grant objectives with $250,000.

If anyone's interested, here's a demo of my own project:
[https://tuvalie.com/fae/?q=Albert%20Einstein](https://tuvalie.com/fae/?q=Albert%20Einstein)

~~~
metasean
Over 15 years ago, for my undergraduate thesis I set up a "Hypermedia
Textbook" on the history of my field. I had to manually collate all the info,
manually scan in every photo, and type in every last bit of text and html. The
end result was a couple hundred pages that looked very, very similar to what
your Einstein page looks like! At the time, I knew a better way would emerge,
but didn't know how or when. It's moments like this that I (a) feel old :( and
(b) am amazed by the times we live in and the speed at which things are
happening! :) Thank you for providing such a wonderful, if unintentional,
moment of self-reflection!

~~~
tuvalie
Thank you for checking it out! And if you have any ideas for how things can be
improved, I'd love to hear them :)

------
NKCSS
Weird; they have anual drives to raise money to keep the site running; would
not expect that the'd have 2.5m lying around to do pet projects like this...

~~~
marincounty
Most non-profits, especially the ones that are always asking, usually have a
lot of funds.

Before I give, I go to guidestar, hit free preview(they try to trick you into
a paying membership), download last few years of 1040's, and see if everything
looks copacetic. I look at who is making the most money. Their is usually one
person making a very good living. California non-profits are much easier to
scrutinize than Deleware non-profits.

~~~
ryacko
They are required to give you the 990 form if you email them usually.

In anycase,
[https://wikimediafoundation.org/wiki/Financial_reports](https://wikimediafoundation.org/wiki/Financial_reports)

And the internet archive is a lot more deserving.

~~~
rpgmaker
I believe in internet's archive mission even if I don't use the site that
often but I use wikipedia too much, I can't justify not donating to them when
they need it.

------
seven-dev
I support it 100%. I love Google but they have too much power and I'm sure
they'll start taking advantage of that soon (like they did with Google+ and
Youtube).

~~~
dghughes
I can't see it being much better I find Wikipedia impartiality lacking it's
very US-centric.

For example something like the history of the Alaskan panhandle as seen from
the US perspective is totally different when seen from a Canadian perspective.

I never use Wikipedia as a primary source of info even the linked sources I
try to use as least three independent sources.

I would certainly like to see "Reliable and trustworthy information" but who
do I trust who is reliable?

~~~
ino
I don't know if you can trust anyone, I mean bias is everywhere. You can pick
a side.

For example the Portuguese Wikipedia, shared by Brazil, Portugal (and other
countries), with controversial matters many on colonization in the last 500
years, where both sides have academic work to support their contradictory
views. Which views prevail?

An example of things being done differently are some Ex-Yugoslav countries
(Serbia, Croatia, Bosnia, Montenegro, etc) whose languages are more similar
than Portuguese and Brasilian Portuguese, and each one has its own Wikipedia,
with different articles on the same subject depending on their point of view.
Lately, I've been seeing more of the Serbocroatian Wikipedia, which I think
aims to unite more of the others.

I don't know which way is better, I'm just a user.

Another reason you can't trust anyone, and this is general to the Internet, is
that shilling, commercial and political interests aiming to change perception
are everywhere. On reddit or facebook, with or without sources. It's the worst
aspect of the internet for me these days.

------
nitrix
The title is misleading. It's 250k, not 2.5m and the goal is a knowledge
engine, not a search engine.

~~~
chris_wot
Not only is it a search engine, but it is a grant application that has had WMF
staff leaving in droves, and has greatly upset many, many others - who will
quite likely also leave.

It's very, very sad. And it's also a shameful moment for the WMF.

 _edit:_ and don't just think it's me saying it. The WMF has had a mass exodus
of staff in the last week or so. If you speak to any WMF non-executive staff
members directly, you'll quickly find out that morale is at an all time low,
and confidence in the WMF Board is sitting at something like 12%.

~~~
incongruity
Can you say more/explain? What about the application is so upsetting? What is
shameful about this?

The Knight Foundation is about as upstanding as you can get, so it can't be
that (full disclosure, I've received funding from them, so I'm definitely not
unbiased on that point). So, what exactly is it that's so shameful here?

~~~
chris_wot
To be clear: my issue (and in fact, most peoples' issues) are not with the
Knight Foundation. In fact, they appear to have been above board in every way
in this whole debacle. It is the WMF board who are the problem here.

See my comment here for just a few comments on this issue:

[https://news.ycombinator.com/item?id=11101262](https://news.ycombinator.com/item?id=11101262)

Frankly, there's a lot more - to understand the issue better you might want to
read Liam Wyatt's blog posts:

[http://wittylama.com/2016/01/08/strategy-and-
controversy/](http://wittylama.com/2016/01/08/strategy-and-controversy/)

and here:

[http://wittylama.com/2016/01/30/strategy-controversy-
part-2/](http://wittylama.com/2016/01/30/strategy-controversy-part-2/)

~~~
incongruity
Thanks – I appreciate the references

~~~
chris_wot
That's OK, ironically it was Wikipedia that taught me to always back up my
statements with references :-)

~~~
incongruity
As an aside, I think the internet is simultaneously great at spreading
absolute bs and disinformation and pushing people to have citations handy...
paradoxically, both seem to be getting more frequent. (It all depends on where
you browse, clearly).

~~~
chris_wot
Yeah, nobody knows this more than myself. I created [citation needed] and I've
watched it be misused for years. I _am_ glad I came up with the idea, but I'm
resigned to the fact that it's human nature to misuse a valuable idea.

------
xojoc
This is exciting. Recently I started to work on an answer engine/search
engine. It still sucks but it's a good project to work on when bored

[http://kairos.xyz/](http://kairos.xyz/)

In a few weeks I'll publish the source code and do a Show HN.

I wish a lot of luck to

[https://www.mojeek.com/](https://www.mojeek.com/)

and

[http://www.lexxe.com/](http://www.lexxe.com/)

too. DuckDuckGo also started to crawl the web with its own bot (right now
they're using Yandex's api).

We need more competition from different countries. Just think about the
censorship done by Baidu or how Google never plays by its own rules.

It's also interesting to think about a way to monetize a search engine. For
kairos.xyz I was thinking about paid accounts (1 euro per month) providing
more features, like the ability to search from the command line. For example
you write "kairos Richard Stallman" and it prints basic information about
Richard Stallman on your terminal.

~~~
seven-dev
(Default nginx page showing up on your website)

~~~
xojoc
It works for me. Care to show me a screenshot?

~~~
bratch
The host (assuming the same host) is responding with a different website when
accessed via IPv4 vs IPv6.

    
    
      $ curl -4s http://kairos.xyz/ | grep title
          <title>Kairos</title>
    
      $ curl -6s http://kairos.xyz/ | grep title
      <title>Welcome to nginx on Debian!</title>
    
      $ host kairos.xyz
      kairos.xyz has address 107.161.29.121
      kairos.xyz has IPv6 address 2604:180:0:a54::24d9

~~~
xojoc
Thanks, I found the problem. I thought that with Nginx, Ipv6 would just work
but I had to add

    
    
        listen 80;
        listen [::]:80;
    

to my server block.

------
timClicks
Link to Wikimedia's wiki page on the project includes a decent FAQ:
[https://meta.wikimedia.org/wiki/Knowledge_Engine](https://meta.wikimedia.org/wiki/Knowledge_Engine)

------
inaudible
Anyone want to hazard a guess at the technology they plan to implement to get
this started?

Surely this is not designed to be written from scratch, so..

\- Are they using known lexical & semantic scanners? \- Is it focused on
English language first? \- What crawlers will scan content? \- I'll assume
it's an open platform, but license for contributors? \- What database
architecture will hold the graph? \- How does it know the mark of authority,
and is this primarily based on human input learning or machine learning?

I'm sure $2.5M wont touch the sides, but maybe if it's a well directed
project, with healthy user contribution, based on interesting technologies
they might develop a good backbone architecture. Ambitious for sure.

------
kristianp
This is a link to the actual pdf:

[https://upload.wikimedia.org/wikipedia/foundation/a/a7/Knowl...](https://upload.wikimedia.org/wikipedia/foundation/a/a7/Knowledge_engine_grant_agreement.pdf)

------
naveen99
Computing is so cheap now, google isn't going to be so dominant on text search
for long. Their money is needed for video and pictures and audio, but the text
internet can be cached whole by small entities now.

Maybe Wikipedia should launch a video encyclopedia to try to provide a 5
minute video of every article, for people who like videos more than reading.

------
pavanlimo
Doesn't it say 250k in the letter?

------
udkl
The grant amount is $250,000

~~~
olh
Correct. As per page 9 the most of the budget allocated for the project comes
from wikimedia themselves, totaling 2,445,873.00 USD in for the fiscal year
2015-2016.

~~~
mh-
do these funds come from the donations they solicit on wikipedia.org?

~~~
TazeTSchnitzel
Yes, it's the main (only?) source of income for the Wikimedia Foundation.

~~~
maxerickson
There are substantial donations from other foundations and company match
programs (Huuuge page):

[https://annual.wikimedia.org/2014/#s-5](https://annual.wikimedia.org/2014/#s-5)

I guess the foundations aren't reasonably a response to the Jimmy banners, the
company matches probably are.

------
jayadeeptp
Misleading title. I don't think they want the result of the grant to rival
Google

------
qaq
I am financing a $10 project to rival Tesla.

------
datashovel
I'm not kidding when I say that if they want to know where to spend the $2.5m
I would start with cleaning up their core codebase. IMO Mediawiki open source
code is a disaster.

EDIT: Not because it's written in PHP. Because it's architected poorly.

~~~
chris_wot
It's funny you should mention that. That was a point that apparently a number
of WMF staff expressed, and it was apparently ignored.

~~~
yuvipanda
There used to be a team dedicated to making MW Core better / cleaner, but that
was lost in a re-org earlier in 2015.

~~~
chris_wot
That's a darned shame. I'm aware that there are a _lot_ of areas that people
want to fix on MW Core.

I get really concerned when I hear that the person who holds the vision and
direction for the Wikimedia Foundation didn't really participate in it
beforehand, and I get even more concerned when I see that she branches off
into proposals for search technology that appear to be far outside the scope
of Wikimedia projects.

Nobody has ever thought search in Wikipedia or the various projects was
particularly effective. However, bringing everything together doesn't just
involve searching, and frankly there are a number of more pressing governance
and community issues that need to be managed.

Perhaps I'm being a bit unfair here, but she was profiled when she first
joined the WMF Board, and the following was said about her:

 _At the meeting, she described the impact on friends and family of the
Chernobyl nuclear disaster, and the difficulty of getting reliable information
in the face of “so much secrecy.”_

Yet we see that this is _precisely_ what happened with this grant proposal. A
major grant was applied for and awarded and _not even WMF staffers_ knew about
it. You can see on the mailing list that it was a total shock when it was
finally revealed.

I'm watching this train wreck from afar, but closer than others because some
of my friends are deeply involved in Wikipedia and the WMF. I'm always amazed
that a leadership change can complete kill an organisation. I've seen it in
the corporate world, and I see it all the time in the volunteer world as well.
The Wikimedia Foundation seems to be yet another victim of the appointment of
a clueless leader, with no experience in the area or with the group they are
meant to be leading, thrashing around, making changes without really
understanding how systems work, the history of the organisation or relying on
the experience and sage advise of the many expert and dedicated people around
them, ultimately leading to a great deal of unnecessary turmoil, ill-will and
frankly destruction in their wake.

------
doyoulikeworms
If nothing else, I hope it improves the currently abysmal search features for
Wikipedia today.

~~~
lacksconfidence
turns out that is the only thing this project is about. There is no web
crawler, there is no external content. The grant and the money WMF is spending
are being spent to improve internal search at wikipedia.

------
frik
Good luck. We definitely need more search engines. (That Google announced to
lower the PageRank(R)/site score for non-HTTPS sites is a clear indicator that
they about to cross the line (monopoly). And no, DDG and most others are
"just" _meta search engines_ that rely on Yahoo Boss ($$$) which future is
uncertain and relies on Bing.)

There was "Wikia Search" by Wikipedia founder Jimmy Wales:

" _Wikia Search was a short-lived free and open-source Web search engine
launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by
Jimmy Wales and Angela Beesley.

Wikia Search followed other experiments by Wikia into search engine technology
and officially launched as a "public alpha" on January 7, 2008. The roll-out
version of the search interface was widely criticized by reviewers in
mainstream media. After failing to attract an audience, the site closed by
2009._"

[https://en.wikipedia.org/wiki/Wikia_Search](https://en.wikipedia.org/wiki/Wikia_Search)

I used Wikia Search back then, it was good enough (like Bing in comparison to
Google back then).

It was based on Apache Nutch and Solr/(Hadoop(?)/Lucene ...

Maybe you can rely on Lucene or SphinxSearch projects to kick-start.

~~~
grapehut
Google doesn't want to lower the PageRank of http sites, it just wants to use
http vs https as a feature in ranking. That isn't particularly surprising, I
would be willing to bet google already uses hundreds of such features (one of
which might be PageRank).

------
wanderingstan
Given the terrible state of advertising, I would welcome a search engine that
penalizes pages with popovers, animated ads, auto playing audio, and so on.
Google would never build this, given its business model.

I hope Wikipedia brings some innovation to search, untethered from advertising
revenue.

~~~
aabbccddee
Google does do this AFAIK. They dont get any money when people turn on adblock
and have always leaned towards simple ads as well as filtering out sites with
crummy ux in search

------
gersh
As a former NLP engineer and former wikiHow engineer, I have some perspective
on this. Google has included more and more information from Wikipedia.
Furthermore, Google includes snippets of external websites in the knowledge
box on more and more pages.

How long will be until can algorithmically generate its own Wikipedia
articles? Wikipedia relies upon coming to its site for contributions and
donations. Without search, Wikipedia risks being subsumed by Google. They have
a difficult position of thinking about the future without pissing off Google.

Computers are getting more and more powerful. Wikipedia needs to do stay
relevant. I think this is the right decision.

------
talles
I don't know how should I feel about all those donation campaigns they usually
do after this.

~~~
Mizza
Looks like this is a grant specifically for this work (at least in part.)

I think this is an incredibly good use of their money. Google is the world's
biggest surveillance machine, I hope that Wikipedia can do to them what
they've already done to Encarta.

------
jonathankoren
I've read a log of the inside wikimedia links here, and I'm confused about all
the talk of gnashing of teeth and rending of cloth. This is controversial
because some want to pay down technical debt rather than have a small team do
knowledge graph search?

Okay...

~~~
chris_wot
It's not a small team compared to the size if the organisation. And you
mischaracterise the situation: this is a problem with engagement, transparency
and openness.

~~~
jonathankoren
Yeah the relative merits of the initiative seem to be besides the point. If
you have a toxic environment, even a proposal to cure all disease for everyone
for free will attract derision.

It's hard to get worked up over some other team's morale. If its such a crappy
place, just quit. The could probably literally go across the street and get a
new job. I don't really care. It's all way too inside baseball for me.

~~~
chris_wot
You cared enough to comment. This wasn't just a place of work for me, I
volunteered my time because I believed in what they were doing.

 _You_ might never have contributed in a significant fashion to Wikipedia and
other WMF projects, but I did. Sure, I didn't get employed, but then again I
know a lot of people I met and have continued to be friends with who are still
deeply involved. _You_ may not care about your friends' morale, and you might
think it's easy for people to "go across the street and get a new job", but
then you seem like a pretty thoughtless person.

Of course, you've not understood at all what the larger issues are. You must
have a bit of a comprehension issue, because I supplied quite a few links that
you apparently read that explained the underlying problems.

Just remember though: you don't care :-)

~~~
jonathankoren
0) Stop with the personal attacks. It's rather unbecoming, and I don't
appreciate it.

1) Being genuinely confused about what the big deal isn't the same thing as
caring. It's asking for confirmation of a conclusion.

2) If you're in a situation where you're unhappy, then you have a
responsibility to make yourself happy. Staying around in a crappy situation
and whining about doesn't help, and neither does insulting people.

3) Wikimedia is in San Francisco. If I had to take a guess, I would say there
was a literally a hundred other tech organizations in that city alone,
including nonprofit organizations with a societal purpose. 18F comes to mind.
Again, See 2.

~~~
chris_wot
0) you don't seem to realise how you come across

1) you literally wrote "I don't really care".

2) they aren't whining. Saying so is pretty much a personal attack. It's
certainly insulting. I don't appreciate it, and I'd say neither do they. Funny
how that works both ways.

But interestingly enough, as has been pointed out already - people ARE leaving
in droves.

I'm no longer involved in Wikipedia, but I can still be unhappy with the
direction they are taking.

3) if you think that just leaving a non-profit you have emotionally invested
in is an easy decision, then you really haven't thought things through. If you
think it's elementary to just step out of one job and into another, that's
also thoughtless.

------
melted
That's how much 5 qualified software engineers would cost to employ for a year
(gross, including compensation, benefits, payroll taxes, office space,
hardware, etc, and that's on the low end of the range). Good luck with that.

~~~
treve
It costs half a million to employ one software engineer for a year?

~~~
melted
A software engineer qualified to work on this kind of thing is worth about
$350K in combined compensation on the market right now. Typically half of that
is base, while the other half is stock and other taxable benefits. The number
can be higher. This is the cost of just compensation to the company, excluding
the payroll tax. You can, of course, find someone a lot cheaper, but then
you'd be a fool to expect the result to be anywhere near as good as what
Google can pull off, because if that someone could do what Google can, why
would she work for half the compensation instead of applying to Google or FB
or whoever pays competitive salaries these days.

------
idibidiart
Remember this?

[https://evolvingtrends.wordpress.com/2006/06/26/wikipedia-30...](https://evolvingtrends.wordpress.com/2006/06/26/wikipedia-30-the-
end-of-google/)

------
languagehacker
As a former Wikia employee, I am somewhat of a MediaWiki insider. I sped
Wikia's search engine up by several orders of magnitude and then went on to
pilot a number of NLP/machine learning initiatives in the company.

Jimmy Wales' already tried to make a "Google Killer" ten years ago. It was
tilting at windmills to say the least. Letting individuals help manage
algorithmic search results was harder than you could imagine. Let's not even
get into the difficulty of building an effective crawler.

One of Wikia's former CEOs, Gil Penchina, notoriously undervalued search as a
result of this very public gaffe. By the time I came in, it took over five
seconds to do a simple on-wiki search. Searching across wikis took so long
they actually just sent the search to Google and had you abandon the site. I
personally fixed a lot of these problems, and that part was pretty cool.

So now let's get to the subject at hand, which is a search feature based on an
authoritative knowledge graph. Something like this should adequately surface
factual information in an intuitive manner -- optimally based on natural
language. Wikia already tried this, too. They brought on a very seasoned
advisor who played a crucial role in the semantic web movement far back into
the early oughts. I remember going to semantic web meetups in Austin when I
was in grad school quite some time ago now to hear this guy talk.

This guy was essentially the SF-based manager or lead for a small team located
in Poland whose job it was to take some of the "structured data" at Wikia and
attempt to build some kind of knowledge graph on top of it. This project was
unsuccessful.

So why did it fail? We'll start with a lack of product direction. Wikia had
and probably still has a very junior product organization that is mostly
interested in the site's UI and (recently) a focus on "fandom" (yuck). The
team allocated to the project was based in Poland (Poznan, to be exact), and
primarily kids coming out of a technical school on their first job. Your
assumption about communication being a problem would be correct. However, the
subject matter expert was so entrenched in his area of specialization, the
problem was even more compounded on the native English-speaker side. There was
too much getting in the weeds, and not enough focus on incremental progress.

To make things worse, they tried using a proprietary, not-ready-for-primetime
data store because it most closely matched the SME's preconceptions on how the
data should be structured. There was absolutely not an existing business use
case for this data store, and problems getting it to work turned even building
a simple demo into a death march.

Either way, what I'm saying is, $250,000 is not enough to solve this problem.
We have attempted to solve this problem before in the MediaWiki world. It's
not going to magically get better. To make something like this work, you need:

1) Best-in-class UX people who would know how a knowledge graph provides a
significant improvement over existing solutions 2) Leadership that can bridge
the gap between SMEs and implementers 3) Very skilled engineering resources
with backgrounds in less conventional technologies

This is a massive investment that no one is willing to spend on what is
essentially a media play.

About six months later, I had built a proof-of-concept that sucked data out of
MediaWiki Infobox templates into Neo4j, a well supported graph database. I was
able to answer questions like, "Which cartoon characters are rabbits", and
"What movie won the most Oscars in 1968" using the Cypher query language.

At that point in time, Wikia had decided they were tired of investing in
structured data, and wanted to re-skin the site for a third time in as many
years to make it look more like BuzzFeed.

Structured data is cool. In many cases, unsupervised learning may be what
you're actually looking for. But in the end it has to satisfy a real user's
needs.

Wikipedia has five million English articles. Wikia has over 20 million. As far
as capitalizing on this wealth of knowledge, the devil is truly in the
details. But it's a real shame that all of that information isn't put to
better use than to encourage the socially maladjusted to take quizzes over
which anime character they're more like.

~~~
atdt
How did you arrive at 20 million? This sounds like one of those "technically
true" facts that are cooked up for investors.
[http://wikis.wikia.com/wiki/List_of_Wikia_wikis](http://wikis.wikia.com/wiki/List_of_Wikia_wikis)
puts the combined total of the top 1,000 wikis (in all languages) at 12.4m.

~~~
languagehacker
20 million pages, not wikis -- sorry if I mistyped?

~~~
atdt
There aren't 20 million pages. Read my comment again.

~~~
languagehacker
There are over 300,000 wikis. I usually worked with the top ten thousand
English wikis, which had over 15 million pages.

Or I'm just making the number up. Doesn't really matter to me.

------
mabbo
Google's advantage isn't just that they were first, or that their algorithm is
the best- it's the CPU resources they have available to keep their data
updated faster.

Search for any news item and you'll have all articles published more than 2
minutes ago included in your results, all blog posts, everything. They consume
it all, and offer the output in near-real-time.

Wikimedia don't have the resources to do that. And they especially won't
without advertising to pay for it.

------
aaron695
Wikipedia main asset seems based around user contributions and human
interactions not programming or hard algorithms, this seems quite a leap into
another field with not much money.

Bing cost MS $5.5 Billion in their field of expertise

[http://www.geek.com/news/bing-has-cost-
microsoft-5-5-billion...](http://www.geek.com/news/bing-has-cost-
microsoft-5-5-billion-since-launch-1423117/)

------
lossolo
This is like competing with Intel in CPU server market, where he has 98%
market share.

So they are trying to compete using 2.5 mil dollars with software backed by
multi billion dollars, hundreds thousands of servers, tons of data, thousands
of developers, ML integration etc.?

Good luck with that. Many tried backed by x times more resources than this 2.5
mil, unfortunately all failed.

------
jccalhoun
I am guessing this has a different focus than their previous attempt at making
a search engine, wikia search, which they abandoned fairly quickly
[https://en.wikipedia.org/wiki/Wikia_Search](https://en.wikipedia.org/wiki/Wikia_Search)

~~~
Washuu
Wikia is not part of the Wikimedia.

[https://en.wikipedia.org/wiki/Wikia#Relationship_with_Wikipe...](https://en.wikipedia.org/wiki/Wikia#Relationship_with_Wikipedia)

[http://community.wikia.com/wiki/Help:Wikimedia](http://community.wikia.com/wiki/Help:Wikimedia)

(Edit: Who decided that enter is not equal to enter?)

------
JayHost
This is great. It feels like we live in an information overload era opposite
of North Korea.

Search "Are cookies really bad for me" and find an answer that supports what
you want to hear.

"Live a little" Sponsored by Nesthouse Cookies INC

------
Aissen
Comparison point: it's Bing's budget for about 5 hours.

------
exDM69
The link headline is highly editorialized, there is no mention of "google" or
"rivalry" in the pdf in the link.

------
chrisra
So this is why they've been asking for donations? Made it seem like they're on
the ropes.

------
grandalf
most of my google searches include a wikipedia result on the first page. I
would estimate this could reduce Google's web search revenue by upwards of 40%
worldwide.

~~~
xiphias
Google doesn't make that much money on research quieries. It's mostly your
other queries are what enables them to sell ads.

~~~
grandalf
well nearly everything one might conceivably google has a wikipedia page...

------
veritas213
a JV might be a better idea...$2.5M isnt that much money and i doubt it will
even come close to being useful relative to the other search engines

------
bato
Surprised no one has been mentioning qwant.com yet.

------
franky303
$250000 != $2.5M

~~~
chris_wot
Part of a 2.5 million dollar grant, this is the first stage.

------
blairanderson
I support it 100%, but still think they should use search advertising to cover
costs and further development instead of asking for donations every year.

especially if they can make something that actually does rival google... other
companies have spent billions and not gotten very close.

------
sparkzilla
Google already _is_ the search engine for Wikipedia. And Wikipedia is the
content provider for Google. Why mess up such a beautiful arrangement?
[http://newslines.org/blog/google-and-wikipedia-best-
friends-...](http://newslines.org/blog/google-and-wikipedia-best-friends-
forever/)

