Not summary, just sentence ranking and extraction. Still cool but not anything new. Sweet side project though! For anyone wondering how this was done, I have a similar project up at http://bookshrink.com (source code at https://github.com/peterldowns/bookshrink), although I don't fetch article text.
Really nice, I made a similar project for "rich text" where I took into account the html tags such as h1, bold, etc for my Final Year Project. Also mixing few analysis including TF-IDF and others. I'll have a look at the source when I get some time (:
I was pretty quick to knee jerk ask myself "Why is this any better than any other schema?" (I was not convinced that "API discovery" was, by itself, a good enough case).
Then I read the very practical first sentence of the jsonapi page: "If you've ever argued with your team about the way your JSON responses should be formatted, JSON API can be your anti-bikeshedding tool." That alone is probably huge. May not mean much for individual projects, but it's good enough for me to bookmark for the future.
Not sure if standards like this can prevent bikeshedding. You can always bikeshed about the need to stick to any particular standard. One counterexample to what I'm saying may be one standard Go language formatting with gofmt but that was introduced very early on and became a part of the culture. Too late for that with JSON APIs.
I'd like to point out that this formatting convention is not a widespread standard. "You should" is biased towards making life simpler for a small number of people who have used the format before, while complicating and bloating your API responses - for both the developer(s) and consumers of your API.
Consumers are now expected to add a full library to their project to parse/understand the JSON responses. Also, implementation overload for many languages: http://jsonapi.org/implementations/
JSON plus a loose adherence to REST won out over SOAP/XML-RPC/WSDL/etc because of its simplicity. Services discovery seems to be a solution in search of a problem.
Even the simple idea of embedded links, with apologies to Dr. Fielding, seem to be a less critical component of REST than was initially believed, since very few modern REST API's actually provide them.
Very few modern "REST" APIs have even a remote resemblance to the REST architecture described by Fielding, it's mostly just RPC over HTML with data in an specialized subset of JSON or XML that is specified per endpoint out-of-band rather than indicated by media type.
I really wish people would stop calling it REST, since that Rob's the meaning from the term.
To some point, I agree! I also used to be a REST purist, but I've become more pragmatic in recent years.
Some crucial points that are often preserved even in today's API's that distinguish them from RPC:
- Any REST API will have the concept of resources that are acted upon by HTTP verbs (methods), instead of RPC-style calling a method named in the URI.
- statelessness (no session state assumed on the server)
- use of HTTP status codes
- resource path generally indicates hierarchy or at least a specific 'one path to this representation'
- broad use of existing HTTP headers for metadata instead of a separate "envelope" in the body as in SOAP
- use of common HTTP headers such as Authorization rather than cookies or other carriers of state
Many of the other compromises are not always because of ignorances, but in order to be broadly useful in the most common cases.
I don't disagree with your point about not calling it REST, because this trend does diverge from Dr. Fielding's dissertation in several important areas (for example, content negotiation, as you point out). That's why I prefer the term "loose REST".
I'm not sure I am seeking the same wow-factor results from the service that everyone else is raving about.
I submitted this link [0] which was on the HN homepage a couple of days ago and the results that I got back were more either the least important bits or in some ways implying the opposite of the article, so either the writing was really bad or the algorithm needs some work.
Submitting a "simpler" less-ranty article [1] was even less successful, leading to paraphrases of less-important sentences as the results.
Then I submitted the BBC article from this morning about Philae [3] and received much, much better results. I think it works best on articles that have single sentences that clearly sum up the gist of the post as a single, hard fact and doesn't work with anything that works towards logical conclusions or tries to build an argument. Which makes sense, because this isn't an AI and can't actually deduce anything.
> Most of the comments seem neutral to negative to me?
Shameless plug, thanks to HackerMoods [1] I can quantify that statement: 0.85 neutral, 0.08 positive, 0.07 negative. The average Show HN is 0.17 positive and 0.04 negative, so your assessment is in line with the numbers.
Sorry about that. Design isn't my wheelhouse but I updated it to what google tells me is a colorblind friendly palette. I would definitely appreciate it if you could take a look and let me know how it is.
I would risk to say it works based on assigning information weight to words, number of non repeating and the way they are related and then filter top down.
{"1":"nor could anyone for the day had long since passed zee prime knew when any man had any part of the making of a universal ac","2":"zee prime's mentality was guided into the dim sea of galaxies and one in particular enlarged into stars","3":"he gave no further thought to dee sub wun whose body might be waiting on a galaxy a trillion light-years away or on the star next to zee prime's own","4":"the universal ac said man's original star has gone nova","5":"the universal ac interrupted zee prime's wandering thoughts not with words but with guidance"}}
{
"1":"betterthreads provides an enhanced replacement for the an enhanced replacement for the python this isn't actually a true thread instead it uses gevent to",
"2":"the widely-accepted solution is to set a timeout on our blocking functions so we can periodically check a which we set from the main thread to indicate we want the child thread to stop",
"3":"if the thread is still alive the when the *timeout* argument is not present or ``none`` the operation will block until the thread terminates",
"4":"`runtimeerror` if an attempt is made to join the current thread as that would cause a deadlock",
"5":"`join` a thread before it has been started and attempts to do so raises the same exception"
}
That said, I think to summarise that particular would require a certain level of domain expertise, something which a general bot couldn't provide.
{"1":"for various reasons i also spend a lot of weekends in new york and make more friends with people working on data and journalism","2":"my friend jean-baptiste who reads it asks why my blog is so good but my paper drafts are so bad","3":"at the beginning of this year i start telling people that i wish i had more female friends since i realize that there are many fewer women around me than before","4":"to keep myself from thinking about my uncertain future all the time i start a cybersecurity accelerator cybersecurity factory with my friend frank wang with the goal of helping research-minded people start companies","5":"i am too lazy to make many friends so i spend my free time reading cooking doing yoga and running"}
Thinking about why someone might mike the choice to use text, text is more in keeping with *nix philosophy. Not that I'm saying it's better, but grep is pretty light weight and a lot of people use the command line and/or languages other than Javascript. YMMV.
I absolutely love the fact that the OP did not get a domain name for this demo. This is an interesting "technique" I haven't seen for quite a while. People tend to own and re-new tenths of domain names, which are just sitting there for an "just in case" moment. This is a great example of how things can really be simplified - spin up an instance, make a demo, shut the instance down.
Good luck with the link still being usable in a month or a year. There's a reason we use domain names for sharing. Not only because they are friendly to read and remember, but also because IPs are typically far more transient than domain names.
The IPs behind my projects have changed dozens of times over the years (new server, changing hosting provider, adding a load balancer, etc.). A simple DNS change allows the same domain name to follow the project.
I'm actually surprised HN permits links to IP addresses. While links posted here are not guaranteed to point to the same content in the future anyway, it is more likely that an IP address will change before the project is taken down entirely. Search engine posterity and all.