Everything in the example given is encapsulated by schema.org and that would describe it in a way that was unambiguous.
I know that schema.org has been dismissed because "The problem is that it's too complicated for a non-developer", but I would say that for a non-developer this file is also too complicated. Most non-developer small business owners can barely use FTP or use the Wordpress admin. These people don't have a robots.txt and won't create a business.txt
I would argue that making business.txt schema.org formatted and then to use a simple generator wizard to produce it would be more accessible to the small business owner than giving them a text file to edit.
I'm of the same mind that implementing a common interface for Schema.org would be very useful. In many cases, the business.txt file is duplicating information that should be on your site anyway.
My experience with small business clients is that they have no idea what they are doing when it comes to the web. They tend to know exactly what they want, and are usually right. But, they don't know how to translate that to their sites. Usually, it just takes a bit of guidance.
Creating a proper about page for a site that implements schema.org standards makes sense for both the developer and content creator.
I completely agree with you that an interface on Wordpress would be much easier to a business owner than editing directly the text file
Having said that I also believe that a txt file is way more easier than schema for the rest of people who can edit a file. That txt file is done in minutes, the schema file is done in hours :P
What i feel necessary are 2 things:
- Having a separate file
- Doing something that the big websites support so we can "sell" to local businesses.
So if more people believe we should extend schema instead of going with txt, we can work on it.
That format is perfectly valid YAML, easily parsed, and just as easily created. It needn't be any more complicated than it is. I could make a WP plugin to support it within a few hours.
The robot will very likely hit my frontpage anyways, so I'd rather have them look (and not find) for a meta tag there than to produce an additional 404 error which will, depending on the content of said page, waste a considerable amount of bandwidth.
Besides, my frontpage is either heavily fragment- or just page-cached anyways - especially for anonymous user - so it can be served directly from RAM for all intents and purposes, so a hit to the frontpage is really, really cheap.
> an additional 404 error which will, depending on the content of said page, waste a considerable amount of bandwidth.
Because your 404 should be one of the heaviest pages on your site, full of graphics and surprises. Why not make a full-featured game specifically for your 404s?
That will cause you to miss some behavior if internal URLs ever change that users bookmarked or remember.
As an example, a TV station site I follow had a /fullepisodes/ basename page which wasn't updating in sync with their show pages, apparently pulling from a different data store. After about six months of declining updates to that page, they eventually re-vamped it with a more-correct data source. I like to think it happened because I and others kept browsing directly to the easily memorable URL
If you plan to implement a "standard" please try to review the RFCs that have covered this ground before. There's probably already a standard which may fit. If not, there's probably one that's close you could propose a change to. And if you're trying something genuinely new, you'll at least be on the right foundation.
The core problem is definitely that many small business sites, especially restaurant ones, are really terrible outdated, and really not run by someone who would understand the concept of "upload text file to server".
So perversely, storing the information centrally would be easier, but who would you trust with it? The temptation to create a walled garden and "monetize" all that juicy local business data would be very strong for the maintainers. And then everything falls apart into small localized non-interoperating fiefdoms again, and we're back where we started...
I think this usecase is already covered very well by microformats and the various metadata standards that already exist and are supported by Google, Facebook et al.
I'm not sure the argument that these are too complicated for non-developers really works here, after all uploading a file to the root of a web directory is likely also too complicated...
Apart from the really complicated thing for non-developers, another thing I don't like from schema is that it seems to targeted to websites like Yelp or Foursquare; but not too the original local business website.
The owner of the local business is the one who knows which is the correct address, phone, opening hours, menu, etc. All the other websites (yelp, foursquare, google places) can be wrong. We just need the local business to tell the rest of the world which is the correct data.
I'm not sure I follow your logic here, micro-formats are part of the HTML of the local businesses website, search engines and other crawlers can collect and parse this data . This is just like the local business owner telling the rest of the world which is the correct data.
If updating the HTML is too complicated for the website owner (fair enough!) then this should be taken care of in the respective CMS.
I don't mean to sound negative, I know It takes some thought and time to put out a proposal like this :-) Just in this case, this seems to be a problem that was solved a long time ago...
"Without business.txt he would have to go to all the websites like Yelp and Foursquare and..."
No. No no no. This is not how the Internet is supposed to work. I search for a restaurant online hoping they have a website with this information on it. If it's a chain, I can find the local location and know the information is correct. If it's a local place with a website, the information is probably outdated anyway because they don't edit the site when their menu and hours change ... which means they're not going to edit business.txt to reflect the changes. So I'm really trying to find a phone number to speak to a human or listen to the answering thingy so that I can verify their hours.
This proposal is to help automate updates on other sites when the restaurant changes their menu or their hours. The only way this is going to work is for the computers that help manage the restaurant are also providing information to the website. Need to change the menu? Great, the changes are also pushed to the website. Changing the hours employees can clock in? Comes with a requisite change to operating hours and is reflected on the website.
I like this idea very much and its simplicity, but it seems inevitable to me that going down this path will just recreate RDF[0] and RDF Schema. A sort of a semantic web version of Greenspun's tenth rule[1].
For those of you who want to quickly get up to speed on RDF/Schema, "A Semantic Web Primer for Object-Oriented Software Developers"[2] was to me a very good introduction.
[0] From the W3C primer on RDF: "The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web."
I'm an RDF kool-aid drinker, but removing friction for adding somewhat structured content is always OK for me. I'd rather have a standard way of converting from the business.txt format to RDF than not have the data at all.
The schema is rather US-centric. For instance, many countries don't have "states". They may have other divisions, in the 0..N level range, with other names. It would be better to research and use a current, established format for international addresses.
Falsehoods programmers believe about addresses [1][2] is long-overdue. “There is a current, established format for international addresses” is probably one of them.
Line 1, Line 2, Line 3, Country should cover just about all cases.
Ask yourself if you really need to break out the specific components of the address. In this case you don't. As long as the user knows the correct local format, it's fine.
The txt file isn't just about the user though, it's to aid in indexing useful information from the site, so having some sort of breakdown into nested administrative divisions is something that makes sense (after all, I don't just say "I'm looking for a steakhouse in the USA" when I'm trying to decide where to have dinner). Of course, administrative divisions introduce their own problems and work against the whole human readable / human writeable nature of what they're trying to achieve.
On the flipside, even Line 1, 2, 3, country isn't sufficient for all addresses. If you have an addressee, additional delivery information (eg a department), need to include a rural route identifier (eg for Canada), or need to store/use bilingual addresses (again for Canada) then you need more than 3 lines. If you want to talk edge cases, having a country code means that places like the Haskell Free Library and Opera House (http://en.wikipedia.org/wiki/Haskell_Free_Library_and_Opera_...) can't be correctly addressed.
Hell, even the Falsehoods programmers believe about time doesn't come close to capturing the intracacies of lunar/lunisolar calendars (eg those with 13 months in a year, or a variable number of months in a year etc).
I guess that my point is that you need to find the balance of utility and complexity. If you need to be able to store every format of everything you wind up with either a hugely complex schema, or a single field that contains everything (and may even not capture everything completely), but that's not useful for anything except end-user display when the user is able to parse (or make a good guess at) the data.
Unfortunately there isn't a single winning approach - so unless you draw an arbitrary line your specification can't encompass every edge case while maintaining simplicity and achieving what it's set out to.
How well does the business.txt standard hold up against malicious behavior? For example, what happens if I want to defame Restaurant X, so I make restaurantXsucks.com and put a business.txt file in my root directory with the same address and contact information? Currently, Google Places (the service that puts stuff on Google Maps) mails a PIN to the address and requires verification before listing to mitigate this problem -- how would business.txt mitigate the problem?
More simply, you're saying that this solves the updating issue but not the trust problem. This is true, but currently it's no better than what is being proposed.
Google places is good but I doubt Yelp does anything, for example, and a ton of other sites. And most people don't even know to trust google places more anyway, they'll just trust the top result on google.
Regardless of the format of the content, I think that it'd be nice if files like this would be placed in /.well-known/[standard] in accordance with http://tools.ietf.org/html/rfc5785
There are already too many magical files cluttering up the root.
It's supposed to be used by newer specifications, not change existing ones like robots.txt. That would create too many problems with existing applications.
Why do business people try to push business standards as technical solutions? That's not what standards are for, they are for technical problems. It looks like DRM to me: a technical solution to a social problem or a broken business model.
TL;DR: there are existing solutions, micro-formats for example.
But then how will fesja be able to tell everyone "Hey, I created a web standard! /flex"
You're right. No research was done. Author just threw information that he thought was important into a text file and called it a day. RFC 5785 says to put the file into the .well-known folder, the author only thought about United States addresses when making this, and as you stated the problem has already been addressed.
I don't know what's the correct word for this, and even if it will arrive somewhere.
The only thing I know is that there is a problem local businesses and website providers have. And there isn't an efficient solution yet. I've propose a solution so we can discuss it and see if it makes sense. That's where we are now.
About going international, I'm from Spain, so of course I will be the first one interested in having an international "standard". People are already giving suggestions in github!
This – in some form – is probably a good idea. Recently, I've worked on a few Business Improvement Area projects and one of the hassels for BIAs is keeping up-to-date information for each business (i.e., hours of operation, services, description, etc). So, this type of implementation would be great.
I think what I really get from this is that each business needs some form of public API.
You want to trust the businesses being reviewed instead? This also wouldn't stop these sites choosing which reviews to show. The best solution is to find a site you trust.
I think he's suggesting you would host this on your personal domain, reviewing other services. Yelp/etc would then act as an aggregation of these reviews, rather than the hosting company for them.
Yes, that's it, though it probably shouldn't require a particular position in the URL, so that you could e.g. put it on Dropbox, make it public and link it from somewhere.
As long as it was called reviews.txt and had that particular format, it'd be valid.
Does anyone have the contact of people in "Data harmonization" team of Google, Facebook, Foursquare, Yelp, etc? Could you share this idea with them to see if we can discuss it further?
Please don't take this the wrong way, since I think it's great that you are thinking about these problems, but :
It's not clear what advantage your format offers.
On the other hand, it has some pretty clear disadvantages, including generating massive amounts of possibly useless web traffic, not just on the server side, but on the crawling side, since now getting business info takes two requests, instead of one (when it is embedded in a schema.org format on the page).
Additionally, without some tag that tells you whether business.txt would exist, you get to check for every website.
This will slow down crawlers.
Given at least most of the companies on the crawling side of this want to support/support the schema.org markup version, ISTM you would be better off spending your time making simple generators for it or adding support for it to wordpress/et al.
FWIW: I have no comment on whether text formats are better than schema.org or anything like that, but to a large degree, it's irrelevant, because getting a large number of folks to support something they believe is already a solved problem is very very difficult.
I worked at SAP Research on that exact subject. Many comments here were really spot on and sprung to my mind as well. Instead of responding to each one individually and as you asked for it I'll try to organize my feedback here, FWIW:
1. I wrote in another comment here (ctrl+f) but there's overlap with other standards. One of our main multi-million EU projects was about seamless integration of different standards for representing information about resources (i.e. details of a factory) when integrating between very large entities (i.e. BMW and Honda). It was more on the mathematical/computer-science side than technical (i.e. not "Is some path for business.txt better than others"). When two IT departments/armies of consultants use different standards for everything from Address to TaxReceiptCode and you try to integrate it's ugly. When it's a large N of such, you need multi-million EU-sanctioned research projects. Moral of the story: please use standards. That said:
2. There's definitely a turing tarpit[1] situation with RDFs and RDF Schema(s). As you said on github, "Other local businesses may think on adding to their website some metadata (using http://schema.org/). The problem is that it's too complicated for a non-developer". The best standard is the one being used instead of being forgotten in an hundred pages design document. If you find yourself thinking about the problems RDF (and schemas) tries to solve, please take into consideration a clear and standard one-to-one conversion between the specification and some other more established and expressive specification. For example, define an RDF Schema and a clear conversion between RDF using that schema to and from business.txt. The main benefit for the project if you take such a conversion into consideration while designing the document is that it promises a clear way for future spiders to interpret the data regardless of expressiveness and may give the more knowledgeable authors of such document that more expressiveness, where needed. The default should be clear and easy. Case in point:
3. People here commented how the address format is very American. They are right. But for ease of use, maybe the default should be American and add an option to explicitly express the address differently (i.e. different administrative entity than states) and let there be a canonical conversion to something like this, including a way to explicitly express the address using something like the Freebase schema for addresses[2]. Notice how complex are the types for the different fields, like State/Province. This is because encompassing an Address entity globally is a complex problem. Heck, in some countries usage of place descriptors ("fourth junction after the main entrance to town") is still common. I once read some research paper on it, but it eludes me at the moment.
EDIT: 4. Some people here say what's the use. I'm sure you can address this better than myself, but the main "selling point" for me is the ease of use. The focus should be on very easy defaults. Properly defining namespace URIs or microdata itemtype for example is already error-prone/requires too much thinking for the general user. I do think there may be a room for this project.
That's it for now. Too much long of an HN break as it is :-)
Have you ever written a SAX parser? XML parsing is not "easier". Screw the libraries that "do it for you". You still have to understand the tiered data structure, attributes vs content, namespaces,.... the list goes on. Understanding the schema and all that jazz is WAY more difficult than key: value.
edit: Parsing is just as easy. Here is a one-liner in ruby:
How does your parser handle malformed entries? How do you handle text encoding? What are the valid attributes, what constitutes valid text?
XML is unsexy, but it's only complex because it addresses these issues up front. A clever one-liner doesn't make these issues go away, it just postpones them to an inopportune time later.
The original post doesn't address these issues, and I'm not sure it is ever designed to. If someone wanted to create a globally accessed, multi-language, all-encoding supported fully validated business information XML-RPC protocol they certainly wouldn't have arrived at "business.txt".
The main problem it was trying to solve is to prevent business owners from having to update their information everywhere it ends up - not to ensure proper encoding (99% of applications would be happy with UTF-8) and formatting. In almost all use cases simply copying what was there and plopping it into a string would be fine.
Which brings me back to my point about XML. A lot of the time, XML produces complexity that simply isn't necessary. Following the 80/20 rule, designing it for your specific use case will be multitudes faster and work fine for your target audience, no need to build an enterprise system and standard global protocol out of the gate.
If the entry is malformed the parser skips it...no harm. Text encoding handled like any other text file. Attribute and content validity issue exists with XML, too.
> If the entry is malformed the parser skips it...no harm.
How is the poor non-technical person who made the file to know there was a problem? Run it through some kind of validator? Or just wait a couple days and see if Google has picked up the file properly? The former is what you get with XML, the latter is what you get with DNS (and which necessitates tools such as DNS Report). If you dislike the complexity of the first option, you must be saying you prefer the second, which is ludicrous.
> Text encoding handled like any other text file.
So not, in other words. Or we can implicitly include HTTP in our non-specification, and now our user has to be sure the server is going to issue the file with the correct encoding header. Which again is not something our poor benighted user is going to have the chops to do.
> Attribute and content validity issue exists with XML, too
I didn't say XML magically makes these problems go away. I said XML forces you to deal with them up front rather than later on.
> txt: 1, XML: 0
You've done nothing but push food around on your plate and dodge responsibility for technical problems you created. If this constitutes "proof" of anything but an inability to see long-term consequences of short-term "let's throw some code at it" thinking, we're all doomed.
How is the poor non-technical person who made the file to know there was a problem? Run it through some kind of validator? Or just wait a couple days and see if Google has picked up the file properly? The former is what you get with XML, the latter is what you get with DNS (and which necessitates tools such as DNS Report). If you dislike the complexity of the first option, you must be saying you prefer the second, which is ludicrous.
Why wouldn't the user run the parser himself, probably using some kind of frontend (web, possibly)?
Going from "the parser skips it" to "you need to wait for Google to index the file" doesn't make sense unless you for some reason assume that Google owns the one single parser in existence and, unlike with all the structured formats they support, they don't offer an online tool for showing how it'll read the data.
So not, in other words. Or we can implicitly include HTTP in our non-specification, and now our user has to be sure the server is going to issue the file with the correct encoding header. Which again is not something our poor benighted user is going to have the chops to do.
You're right that encoding needs to be solved, but the solution is to just make UTF-8 mandatory and be done with it. There's no real reason to support every encoding under the sun nowadays.
If the user is not technically able to make sure he saves the file in the right format, he can just use a tool. It's not like XML doesn't require tools anyway.
Bloating the file format is a poor solution to that problem.
If they're using a front-end tool, you're committing to making additional software--you might as well have the front-end generate the file. What's the gain of using one file format over another if it's the structured output of your program?
> You're right that encoding needs to be solved, but the solution is to just make UTF-8 mandatory and be done with it. There's no real reason to support every encoding under the sun nowadays.
I agree--but that's the kind of decision that needs to be made and documented up-front. And you'll still have issues, because users will be creating text files on their home computers, and who knows what their home computer's encoding is set to? It's not guaranteed to be Unicode. You'll either wind up bloating your spider by guessing encodings, or you'll have made the format more strict.
> If the user is not technically able to make sure he saves the file in the right format, he can just use a tool. It's not like XML doesn't require tools anyway.
If I wait long enough, my point makes itself.
> Bloating the file format is a poor solution to that problem.
You don't have to love XML to be able to admit that it sometimes is the right solution. By not using it, you're admitting you'll deal with all these problems yourself instead. It may be that you can handle them with less effort than using the solution XML provides, but you'll be making everyone else go to that same effort as well, which isn't the case with XML.
At the end of the day, this is the fifth proposal to solve this problem and it's obviously the worst of the lot. The real world will go on using microformats, RDF and Google's AI.
And you'll still have issues, because users will be creating text files on their home computers, and who knows what their home computer's encoding is set to? It's not guaranteed to be Unicode. You'll either wind up bloating your spider by guessing encodings, or you'll have made the format more strict.
It should be strict. Supporting multiple encodings is a bad solution in any case.
If I wait long enough, my point makes itself.
I'm not sure I follow you. Even if this format always required a tool - which it doesn't, only if you can't choose "UTF-8" when saving in your text editor - how would that make XML a better choice?
You don't have to love XML to be able to admit that it sometimes is the right solution.
There may be cases where XML is the right solution. I don't believe this is one. Even if this format is not right either, there are still less bad solutions.
By not using it, you're admitting you'll deal with all these problems yourself instead.
Which is often a trade worth being made.
It may be that you can handle them with less effort than using the solution XML provides, but you'll be making everyone else go to that same effort as well, which isn't the case with XML.
What effort does this solution impose that XML doesn't?
At the end of the day, this is the fifth proposal to solve this problem and it's obviously the worst of the lot. The real world will go on using microformats, RDF and Google's AI.
Certainly, no disagreements there! I'm a fan of both microformats, since they have the big advantage of not duplicating effort and data, and of RDF - I publish mine as Turtle[1], which by the way happens to not support multiple encodings either, it's all UTF-8, thankfully.
We want to be a human friendly file. If we had a JSON or XML file, it would be too complicated for a non-developer to write or read it. On this way, I think it doesn't matter if the website is done in wordpress, drupla, static files, flash, etc. It's just a simple file.
Also, we are following the same pattern as the robots.txt file.
I would strongly suggest e.g. JSON or XML - if you can upload a website, the chance are that you could also create something like this. There would of course also be a template where absolute noobs could fill in the details and get correct output. With our current tools, implementing a clear text data storage seems like a pretty stupid idea - everyone would have to implement their own parsing. With e.g. JSON, there are libraries for every language and it has validation.
YAML, or the format used by Python's configParser or PHP's INI files (don't know what it's called), might be easiest. Less syntax and formatting.
I'm not so sure about the proposal itself, though. For local, small businesses, how much can we assume about technical ability? If they have to pay the people who did their website to keep it up to date, how can we be sure they'll use it?
Yeah I have to second this. No "non-techie" person will be updating this on their own, for the simple fact that they need to upload it after their done or edit it via ftp... etc. A wordpress plugin would help, but it still lacks great visibility to the business owner.
The main issue I see with this philosophy is that you are pushing the responsibility for understanding the format upstream, similar to all the problems people used to have with malformed RSS feeds.
If you want to have a feed of some kind that is machine-readable then you need to have a spec that is unambiguous. Using JSON or XML helps. Having a clear understanding of what could be in each field, including currencies, timezones, etc, is required.
If everything follows the format (name: value) I can't see a reason why machine parsing would be difficult, I would guess the problem with XML (or any structure beyond plain text) is that it can be very daunting for technology inexperienced people.
I would say that nowadays there was no reason not to use UTF-8, but then I found out that Notepad still saves using ANSI as the format encoding. Damn you, Microsoft!
Yext (http://www.yext.com/) offers a paid solution for exactly this issue - they sync local business info across 35+ different sites (Bing, Yahoo, Yelp, etc.)
Disclaimer: My significant other works there. But I wouldn't recommend it if it weren't useful/relevant/awesome.
This is why I like the business.txt idea. Microformats or microdata, however appropriate and simple, are still one level of abstraction above this idea.
I came to this post 45 seconds ago, looked at GH, and could explain what this is for and how it works. That's saying something.
My browser is failing to render the address properly, it reads as "Poissonnière" so the diaresis is breaking. Probably need to double-check how you serve the file, and make sure the charset and encoding match.
I like the idea of having a standard place to find this information. It's still for robots though isn't it? Why not include this information in robots.txt?
Have you looked at http://schema.org/docs/schemas.html and the examples there?
Everything in the example given is encapsulated by schema.org and that would describe it in a way that was unambiguous.
I know that schema.org has been dismissed because "The problem is that it's too complicated for a non-developer", but I would say that for a non-developer this file is also too complicated. Most non-developer small business owners can barely use FTP or use the Wordpress admin. These people don't have a robots.txt and won't create a business.txt
I would argue that making business.txt schema.org formatted and then to use a simple generator wizard to produce it would be more accessible to the small business owner than giving them a text file to edit.