

Ask HN: How to implement a succesful semantic tagging strategy? - hcarvalhoalves

Does anyone know about good resources (papers, articles, open source projects) that can be used as best practices or previous-art on semantic tagging inside a CMS context (aka tripe-tags or machine-tags)? Unfortunately Google's results don't return anything insightful (most results only talk about RDF) and interest in this subject seems rare. Even better if someone from HN has actual experience implementing a semantic tagging system and is willing to toss it's $ 0.02 in.
======
hmgauna
I'm really interested in this topic, and unfortunately it seems to be, as you
say, a lack of interest there. As far as I know, the last effort to go towards
a standard in semantic markup is <http://schema.org/> Apparently, the major
search engines are supporting this standard by now (read
<http://googleblog.blogspot.com.ar/> ). According to what I read in your reply
to keefe, I think this is what you may be looking for. Example for movie:
<http://schema.org/Movie> Wish it helps, I'm not always fully satisfied with
how this ends up classifying the reality, but that is where we are by now.

------
traxtech
What you search, I believe, is called "semantic clustering" : grouping textual
content with related concepts.

You may be interested in the algorithms used in Apache Mahout (
[https://cwiki.apache.org/confluence/display/MAHOUT/Algorithm...](https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms)
)

------
dougk7
Maybe Carrot2 <http://project.carrot2.org/> is worth looking at

------
maxdemarzi
Maybe something on <http://tm.durusau.net/?cat=151>

------
keefe
RDFa and microformats are often used, need to know more about what your
particular usecase is

~~~
hcarvalhoalves
More specifically, I'm interested on what are the best practices for semantic
tagging content in a CMS system. E.g.: how to tag an article with a movie
title and an artist name, and then persist this information in such a way that
it can be recoverable and searchable. I've seen quite a few tagging solutions
and models, but it doesn't seem there's agreement on the technologies
involved, and even less on the theory behind it.

~~~
keefe
are you rolling your own CMS? yahoo at one point was using RDFa so I think
that is the closest thing to a standard agreement. Entity extraction to
generate tags from natural language is a different and more difficult issue.
Some reading on parametric search is probably helpful.

~~~
hcarvalhoalves
Indeed, I'm developing a CMS. The problem is that all our material (articles,
reviews, etc.) requires a high degree of semantics, so just tagging posts
won't cut it. Thank you very much, I'm researching into RDFa.

~~~
keefe
so in some sense you can look at tagging as a core operation - if you want a
quick hack solution, restrict the vocabulary that you are allowing for tagging
s.t. each unique term is an instance in some ontology, this allows you to
build subsumption hierarchies etc. I'd just read the regular RDF on
rdfabout.com and understand that really, deeply then define a namespace that
is like <http://yoursite/tags> and then write an OWL or RDFS ontology in
Protege, which is free and open source and evolved into a tool I worked on for
a different company and they have a free version called topbraid composer. I'd
highly recommend you move off of the microformat/RDFa solution and hack in
your first draft using simple tags that you then describe in an ontology.
Referential semantics is actually complicated so don't get discouraged and
work up incrementally.

