

Ask HN: Workflow Issue: How to "deformat" text without losing links? - brandnewlow

What's the best way to get text from Word into raw HTML...without cutting out URLs, bold tags, italics and underlining?<p>I run a Drupal-powered news site.  We've got about 30 contributors.  They post their own work.  Sometimes I post it for them.<p>These people always write in Word.  No matter how many times I ask them not to, they do.  Pasting into our site from Word leads to formatting problems.  Now, I've got a tinymce plugin that will strip out all formatting from the text and just give me plain text...but then I lose useful tags like &#60;a&#62;, &#60;b&#62;, and &#60;i&#62; and have to add them in again.  I don't have time to add in 30 links again.  And I don't have time to strip out all the awful &#60;span style="font-family:cambria"&#62;-type tags that Word puts into the text.<p>Solutions?
======
babyshake
Put a good rich text editor in your site. Here's a few good ones:

[http://bulletproofbox.com/web-based-rich-text-editors-
compar...](http://bulletproofbox.com/web-based-rich-text-editors-compared/)

------
ivank
It's a solved problem.

<http://www.google.com/search?q=word+cleaner+html>

------
noodle
i like the javascript rich text editor idea.

you could also go with a basic regex removal of everything that isn't
'useful'. just strip out all tags except a/i/b/u and strip out all
'style="*"'.

since its drupal and all, its in php, so this is perhaps in order:
<http://us2.php.net/strip-tags>

