<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Henrik Falck&#039;s blog &#187; language analyzer</title>
	<atom:link href="http://henrikfalck.com/blog/tag/language-analyzer/feed" rel="self" type="application/rss+xml" />
	<link>http://henrikfalck.com/blog</link>
	<description>reinventing web 3.0</description>
	<lastBuildDate>Mon, 12 Apr 2010 00:33:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Localization support for language identifier</title>
		<link>http://henrikfalck.com/blog/2010/04/localization-support-for-language-identifier.html</link>
		<comments>http://henrikfalck.com/blog/2010/04/localization-support-for-language-identifier.html#comments</comments>
		<pubDate>Sun, 11 Apr 2010 03:18:40 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[improvements]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[language analyzer]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[web apps]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog/?p=370</guid>
		<description><![CDATA[Something&#8217;s wrong when a language identifier doesn&#8217;t have localization support. So I cooked up a little localization code for What Language Is This?, which proved to be not as easy as one might guess. That&#8217;s because some of the textual content of the web app is in HTML, other is generated by PHP, and yet [...]]]></description>
			<content:encoded><![CDATA[<p>Something&#8217;s wrong when a language identifier doesn&#8217;t have localization support. So I cooked up a little localization code for <a href="http://whatlanguageisthis.com/" title="What Language Is This? Online language identifier"  target="_blank">What Language Is This?</a>, which proved to be not as easy as one might guess. That&#8217;s because some of the textual content of the web app is in HTML, other is generated by PHP, and yet other is generated in JavaScript. I wanted to have one single source of localized strings for all three output paths to simplify overviewing, translating, changing, and adding strings to the web app.</p>
<p>I&#8217;m not sure if there&#8217;s any good solution for this out there, but I cooked up my own. Each language translation has its strings in a text file formatted like an ini file with id keys and localized strings separated by an equals sign. You can view the <a href="http://whatlanguageisthis.com/strings-en.txt"  target="_blank">English</a> and <a href="http://whatlanguageisthis.com/strings-ja.txt"  target="_blank">Japanese</a> raw text files if you like. These are read into a PHP array (i.e. dictionary), after first looking at what language is specified by the URL (/en for English, /ja for Japanese or any other code), and if that is not specified then looking at what languages the browser is set to prefer via the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4"  target="_blank">Accept-Language</a> HTTP header. If the requested language is not available then default to English.</p>
<p>To get the html output localized, the php script that reads through and configures the app (the plain html file itself is set up to run offline for debugging purposes only) looks for string ids enclosed in percent signs, i.e. like %strings id%. These are then replaced with the localized strings from the dictionary. The php-generated content is trivially changed to look up strings from the dictionary. On the JavaScript side, I wanted access to the same string dictionary that I had on the php side, so this is inserted into a &lt;script&gt; block of the generated html output as a JavaScript object (i.e. dictionary). String id lookups can then be done on this object from the JavaScript code just like on the php side. In other words, the php string dictionary is converted into JSON, which is used from the JavaScript side.</p>
<div id="attachment_371" class="wp-caption aligncenter" style="width: 310px"><a href="http://henrikfalck.com/blog/wp-content/uploads/2010/04/wlit-japanese.jpg" ><img class="size-medium wp-image-371" title="あれ何語？ What Language Is This? in 日本語" src="http://henrikfalck.com/blog/wp-content/uploads/2010/04/wlit-japanese-300x186.jpg" alt="あれ何語？ What Language Is This? in 日本語" width="300" height="186" /></a><p class="wp-caption-text">あれ何語？ What Language Is This? in 日本語</p></div>
<p>It all works pretty well and meets my goals. The only downside is that it relies on the server to do some processing, so when I develop on the offline version the strings aren&#8217;t available, instead I get to see the raw string ids, which can be useful too, but you have to rely on imagination to envision the end result. Isn&#8217;t programming always like that anyway, though?</p>
<p>The first translated version of What Language Is This? is of course <a href="http://whatlanguageisthis.com/ja" title="ウェブ上言語識別サービス"  target="_blank">Japanese</a>, done by myself and my wife (初めての共同作業? lol), not just because it&#8217;s easy for me to do, but also because when looking at the <a href="http://addthis.com/"  target="_blank">AddThis</a> stats, Japan is the top ranking country, and also as you know the average English skills in Japan are pretty bad, so I suspect there is a demand for a Japanese translation. Looking at the access stats, and discounting those with good English skills (India, Netherlands, Scandinavia, for example), next in line would most likely be Spanish, French, and German, in that order. Anyone feel like helping? Please drop me a comment in that case. I can offer proper credit and a link back from the site in return.</p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2010/04/localization-support-for-language-identifier.html/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>More Dravidian language identification</title>
		<link>http://henrikfalck.com/blog/2009/06/more-dravidian-language-identification.html</link>
		<comments>http://henrikfalck.com/blog/2009/06/more-dravidian-language-identification.html#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:27:00 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[improvements]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[language analyzer]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog2/2009/06/more-dravidian-language-identification.html</guid>
		<description><![CDATA[Lately, What Language Is This?, the web-based language identification tool I&#8217;m running, has been getting many hits from Tamil-language sources, probably as a result of being covered in two seemingly popular blogs, techintamil.blogspot.com, and tamilnenjam.com. As another blogger pointed out,
Also this service is very good at identifying indic languages (where as many other services fail [...]]]></description>
			<content:encoded><![CDATA[<p>Lately, <a href="http://whatlanguageisthis.com/" >What Language Is This?</a>, the web-based language identification tool I&#8217;m running, has been getting many hits from <span style="font-weight: bold;">Tamil</span>-language sources, probably as a result of being covered in two seemingly popular blogs, <a rel="nofollow" href="http://techintamil.blogspot.com/2009/06/foreign-language-detection-tools.html" >techintamil.blogspot.com</a>, and <a href="http://www.tamilnenjam.org/2009/06/blog-post_11.html" >tamilnenjam.com</a>. <a href="http://www.techdreams.org/tips-tricks/how-to-identify-language-of-unknown-text/2739-20090609" >As another blogger pointed</a> out,<br />
<blockquote style="font-style: italic;">Also this service is very good at identifying indic languages (where as many other services fail to understand).</p></blockquote>
<p>Well, thanks. And yes, I have been making sure that the languages of the Indian subcontinent and its surrounding areas are thoroughly supported for identification.</p>
<p><span style="font-weight: bold; font-style: italic;">But two notable languages have been missing</span>, and I finally got around to adding them. Namely the two Dravidian languages <span style="font-weight: bold;">Malayalam</span> (not to be confused with Malay, to which it is unrelated) and <span style="font-weight: bold;">Kannada</span> (not to be confused with Canada, to which it is unrelated).</p>
<p>Together with the already supported <span style="font-weight: bold;">Tamil</span> and <span style="font-weight: bold;">Telugu</span>, this means that <span style="font-weight: bold; font-style: italic;">all four literary Dravidian languages are supported now!</span> I hope this will be of use to many, and I&#8217;d like to thank the Dravidian-speaking bloggers for their support in the form of writing about the site.</p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2009/06/more-dravidian-language-identification.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Feedback Feature for What Language Is This?</title>
		<link>http://henrikfalck.com/blog/2009/02/new-feedback-feature-for-what-language.html</link>
		<comments>http://henrikfalck.com/blog/2009/02/new-feedback-feature-for-what-language.html#comments</comments>
		<pubDate>Sun, 01 Feb 2009 13:03:00 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[improvements]]></category>
		<category><![CDATA[language analyzer]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[web apps]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog2/2009/02/new-feedback-feature-for-what-language-is-this.html</guid>
		<description><![CDATA[I got around to implementing a feature I&#8217;ve been planning for What Language Is This? today: feedback. Not the comments &#8211; that&#8217;s been there from the start &#8211; but a way of sending immediate feedback on specific results. So that if you disagree with the result, or you know the correct language but it&#8217;s not [...]]]></description>
			<content:encoded><![CDATA[<p>I got around to implementing a feature I&#8217;ve been planning for <a href="http://whatlanguageisthis.com/" style="font-weight: bold; font-style: italic;" >What Language Is This?</a> today: <span style="font-weight: bold;">feedback</span>. Not the comments &#8211; that&#8217;s been there from the start &#8211; but a way of sending immediate feedback on specific results. So that if you disagree with the result, or you know the correct language but it&#8217;s not yet supported, just click on <span style="font-style: italic;">&#8220;send feedback&#8221;</span> that appears with each result, and a simple form pops up that where you can indicate what the problem with that result is.</p>
<p><a href="http://henrikfalck.com/blog/uploaded_images/language-identifier-feedback-782126.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" ><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 261px;" src="http://henrikfalck.com/blog/uploaded_images/language-identifier-feedback-782122.jpg" alt="" border="0" /></a><br />The entered text can also be sent with the feedback, allowing me to gather more sample texts to use as material for the statistical analysis used as a basis when identifying the language, and for testing (there&#8217;s an automatic test feature built in to <a href="http://whatlanguageisthis.com/" >What Language Is This?</a>, just run <span style="font-family: courier new;">selftest()</span> from a JavaScript console on the page and it&#8217;ll test all supported languages to check for regressions &#8211; very handy when updating the database, since it&#8217;s easy to accidentally break some of the fine tuning).</p>
<p>Anyway, I think it&#8217;ll be useful, and I hope everyone will use it a lot since it&#8217;ll help me improve the site. I&#8217;m already getting a lot of useful and encouraging comments so it&#8217;s really fun to keep on developing it. For the next update I&#8217;ll probably add more languages.</p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2009/02/new-feedback-feature-for-what-language.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What Language Is This? Dot Com!</title>
		<link>http://henrikfalck.com/blog/2008/07/what-language-is-this-dot-com.html</link>
		<comments>http://henrikfalck.com/blog/2008/07/what-language-is-this-dot-com.html#comments</comments>
		<pubDate>Sat, 05 Jul 2008 07:40:00 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[language analyzer]]></category>
		<category><![CDATA[web apps]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog2/2008/07/what-language-is-this-dot-com.html</guid>
		<description><![CDATA[http://whatlanguageisthis.com/
Since the language analyzer is becoming one of the most used web services that I run, the other day I was thinking that it would be cool get it its own domain (and a .com domain costs just 50 SEK (around 850 yen in normal times) anyway). So I was thinking about what domain name [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-size:130%;"><a href="http://whatlanguageisthis.com/" >http://whatlanguageisthis.com/</a></p>
<p></span>Since the <a href="http://henrikfalck.com/languageanalyzer/" style="font-weight: bold;" >language analyzer</a> is becoming <span style="font-weight: bold; font-style: italic;">one of the most used web services</span> that I run, the other day I was thinking that <span style="font-style: italic;">it would be cool get it its own domain</span> (and a .com domain costs just 50 SEK (around 850 yen in normal times) anyway). So I was thinking about what domain name to get &#8211; that isn&#8217;t already taken &#8211; and well, one of the most common search phrases people use to find the language analyzer is &#8220;what language is this webpage/blog/text/whatever&#8221; and luckily <a href="http://whatlanguageisthis.com/" style="font-weight: bold;" >whatlanguageisthis.com</a> was available, <span style="font-style: italic; font-weight: bold;">so there it is!</span> I think it&#8217;s quite easy to remember and very easy to tell people. 4 stars out of 5, perhaps? Pretty good.</p>
<p><a href="http://www.spamula.net/blog/i17/babel1.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" ><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://www.spamula.net/blog/i17/babel1.jpg" alt="" border="0" /></a><br />Setting up the new site was pretty easy; it&#8217;s essentially just a php script that chdirs into the language analyzer directory and continues from there as before.</p>
<p>I also did another nice update: the data file that the app uses to identify the language is now downloaded after the page and all the application javascript files have loaded. That means the page should load much faster, and the user can start reading the instructions or entering text while the data is being downloaded in the background. If the user clicks &#8220;Go&#8221; before the data file is downloaded, it will stop and wait, while displaying a typical web 2.0-ish loading indicator.</p>
<p>I&#8217;m planning to add support for more languages soon, and improve identification of similar-looking languages even further. Anyway, here&#8217;s the url for the new site again:<br /><span style="font-size:130%;"><a href="http://whatlanguageisthis.com/" >http://whatlanguageisthis.com/</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2008/07/what-language-is-this-dot-com.html/feed</wfw:commentRss>
		<slash:comments>89</slash:comments>
		</item>
		<item>
		<title>Updated the Language Identifier with ranking of most popular languages right now</title>
		<link>http://henrikfalck.com/blog/2008/05/updated-language-identifier-with.html</link>
		<comments>http://henrikfalck.com/blog/2008/05/updated-language-identifier-with.html#comments</comments>
		<pubDate>Sat, 10 May 2008 04:07:00 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[language analyzer]]></category>
		<category><![CDATA[web apps]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog2/2008/05/updated-the-language-identifier-with-ranking-of-most-popular-languages-right-now.html</guid>
		<description><![CDATA[Over time I&#8217;ve been making some smaller changes to the language analyzer (my language identification web app), like manually tuning it to better distinguish between hard-to-distinguish languages, like the Scandinavian languages, Serbian-Bosnian-Croatian-Slovenian, Afrikaans and Dutch, and Czech and Slovak.
But I&#8217;ve been wondering what languages people use it for, so yesterday evening, while drinking shochu (in [...]]]></description>
			<content:encoded><![CDATA[<p>Over time I&#8217;ve been making some smaller changes to the <a href="http://henrikfalck.com/languageanalyzer/" >language analyzer</a> (my language identification web app), like manually tuning it to better distinguish between hard-to-distinguish languages, like the Scandinavian languages, Serbian-Bosnian-Croatian-Slovenian, Afrikaans and Dutch, and Czech and Slovak.</p>
<p><span style="font-weight: bold; font-style: italic;">But I&#8217;ve been wondering what languages people use it for</span>, so yesterday evening, while drinking <span style="font-weight: bold;">shochu</span> (in spite of which I could only find one bug today! but I did write a processing and database-intensive function, n00b style, which I replaced with a single SQL query today&#8230;), I added <span style="font-weight: bold;">logging of the results</span>. Only when the language identification certainty is reasonably high is it logged, and only the result; the actual text inputted is not sent. This, of course, happens in the background. A language is only logged once per client, and results from clicking the &#8220;example&#8221; button (Tower of Babel extracts &#8211; I like that story) are not logged.</p>
<p>This morning I added the <span style="font-weight: bold;">top ranking</span> to the page. It&#8217;s generated on the server side in order for the search engines to see it. The top 5 languages for the past seven days are printed. At this time, i.e. about 15 hours after the result logging started, these are <span style="font-weight: bold;">Spanish</span>, <span style="font-weight: bold;">Korean</span>, <span style="font-weight: bold;">Portuguese</span>, and <span style="font-weight: bold;">Thai</span><span style="font-weight: bold;"></span>.</p>
<p>You can see the currently most inputted languages live: <a href="http://henrikfalck.com/languageanalyzer/" >http://henrikfalck.com/languageanalyzer/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2008/05/updated-language-identifier-with.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Amazing Language Analyzer Web Application</title>
		<link>http://henrikfalck.com/blog/2008/01/amazing-language-analyzer-web.html</link>
		<comments>http://henrikfalck.com/blog/2008/01/amazing-language-analyzer-web.html#comments</comments>
		<pubDate>Sun, 27 Jan 2008 06:52:00 +0000</pubDate>
		<dc:creator>Henrik Falck</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[language analyzer]]></category>
		<category><![CDATA[web apps]]></category>
		<category><![CDATA[widgets]]></category>

		<guid isPermaLink="false">http://henrikfalck.com/blog2/2008/01/the-amazing-language-analyzer-web-application.html</guid>
		<description><![CDATA[&#8220;Have you ever wondered what language a blog entry you glanced at might be in?&#8221; was the question I set out to work on more than two years ago, if memory serves me right. I always get curious when I see a blog post in an unknown language. I mean not just a language I [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-weight: bold; font-style: italic;">&#8220;Have you ever wondered what language a blog entry you glanced at might be in?&#8221;</span> was the question I set out to work on more than two years ago, if memory serves me right. I always get curious when I see a blog post in an unknown language. I mean not just a language I don&#8217;t speak &#8211; <span style="font-style: italic;">a language I can&#8217;t identify</span>.</p>
<p><a href="http://henrikfalck.com/blog/uploaded_images/languageanalyzer-screenshot-740993.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" ><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://henrikfalck.com/blog/uploaded_images/languageanalyzer-screenshot-740989.png" alt="" border="0" /></a><br />I thought it would be a really hard problem to solve &#8211; writing a piece of software that could figure that out. It turned out not to be so hard though. Just hours of programming, and probably a lot of luck. Because my initial hunches on how to tune the algorithms proved to be pretty right, and I was, and still am, really startled at how good the software became.</p>
<p>I released it as the <a href="http://widgets.opera.com/widget/detail/5619/" >Wørd &#8211; Language Analyzer</a> Opera widget. Unfortunately the target audience for Opera widgets is quite small, so I always thought of making it into a web page. I don&#8217;t know why it took so long, but <a href="http://henrikfalck.com/languageanalyzer/" >here it is</a>!</p>
<p>The web page version has some new, cool improvements. It will try to detect as you&#8217;re typing, for instance. It also has improved support for Swedish, Serbian, and Afrikaans. And the UI is in my opinion better than the widget version.</p>
<p>So please try it yourself and see how it works. It&#8217;s pretty fun to just copy-paste any piece of text your can find on the Internet into it, or just type something in a language you know yourself and see when it gets it right. Here&#8217;s the address again:</p>
<p><a href="http://henrikfalck.com/languageanalyzer/" ><span style="font-size:130%;">http://henrikfalck.com/languageanalyzer/</span></a></p>
]]></content:encoded>
			<wfw:commentRss>http://henrikfalck.com/blog/2008/01/amazing-language-analyzer-web.html/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

