Over time I’ve been making some smaller changes to the language analyzer (my language identification web app), like manually tuning it to better distinguish between hard-to-distinguish languages, like the Scandinavian languages, Serbian-Bosnian-Croatian-Slovenian, Afrikaans and Dutch, and Czech and Slovak.
But I’ve been wondering what languages people use it for, so yesterday evening, while drinking shochu (in spite of which I could only find one bug today! but I did write a processing and database-intensive function, n00b style, which I replaced with a single SQL query today…), I added logging of the results. Only when the language identification certainty is reasonably high is it logged, and only the result; the actual text inputted is not sent. This, of course, happens in the background. A language is only logged once per client, and results from clicking the “example” button (Tower of Babel extracts – I like that story) are not logged.
This morning I added the top ranking to the page. It’s generated on the server side in order for the search engines to see it. The top 5 languages for the past seven days are printed. At this time, i.e. about 15 hours after the result logging started, these are Spanish, Korean, Portuguese, and Thai.
You can see the currently most inputted languages live: http://henrikfalck.com/languageanalyzer/