devroom.io/content/posts/2007-08-21-using-iconv-to-convert-utf-8-to-ascii-on-linux.md

48 lines
1.5 KiB
Markdown
Raw Normal View History

2015-03-26 11:28:08 +00:00
+++
date = "2007-08-21"
title = "Using Iconv to convert UTF-8 to ASCII (on Linux)"
tags = ["General", "RubyOnRails", "Features", "Ruby"]
slug = "using-iconv-to-convert-utf-8-to-ascii-on-linux"
2017-09-11 12:20:15 +00:00
description = "Text encoding is a mess. This will help you convert UTF-8 to ASCII on Linux using iconv."
2015-03-26 11:28:08 +00:00
+++
2017-03-20 15:35:19 +00:00
There are situations where you want to remove all the UTF-8 goodness from a string
(mostly because of legacy systems you're working with). Now, this is rather easy to do.
I'll give you an example: `çéß`
Should be converted to `cess`. On my mac, I can simply use the following snippet to convert
the string:
``` ruby
s = "çéß"
2015-03-26 11:28:08 +00:00
s = Iconv.iconv('ascii//translit', 'utf-8', s).to_s # returns "c'ess"
2017-03-20 15:35:19 +00:00
s.gsub(/\W/, '') # return "cess"
```
Very nice and all, but when I deploy to my Debian 4.0 linux system, the I get an error that
tells me that invalid characters were present. Why? Because the Mac has unicode goodness built-in.
Linux does not (in most cases).
2015-03-26 11:28:08 +00:00
So, how do you go about solving this? Easy! Get unicode support!
2017-03-20 15:35:19 +00:00
``` shell
sudo apt-get install unicode
```
2015-03-26 11:28:08 +00:00
Now, try again.
2017-03-20 15:35:19 +00:00
## Bonus
2015-03-26 11:28:08 +00:00
If you want to convert a sentence (or anything else with spaces in it), you'll notice that spaces are removed by the gsub command. I solve this by splitting up the string first into words. Convert the words and then joining the words together again.
2017-03-20 15:35:19 +00:00
``` ruby
words = s.split(" ")
2015-03-26 11:28:08 +00:00
words = words.collect do |word|
2017-03-20 15:35:19 +00:00
word = Iconv.iconv('ascii//translit', 'utf-8', word).to_s
word = word.gsub(/\W/,'')
2015-03-26 11:28:08 +00:00
end
2017-03-20 15:35:19 +00:00
words.join(" ")
```
2015-03-26 11:28:08 +00:00
2017-03-20 15:35:19 +00:00
Like this? Why not write a mix-in for String?