General Question

Vincentt's avatar

How do I make sure form input in websites is valid UTF-8?

Asked by Vincentt (8074points) June 15th, 2008

I want to convert the title someone enters in a form in my website in the URL, and for that I want to convert all special UTF-8 characters to ASCII equivalents. However, for this the input has to be valid UTF-8. How can I make sure it is?

Observing members: 0 Composing members: 0

9 Answers

Breefield's avatar

I’m not particularly sure what UTF-8 is. but I assume you’re using PHP. SO, I’d use this.

Vincentt's avatar

I’ve looked at that, but it doesn’t quite help :).

By the way, I’m using phputf8 and my problem is that utf8_to_ascii converts every character to the “unknown” character, so I end up with a line of dashes… :(

richardhenry's avatar

I think you’re overcomplicating things for yourself by dwelling on the encoding. PHP will handle conversions for you if you use the special chars or HTML chars functions on the post data.

paulc's avatar

Browsers will genrally respect the Content-type meta tag so you can start there:
<meta http-equiv=“Content-type” content=“text/html; charset=utf-8” />

If all you’re doing is taking form input at storing it then richardhenry is right, you’re proably overcomplicating it insofar as using a library to handle UTF8 explicitly. If your database charset is UTF8 then PHP should pass the correct data to it so long as thr browser knows which charset to pass in the first place (see first paragraph).

In my experience, the display of odd characters in a web browser is usually the browser trying to interpret the characters in the wrong set.

Vincentt's avatar

The idea is that when someone enters a special UTF-8 character, it will be converted to an ASCII equivalent instead of being stripped away, so it can then be used in a URL. For example, say I used the title “Hey, I’m Vincent”, that would end up in the URL as However, when the title consists only of special UTF-8 characters, and those are stripped out, that would mean a URL of So I’m using the library to convert them to ASCII equivalents, but that library needs valid UTF-8.

I still might be overcomplicating things, but as I’m not at home at encoding I’m a bit at a loss as to what else to do.

I’ll toy a bit with the content type.

paulc's avatar

Ah I see now, you want to convert like ü => u and è => e? I think you’ll need to make the mappings yourself. I did, however, find this piece of code (the convert_high_ascii method) that maps most of the higher alpha characters (after you’ve converted from UTF8 => ASCII).

Vincentt's avatar

@paulc, yep, that was the idea. The utf8_to_ascii method of the phputf8 library was supposed to do that but chokes on invalid UTF-8. I have now set the content type but I’ve got a few other problems to deal with before I can test that…

Are extended ASCII characters also not allowed in URLs? If so, that piece of code won’t do the trick as I need it to be open source ;-)

paulc's avatar

Well according to the URI RFC, which I just skimmed quickly, extended ASCII is not allowed. The list of unreserved characters is:

a-z A-Z 0–9 – _ . ! ~ * ’ ( )

Note that after 0–9 the character is a dash not an emdash (fluther’s markdown is changing it).

Vincentt's avatar

Right, thanks :)

Answer this question




to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
Knowledge Networking @ Fluther