Discussion:
How to get HTTP charset? I wanna do charset conversion(or maybe libcurl already has this feature)
kartwall
2010-06-10 07:53:11 UTC
Permalink
Hi, curl-library:
I am new to libcurl. I have tried to write some programs based on libcurl easy interfaces, it works and it's cool. :) But currently I have a question here:
I wanna convert all http responses to UTF-8 because, you know, not all web pages are written in UTF-8. I skimmed the manual of "curl_easy_setopt", seems "CURLOPT_CONV_TO_NETWORK_FUNCTION", "CURLOPT_CONV_FROM_NETWORK_FUNCTION" do helps. But here is a big question: How can I know the charset of html file which I received? My understanding is, first I should know the charset of html I received, then I can start converting this html into UTF-8.
So, any suggestions? Thanks for any comments.
Eric Zhang
Daniel Stenberg
2010-06-10 10:15:22 UTC
Permalink
Post by kartwall
I wanna convert all http responses to UTF-8 because, you know, not all
web pages are written in UTF-8. I skimmed the manual of "curl_easy_setopt",
seems "CURLOPT_CONV_TO_NETWORK_FUNCTION",
"CURLOPT_CONV_FROM_NETWORK_FUNCTION" do helps.
Not really. The purpose of that functionality is for platforms that do not
speak ASCII natively to provide a way to make the protocols we use that are
ascii-based to still work fine.
Post by kartwall
But here is a big question: How can I know the charset of html file which I
received?
HTML is contents that libcurl may deliver. How to deal with that data is
beyond what libcurl knows or cares about. You would need to read up on how
HTML works to figure this out. Of course, there may be HTTP headers in some or
many cases that help you out.
--
/ daniel.haxx.se
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
kartwall
2010-06-10 10:30:36 UTC
Permalink
Post by Daniel Stenberg
Post by kartwall
I wanna convert all http responses to UTF-8 because, you know, not all
web pages are written in UTF-8. I skimmed the manual of "curl_easy_setopt",
seems "CURLOPT_CONV_TO_NETWORK_FUNCTION",
"CURLOPT_CONV_FROM_NETWORK_FUNCTION" do helps.
Not really. The purpose of that functionality is for platforms that do not
speak ASCII natively to provide a way to make the protocols we use that are
ascii-based to still work fine.
Thanks a lot. But I don't understand what is non-ASCII platform? A Chinese or Japanese PC which uses Chinese or Japanese as the default language? If so, why libcurl needs to convert strings? Almost all protocols(such as HTTP, FTP) are all ASCII based text protocols. Maybe I misunderstanding something, so I think could you give me a code example about these 2 options or something else to help me out?
Post by Daniel Stenberg
Post by kartwall
But here is a big question: How can I know the charset of html file which I
received?
HTML is contents that libcurl may deliver. How to deal with that data is
beyond what libcurl knows or cares about. You would need to read up on how
HTML works to figure this out. Of course, there may be HTTP headers in some or
many cases that help you out.
Yeah, I got it. I found in HTTP response headers, "Content-Type: text/html; charset=UTF-8" is what I want. I can check out the charset here. Then I can use iconv to convert between different charsets.

Thanks again.
Post by Daniel Stenberg
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Daniel Stenberg
2010-06-10 10:34:25 UTC
Permalink
Post by kartwall
Post by Daniel Stenberg
Not really. The purpose of that functionality is for platforms that do not
speak ASCII natively to provide a way to make the protocols we use that are
ascii-based to still work fine.
Thanks a lot. But I don't understand what is non-ASCII platform?
Primarily in our cases, those are EBCDIC ones:
http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchange_Code
--
/ daniel.haxx.se
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
kartwall
2010-06-10 10:40:41 UTC
Permalink
Thank you, Daniel. I got what the non-ASCII platform means.
-----Ô­ÊŒÓÊŒþ-----
·¢ËÍʱŒä: 2010Äê6ÔÂ10ÈÕ ÐÇÆÚËÄ
Ö÷Ìâ: Re:Re: How to get HTTP charset? I wanna do charset conversion(or maybe libcurl already has this feature)
Post by kartwall
Post by Daniel Stenberg
Not really. The purpose of that functionality is for platforms that do not
speak ASCII natively to provide a way to make the protocols we use that are
ascii-based to still work fine.
Thanks a lot. But I don't understand what is non-ASCII platform?
http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchange_Code
--
/ daniel.haxx.se
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Michael Wood
2010-06-10 15:04:35 UTC
Permalink
Hi

2010/6/10 kartwall <***@126.com>:
[...]
Post by kartwall
text/html; charset=UTF-8" is what I want. I can check out the charset here.
The problem is that you will find many pages that claim to be utf-8,
but are actually iso-8859-1 or something else.
--
Michael Wood <***@gmail.com>
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Alexandre Morgaut
2010-06-10 15:17:30 UTC
Permalink
Some can't fix the content-type HTTP header and then specify it as a meta in the HTML content

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...


But still...
As some editors set this meta html tag by default from templates, even this information isn't already true

Your choice to add a byte based charset detection script or to accept that badly formatted document won't be well supported
Post by Michael Wood
Hi
[...]
Post by kartwall
text/html; charset=UTF-8" is what I want. I can check out the charset here.
The problem is that you will find many pages that claim to be utf-8,
but are actually iso-8859-1 or something else.
--
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Loading...