Discussion:
Determining Text File Encoding
Paul H. Tarver
2018-08-01 18:00:05 UTC
Permalink
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?



I have a client sending me files with UTF-16 Little Endian encoding. I have
some code in place to try to determine if a file is UNICODE based on the
first two or four characters once the file is loaded to memory and then
convert it using STRCONV, but I'm concerned that although it works, it is a
bit of a hack and maybe there is a better way.



Any thoughts?



Paul





--- StripMime Report -- processed MIME parts ---
multipart/alternative
text/plain (text body -- kept)
text/html
---

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/008601d429c1$7ab842b0$7028c810$@tpcqpc.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see the obvious.
Fernando D. Bozzo
2018-08-01 18:32:20 UTC
Permalink
AFAIK there is no way to determine the exact encoding of the files. You can
do a "best effort" algorithm to try identifying it, but even Notepad++
sometimes fails to show the correct encoding.

That's why XML, HTML and some other metalanguages use the
[encoding="utf-8"] or [charset="utf-8"] or similar, because this must be
explicitly indicated for not misunderstanding the contents.

In similar way, when delivering text files to someone, an encoding must be
explicitly defined and agreed between the parts to not misinterpret the
contents.

UTF-16 is a little strange for me and never did deal with it, isn't used
for double byte characters, like chinese or similar?

One idea that comes to me is that you can ask for a header indicating the
encoding (like XML does), or even ask for a predefined string (always the
same, like "Test header - áàä") [with some special chars] which you can
compare to your own. If the comparison of the source string in UTF-16 does
not match your string in UTF-16, then you can assume it's UTF-8, or
re-check comparing with the same string in UTF-8


Regards.-
Post by Paul H. Tarver
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?
I have a client sending me files with UTF-16 Little Endian encoding. I have
some code in place to try to determine if a file is UNICODE based on the
first two or four characters once the file is loaded to memory and then
convert it using STRCONV, but I'm concerned that although it works, it is a
bit of a hack and maybe there is a better way.
Any thoughts?
Paul
--- StripMime Report -- processed MIME parts ---
multipart/alternative
text/plain (text body -- kept)
text/html
---
[excessive quoting removed by server]

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/CAGQ_JumYHtCqqPJMB-jPMCXrfWLg-TXdkdS5JcDX050LfC0o-***@mail.gmail.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see t
Paul H. Tarver
2018-08-01 19:55:03 UTC
Permalink
Currently I'm checking the first two bytes and if they are 255 & 254 respectively, I run STRCONV(textdata,6) on file contents and resave the text file to a temp file and it seems to do the trick and I can then import the tab-delimited data from the temp file with no problem.

I've only run into this a few times before and my method has worked pretty well so far, but I thought I would run it by the group.

Thanks!

Paul

-----Original Message-----
From: ProfoxTech [mailto:profoxtech-***@leafe.com] On Behalf Of Fernando D. Bozzo
Sent: Wednesday, August 01, 2018 1:32 PM
To: ***@leafe.com
Subject: Re: Determining Text File Encoding

AFAIK there is no way to determine the exact encoding of the files. You can
do a "best effort" algorithm to try identifying it, but even Notepad++
sometimes fails to show the correct encoding.

That's why XML, HTML and some other metalanguages use the
[encoding="utf-8"] or [charset="utf-8"] or similar, because this must be
explicitly indicated for not misunderstanding the contents.

In similar way, when delivering text files to someone, an encoding must be
explicitly defined and agreed between the parts to not misinterpret the
contents.

UTF-16 is a little strange for me and never did deal with it, isn't used
for double byte characters, like chinese or similar?

One idea that comes to me is that you can ask for a header indicating the
encoding (like XML does), or even ask for a predefined string (always the
same, like "Test header - áàä") [with some special chars] which you can
compare to your own. If the comparison of the source string in UTF-16 does
not match your string in UTF-16, then you can assume it's UTF-8, or
re-check comparing with the same string in UTF-8


Regards.-
Post by Paul H. Tarver
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?
I have a client sending me files with UTF-16 Little Endian encoding. I have
some code in place to try to determine if a file is UNICODE based on the
first two or four characters once the file is loaded to memory and then
convert it using STRCONV, but I'm concerned that although it works, it is a
bit of a hack and maybe there is a better way.
Any thoughts?
Paul
--- StripMime Report -- processed MIME parts ---
multipart/alternative
text/plain (text body -- kept)
text/html
---
[excessive quoting removed by server]

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/00a301d429d1$8a18c5d0$9e4a5170$@tpcqpc.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see t
Alan Bourke
2018-08-02 09:25:26 UTC
Permalink
Post by Paul H. Tarver
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?
I use the code at the link below, which seems to work OK.

https://pastebin.com/1wzftPUg
--
Alan Bourke
alanpbourke (at) fastmail (dot) fm

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/***@webmail.messagingengine.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see the obvious.
Paul H. Tarver
2018-08-03 13:51:57 UTC
Permalink
Thanks Alan! I'll give this a try.

BTW, I like to note the original source in my comments, so do I get to
credit you for this code?

Paul

-----Original Message-----
From: ProfoxTech [mailto:profoxtech-***@leafe.com] On Behalf Of Alan
Bourke
Sent: Thursday, August 02, 2018 4:25 AM
To: ***@leafe.com
Subject: Re: Determining Text File Encoding
Post by Paul H. Tarver
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?
I use the code at the link below, which seems to work OK.

https://pastebin.com/1wzftPUg
--
Alan Bourke
alanpbourke (at) fastmail (dot) fm

[excessive quoting removed by server]

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/005001d42b31$25427050$6fc750f0$@tpcqpc.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see the obvious.
Alan Bourke
2018-08-03 14:12:50 UTC
Permalink
To be perfectly honest I can't remember. I may have taken it from somewhere and formatted it to the way I do things.
--
Alan Bourke
alanpbourke (at) fastmail (dot) fm
Post by Paul H. Tarver
Thanks Alan! I'll give this a try.
BTW, I like to note the original source in my comments, so do I get to
credit you for this code?
Paul
-----Original Message-----
Bourke
Sent: Thursday, August 02, 2018 4:25 AM
Subject: Re: Determining Text File Encoding
Post by Paul H. Tarver
Ok, this may be a dumb question, but is there a reliable and easy way to
detect and determine the file encoding on simple text files?
I use the code at the link below, which seems to work OK.
https://pastebin.com/1wzftPUg
--
Alan Bourke
alanpbourke (at) fastmail (dot) fm
[excessive quoting removed by server]

_______________________________________________
Post Messages to: ***@leafe.com
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/***@webmail.messagingengine.com
** All postings, unless explicitly stated otherwise, are the opinions of the author, and do not constitute legal or medical advice. This statement is added to the messages for those lawyers who are too stupid to see the obvious.
Loading...