The inputenc package – Clemens' LaTeX Corner

Have you ever wondered how the inputenc package works? In this case you should read JLDiaz’ wonderful answer on TeX.sx:

TeX does not know about unicode. For TeX, a character is simply 1 byte in the input. But unicode is multibyte. [cce inline=”true” lang=”latex”]\usepackage[utf8]{inputenc}[/cce] is an elaborate hack to fool TeX into accepting those multibyte chars. When you write a file using [cce inline=”true” lang=”latex”]utf8[/cce] encoding, each character in the file can be coded into 1, 2 or 3 bytes (or even 4 bytes for very exotic alphabets). If the character is in the ASCII standard, it takes only 1 byte, and everything is OK for TeX. Those characters have a binary code of the form [cce inline=”true” lang=”latex”]0xxxxxxx[/cce], i.e. the first bit is zero (because ASCII standard comprises codes only up to 127). The unicode char ẟ that you used in your input, is coded in utf8 as three bytes, of binary values: [cce inline=”true” lang=”latex”]11100001[/cce], [cce inline=”true” lang=”latex”]10111010[/cce] and [cce inline=”true” lang=”latex”]10011111[/cce]. Note that all of those begin with a bit of value [cce inline=”true” lang=”latex”]1[/cce], which is the “mark” utf8 uses to denote that they are not ASCII, but multibyte chars. However, for TeX those three bytes are simply three chars, with codes [cce inline=”true” lang=”latex”]”E1[/cce], [cce inline=”true” lang=”latex”]”BA[/cce] and [cce inline=”true” lang=”latex”]”9F[/cce] respectively ([cce inline=”true” lang=”latex”]”[/cce] is the hexadecimal prefix for TeX). What [cce inline=”true” lang=”latex”]inputenc[/cce] basically does is to make the character with code [cce inline=”true” lang=”latex”]”E1[/cce] an active char, and define the command associated with that character in such a way that if after it came characters [cce inline=”true” lang=”latex”]”BA[/cce] and [cce inline=”true” lang=”latex”]”9F[/cce] then the TeX command [cce inline=”true” lang=”latex”]\delta[/cce] is issued. I guess that you can understand now why you can’t alter the catcodes of Unicode characters. XeTeX or LuaTeX, on the other hand, use a TeX engine capable of accepting “characters” 32bits wide, and the input phase “translates” utf8 to the appropiate unicode point, which is what TeX “eyes” see.

(One reason for me posting this here is to remember the answer without having to search for it on TeX.sx which can be a tedious task even if you know exactly what you’re looking for.)

On April 6, 2013 By Clemens

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Leave a Reply Cancel reply