| ||||
| RFC 2045 Excerpt (谁可以翻成中文或者有中文文档,可以贴在后面。有偏差,以英文文档为准。) 6.8. Base64 Content-Transfer-Encoding The Base64 Content-Transfer-Encoding is designed to represent arbitrary sequences of octets in a form that need not be humanly readable. The encoding and decoding algorithms are simple, but the encoded data are consistently only about 33 percent larger than the unencoded data. This encoding is virtually identical to the one used in Privacy Enhanced Mail (PEM) applications, as defined in RFC 1421. A 65-character subset of US-ASCII is used, enabling 6 bits to be represented per printable character. (The extra 65th character, "=", is used to signify a special processing function.) NOTE: This subset has the important property that it is represented identically in all versions of ISO 646, including US-ASCII, and all characters in the subset are also represented identically in all versions of EBCDIC. Other popular encodings, such as the encoding used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and the base85 encoding specified as part of Level 2 PostScript, do not share these properties, and thus do not fulfill the portability requirements a binary transport encoding for mail must meet. The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right, a 24-bit input group is formed by concatenating 3 8bit input groups. These 24 bits are then treated as 4 concatenated 6-bit groups, each of which is translated into a single digit in the base64 alphabet. When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first. That is, the first bit in the stream will be the high-order bit in the first 8bit byte, and the eighth bit will be the low-order bit in the first 8bit byte, and so on. Each 6-bit group is used as an index into an array of 64 printable characters. The character referenced by the index is placed in the output string. These characters, identified in Table 1, below, are selected so as to be universally representable, and the set excludes characters with particular significance to SMTP (e.g., ".", CR, LF) and to the multipart boundary delimiters defined in RFC 2046 (e.g., "-"). Table 1: The Base64 Alphabet Value Encoding Value Encoding Value Encoding Value Encoding 0 A 17 R 34 i 51 z 1 B 18 S 35 j 52 0 2 C 19 T 36 k 53 1 3 D 20 U 37 l 54 2 4 E 21 V 38 m 55 3 5 F 22 W 39 n 56 4 6 G 23 X 40 o 57 5 7 H 24 Y 41 p 58 6 8 I 25 Z 42 q 59 7 9 J 26 a 43 r 60 8 10 K 27 b 44 s 61 9 11 L 28 c 45 t 62 + 12 M 29 d 46 u 63 / 13 N 30 e 47 v 14 O 31 f 48 w (pad) = 15 P 32 g 49 x 16 Q 33 h 50 y The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances. Special processing is performed if fewer than 24 bits are available at the end of the data being encoded. A full encoding quantum is always completed at the end of a body. When fewer than 24 input bits are available in an input group, zero bits are added (on the right) to form an integral number of 6-bit groups. Padding at the end of the data is performed using the "=" character. Since all base64 input is an integral number of octets, only the following cases can arise: (1) the final quantum of encoding input is an integral multiple of 24 bits; here, the final unit of encoded output will be an integral multiple of 4 characters with no "=" padding, (2) the final quantum of encoding input is exactly 8 bits; here, the final unit of encoded output will be two characters followed by two "=" padding characters, or (3) the final quantum of encoding input is exactly 16 bits; here, the final unit of encoded output will be three characters followed by one "=" padding character. Because it is used only for padding at the end of the data, the occurrence of any "=" characters may be taken as evidence that the end of the data has been reached (without truncation in transit). No such assurance is possible, however, when the number of octets transmitted was a multiple of three and no "=" characters are present. Any characters outside of the base64 alphabet are to be ignored in base64-encoded data. Care must be taken to use the proper octets for line breaks if base64 encoding is applied directly to text material that has not been converted to canonical form. In particular, text line breaks must be converted into CRLF sequences prior to base64 encoding. The important thing to note is that this may be done directly by the encoder rather than in a prior canonicalization step in some implementations. NOTE: There is no need to worry about quoting potential boundary delimiters within base64-encoded bodies within multipart entities because no hyphen characters are used in the base64 encoding.
__________________ farewell ........................... |
| ||||
| 一个生成随机测试用例的工具 代码:
|
| ||||
| 参考实现(相信我,我已经努力“优化”让它很慢了。) 代码:
|
| |||
| 发信人: bluesea (蓝海), 信区: Internet 标 题: 乱码大全 ──MIME/BASE64介绍 发信站: BBS 水木清华站 (Tue Feb 3 01:02:17 199 乱码大全 ──MIME/BASE64介绍 “乱码大全”,作者:bluesea,水木清华 BBS 成员。欢迎在 BBS 中转载,帮助计算机初学者解决使用软件过程中遇到的实际问题。本文原载于水木清华 BBS 的 Internet讨论区。地址是: telnet://bbs.tsinghua.edu.cn ,WWW访问的地址是 http://bbs.tsinghua.edu.cn 。当下面的条件全部满足时,转载本文可以不经过作者允许:(1) 转载水木清华 BBS 的信头;(2)不修改原文;(3) 转载仅限于各种 BBS 和非商业性质的个人网点。严禁各种形式的抄袭,严禁非作者将 本文或局部用于任何正式出版的刊物。请所有转载文章的网友注意阅读本文的第一段,遵守网络的惯例、尊重作者的劳动。本自然段是全文的一部分。 MIME: Multipurpose Internet Mail Extensions 英国帝国大学计算机在线字典 FOLDOC 对 MIME 的解释为:“多部分( multi- part )、多媒体电子邮件和 WWW 超文本的一种编码标准,用于传送诸如图形、声音和传真等非文本数据。MIME 定义于 RFC 1341,用 MIMENCODE 的方法将二进制数据转换成为一种被称为 BASE64 的 ASCII 子集的字符的组合。” Internet 上有专门讨论 MIME 的新闻组:comp.mail.mime。该新闻组的 FAQ 可以从下面的网点获得: http://www.cis.ohio-state.edu/hypert...mime0/faq.html MIMENCODE 最早称为 MMENCODE,提出用 MIMENCODE 代替 UUENCODE,是因为 UUENCODE 使用了一些字符在一些邮件网关(特别是那些转换 ASCII 和 EBCDIC 码的网关)中造成传输障碍,(还有一些软件不能对所有 UUENCODE 的算法进行正确解码而导致邮件的阅读困难),因此 MIME 被设计用于替代 UUENCODE,但是结果是这些协议共存。 MIME/BASE64 的算法很简单,它将字符流顺序放入一个 24 位的缓冲区,缺字符的地方补零。然后将缓冲区截断成为 4 个部分,高位在先,每个部分 6 位,用下面的 64 个字符重新表示:“ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnop qrstuvwxyz0123456789+/”。如果输入只有一个或两个字节,那么输出将用等号“=”补足。这可以隔断附加的信息造成编码的混乱。这就是BASE64。 UEsDBBQAAgAIAEapPiS/mrkHoQEAAEECAAAMAAAAYWNhZDd+dWUudHh0dVFRb5swGHyPlP9Q5Iet TUICzCUJASTiFuIQhwbHRJnUqmRsndqmVQO0CsK/fXbRVvVh+AV9vrvv7txunYiPV8rAW3n33w5B R9FAN3ld34Axr9OHIjtkt7wC3bmKHD+zznjteTGvjBnVFGCdxz265XW7dfI+ZVFyRjegO3hix8nl fGcdjeLqy/rGTp0Sj+Jp8Djho9oIWRSX0IYIwwmWbF4RHKYK0ByCaI9u4u3HukYZIvE32+fZyz7L 含有 MIME/BASE64 编码的邮件,你查看它的源码时 一般都含有:“This is a multi-part message in MIME format.”这样的句子。也可以被绝大多数的 email 程序进行解码,包括 Netscape、MS Mail、Eudora等。这些程序可以正确识别邮件的正文,恢复 MIME/BASE64 编码的部分为正确的文字或夹带的二进制文件。 如果这些文件不能正确被恢复,可以将邮件原文存成文本文件,改文件名后缀为 .UUE,让 Winzip 自动识别并恢复。推荐使用 Winzip 6.3 SR-1 或更高的版本。也可以将文件后缀改为 .EML , 由 Microsoft Mail 或 OutLook Express 打开,该软件也可以自动解码。另外很多网点,如 http://www.shareware.com 、 http://www.download.com 、 http://www.hotfiles.com 、 http://www.alberts.com 等都可以通过查询 MIME 关键字得到大量的小型应用程序支持 MIME 的转换。 |
| |||
| 提个意见。 这个接口: 代码:
为什么不能是输出到unsigned char*?用vector的一个好处是可以动态分配内存。但对于象base64这种要求高效率的算法,当然是预先分配好固定的内存最好。反正输出缓冲区的大小是可以先计算出来的。 当然,你可以先reserve, 但是,一则,这样一来没有利用vector的优点,二则,我难道每次只能push_back一个char? 我不能把24bit先转换成一个int,然后再把int写进缓冲区?写一个int当然比写四个char快多了。 输入vector < unsigned char >也似乎改为const char*和size_t的好,否则,如果我有一个const char*(比如说,从argv得到的), 难道还要把它拷贝到一个vector中去才能调用这个encode? 太浪费了吧? 而如果输入是const char*, 它可以从vector < unsigned char > 用O(1)时间调用front() 转换得到。所以,既可以给const char*用,也可以给vector < unsigned char > 用。 很显然,const char* 有更大的兼容性和灵活性。 |
| ||||
| 引用:
2.vector就是这样用的,就连标准中也刻意的规定vector的内存是连续的没有空隙的。这个就是之所以说vector可以在大多数地方替换动态分配的数组的原因。只要你不要写越界,不会破坏这个vector的内部状态的。 |
| |||
| 最新的vc.net也是o(n)的。 而且,既然各种实现不同,那就不是标准的。既然不是标准的,就得假设最坏的情况,对吗? 而且,对竞赛来讲,也是不公平的。 有的人可能不resize, 只reserve+push_back. 而有的人可能resize+直接赋值,而不push_back. 怎么比呢? 最好的情况是直接赋值,是n/4 (写入n/4个整数。如果不用resize的话) 而resize+赋值可能是n+n/4. reserve+push_back 可能是n. 也就是说,本来直接赋值比push_back快的。但因为resize的关系,反而比push_back慢了。 |