C#中字符串的编解码和乱码问题

Author: nex3z 2016-04-08

　　最近在C#使用StringBuilder处理调用dll获得的字符串时，中文出现乱码，如原字符串为“hello 你好”，在StringBuilder获取后变成“hello 浣犲ソ”。使用的调用为：

[DllImport("user32")]
public static extern IntPtr SendMessage(IntPtr hWnd, NppMsg Msg, int wParam, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder lParam);

　　首先尝试了加入CharSet，如[DllImport(“user32”)，CharSet=CharSet.AUTO] 和[DllImport(“user32”)，CharSet=CharSet.Unicode] ，均不奏效。

　　然后考虑重新对字符串进行编解码。从现象来看，英文是正常的，中文是乱码，可能是多字节字符编解码的问题。.NET中使用一个名为CodePage的int值来表示各种编码方式，从这里找到一个非常霸道的方法，遍历各种CodePage对字符串进行解码和编码，由此来寻找能够正确编解码的CodePage组合：

static void Main(string[] args)
{
	StringBuilder sb = new StringBuilder();
	string source = "hello 浣犲ソ";

	foreach (var e1 in Encoding.GetEncodings())
	{
		foreach (var e2 in Encoding.GetEncodings())
		{
			byte[] unknow = Encoding.GetEncoding(e1.CodePage).GetBytes(source);
			string result = Encoding.GetEncoding(e2.CodePage).GetString(unknow);
			sb.AppendLine(string.Format("{0} => {1} : {2}", e1.CodePage, e2.CodePage, result));
		}
	}
	File.WriteAllText("test.txt", sb.ToString());
}

　　运行结果写入文件，在其中搜索“hello 你好”，可以找到：

Line 3503: 936 => 65001 : hello 你好
Line 17319: 50227 => 65001 : hello 你好
Line 17599: 51936 => 65001 : hello 你好
Line 18051: 54936 => 65001 : hello 你好

可见用936、50227、51936、54936这四种CodePage解码后，再使用65001编码，可以得到正确的结果。在这里可以查到CodePage号码所对应的编码方式，936、50227、51936、54936都对应简体中文语言，也就是我的系统语言；65001对应Unicode (UTF-8)，也就是我的目标编码。

　　最后解决乱码的方法是，首先获取系统语言所对应的CodePage，对字符串解码后，编码为Unicode。

readonly int CURRENT_CODE_PAGE = Encoding.Default.CodePage;
readonly int TARGET_CODE_PAGE = Encoding.UTF8.CodePage;
byte[] raw = Encoding.GetEncoding(CURRENT_CODE_PAGE).GetBytes(text);
string newText = Encoding.GetEncoding(TARGET_CODE_PAGE).GetString(raw);

一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30