|
1 |
|
2 Unicode functions |
|
3 ----------------- |
|
4 |
|
5 While working with extended characters sets containing accentuated characters, it's necessary to |
|
6 convert strings to UTF8 so that they can be used without any conversion problem. |
|
7 |
|
8 >>> from pyams_utils import unicode |
|
9 |
|
10 'translate_string' is a utility function which can be used, for example, to generate an object's id |
|
11 without space and with accentuated characters converted to their unaccentuated version: |
|
12 |
|
13 >>> sample = 'Mon titre accentué' |
|
14 >>> unicode.translate_string(sample) |
|
15 'mon titre accentue' |
|
16 |
|
17 Results are lower-cased by default ; this can be avoided by setting the 'force_lower' argument |
|
18 to False: |
|
19 |
|
20 >>> unicode.translate_string(sample, force_lower=False) |
|
21 'Mon titre accentue' |
|
22 >>> unicode.translate_string(sample, force_lower=True, spaces='-') |
|
23 'mon-titre-accentue' |
|
24 |
|
25 >>> sample = 'Texte accentué avec "ponctuation" !' |
|
26 >>> unicode.translate_string(sample, force_lower=True, spaces=' ') |
|
27 'texte accentue avec ponctuation' |
|
28 >>> unicode.translate_string(sample, force_lower=True, remove_punctuation=False, spaces=' ') |
|
29 'texte accentue avec "ponctuation" !' |
|
30 >>> unicode.translate_string(sample, force_lower=True, remove_punctuation=False, spaces='-') |
|
31 'texte-accentue-avec-"ponctuation"-!' |
|
32 >>> unicode.translate_string(sample, force_lower=True, remove_punctuation=True, spaces='-') |
|
33 'texte-accentue-avec-ponctuation' |
|
34 >>> unicode.translate_string(sample, force_lower=True, remove_punctuation=True, spaces=' ', keep_chars='!') |
|
35 'texte accentue avec ponctuation !' |
|
36 |
|
37 |
|
38 If input string can contain 'slashes' (/) or 'backslashes' (\), they are normally removed ; |
|
39 by using the 'escape_slashes' parameter, the input string is splitted and only the last element is |
|
40 returned ; this is handy to handle filenames on Windows platform: |
|
41 |
|
42 >>> sample = 'Autre / chaîne / accentuée' |
|
43 >>> unicode.translate_string(sample) |
|
44 'autre chaine accentuee' |
|
45 >>> unicode.translate_string(sample, escape_slashes=True) |
|
46 'accentuee' |
|
47 >>> sample = 'C:\\Program Files\\My Application\\test.txt' |
|
48 >>> unicode.translate_string(sample) |
|
49 'cprogram filesmy applicationtest.txt' |
|
50 >>> unicode.translate_string(sample, escape_slashes=True) |
|
51 'test.txt' |
|
52 |
|
53 To remove remaining spaces or convert them to another character, you can use the "spaces" parameter |
|
54 which can contain any string to be used instead of initial spaces: |
|
55 |
|
56 >>> sample = 'C:\\Program Files\\My Application\\test.txt' |
|
57 >>> unicode.translate_string(sample, spaces=' ') |
|
58 'cprogram filesmy applicationtest.txt' |
|
59 >>> unicode.translate_string(sample, spaces='-') |
|
60 'cprogram-filesmy-applicationtest.txt' |
|
61 |
|
62 Spaces replacement is made in the last step, so using it with "escape_slashes" parameter only affects |
|
63 the final result: |
|
64 |
|
65 >>> unicode.translate_string(sample, escape_slashes=True, spaces='-') |
|
66 'test.txt' |
|
67 |
|
68 Unicode module also provides encoding and decoding functions: |
|
69 |
|
70 >>> var = b'Cha\xeene accentu\xe9e' |
|
71 >>> unicode.decode(var, 'latin1') |
|
72 'Chaîne accentuée' |
|
73 >>> unicode.encode(unicode.decode(var, 'latin1'), 'latin1') == var |
|
74 True |
|
75 |
|
76 >>> utf = 'Chaîne accentuée' |
|
77 >>> unicode.encode(utf, 'latin1') |
|
78 b'Cha\xeene accentu\xe9e' |
|
79 >>> unicode.decode(unicode.encode(utf, 'latin1'), 'latin1') == utf |
|
80 True |