You can accidentally store a mixture of strings and non-strings in an In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype array. It’s better to have a dedicated dtype. Show
In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype breaks dtype-specific operations like . There isn’t a clear way to select just text while excluding non-text but still object-dtype columns. When reading code, the contents of an In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype array is less clear than In [19]: s2 = pd.Series(["a", None, "b"], dtype="object") In [20]: s2.str.count("a") Out[20]: 0 1.0 1 NaN 2 0.0 dtype: float64 In [21]: s2.dropna().str.count("a") Out[21]: 0 1 2 0 dtype: int645. Currently, the performance of In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype arrays of strings and are about the same. We expect future enhancements to significantly increase the performance and lower the memory overhead of . Warning In [19]: s2 = pd.Series(["a", None, "b"], dtype="object") In [20]: s2.str.count("a") Out[20]: 0 1.0 1 NaN 2 0.0 dtype: float64 In [21]: s2.dropna().str.count("a") Out[21]: 0 1 2 0 dtype: int648 is currently considered experimental. The implementation and parts of the API may change without warning. For backwards-compatibility, In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype remains the default type we infer a list of strings to In [1]: pd.Series(["a", "b", "c"]) Out[1]: 0 a 1 b 2 c dtype: object To explicitly request In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean1 dtype, specify the In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean2 In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string Or In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean3 after the In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 or In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 is created In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string Changed in version 1.1.0. You can also use / In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean7 as the dtype on non-string data and it will be converted to In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean1 dtype: In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str or convert from existing pandas data: In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str Behavior differencesThese are places where the behavior of In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int648 objects differ from In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype
Everything else that follows in the rest of this document applies equally to In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean1 and In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int647 dtype. String methodsSeries and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')5 attribute and generally have names matching the equivalent (scalar) built-in string methods: In [24]: s = pd.Series( ....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string" ....: ) ....: In [25]: s.str.lower() Out[25]: 0 a 1 b 2 c 3 aaba 4 baca 5 <NA> 6 caba 7 dog 8 cat dtype: string In [26]: s.str.upper() Out[26]: 0 A 1 B 2 C 3 AABA 4 BACA 5 <NA> 6 CABA 7 DOG 8 CAT dtype: string In [27]: s.str.len() Out[27]: 0 1 1 1 2 1 3 4 4 4 5 <NA> 6 4 7 3 8 3 dtype: Int64 In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object') The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string0 Since In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')6 is an Index object, we can use the In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')7 accessor In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string1 These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string2 Note If you have a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 where lots of elements are repeated (i.e. the number of unique elements in the In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 is a lot smaller than the length of the In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4), it can be faster to convert the original In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 to one of type In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string02 and then use In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string03 or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string04 on that. The performance difference comes from the fact that, for In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 of type In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string02, the string operations are done on the In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string07 and not on each element of the In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4. Please note that a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 of type In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string02 with string In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string07 has some limitations in comparison to In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 of type string (e.g. you can’t add strings to each other: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string13 won’t work if In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string14 is a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 of type In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string02). Also, In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')7 methods which operate on elements of type In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string18 are not available on such a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4. Warning Before v.0.25.0, the In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')7-accessor did only the most rudimentary type checks. Starting with v.0.25.0, the type of the Series is inferred and the allowed types (i.e. strings) are enforced more rigorously. Generally speaking, the In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')7 accessor is intended to work only on strings. With very few exceptions, other uses are not supported, and may be disabled at a later point. Splitting and replacing stringsMethods like In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string22 return a Series of lists: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string3 Elements in the split lists can be accessed using In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string23 or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string24 notation: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string4 It is easy to expand this to return a DataFrame using In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string25. In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string5 When original In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 has , the output columns will all be as well. It is also possible to limit the number of splits: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string6 In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string29 is similar to In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string22 except it works in the reverse direction, i.e., from the end of the string to the beginning of the string: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string7 In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string31 optionally uses regular expressions: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string8 Warning Some caution must be taken when dealing with regular expressions! The current behavior is to treat single character patterns as literal strings, even when In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string32 is set to In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string33. This behavior is deprecated and will be removed in a future version so that the In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string32 keyword is always respected. Changed in version 1.2.0. If you want literal replacement of a string (equivalent to ), you can set the optional In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string32 parameter to In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string37, rather than escaping each character. In this case both In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string38 and In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string39 must be strings: In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string9 The In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string31 method can also take a callable as replacement. It is called on every In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string38 using . The callable should expect one positional argument (a regex object) and return a string. In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string0 The In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string31 method also accepts a compiled regular expression object from as a pattern. All flags should be included in the compiled regular expression object. In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string1 Including a In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string45 argument when calling In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string31 with a compiled regular expression object will raise a In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string47. In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string2 In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string48 and In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string49 have the same effect as In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string50 and In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string51 added in Python 3.9 <>`__: New in version 1.4.0. In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string3 ConcatenationThere are several ways to concatenate a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53, either with itself or others, all based on , resp. In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string55. Concatenating a single Series into a stringThe content of a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 (or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53) can be concatenated: In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string4 If not specified, the keyword In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string58 for the separator defaults to the empty string, In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string59: In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string5 By default, missing values are ignored. Using In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string60, they can be given a representation: In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string6 Concatenating a Series and something list-like into a SeriesThe first argument to can be a list-like object, provided that it matches the length of the calling In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 (or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53). In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string7 Missing values on either side will result in missing values in the result as well, unless In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string60 is specified: In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string8 Concatenating a Series and something array-like into a SeriesThe parameter In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string65 can also be two-dimensional. In this case, the number or rows must match the lengths of the calling In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 (or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53). In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string9 Concatenating a Series and an indexed object into a Series, with alignmentFor concatenation with a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 or In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5, it is possible to align the indexes before concatenation by setting the In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string70-keyword. In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str0 Warning If the In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string70 keyword is not passed, the method will currently fall back to the behavior before version 0.23.0 (i.e. no alignment), but a In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string73 will be raised if any of the involved indexes differ, since this default will change to In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string74 in a future version. The usual options are available for In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string70 (one of In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string76). In particular, alignment also means that the different lengths do not need to coincide anymore. In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str1 The same alignment can be used when In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string65 is a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5: In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str2 Concatenating a Series and many objects into a SeriesSeveral array-like items (specifically: In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4, In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53, and 1-dimensional variants of In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string81) can be combined in a list-like container (including iterators, In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string82-views, etc.). In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str3 All elements without an index (e.g. In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string81) within the passed list-like must match in length to the calling In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 (or In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53), but In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4 and In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 may have arbitrary length (as long as alignment is not disabled with In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string88): In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str4 If using In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string89 on a list-like of In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string65 that contains different indexes, the union of these indexes will be used as the basis for the final concatenation: In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str5 Indexing with In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object') 7You can use In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string24 notation to directly index by position locations. If you index past the end of the string, the result will be a In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string93. In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str6 Extracting substringsExtract first match in each subject (extract)Warning Before version 0.23, argument In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string25 of the In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string95 method defaulted to In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string37. When In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string97, In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string25 returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean4, In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53, or In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5, depending on the subject and regular expression pattern. When In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string02, it always returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5, which is more consistent and less confusing from the perspective of a user. In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string02 has been the default since version 0.23.0. The In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string95 method accepts a regular expression with at least one capture group. Extracting a regular expression with more than one group returns a DataFrame with one column per group. In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str7 Elements that do not match return a row filled with In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string93. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string07 to access tuples or In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string08 objects. The dtype of the result is always object, even if no match is found and the result only contains In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string93. Named groups like In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str8 and optional groups like In [7]: s = pd.Series(["a", 2, np.nan], dtype="string") In [8]: s Out[8]: 0 a 1 2 2 <NA> dtype: string In [9]: type(s[1]) Out[9]: str9 can also be used. Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used. Extracting a regular expression with one group returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 with one column if In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string02. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str0 It returns a Series if In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string97. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str1 Calling on an In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 with a regex with exactly one capture group returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 with one column if In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string02. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str2 It returns an In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 if In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string97. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str3 Calling on an In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 with a regex with more than one capture group returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 if In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string02. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str4 It raises In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string47 if In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string97. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str5 The table below summarizes the behavior of In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string23 (input subject in first column, number of groups in regex in first row) 1 group >1 group Index Index ValueError Series Series DataFrame Extract all matches in each subject (extractall)Unlike In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string95 (which returns only the first match), In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str6 the In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string25 method returns every match. The result of In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string25 is always a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 with a In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string28 on its rows. The last level of the In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string28 is named In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string30 and indicates the order in the subject. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str7 When each subject string in the Series has exactly one match, In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str8 then In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string31 gives the same result as In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string32. In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64") In [11]: s1 Out[11]: 0 1 1 2 2 <NA> dtype: Int64 In [12]: s2 = s1.astype("string") In [13]: s2 Out[13]: 0 1 1 2 2 <NA> dtype: string In [14]: type(s2[0]) Out[14]: str9 In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 also supports In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string34. It returns a In [22]: s.str.isdigit() Out[22]: 0 False 1 <NA> 2 False dtype: boolean In [23]: s.str.match("a") Out[23]: 0 True 1 <NA> 2 False dtype: boolean5 which has the same result as a In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string36 with a default index (starts from 0). In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int640 Testing for strings that match or contain a patternYou can check whether elements contain a pattern: In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int641 Or whether elements match a pattern: In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int642 New in version 1.1.0. In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int643 Note The distinction between In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string30, In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string38, and In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string39 is strictness: In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string38 tests whether the entire string matches the regular expression; In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string30 tests whether there is a match of the regular expression that begins at the first character of the string; and In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string39 tests whether there is a match of the regular expression at any position within the string. The corresponding functions in the In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string43 package for these three match modes are , , and , respectively. Methods like In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string30, In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string38, In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string39, In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string47, and In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string48 take an extra In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string49 argument so missing values can be considered True or False: In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int644 Creating indicator variablesYou can extract dummy variables from string columns. For example if they are separated by a In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string50: In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int645 String In [2]: pd.Series(["a", "b", "c"], dtype="string") Out[2]: 0 a 1 b 2 c dtype: string In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) Out[3]: 0 a 1 b 2 c dtype: string53 also supports In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string52 which returns a In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string28. In [15]: s = pd.Series(["a", None, "b"], dtype="string") In [16]: s Out[16]: 0 a 1 <NA> 2 b dtype: string In [17]: s.str.count("a") Out[17]: 0 1 1 <NA> 2 0 dtype: Int64 In [18]: s.dropna().str.count("a") Out[18]: 0 1 2 0 dtype: Int646 See also . Method summaryMethod Description Concatenate strings Split strings on delimiter Split strings on delimiter working from the end of the string Index into each element (retrieve i-th element) Join strings in each element of the Series with passed separator Split strings on the delimiter returning DataFrame of dummy variables Return boolean array if each string contains pattern/regex Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence Remove prefix from string, i.e. only remove if string starts with prefix. Remove suffix from string, i.e. only remove if string ends with suffix. Duplicate values ( In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string66 equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string67) Add whitespace to left, right, or both sides of strings Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string70 Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string72 Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string74 Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string76 Split long strings into lines with length less than a given width Slice each string in the Series Replace slice in each string with passed value Count occurrences of pattern Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string82 for each element Equivalent to In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string84 for each element Compute list of all occurrences of pattern/regex for each string Call In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string08 on each element, returning matched groups as list Call In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string89 on each element, returning DataFrame with one row for each element and one column for each regex capture group Call In [4]: s = pd.Series(["a", "b", "c"]) In [5]: s Out[5]: 0 a 1 b 2 c dtype: object In [6]: s.astype("string") Out[6]: 0 a 1 b 2 c dtype: string91 on each element, returning DataFrame with one row for each match and one column for each regex capture group How do I convert a string to a DataFrame in Python?Method 1: Create Pandas DataFrame from a string using StringIO() One way to achieve this is by using the StringIO() function. It will act as a wrapper and it will help us to read the data using the pd. read_csv() function.
How to convert object data type to string in Python?Python is all about objects thus the objects can be directly converted into strings using methods like str() and repr(). Str() method is used for the conversion of all built-in objects into strings. Similarly, repr() method as part of object conversion method is also used to convert an object back to a string.
How can we convert a Python series object into a DataFrame?to_frame() function is used to convert the given series object to a dataframe. Parameter : name : The passed name should substitute for the series name (if it has one).
How to convert JSON string to DataFrame in Python?The json_normalize() function is used to convert the JSON string into a DataFrame. You can load JSON string using json. loads() function. Pass JSON object to json_normalize() , which returns a Pandas DataFrame.
|