Python convert string object to DataFrame

You can accidentally store a mixture of strings and non-strings in an

In [15]: s = pd.Series(["a", None, "b"], dtype="string")

In [16]: s
Out[16]: 
0       a
1    <NA>
2       b
dtype: string

In [17]: s.str.count("a")
Out[17]: 
0       1
1    <NA>
2       0
dtype: Int64

In [18]: s.dropna().str.count("a")
Out[18]: 
0    1
2    0
dtype: Int64
7 dtype array. It’s better to have a dedicated dtype.

  • In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype breaks dtype-specific operations like . There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

  • When reading code, the contents of an

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype array is less clear than
    In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
    
    In [20]: s2.str.count("a")
    Out[20]: 
    0    1.0
    1    NaN
    2    0.0
    dtype: float64
    
    In [21]: s2.dropna().str.count("a")
    Out[21]: 
    0    1
    2    0
    dtype: int64
    
    5.

  • Currently, the performance of

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype arrays of strings and are about the same. We expect future enhancements to significantly increase the performance and lower the memory overhead of .

    Warning

    In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
    
    In [20]: s2.str.count("a")
    Out[20]: 
    0    1.0
    1    NaN
    2    0.0
    dtype: float64
    
    In [21]: s2.dropna().str.count("a")
    Out[21]: 
    0    1
    2    0
    dtype: int64
    
    8 is currently considered experimental. The implementation and parts of the API may change without warning.

    For backwards-compatibility,

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype remains the default type we infer a list of strings to

    In [1]: pd.Series(["a", "b", "c"])
    Out[1]: 
    0    a
    1    b
    2    c
    dtype: object
    

    To explicitly request

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    1 dtype, specify the
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    2

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    

    Or

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    3 after the
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 or
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 is created

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    

    Changed in version 1.1.0.

    You can also use /

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    7 as the dtype on non-string data and it will be converted to
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    1 dtype:

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    

    or convert from existing pandas data:

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    

    Behavior differences

    These are places where the behavior of

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    8 objects differ from
    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype

    1. For

      In [15]: s = pd.Series(["a", None, "b"], dtype="string")
      
      In [16]: s
      Out[16]: 
      0       a
      1    <NA>
      2       b
      dtype: string
      
      In [17]: s.str.count("a")
      Out[17]: 
      0       1
      1    <NA>
      2       0
      dtype: Int64
      
      In [18]: s.dropna().str.count("a")
      Out[18]: 
      0    1
      2    0
      dtype: Int64
      
      8, that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.

      In [15]: s = pd.Series(["a", None, "b"], dtype="string")
      
      In [16]: s
      Out[16]: 
      0       a
      1    <NA>
      2       b
      dtype: string
      
      In [17]: s.str.count("a")
      Out[17]: 
      0       1
      1    <NA>
      2       0
      dtype: Int64
      
      In [18]: s.dropna().str.count("a")
      Out[18]: 
      0    1
      2    0
      dtype: Int64
      

      Both outputs are

      In [24]: s = pd.Series(
         ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
         ....: )
         ....: 
      
      In [25]: s.str.lower()
      Out[25]: 
      0       a
      1       b
      2       c
      3    aaba
      4    baca
      5    <NA>
      6    caba
      7     dog
      8     cat
      dtype: string
      
      In [26]: s.str.upper()
      Out[26]: 
      0       A
      1       B
      2       C
      3    AABA
      4    BACA
      5    <NA>
      6    CABA
      7     DOG
      8     CAT
      dtype: string
      
      In [27]: s.str.len()
      Out[27]: 
      0       1
      1       1
      2       1
      3       4
      4       4
      5    <NA>
      6       4
      7       3
      8       3
      dtype: Int64
      
      2 dtype. Compare that with object-dtype

      In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
      
      In [20]: s2.str.count("a")
      Out[20]: 
      0    1.0
      1    NaN
      2    0.0
      dtype: float64
      
      In [21]: s2.dropna().str.count("a")
      Out[21]: 
      0    1
      2    0
      dtype: int64
      

      When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.

      In [22]: s.str.isdigit()
      Out[22]: 
      0    False
      1     <NA>
      2    False
      dtype: boolean
      
      In [23]: s.str.match("a")
      Out[23]: 
      0     True
      1     <NA>
      2    False
      dtype: boolean
      

    1. Some string methods, like are not available on

      In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
      
      In [20]: s2.str.count("a")
      Out[20]: 
      0    1.0
      1    NaN
      2    0.0
      dtype: float64
      
      In [21]: s2.dropna().str.count("a")
      Out[21]: 
      0    1
      2    0
      dtype: int64
      
      8 because
      In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
      
      In [20]: s2.str.count("a")
      Out[20]: 
      0    1.0
      1    NaN
      2    0.0
      dtype: float64
      
      In [21]: s2.dropna().str.count("a")
      Out[21]: 
      0    1
      2    0
      dtype: int64
      
      8 only holds strings, not bytes.

    2. In comparison operations, and

      In [22]: s.str.isdigit()
      Out[22]: 
      0    False
      1     <NA>
      2    False
      dtype: boolean
      
      In [23]: s.str.match("a")
      Out[23]: 
      0     True
      1     <NA>
      2    False
      dtype: boolean
      
      4 backed by a
      In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
      
      In [20]: s2.str.count("a")
      Out[20]: 
      0    1.0
      1    NaN
      2    0.0
      dtype: float64
      
      In [21]: s2.dropna().str.count("a")
      Out[21]: 
      0    1
      2    0
      dtype: int64
      
      8 will return an object with , rather than a
      In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
      
      In [29]: idx.str.strip()
      Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
      
      In [30]: idx.str.lstrip()
      Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
      
      In [31]: idx.str.rstrip()
      Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
      
      0 dtype object. Missing values in a
      In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
      
      In [20]: s2.str.count("a")
      Out[20]: 
      0    1.0
      1    NaN
      2    0.0
      dtype: float64
      
      In [21]: s2.dropna().str.count("a")
      Out[21]: 
      0    1
      2    0
      dtype: int64
      
      8 will propagate in comparison operations, rather than always comparing unequal like
      In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
      
      In [29]: idx.str.strip()
      Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
      
      In [30]: idx.str.lstrip()
      Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
      
      In [31]: idx.str.rstrip()
      Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
      
      2.

    Everything else that follows in the rest of this document applies equally to

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    1 and
    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    7 dtype.

    String methods

    Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the

    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    5 attribute and generally have names matching the equivalent (scalar) built-in string methods:

    In [24]: s = pd.Series(
       ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
       ....: )
       ....: 
    
    In [25]: s.str.lower()
    Out[25]: 
    0       a
    1       b
    2       c
    3    aaba
    4    baca
    5    <NA>
    6    caba
    7     dog
    8     cat
    dtype: string
    
    In [26]: s.str.upper()
    Out[26]: 
    0       A
    1       B
    2       C
    3    AABA
    4    BACA
    5    <NA>
    6    CABA
    7     DOG
    8     CAT
    dtype: string
    
    In [27]: s.str.len()
    Out[27]: 
    0       1
    1       1
    2       1
    3       4
    4       4
    5    <NA>
    6       4
    7       3
    8       3
    dtype: Int64
    

    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    

    The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    0

    Since

    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    6 is an Index object, we can use the
    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    7 accessor

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    1

    These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    2

    Note

    If you have a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 where lots of elements are repeated (i.e. the number of unique elements in the
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 is a lot smaller than the length of the
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4), it can be faster to convert the original
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 to one of type
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02 and then use
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    03 or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    04 on that. The performance difference comes from the fact that, for
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 of type
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02, the string operations are done on the
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    07 and not on each element of the
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4.

    Please note that a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 of type
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02 with string
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    07 has some limitations in comparison to
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 of type string (e.g. you can’t add strings to each other:
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    13 won’t work if
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    14 is a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 of type
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02). Also,
    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    7 methods which operate on elements of type
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    18 are not available on such a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4.

    Warning

    Before v.0.25.0, the

    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    7-accessor did only the most rudimentary type checks. Starting with v.0.25.0, the type of the Series is inferred and the allowed types (i.e. strings) are enforced more rigorously.

    Generally speaking, the

    In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [29]: idx.str.strip()
    Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    
    In [30]: idx.str.lstrip()
    Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
    
    In [31]: idx.str.rstrip()
    Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
    
    7 accessor is intended to work only on strings. With very few exceptions, other uses are not supported, and may be disabled at a later point.

    Splitting and replacing strings

    Methods like

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    22 return a Series of lists:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    3

    Elements in the split lists can be accessed using

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    23 or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    24 notation:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    4

    It is easy to expand this to return a DataFrame using

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    25.

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    5

    When original

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 has , the output columns will all be as well.

    It is also possible to limit the number of splits:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    6

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    29 is similar to
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    22 except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    7

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    31 optionally uses regular expressions:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    8

    Warning

    Some caution must be taken when dealing with regular expressions! The current behavior is to treat single character patterns as literal strings, even when

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    32 is set to
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    33. This behavior is deprecated and will be removed in a future version so that the
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    32 keyword is always respected.

    Changed in version 1.2.0.

    If you want literal replacement of a string (equivalent to ), you can set the optional

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    32 parameter to
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    37, rather than escaping each character. In this case both
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    38 and
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    39 must be strings:

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    9

    The

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    31 method can also take a callable as replacement. It is called on every
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    38 using . The callable should expect one positional argument (a regex object) and return a string.

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    0

    The

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    31 method also accepts a compiled regular expression object from as a pattern. All flags should be included in the compiled regular expression object.

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    1

    Including a

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    45 argument when calling
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    31 with a compiled regular expression object will raise a
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    47.

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    2

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    48 and
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    49 have the same effect as
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    50 and
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    51 added in Python 3.9 <>`__:

    New in version 1.4.0.

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    3

    Concatenation

    There are several ways to concatenate a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53, either with itself or others, all based on , resp.
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    55.

    Concatenating a single Series into a string

    The content of a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 (or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53) can be concatenated:

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    4

    If not specified, the keyword

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    58 for the separator defaults to the empty string,
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    59:

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    5

    By default, missing values are ignored. Using

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    60, they can be given a representation:

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    6

    Concatenating a Series and something list-like into a Series

    The first argument to can be a list-like object, provided that it matches the length of the calling

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 (or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53).

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    7

    Missing values on either side will result in missing values in the result as well, unless

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    60 is specified:

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    8

    Concatenating a Series and something array-like into a Series

    The parameter

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    65 can also be two-dimensional. In this case, the number or rows must match the lengths of the calling
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 (or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53).

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    9

    Concatenating a Series and an indexed object into a Series, with alignment

    For concatenation with a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 or
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5, it is possible to align the indexes before concatenation by setting the
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    70-keyword.

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    0

    Warning

    If the

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    70 keyword is not passed, the method will currently fall back to the behavior before version 0.23.0 (i.e. no alignment), but a
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    73 will be raised if any of the involved indexes differ, since this default will change to
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    74 in a future version.

    The usual options are available for

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    70 (one of
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    76). In particular, alignment also means that the different lengths do not need to coincide anymore.

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    1

    The same alignment can be used when

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    65 is a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5:

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    2

    Concatenating a Series and many objects into a Series

    Several array-like items (specifically:

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4,
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53, and 1-dimensional variants of
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    81) can be combined in a list-like container (including iterators,
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    82-views, etc.).

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    3

    All elements without an index (e.g.

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    81) within the passed list-like must match in length to the calling
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 (or
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53), but
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4 and
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 may have arbitrary length (as long as alignment is not disabled with
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    88):

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    4

    If using

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    89 on a list-like of
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    65 that contains different indexes, the union of these indexes will be used as the basis for the final concatenation:

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    5

    Indexing with In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [29]: idx.str.strip() Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') In [30]: idx.str.lstrip() Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') In [31]: idx.str.rstrip() Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object') 7

    You can use

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    24 notation to directly index by position locations. If you index past the end of the string, the result will be a
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    93.

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    6

    Extracting substrings

    Extract first match in each subject (extract)

    Warning

    Before version 0.23, argument

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    25 of the
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    95 method defaulted to
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    37. When
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    97,
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    25 returns a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    4,
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53, or
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5, depending on the subject and regular expression pattern. When
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02, it always returns a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5, which is more consistent and less confusing from the perspective of a user.
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02 has been the default since version 0.23.0.

    The

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    95 method accepts a regular expression with at least one capture group.

    Extracting a regular expression with more than one group returns a DataFrame with one column per group.

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    7

    Elements that do not match return a row filled with

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    93. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    07 to access tuples or
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    08 objects. The dtype of the result is always object, even if no match is found and the result only contains
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    93.

    Named groups like

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    8

    and optional groups like

    In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
    
    In [8]: s
    Out[8]: 
    0       a
    1       2
    2    <NA>
    dtype: string
    
    In [9]: type(s[1])
    Out[9]: str
    
    9

    can also be used. Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.

    Extracting a regular expression with one group returns a

    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 with one column if
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    0

    It returns a Series if

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    97.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    1

    Calling on an

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 with a regex with exactly one capture group returns a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 with one column if
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    2

    It returns an

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 if
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    97.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    3

    Calling on an

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 with a regex with more than one capture group returns a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 if
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    02.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    4

    It raises

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    47 if
    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    97.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    5

    The table below summarizes the behavior of

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    23 (input subject in first column, number of groups in regex in first row)

    1 group

    >1 group

    Index

    Index

    ValueError

    Series

    Series

    DataFrame

    Extract all matches in each subject (extractall)

    Unlike

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    95 (which returns only the first match),

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    6

    the

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    25 method returns every match. The result of
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    25 is always a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 with a
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    28 on its rows. The last level of the
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    28 is named
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    30 and indicates the order in the subject.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    7

    When each subject string in the Series has exactly one match,

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    8

    then

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    31 gives the same result as
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    32.

    In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
    
    In [11]: s1
    Out[11]: 
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    In [12]: s2 = s1.astype("string")
    
    In [13]: s2
    Out[13]: 
    0       1
    1       2
    2    <NA>
    dtype: string
    
    In [14]: type(s2[0])
    Out[14]: str
    
    9

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 also supports
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    34. It returns a
    In [22]: s.str.isdigit()
    Out[22]: 
    0    False
    1     <NA>
    2    False
    dtype: boolean
    
    In [23]: s.str.match("a")
    Out[23]: 
    0     True
    1     <NA>
    2    False
    dtype: boolean
    
    5 which has the same result as a
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    36 with a default index (starts from 0).

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    0

    Testing for strings that match or contain a pattern

    You can check whether elements contain a pattern:

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    1

    Or whether elements match a pattern:

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    2

    New in version 1.1.0.

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    3

    Note

    The distinction between

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    30,
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    38, and
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    39 is strictness:
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    38 tests whether the entire string matches the regular expression;
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    30 tests whether there is a match of the regular expression that begins at the first character of the string; and
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    39 tests whether there is a match of the regular expression at any position within the string.

    The corresponding functions in the

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    43 package for these three match modes are , , and , respectively.

    Methods like

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    30,
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    38,
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    39,
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    47, and
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    48 take an extra
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    49 argument so missing values can be considered True or False:

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    4

    Creating indicator variables

    You can extract dummy variables from string columns. For example if they are separated by a

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    50:

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    5

    String

    In [2]: pd.Series(["a", "b", "c"], dtype="string")
    Out[2]: 
    0    a
    1    b
    2    c
    dtype: string
    
    In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
    Out[3]: 
    0    a
    1    b
    2    c
    dtype: string
    
    53 also supports
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    52 which returns a
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    28.

    In [15]: s = pd.Series(["a", None, "b"], dtype="string")
    
    In [16]: s
    Out[16]: 
    0       a
    1    <NA>
    2       b
    dtype: string
    
    In [17]: s.str.count("a")
    Out[17]: 
    0       1
    1    <NA>
    2       0
    dtype: Int64
    
    In [18]: s.dropna().str.count("a")
    Out[18]: 
    0    1
    2    0
    dtype: Int64
    
    6

    See also .

    Method summary

    Method

    Description

    Concatenate strings

    Split strings on delimiter

    Split strings on delimiter working from the end of the string

    Index into each element (retrieve i-th element)

    Join strings in each element of the Series with passed separator

    Split strings on the delimiter returning DataFrame of dummy variables

    Return boolean array if each string contains pattern/regex

    Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence

    Remove prefix from string, i.e. only remove if string starts with prefix.

    Remove suffix from string, i.e. only remove if string ends with suffix.

    Duplicate values (

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    66 equivalent to
    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    67)

    Add whitespace to left, right, or both sides of strings

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    70

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    72

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    74

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    76

    Split long strings into lines with length less than a given width

    Slice each string in the Series

    Replace slice in each string with passed value

    Count occurrences of pattern

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    82 for each element

    Equivalent to

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    84 for each element

    Compute list of all occurrences of pattern/regex for each string

    Call

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    08 on each element, returning matched groups as list

    Call

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    89 on each element, returning DataFrame with one row for each element and one column for each regex capture group

    Call

    In [4]: s = pd.Series(["a", "b", "c"])
    
    In [5]: s
    Out[5]: 
    0    a
    1    b
    2    c
    dtype: object
    
    In [6]: s.astype("string")
    Out[6]: 
    0    a
    1    b
    2    c
    dtype: string
    
    91 on each element, returning DataFrame with one row for each match and one column for each regex capture group

    How do I convert a string to a DataFrame in Python?

    Method 1: Create Pandas DataFrame from a string using StringIO() One way to achieve this is by using the StringIO() function. It will act as a wrapper and it will help us to read the data using the pd. read_csv() function.

    How to convert object data type to string in Python?

    Python is all about objects thus the objects can be directly converted into strings using methods like str() and repr(). Str() method is used for the conversion of all built-in objects into strings. Similarly, repr() method as part of object conversion method is also used to convert an object back to a string.

    How can we convert a Python series object into a DataFrame?

    to_frame() function is used to convert the given series object to a dataframe. Parameter : name : The passed name should substitute for the series name (if it has one).

    How to convert JSON string to DataFrame in Python?

    The json_normalize() function is used to convert the JSON string into a DataFrame. You can load JSON string using json. loads() function. Pass JSON object to json_normalize() , which returns a Pandas DataFrame.