Cara menggunakan regex find punctuation python

Text preprocessing is one of the most important tasks in Natural Language Processing (NLP). For instance, you may want to remove all punctuation marks from text documents before they can be used for text classification. Similarly, you may want to extract numbers from a text string. Writing manual scripts for such preprocessing tasks requires a lot of effort and is prone to errors. Keeping in view the importance of these preprocessing tasks, the Regular Expressions (aka Regex) have been developed in different languages in order to ease these text preprocessing tasks.

A Regular Expression is a text string that describes a search pattern which can be used to match or replace patterns inside a string with a minimal amount of code. In this tutorial, we will implement different types of regular expressions in the Python language.

To implement regular expressions, the Python's result.group(0) 8 package can be used. Import the Python's result.group(0) 8 package with the following command:

import re

Searching Patterns in a String

One of the most common NLP tasks is to search if a string contains a certain pattern or not. For instance, you may want to perform an operation on the string based on the condition that the string contains a number.

To search a pattern within a string, the 'The film Titanic was released in 1998' 0 and 'The film Titanic was released in 1998' 1 function of the result.group(0) 8 package is used.

The match Function

Initialize a variable 'The film Titanic was released in 1998' 3 with a text string as follows:

text = "The film Titanic was released in 1998"

Let's write a regex expression that matches a string of any length and any character:

result = re.match(r".*", text)

The first parameter of the 'The film Titanic was released in 1998' 0 function is the regex expression that you want to search. Regex expression starts with the alphabet 'The film Titanic was released in 1998' 5 followed by the pattern that you want to search. The pattern should be enclosed in single or double quotes like any other string.

The above regex expression will match the text string, since we are trying to match a string of any length and any character. If a match is found, the 'The film Titanic was released in 1998' 0 function returns 'The film Titanic was released in 1998' 7 object as shown below:

type(result)

Output:

_sre.SRE_Match

Now to find the matched string, you can use the following command:

result.group(0)

Output:

'The film Titanic was released in 1998'

In case if no match is found by the 'The film Titanic was released in 1998' 0 function, a 'The film Titanic was released in 1998' 9 object is returned.

Now the previous regex expression matches a string with any length and any character. It will also match an empty string of length zero. To test this, update the value of text variable with an empty string:

text = ""

Now, if you again execute the following regex expression, a match will be found:

result = re.match(r".*", text)

Since we specified to match the string with any length and any character, even an empty string is being matched.

To match a string with a length of at least 1, the following regex expression is used:

result = re.match(r".+", text)

Here the plus sign specifies that the string should have at least one character.

Searching Alphabets

The 'The film Titanic was released in 1998' 0 function can be used to find any alphabet letters within a string. Let's initialize the text variable with the following text:

text = "The film Titanic was released in 1998"

Now to find all the alphabet letter, both uppercase and lowercase, we can use the following regex expression:

text = "The film Titanic was released in 1998" 1

This regex expression states that match the text string for any alphabets from small text = "" 1 to small text = "" 2 or capital text = "" 3 to capital text = "" 4. The plus sign specifies that string should have at least one character. Let's print the match found by the above expression:

text = "The film Titanic was released in 1998" 2

Output:

text = "The film Titanic was released in 1998" 3

In the output, you can see that the first word i.e. text = "" 5 is returned. This is because the 'The film Titanic was released in 1998' 0 function only returns the first match found. In the regex we specified that find the patterns with both small and capital alphabets from text = "" 1 to text = "" 2. The first match found was text = "" 5. After the word text = "" 5 there is a space, which is not treated as an alphabet letter, therefore the matching stopped and the expression returned just text = "" 5, which is the first match.

However, there is a problem with this. If a string starts with a number instead of an alphabet, the 'The film Titanic was released in 1998' 0 function will return null even if there are alphabets after the number. Let's see this in action:

text = "The film Titanic was released in 1998" 4

Output:

text = "The film Titanic was released in 1998" 5

In the above script, we have updated the text variable and now it starts with a digit. We then used the 'The film Titanic was released in 1998' 0 function to search for alphabets in the string. Though the text string contains alphabets, null will be returned since 'The film Titanic was released in 1998' 0 function only matches the first element in the string.

To solve this problem we can use the result = re.match(r".*", text) 5 function.

The search Function

The result = re.match(r".*", text) 5 function is similar to the 'The film Titanic was released in 1998' 0 function i.e. it tries to match the specified pattern. However, unlike the 'The film Titanic was released in 1998' 0 function, it matches the pattern globally instead of matching only the first element. Therefore, the result = re.match(r".*", text) 5 function will return a match even if the string doesn't contain an alphabet at the start of the string but contains an alphabet elsewhere in the string, as shown below:

text = "The film Titanic was released in 1998" 6

Output:

text = "The film Titanic was released in 1998" 7

The result = re.match(r".*", text) 5 function returns "was" since this is the first match that is found in the text string.

Matching String from the Start

To check if a string starts with a specific word, you can use the carrot key i.e. result = re.match(r".+", text) 1 followed by the word to match with the result = re.match(r".*", text) 5 function as shown below. Suppose we have the following string:

text = "The film Titanic was released in 1998" 8

If we want to find out whether the string starts with "1998", we can use the result = re.match(r".*", text) 5 function as follows:

text = "The film Titanic was released in 1998" 9

In the output, 'The film Titanic was released in 1998' 9 will be returned since the text string doesn't contain "1998" directly at the start.

Now let's change the content text variable and add "1998" at the beginning and then check if "1998" is found at the beginning or not. Execute the following script:

result = re.match(r".*", text) 0

Output:

result = re.match(r".*", text) 1

Matching Strings from the End

To check whether a string ends with a specific word or not, we can use the word in the regular expression, followed by the dollar sign. The dollar sign marks the end of the statement. Take a look at the following example:

result = re.match(r".*", text) 2

In the above script, we tried to find if the text string ends with "1998", which is not the case.

Output:

result = re.match(r".*", text) 3

Now if we update the string and add "1998" at the end of the text string, the above script will return ‘Match found' as shown below:

result = re.match(r".*", text) 4

Output:

result = re.match(r".*", text) 1

Substituting text in a String

Till now we have been using regex to find if a pattern exists in a string. Let's move forward with another advanced regex function i.e. substituting text in a string. The result = re.match(r".+", text) 5 function is used for this purpose.

Let's take a simple example of the substitute function. Suppose we have the following string:

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

result = re.match(r".*", text) 6

To replace the string "Pulp Fiction" with "Forrest Gump" (another movie released in 1994) we can use the result = re.match(r".+", text) 5 function as follows:

result = re.match(r".*", text) 7

The first parameter to the result = re.match(r".+", text) 5 function is the regular expression that finds the pattern to substitute. The second parameter is the new text that you want as a replacement for the old text and the third parameter is the text string on which the substitute operation will be performed.

If you print the result variable, you will see the new string.

Now let's substitute all the alphabets in our string with character "X". Execute the following script:

result = re.match(r".*", text) 8

Output:

result = re.match(r".*", text) 9

It can be seen from the output that all the characters have been replaced except the capital ones. This is because we specified result = re.match(r".+", text) 8 only and not result = re.match(r".+", text) 9. There are two ways to solve this problem. You can either specify result = re.match(r".+", text) 9 in the regular expression along with result = re.match(r".+", text) 8 as follows:

type(result) 0

Or you can pass the additional parameter text = "The film Titanic was released in 1998" 02 to the sub function and set its value to text = "The film Titanic was released in 1998" 03 which refers to case insensitive, as follows:

type(result) 1

More details about different types of flags can be found at Python regex official documentation page.

Shorthand Character Classes

There are different types of shorthand character classes that can be used to perform a variety of different string manipulation functions without having to write complex logic. In this section we will discuss some of them:

Removing Digits from a String

The regex expression to find digits in a string is text = "The film Titanic was released in 1998" 04. This pattern can be used to remove digits from a string by replacing them with an empty string of length zero as shown below:

type(result) 2

Output:

type(result) 3

Removing Alphabet Letters from a String

type(result) 4

Output:

type(result) 5

Removing Word Characters

If you want to remove all the word characters (letters and numbers) from a string and keep the remaining characters, you can use the text = "The film Titanic was released in 1998" 05 pattern in your regex and replace it with an empty string of length zero, as shown below:

type(result) 6

Output:

type(result) 7

The output shows that all the numbers and alphabets have been removed.

Removing Non-Word Characters

To remove all the non-word characters, the text = "The film Titanic was released in 1998" 06 pattern can be used as follows:

type(result) 8

Output:

type(result) 9

From the output, you can see that everything has been removed (even spaces), except the numbers and alphabets.

Grouping Multiple Patterns

You can group multiple patterns to match or substitute in a string using the square bracket. In fact, we did this when we matched capital and small letters. Let's group multiple punctuation marks and remove them from a string:

_sre.SRE_Match 0

Output:

_sre.SRE_Match 1

You can see that the string in the text variable had multiple punctuation marks, we grouped all these punctuations in the regex expression using square brackets. It is important to mention that with a dot and a single quote we have to use the escape sequence i.e. backward slash. This is because by default the dot operator is used for any character and the single quote is used to denote a string.

Removing Multiple Spaces

Sometimes, multiple spaces appear between words as a result of removing words or punctuation. For instance, in the output of the last example, there are multiple spaces between text = "The film Titanic was released in 1998" 07 and text = "The film Titanic was released in 1998" 08. These spaces can be removed using the text = "The film Titanic was released in 1998" 09 pattern, which refers to a single space.

_sre.SRE_Match 2

Output:

_sre.SRE_Match 3

In the script above we used the expression text = "The film Titanic was released in 1998" 10 which refers to single or multiple spaces.

Removing Spaces from Start and End

Sometimes we have a sentence that starts or ends with a space, which is often not desirable. The following script removes spaces from the beginning of a sentence:

_sre.SRE_Match 4

Output:

_sre.SRE_Match 5

Similarly, to remove space at the end of the string, the following script can be used:

_sre.SRE_Match 6

Removing a Single Character

Sometimes removing punctuation marks, such as an apostrophe, results in a single character which has no meaning. For instance, if you remove the apostrophe from the word text = "The film Titanic was released in 1998" 11 and replace it with space, the resultant string is text = "The film Titanic was released in 1998" 12. Here the text = "The film Titanic was released in 1998" 13 makes no sense. Such single characters can be removed using regex as shown below:

_sre.SRE_Match 7

Output:

_sre.SRE_Match 5

The script replaces any small or capital letter between one or more spaces, with a single space.

Splitting a String

String splitting is another very important function. Strings can be split using text = "The film Titanic was released in 1998" 14 function from the re package. The text = "The film Titanic was released in 1998" 14 function returns a list of split tokens. Let's split a string of words where one or more space characters are found, as shown below:

_sre.SRE_Match 9

Output:

result.group(0) 0

Similarly, you can use other regex expressions to split a string using the text = "The film Titanic was released in 1998" 14 functions. For instance, the following text = "The film Titanic was released in 1998" 14 function splits string of words when a comma is found:

result.group(0) 1

Output:

result.group(0) 2

Finding All Instances

The 'The film Titanic was released in 1998' 0 function conducts a match on the first element while the result = re.match(r".*", text) 5 function conducts a global search on the string and returns the first matched instance.

For instance, if we have the following string:

result.group(0) 3

We want to search all the digits from this string. If we use the result = re.match(r".*", text) 5 function, only the first occurrence of digits i.e. 200 will be returned as shown below:

result.group(0) 4

Output:

result.group(0) 5

On the other hand, the 'The film Titanic was released in 1998' 1 function returns a list that contains all the matched utterances as shown below:

result.group(0) 6

Output:

result.group(0) 7

You can see from the output that both "200" and "400" is returned by the 'The film Titanic was released in 1998' 1 function.

Conclusion

In this article we studied some of the most commonly used regex functions in Python. Regular expressions are extremely useful for preprocessing text that can be further used for a variety of applications, such as topic modeling, text classification, sentimental analysis, and text summarization, etc.

Postingan terbaru

LIHAT SEMUA