Introducing Python’s Parse: The Ultimate Alternative to Regular Expressions
Use best practices and real-world examples to demonstrate the powerful text parser library
This article introduces a Python library called parse
for quickly and conveniently parsing and extracting data from text, serving as a great alternative to Python regular expressions.
And which covers the best practices with the parse
library and a real-world example of parsing nginx log text.
Introduction
I have a colleague named Wang. One day, he came to me with a worried expression, saying he encountered a complex problem: his boss wanted him to analyze the server logs from the past month and provide statistics on visitor traffic.
I told him it was simple. Just use regular expressions. For example, to analyze nginx logs, use the following regular expression, and it’s elementary.
content:
192.168.0.2 - - [04/Jan/2019:16:06:38 +0800] "GET http://example.aliyundoc.com/_astats?application=&inf.name=eth0 HTTP/1.1" 200 273932
regular expression:
(?<ip>\d+\.\d+\.\d+\.\d+)( - - \[)(?<datetime>[\s\S]+)(?<t1>\][\s"]+)(?<request>[A-Z]+) (?<url>[\S]*) (?<protocol>[\S]+)["] (?<code>\d+) (?<sendbytes>\d+)
But Wang was still worried, saying that learning regular expressions is too tricky. Although there are many ready-made examples online to learn from, he needs help with parsing uncommon text formats.
Moreover, even if he could solve the problem this time, what if his boss asked for changes in the parsing rules when he submitted the analysis? Wouldn’t he need to fumble around for a long time again?
Is there a simpler and more convenient method?
I thought about it and said, of course, there is. Let’s introduce our protagonist today: the Python parse
library.
Installation & Setup
As described on the parse GitHub page, it uses Python’s format() syntax to parse text, essentially serving as a reverse operation of Python f-strings.
Before starting to use parse
, let’s see how to install the library.
Direct installation with pip:
python -m pip install parse
Installation with conda can be more troublesome, as parse is not in the default conda channel and needs to be installed through conda-forge:
conda install -c conda-forge parse
After installation, you can use from parse import *
in your code to use the library’s methods directly.
Features & Usage
The parse
API is similar to Python Regular Expressions, mainly consisting of the parse
, search
, and findall
methods. Basic usage can be learned from the parse documentation.
Pattern format
The parse format is very similar to the Python format syntax. You can capture matched text using {}
or {field_name}
.
For example, in the following text, if I want to get the profile URL and username, I can write it like this:
content:
Hello everyone, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.
parse pattern:
Hello everyone, my Medium profile url is {profile},
and my username is {username}.
Or you want to extract multiple phone numbers. Still, the phone numbers have different formats of country codes in front, and the phone numbers are of a fixed length of 11 digits. You can write it like this:
compiler = Parser("{country_code}{phone:11.11},")
content = "0085212345678901, +85212345678902, (852)12345678903,"
results = compiler.findall(content)
for result in results:
print(result)
Or if you need to process a piece of text in an HTML tag, but the text is preceded and followed by an indefinite length of whitespace, you can write it like this:
content:
<div> Hello World </div>
pattern:
<div>{:^}</div>
In the code above, {:11}
refers to the width, which means to capture at least 11 characters, equivalent to the regular expression (.{11,})?
. {:.11}
refers to the precision, which means to capture at most 11 characters, equivalent to the regular expression (.{,11})?
. So when combined, it means (.{11, 11})?
. The result is:
The most powerful feature of parse is its handling of time text, which can be directly parsed into Python datetime objects. For example, if we want to parse the time in an HTTP log:
content:
[04/Jan/2019:16:06:38 +0800]
pattern:
[{:th}]
Retrieving results
There are two ways to retrieve the results:
- For capturing methods that use
{}
without a field name, you can directly useresult.fixed
to get the result as a tuple. - For capturing methods that use
{field_name}
, you can useresult.named
to get the result as a dictionary.
Custom Type Conversions
Although using {field_name}
is already quite simple, the source code reveals that {field_name}
is internally converted to (?P<field_name>.+?)
. So, parse
still uses regular expressions for matching. .+?
represents one or more random characters in non-greedy mode.
However, often we hope to match more precisely. For example, the text “my email is xxx@xxx.com”, “my email is {email}”
can capture the email. Sometimes we may get dirty data, for example, “my email is xxxx@xxxx”, and we don’t want to grab it.
Is there a way to use regular expressions for more accurate matching?
That’s when the with_pattern
decorator comes in handy.
For example, for capturing email addresses, we can write it like this:
@with_pattern(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
def email(text: str) -> str:
return text
compiler = Parser("my email address is {email:Email}", dict(Email=email))
legal_result = compiler.parse("my email address is xx@xxx.com") # legal email
illegal_result = compiler.parse("my email address is xx@xx") # illegal email
Using the with_pattern
decorator, we can define a custom field type, in this case, Email
which will match the email address in the text. We can also use this approach to match other complicated patterns.
A Real-world Example: Parsing Nginx Log
After understanding the basic usage of parse, let’s return to the troubles of Wang mentioned at the beginning of the article. Let’s see how to parse logs if we have server log files for the past month.
Note: We chose NASA’s HTTP log dataset for this experiment, which is free to use.
The text fragment to be parsed looks like this:
First, we need to preprocess the parse expression. This way, when parsing large files, we don’t have to compile the regular expression for each line of text, thus improving performance.
from parse import Parser, with_pattern
import pandas as pd
# https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
FILE_NAME = "../../data/access_log_Jul95_min"
compiler = Parser('{source} - - [{timestamp:th}] "{method} {path} {version}" {status_code} {length}\n')
Next, the parse_line
method is the core of this example. It uses the preprocessed expression to parse the text, returning the corresponding match if there is one and an empty dictionary if not.
def process_line(text: str) -> dict:
parse_result = compiler.parse(text)
return parse_result.named if parse_result else {}
Then, we use the read_file
method to process the text line by line using a generator, which can minimize memory usage. However, due to the disk’s 4k capability limitations, this method may not guarantee performance.
def read_file(name: str) -> list[dict]:
result = []
with open(name, 'r') as f:
for line in f:
obj: dict = process_line(line)
result.append(obj)
return result
Since we need to perform statistics on the log files, we must use the from_records
method to construct a DataFrame
from the matched results.
def build_dataframe(records: list[dict]) -> pd.DataFrame:
result: pd.DataFrame = pd.DataFrame.from_records(records, index='timestamp')
return result
Finally, in the main
method, we put all the methods together and try to count the different status_code
occurrences:
def main():
records: list[dict] = read_file(FILE_NAME)
dataframe = build_dataframe(records)
print(dataframe.groupby('status_code').count())
That’s it. Wang’s troubles have been easily solved.
Best Practices with parse Library
Although the parse
library is so simple that I only have a little to write about in the article. There are still some best practices to follow, just like regular expressions.
Readability and maintainability
To efficiently capture text and maintain expressions, it is recommended to always use {field_name}
instead of {}
. This way, you can directly use result.named
to obtain key-value results.
Using Parser(pattern)
to preprocess the expression is recommended, rather than parse(pattern, text)
.
On the one hand, this can improve performance. On the other hand, when using Custom Type Conversions
, you can keep the pattern
and extra_type
together, making it easier to maintain.
Optimizing performance for large datasets
If you look at the source code, you can see that {}
and {field_name}
use the regular expressions (.+?)
and (?P<field_name>.+?)
for capture, respectively. Both expressions use the non-greedy mode. So when you use with_pattern
to write your own expressions, also try to use non-greedy mode.
At the same time, when writing with_pattern
, if you use ()
for capture grouping, please use regex_group_count
to specify the specific groups like this: @with_pattern
(r’((\d+))’, regex_group_count=2)
.
Finally, if a group is not needed in with_pattern, use (?:x)
instead. @with_pattern(r’(?:<input.?>)(.?)(?:</input>)’, regex_group_count=1)
means you want to capture the content between input tags. The input tags will not be captured.
Conclusion
In this article, I changed my usual way of writing lengthy papers. By solving a colleague’s problem, I briefly introduced the use of the parse
library. I hope you like this style.
This article does not cover the detailed usage methods on the official website. Still, it introduces some best practices and performance optimization solutions based on my experience.
At the same time, I explained in detail the use of the parse
library to parse nginx logs with a practical example.
As the new series title suggests, besides improving code execution speed and performance, using various tools to improve work efficiency is also a performance enhancement.
This article helps data scientists simplify text parsing and spend time on more critical tasks. If you have any thoughts on this article, feel free to leave a comment and discuss.