In this blog post I will show you that the “Log Out” button does
not actually log the user out.
The Log Out Button Does Not Log You Out
If you click on “Sair”, which translates to “Log Out” you’re not
actually logged out. It doesn’t matter if you try to login straight away,
try to login in a new tab or try to login in a new window. The only
way to actually log out is to close the browser and open it again.
The video below demonstrates the issue in action.
I’ve tested this in Google Chrome on Android and the problem was present
there too. This is an issue. Let’s say that the user logs on a device that’s
not his. After he’s finished with his tasks he clicks on “Log Out”,
thinking that his session has been terminated. This means that the
attacker can simply open the website, click on “Log In” and
obtain access to the victim’s account.
A little bit of exploring reveled the problem. The only think that
clicking “Log Out” does is redirecting the user’s browser to
https://www.autenticacao.gov.pt/c/portal/logout. This page, in turn,
redirects the user back to the home page and sets some
cookies to an empty string, namely COMPANY_ID, ID, LOGIN, PASSWORD and
REMEMBER ME. I’m not sure what they are, since I haven’t seen them
used anywhere else on the site. I assume it’s some code that was ported
for somewhere, but was never adjusted to a different authentication
mechanism (I just hope they weren’t storing the password in plain text
Figure 1: Cookies Set By The Log Out Page
What the page fails to do, is clean the ASPXAUTH and ASP.Net_SessionId
cookies. In ASP.NETASPXAUTH is a cookie used to identity if the user is
authenticated and the ASP.Net_SessionId is a cookie used to identify the
user’s session on the server.
Those cookie’s expiration date is not set, which means that they will have
the life time of a session cookie, meaning that they will get removed once
the client is shut down. This explains why closing and opening the browser
again actually logs the user out.
The solution to this problem is to simply clear the ASPXAUTH and ASP.Net_SessionId on the log out page.
In this blog post I demonstrated that the “Log Out” button does not actually
log the user out, since it fails to clean the authentication cookie.
This opens a window to unauthorized account access.
This blog post ends the series of blog posts related to the security
weaknesses and vulnerabilities found in the Portuguese government’s
authentication system autenticacao.gov.pt and Chave Móvel Digital.
The issues described throughout these blog posts were not a result of a
thorough analysis, but rather what I found in about 30 minutes.
This leads me to believe that a more detailed study of the system’s
security is likely to reveal many more problems.
I hope that the described issues will be fixed very soon,
since they put millions of Portuguese citizens at risk. The agency
responsible for these systems should take security more seriously
and not as an afterthought.
In this blog post I will be exploring the information leakage regarding
if a phone number is registered or not. The system has a protection
measure which can be easily bypassed. I will also provide proof-of-concept
Determining If A Phone Number Is Registered With An Account
If you enter a wrong phone number and PIN combination, you will be
presented with the following message:
Figure 1: The System Ties To Hide Information
The error message reads:
O número de telemóvel ou o PIN estão errados ou registo inexistente
Which translates to:
Either the phone number or the PIN are incorrect or the account
is not registered.
This might seem like a good thing: the system tries to give the
attacker the least information possible. The problem is that
this is useless, since it can be easily bypassed.
The issue is that if the phone number is registered, after 3 failed
attempts, your account will be temporary locked. If the phone number is not
registered, no lock in will occur.
Figure 2: Error Message When The Number Is Registered
The good news is that if you enter the wrong PIN is entered again,
the temporary lockout period will be extended. Sometimes, you might be
asked to solve a captcha. My personal experience, however, suggests that
some simple machine learning can be used to bypass it. A better solution
here would be to use a more sophisticated solution, like Google’s ReCaptcha.
Figure 3: A Simple Captcha Is Used
Once you know that a phone number is registered in the system, you
could apply social engineering attacks to obtain access to the user’s account.
For example, in Portugal, you can send SMS messages with any Sender ID.
Services like Amazon SNS are cheap
and easy to use. When you receive your second-factor authentication code,
it comes from the sender CMD (it stands for Chave Móvel Digital, or
Digital Mobile Key, in English). What an attacker could do is send the
user an SMS message with the sender id CMD saying that there is an issue
with the user’s account and that he needs to follow the provided link
to solve it. The link could perhaps exploit the XSS vulnerability described in the previous post
to get the user’s PIN.
While not being a critical issue, it renders the existing protection system
useless. What they could instead do is treat all of the numbers the same:
lock the access to all of them, regardless of the fact if they are
registered or not. This should, however, be done with some care.
If you follow the naive approach of inserting every entered phone
number into the database and counting the number of login attempts,
the attacker might be able to flood the database with numbers. Careful
system design is needed here to prevent DoS attacks.
The same trick works with e-mail and Twitter authentication. These two
authentication methods are, however, not enabled by default.
A quick reverse-engineering of the authentication process showed that in order to access the login page, you need to go thought a specific sequence of steps, you can’t just go to the login page directly.
In that sequence of steps, your browser exchanges cookies with the server.
Among those, there are 3 different session ids that you get through this
With every POST request a humanCheck value is sent. This value is
set on document load:
I initially thought that that was a unique
value generated on every request, acting as a nonce. Further analysis
showed that this value is fixed. I’ve tried
with different IP’s, different browsers and the value remained the
As demonstrated in this blog post, while the system tries to be ambiguous by not
exposing if the phone number is registered with an account or not, it can be
In this blog post, I will be exploring the Reflected Cross-Site-Scripting (XSS) vulnerability present in the Portuguese government’s authentication
system’s website www.autenticacao.gov.pt.
The XSS vulnerability is present in the Chave Móvel Digital (CMD) login page.
The XSS Vulnerability
At the moment of writing this post, there are 3 ways that you can authenticate
the second-factor code is sent to your mobile phone number
the second-factor code is sent to your e-mail
the second-factor code is sent to your as a private Twitter message
Figure 1: Chave Movel Digital Authentication Options
I’m going to choose the phone number authentication method, since
this is the only one that’s required. The other two are not activated by
default. The XSS vulnerability seems to only work on the
mobile phone number and Twitter authentication methods. E-mail authentication
seems to be doing input sanitization correctly. There are two separate places where you can login:
the “Personal Area”, which only allows mobile phone number login, and the
regular one, which allows the 3 authentication methods described above.
The XSS vulnerability works in both places.
This is how the main login page looks like:
Figure 2: Digital Mobile Key login page
To login to the portal, you need to enter your mobile phone number and a
numeric PIN. The Mobile phone number field is the one that will be used
to exploit the XSS vulnerability. Throughout this section I will be using
1234 as the PIN number.
Now let’s see what happens when I try to enter a phone number that consists
only of letters (Hello, World"):
Figure 3: Digital Mobile Key login page
And now let’s click on “Authenticate”:
Figure 4: User input gets copied without validation
Even though the page says The phone field only allows numbers, at the bottom, it
still copies the input that I provided verbatim to the same field.
At this point I was pretty sure that the page is vulnerable to XSS.
So I looked at the source code to see how it can be exploited:
Figure 5: The input is copied to the value attribute
Okay, so the input is you provided initially is copied to the value
attribute of the inputMobile element. So let’s try to show a very simple
XSS that would inject a new span element into the page and would comment
out some of the HTML. For this, in the Mobile phone number field, we’ll
input the following text: " <span> Hello, World </span> <!--.
Figure 1: A simple XSS input
And here is the result:
Figure 6: Successful result of an XSS injection
As you can see, our custom HTMLwas successfully injected into the page.
The rest of the page wasn’t commented out only because there is another
closing HTML comment tag (-->) present in the page’s source.
Figure 1: Resulting HTML from the XSS attack
Below is a video demonstration of how this vulnerability can be exploited
Whenever the text in the “Mobile phone number” field is
changed, the ShowPrint() function is called. The only thing that it does is
copies the value from the inputMobile element to the MainContent_hiddenMobile attribute:
Why this is done is not obvious, since both of the fields get POSTed to the
the backend when you click on “Authenticate”:
Figure 1: Both of the attribute values get sent
However, if we manually change the value of the MainContent_hiddenMobile
field to Hello, Reader and leave Hello, World in the inputMobile
field, we can see that the value that that’s placed in the “Mobile phone number” box in the failed login page is the one from MainContent_hiddenMobile.
Figure 7: Manually changing the value of the hidden input
Figure 8: The hidden input value is the one that gets echoed back
In this blog post I showed you the POST XSS vulnerability present
in the Portuguese government’s autenticacao.gov.ptChave Móvel Digital
In the next blog post I will show you how you can find out if a
mobile phone number is registered with Chave Móvel Digital or not,
making the existing ambiguous “Either the mobile number or the PIN are
incorrect” message useless and opening the door for various social
Back in January 2018, after about half an hour of exploring
the “secure” authentication system provided by the Portuguese government
I found various security issues and vulnerabilities. I will briefly
explore them in a series of blog posts.
In this first blog post I will explain what is autenticacao.gov.pt and
Chave Móvel Digital, as well as the sensitive information that they
are supposed to be restricting the access too. I will also explore the
overall weak security that the system has.
This whole series of blog posts is a result of my findings in a short
period of time. It’s a compilation of the notes that I took, as a draft
to be submitted to responsible agency, as part of responsible disclosure.
A more thorough security audit is likely to reveal a lot
more problems. If about 30 minutes is all to took to find those problems,
what more would a malicious, motivated attacker would find.
I’m hoping that the issues will be fixed as soon as possible.
On January 21st I emailed firstname.lastname@example.org (email of the responsible agency),
describing the vulnerabilities and security issues, as well as offering
a more detailed description draft and Proof-Of-Concept code, but
obtained no answer.
What Is Autenticacao.gov.pt and Chave Movel Digital (CMD)?
O sítio oficial dos meios de identificação eletrónica,
assinatura digital e autenticação segura do Estado.
which translates to:
The official site of the means of electronic identification, digital signature
and state's secure authentication.
Basically, it’s an authentication system that allows the Portuguese citizens
to authenticate to administrative public services. Once authenticated, you can, among other things:
register a birth of a child
register a vehicle
register a company
request a birth certificate
obtain the criminal record of an individual or a company
submit/consult revenue of an individual or a company
New functionalities are constantly added. As you can see, it is a very
useful system, since it allows you to save a lot of time. It allows you
to do many things from the comfort of your home and in minutes, instead
of spending hours at the pubic administrative services.
Despite being very useful, it does expose some sensitive personal
information and allows you to do some sensitive actions. If this
gets into the wrong hands, it can lead to many problems, such as
For this reason, it is very important to have a secure authentication system.
This is what Chave Móvel Digital (CMD), which translates to Digital
Mobile Key, supposedly does. The main problem with it is that its security
Bad Security From The Ground Up
So now let’s briefly talk about the security of the Chave Móvel Digital authentication system.
In December of 2017 me and my girlfriend went to a Christmas fair. Among other
things, there were various company booths. One of them was VR-related, so
we decided to go in and check it out. After playing around with the Microsoft’s
HoloLens for a little bit, we were approached by a lady there that asked us
if we have already activated the Chave Móvel Digital system. After
explaining to me what that was, she told me that it soon would be mandatory
to have it activated. I could either do it there now or go to the public
administrative organs, wait in queue for hours and do it there.
Motivated by the interest in the security of this new system and the potential
future time savings (bureaucracy is a big issue in Portugal), I decided
to activate it there.
The Authentication Process
Let’s look at the authentication process. To login using the Chave Móvel Digital system you have to enter your
mobile phone number and your PIN. If the entered details are correct, a
code is sent to your mobile number, which you have to enter to login.
The issues here are the following:
PIN can only be 4-8 numeric digits
SMS as the only second factor authentication option
The first one is the most serious one. PIN is essentially your password.
The issue here is that it only be numeric its length is limited to
be between 4 and 8 characters.
Let’s look at the numbers. 8 digit numeric-only passwords means that
there are 10^8 = 100000000 possible combinations. This is nothing for
modern computer. For example, let’s see how long will it take to generate
all of the possible combinations on my machine. Assuming that the file
gen.py contains the following code:
Here is what time python gen.py gives us:
$ time python gen.py
python gen.py 3.20s user 0.01s system 99% cpu 3.208 total
While this is not the most accurate way of measuring the process execution
time, it’s good enough to transmit the idea of how quick modern computers are
at generating the entire key space. This is to give you an intuition of
how small the number of possible PINs is. It’s 2018 and the Portuguese government
seems to think that 4 to 8 digit passwords are secure.
Let’s compare this to an 8 character long password that allows
numbers (10 different: 0-9), letters (52 different: a-z and A-Z) and special characters (32 different). For every position we would have
10 + 52 + 32 = 94 possible values. This means that there would be a total of
94^8=6095689385000000 possible combinations. If we use the rule of three
to estimate how long it would take my computer to generate the entire key
space, we would see that it would come out to more than 6 years.
As an interesting note, when I was signing up with the system, the lady
told me that from her experience people usually choose 4 digit PINs, so
that they’re the same as their phone PINs. This seems like a reasonable
assumption to take. In this case, the simple task is made even simpler, since
the number of possible combinations is lowered to 10^4 = 10000:
Here is what time python gen.py gives us:
$ time python gen.py
python gen.py 0.01s user 0.01s system 98% cpu 0.015 total
I presented the time measurements just to give you an idea of how
small the possible key space is. Of course, we can’t conclude that the PIN
will be bruteforced in 3 seconds. The system limits the number of
attempts that you can perform. However, if you think that this makes bruteforcing
impossible, you’re wrong. I will now demonstrate why.
Let’s assume that the number of login attempts is limited to 3 (in reality,
autenticacao.gov.ptallows more than 3 attempts). Let’s also assume that a lot of people will in fact
use a 4 digit long PIN. This means that for each individual user there will
be a 3/10000 chance of guessing his PIN. Pretty unlikely, right?
Well, this number also means that for every 10000 accounts the attacker
will guess the PIN of 3 of them.
To make the things worse, the numbers above are assuming a totally random
distribution of PINs, which is not true. A data analysis of almost 3.4 million four digit passwords showed that 1234, 0000 and 1111
make up almost 20% of used passwords. Considering this, the probability
of guessing the PIN of a single account would be about 2000/10000.
Therefore, for every 10000 accounts the attacker would guess the PIN of 2000 of them.
Why doesn’t the system allow passwords that contain numbers, letters and special
characters? Maybe it comes down to a legacy issue, but it is definitely possible to write
middleware that would take care of that. The system is supposed to protect
very sensitive data, after all.
As an example, let me present to you a very simple solution that would make
the existing system more secure. It would do it so by allowing the users to use
a password that consists of numbers, letters and special characters, while
keeping the core legacy backend unchanged. I’m going to take the following
the backend requires a 4-8 digit PIN
all of the PINs are stored in plain text
A very simple solution would be to do the following:
When the user is registering/changing his password: hash the password provided by the user with the phone number and store it in the database. Generate an 8 digit PIN and store it. This PIN will never be revealed to the user.
When the users tries to log in: hash the provided password with the phone
number and compare this result to the one that you have stored in the
if they are equal, send an “OK” to the backend, which will then
obtain the PIN from the database and use it as needed
if they are not equal, fail the authentication
I am in no way suggesting that this is the best solution, because it is not.
It is, however, a very simple one, that would require very little changes
to the overall system and keeping the core legacy backend unchanged.
All you would have to do is add a new table to the database (this way you
will keep your existing tables unchanged) for the password hash and
change the authentication code in the middlware between the client and the
backend system. Even if the system is different from the one in my assumptions,
you can adopt the solution with relative ease.
The second point relates to only allowing SMS as the second-factor
authentication. While being a lot better than not using two-factor
authentication at all, it’s not the most secure option.
Let me give you two examples. By using some social engineering, if an attacker
can get access to your personal information, they can contact your phone
company and move your phone number to a new SIM. By exploiting the vulnerabilities in the
SS7 protocol the attacker can
read all of your text messages.
Allowing the use of other authentication options, such as Google Authenticator
would make the second-factor authentication a lot more secure. In general,
I have a feeling that the engineers assumed that there was no need for
strong passwords, since two-factor authentication is required and
only the user has the access to his mobile phone number. This assumption
is, of course, incorrect. While it is true, that if you’re using a good
second-factor authentication, it’s safe to weaken your password,
this does not apply to making your password a trivial 4-8 digit one.
Now, this just smelled bad security. I was certain that I could more problems. I was correct. The security issues that I’m going to be describing in this series of blog posts were found after about 30 minutes of
There is now an option to receive your second-factor code
as a private message on Twitter or be sent to your e-mail address. Those
are not enabled by default and you have to do it yourself. I am not
sure when those options were added. In any case,
when logging in into your “Personal Area”, you only have the option of
using your mobile phone number.
In this first post of the four-part series I described why the security
of the authentication system is flawed. This is also what motivated me
to explore further: I was sure that it wouldn’t be hard to find more issues,
since the “security” of this “secure authentication system” did seem
like an afterthought.
In the next blog post, I will be describing the multiple XSS vulnerabilities
present in the authentication page.
A while back I was researching the most efficient way to check if a number is
prime. This lead me to find the following piece of code:
I was intrigued. While this might not be the most efficient way, it’s
certainly one of the less obvious ones, so my curiosity kicked in.
How on Earth could a match for the .?|(..+?)\1+ regular expression tell that
a number is not prime (once it’s converted to its unary representation)?
If you’re interested, read on, I’ll try to dissect this regular expression and
explain what’s really going on. The explanation will be programming language
of the Java code above and explain why they are slightly different.
I will explain how the regular expression ^.?$|^(..+?)\1+$ can filter out
any prime numbers. Why this one and not .?|(..+?)\1+ (the one used in Java code example above)?
Well, this has to do with the way String.matches() works, which I’ll explain later.
While there are some blog posts on this topic, I found them to not go deep enough
and give just a high level overview, not explaining some of the important details well enough.
Here, I’ll try to lay it out with enough detail so that anyone can follow and understand.
The goal is to make it simple to understand for any one - whether you are a regular expression
guru or this is the first time you’ve heard about them, anyone should be able to follow along.
1. Prime Numbers and Regular Expressions - The Theory
Let’s start at a higher level. But wait, first, let’s get every one on the same
page and begin with some definitions. If you know know what a prime number is
and are familiar with regular expression, feel free to skip this section.
I will try to explain how every bit of the regular expression works, so that even
people who are new or unfamiliar with them can follow along.
First, a prime number is any natural number greater than 1that is only divisible by 1 and the number itself, without leaving a remainder.
Here’s a list of the fist 8 prime numbers: 2, 3, 5, 7, 11, 13, 17, 19.
For example, 5 is prime because you can only divide it by 1 and 5
without leaving a remainder. Sure we can divide it by 2, but that
would leave a remainder of 1, since 5 = 2*2 + 1.
The number 4, on the other hand, is not prime, since we can divide it
by 1, 2 and 4 without leaving a remainder.
Okay, now let’s get to the regular expression (A.K.A. regex) syntax. Now, there are quite a few regex
flavors, I’m not going to focus on any specific one, since that is not the
point of this post. The concepts described here
work in a similar manner in all of the most common flavors, so don’t worry
about it. If you want to learn more about regular expressions, check out
Regular-Expressions.info, it’s a great
resource to learn regex and later use it as a reference.
Here’s a cheatsheet with the concepts that will be needed for the explanation
^ - matches the position before the first character in the string
$ - matches the position right after the last character in the string
. - matches any character, except line break characters (for example, it does not match \n)
| - matches everything that’s either to the left or the right of it. You can think of it as an or operator.
( and ) delimit a capturing group. By placing a part of a regular expression between parentheses, you’re grouping that part of the regular expression together. This allows you to apply quantifiers (like +)
to the entire group or restrict alternation (i.e. “or”: |) to part of the
regular expression. Besides that, parentheses also create a numbered capturing group, which you can refer to later with backreferences (more on that below)
\<number_here> - backreferences match the same text as previously matched by a capturing group. The <number_here> is the group number (remember the discussion above? The one that says that parentheses create a numbered capturing group? That’s where it comes in). I’ll give an example to clarify things in a little bit, so if you’re confused, hang on!
+ - matches the preceding token (for example, it can be a character or a group of characters, if the preceding token is a capturing group) one or more times
* - matches the preceding token zero or more times
if ? is used after + or * quantifiers, it makes that quantifier non-greedy (more on that below)
Capturing Groups and Backreferences
As promised, let’s clarify how capturing groups and backreferences work together.
As I mentioned, parentheses create numbered capturing groups. What do I
mean by that? Well, that means that when you use parentheses, you create a
group that matches some characters and you can refer to those matched characters later on. The numbers are given to the groups in the order they
appear in the regular expression, beginning with 1. For example, let’s say
you have the following regular expression: ^aa(bb)cc(dd)$. Note, that in this
case, we have 2 groups. They are numbered as follows:
This means that we can refer to the characters matched by them later using
backreferences. If we want to refer to what is matched by (bb), we use
\1 (we use 1 because we’re referring to the capturing group #1). To refer to the characters matched by (dd) we use \2. Putting that together,
the the regular expression ^aa(bb)cc(dd)\1$ matches the string aabbccddbb.
Note how we used \1 to refer to the last bb. \1 refers to what was matched by the group (bb), which in this case, was the sting bb.
Now note that I emphasize on what was matched. I really mean the
characters that were matched and not ones that can be matched.
This means, that the regular expression ^aa(.+)cc(dd)\1$does match the sting aaHELLOccddHELLO, but does not match the sting
aaHELLOccddGOODBYE, since it cannot find what was matched by the group #1 (in this case it’s the character sequence HELLO) after the character sequence dd (it finds GOODBYE there).
Greedy and Non-Greedy Quantifiers
If you remember correctly, in the cheatseheet above, I mentioned that ?can be used to make the preceding quantifier non-greedy.
Well, okay, but what does that actually mean? + is greedy quantifier, this
means that it will try to repeat the preceding token as many times as possible, i.e. it will try to consume as much input as it can. The same
is true for the * quantifier.
For example, let’s say we have the string <p>The Documentary</p> (2005) and
the regular expression <.+>. Now, you might think that it will match
<p>, but that’s not true. The matched string will actually be <p>The Documentary</p>. Why is that? Well, that has to do with the fact mentioned
above: the +will try to consume as much input as it can, so that means
that it will not stop at the first >, but rather at the last one.
Now how do we go about making a quantifier non-greedy? Well, you might
be already tired of hearing that (since I’ve already mentioned it twice),
but in order to make a greedy quantifier non-greedy, you put a question mark (?) in front of it.
It’s really as simple as that. In case you’re still confused, don’t worry,
let’s see an example.
Suppose we have the same string: <p>The Documentary</p> (2005), but this
time, we only want to match what is between the first < and >.
How would we go about that? Well, all we have to do is add ? in front of
the +. This will lead us to the <.+?> regex. “Uhhh, okay…”, you might wonder,
“But what does that actually do?”. Well, it will make the + quantifier
non-greedy. This means that it will make the quantifier consume as little input as possible.
Well, in our case, the “as little as possible” is <p>, which is
exactly what we want! To be precise, it will match both of the p’s: <p> and
</p>, but we can easily get what we want by asking for the fist match (<p>).
A Little Note On ^ and $
Since we’re on it, I’ll take a moment to quickly explain what the ^ and
$ actually do. If you remember correctly, ^ matches the position right
before the first character in the string and $ matches the position right after the last character in the string. Note how in both of the regular
expressions above (<.+> and <.+?>) we did not use them. What does that
mean? Well, that means that the match does not have to begin at the start of the string and end at the end of the string.
Taking the second, non-greedy, regex (<.+?>) and the sting The Game - <p>The Documentary</p> (2005), we would still obtain our expected matches (<p> and </p>),
since we’re not forcing it to begin at the beginning of the string and end
at the end of the string.
2. The Regular Expression That Tells If A Number Is Prime
Phew, so we’re finally done with the theoretical introduction and now, since we’ve
already have everything we need under the belt, we’re ready to dive into
the analysis of how the ^.?$|^(..+?)\1+$ regular expression can match
non-prime numbers (in their unary form).
You can ignore the ? in the regular expression, it’s there for performance reasons (explained below) -
it makes the + non-greedy. If it confuses you, just ignore it and consider
that the regex is actually ^.?$|^(..+)\1+$, it works as well, but it’s slower (with some exceptions, like when the number is prime, where the ? makes no difference whatsoever).
After explaining how this regular expression works, I’ll also explain what that ? does there,
you shouldn’t have any trouble understanding it after you understand the inner workings of this
All of the discussion below assumes that we have the number represented in
its unary form (or base-1, if you prefer). It doesn’t actually have to be represented as
a sequence of 1s, it can be a sequence of any characters that are
matched by .. This means that 5 does not have to be represented as
11111, it might as well be represented as fffff or BBBBB. As long
as there are five characters, we’re good to go. Please note, that the
characters have to be the same, no mixtures of characters are allowed,
this means that we cannot represent 5 as ffffB, since here we have a mixture
of two different characters.
High Level Overview
Let’s begin with a high level overview and then dive into the details.
Our ^.?$|^(..+?)\1+$ regular expression consists of two parts: ^.?$ and
As a heads-up, I just want to say that I’m lying a little in the explanation
in the paragraph about the ^(..+?)\1+$ regex. The lie has to do with the order in which the regex engine
checks for multiples, it actually starts with the highest number and goes to the lowest,
and not how I explain it here. But feel free to ignore that distinction here,
since the regular expression still matches the same thing, it just does it in more
steps (so I’ll actually be explaining how ^.?$|^(..+?)\1+?$ works: notice the extra ? after the +.
I’m doing this because I believe this explanation is less verbose and easier to understand.
And don’t worry, I explain how I lied and reveal the shocking truth
later on, so keep on reading. Well, maybe it’s not really that shocking, but
I wanna keep you engaged, so I’ll stick to that naming.
The regex engine will first try to match ^.?$, then, if it fails, it will try to
match ^(..+?)\1+$. Note that the number of characters matched corresponds to the matched number,
i.e. if 3 characters are matched, that means that number 3 was matched, if
26 characters are matched, that means that the number 26 was matched.
^.?$ matches strings with zero or one characters (corresponds to the numbers
0 and 1, respectively).
^(..+?)\1+$ first tries to match 2 characters (corresponds to the number 2),
then 4 characters (corresponds to the number 4), then 6 characters, then 8
characters and so on. Basically it will try to match multiples of 2.
If that fails, it will try to first match 3 characters (corresponds to the number 3),
then 6 characters (corresponds to the number 6), then 9 characters, then 12 characters
and so on. This means that it will try to match multiples of 3. If that fails,
it proceeds to try match multiples of 4, then if that fails it will try
to match multiples of 5 and so on, until the number whose multiple it tries to match is
the length of the string (failure case) or there is a successful match (success case).
Note, that both of parts of the regular expression begin with a ^ symbol and end with a $ symbol, this
forces to what’s in between those symbols (.? in the first case and (..+)\1+ in the second case)
to start at the beginning of the string and end at the end of the string.
In our case that string is the unary representation of the number.
Both of the parts are separated separated by an alternation operator,
this means that either only one of them will be matched or neither will.
If the number is prime, a match will not occur. If the number is not prime a
match will occur. To summarize, we concluded that:
either ^.?$ or ^(..+?)\1+$ will be matched
the match has to be on the whole string, i.e.start at the beginning of the string and end at the end of the string
Okay, but what does each one those parts matches? Keep in mind that if a match occurs, it means that the number is not prime.
How The ^.?$ Regular Expression Works
^.?$ will match 0 or 1 characters. This match will be successful if:
the string contains only 1 character - this means that we’re dealing with
number 1 and, by definition, 1is not prime.
the string contains 0 characters - this means that we’re dealing with
number 0, and 0 is certainly not prime, since we can divide 0 by anything
we want, except for 0 itself, of course.
If we’re given the sting 1, ^.?$ will match it, since we have only one
character in our string (1). The match will also occur if we provide an empty
string, since, as explained before, ^.?$ will match either an empty string (0
characters) or a string with only 1 character.
Okay, so far so so good, we certainly want our regex to recognize 0 and 1 as
non-primes. But that’s not enough, since there are numbers other than
0 and 1 that are not prime. This is where the second part of the regular
expression comes in.
How The ^(..+?)\1+$ Regular Expression Works
^(..+?)\1+$ will first try to match multiples of 2, then multiples of 3, then
multiples of 4, then multiples 5, then multiples of 6 and so on, until the multiple
of the number it tries to match is the length of the string or there is a successful match.
But how does it actually work? Well, let’s dissect it!
Let’s focus on the parentheses now, here we have (..+?) (remember, ? just
makes this expression non-greedy). Notice
that we have a + here, which means “one or more of the preceding token”.
This regex will first try to match (..) (2 characters), then (...) (3 characters),
then (....) (4 characters), and so on, until the length of the string we’re
matching against is reached or there is a successful match.
After matching for some number of characters (let’s call that number x, the regular expression will try to
see if the string’s length is multiple of x. How does it do that? Well, there’s
a backreference. This takes us to the second
part of the regex: \1+. Now, as explained before
this will try to repeat the match in capturing group #1 one or more times (actually it’s more “more or one times, I’m lying a little bit”)
This means that first, it will try to match x * 2 characters in the string,
then x * 3, then x * 4, and so on. If it succeeds in any of those matches,
it returns it (and this means that the number is not prime). If it fails (it will fail when x * <number> exceeds the length of the string we’re matching against),
it will try the same thing, but with x+1 characters, i.e, first (x+1) * 2,
then (x+1) * 3, then (x+1) * 4 and so on (because now the \1+ backreference
refers to x+1 characters). If the number of characters matched by (..+?) reaches
the length of the string we’re matching against, the regex matching process will stop and return a failure. If there is a successful match, it will be returned.
Now, I’ll sketch some examples to make sure you got everything. I will provide
one example where a regular expression succeeds to match and one where it
fails to match. Again, I’m lying in the order of sub-steps (the nested ones, i.e the ones that have a ., like 2.1, 3.2, etc), just a little.
As an example of where a match succeeds, let’s consider the string 111111. The length of the string we’re matching against is 6.
Now, 6 is not a prime number, so we expect the regex to succeed with the match. Let’s see
a sketch of how it will work:
1. It will try to match ^.?$. No luck. The left side of | returns a failure2. It try to match ^(..+?)\1+$ (the right side of |). It begins with (..+?) matching 11:
2.1 The backreference \1+ will try to match 11 twice (i.e1111). No luck.
2.2 The backreference \1+ will try to match 11 trice (i.e111111). Success!. Right side of | returns success
Woah, that was fast! Since the right side of |succeeded,
our regular expression succeeds with the match, which means our number is not prime.
As an example of where a match fails, let’s consider the string 11111. The length of the string we’re matching against is 5.
Now, 5 is a prime number, so we expect the regex to fail to match anything. Let’s see
a sketch of how it will work:
1. It will try to match ^.?$. No luck. The left side of | returns a failure2. It try to match ^(..+?)\1+$ (the right side of |). It begins with (..+?) matching 11:
2.1 The backreference \1+ will try to match 11 twice (i.e1111). No luck.
2.2 The backreference \1+ will try to match 11 trice (i.e111111). No luck. Length of string exceeded (6 > 5). Backreference returns a failure.
3.(..+?) now matches 111:
3.1 The backreference \1+ will try to match 111 twice (i.e111111). No luck. Length of string exceeded (6 > 5). Backreference returns a failure.
4.(..+?) now matches 1111:
4.1 The backreference \1+ will try to match 1111 twice (i.e11111111). No luck. Length of string exceeded (8 > 5). Backreference returns a failure.
5.(..+?) now matches 11111:
5.1 The backreference \1+ will try to match 11111 twice (i.e1111111111). No luck. Length of string exceeded (10 > 5). Backreference returns a failure.
5.(..+?) will try to match 1111111. No luck. Length of string exceeded (6 > 5). (..+?) returns a failure. The right side of | returns a failure
Now since both sides of | failed to match anything, the regular expression fails
to match anything, which means our number is prime.
What About The ?
Well, I mentioned that you can ignore the ? symbol in the regular expression,
since it’s there only for performance reasons, and that’s true, but there is no
need to keep its purpose a mystery, so I’ll explain what it actually does there.
As mentioned before, ? makes the preceding +non-greedy. What does it mean in practice?
Let’s say our string is 111111111111111 (corresponds to the number 15). Let’s call
L the length of the string. In our case, L=15.
With the ? present there, + will try to match its preceding token (in this case .) as few times as possible. This means that first (..+?) will
try to match .., then ..., then .... and then .....,
after which our whole regex (^.?$|^(..+?)\1+$) would succeed. So first, we’ll be testing the divisibility
by 2, then by 3, then by 4 and then by 5, after which we would have a match.
Notice that the number of steps in (..+?) was 4 (first it matches 2, then 3, then 4 and then 5).
If we omitted the ?, i.e if we had (..+), then it would go the other way around: first it would try to
match ............... (the number 15, which is our L), then .............. (the number 14, i.eL-1),
and so on until ....., after which the whole regex would succeed. Notice
that even though the result was the same as in (..+?), in (..+) the number of steps was 11
instead of 4. By definition, any divisor of L must be no greater than L/2, so that means that
means that 8 steps were absolutely wasted computation, since first we tested
the divisibility by 15, then 14, then 13, and so on until 5 (we could only
hope for a match from number 7 and downwards, since L/2 = 15/2 = 7.5
and the first integer smaller than 7.5 is 7).
The Shocking Lie
As I mentioned before, I actually lied in the explanation of how the multiples of a number
are matched. Let’s say we have the string 111111111111111 (number 15).
The way I explained it before was that the regular expression would begin
to test for divisibility by 2. It would do so by first trying to match
2*2 characters, then 2*3, then 2*4, then 2*5, then 2*6, then 2*7,
after which it would fail to match 2*8, so it would try its luck with testing
for divisibility by 3, by first trying to match for 3*2 characters, then for
3*3 characters, then for 3*4 and then for 3*5, where it would succeed.
This is actually what would happen if the regular expression was ^.?$|^(..+?)\1+?$
(notice the ? at the end), i.e., if the +following the backreference was
What actually happens is the opposite. It would still try to test
for the divisibility by 2, first, but instead of trying to match for 2*2 characters,
it would begin with trying to match for 2*7, then for 2*6, then for 2*5,
then for 2*4, then for 2*3 and then for 2*2, after which it would fail and, once again,
try its luck with divisibility by 3, by first trying to match for 3*5 characters,
where it would succeed right away.
Notice, that in the second case, which is what happens in reality,
less steps are required: 11 in the first case vs 7 in the second (in reality, both of the cases would require more steps than presented here,
the goal of this explanation is not count them all, but to transmit the idea of what’s happening in both cases, it’s just a sketch of what’s going on under the hood).
While both versions are equivalent, the one explained in this blog post, is more efficient.
3. The Java Case
Here’s the piece of Java code that started all of this:
If you remember correctly, I said that due to the peculiarities of the way
String.matches works in Java, the regular expression that matches
non-prime numbers is not the one in the code example above (.?|(..+?)\1+),
but it’s actually ^.?$|^(..+?)\1+$. Why? Well, turns out String.matches()matches on the whole string, not on any substring of the string.
Basically, it “automatically inserts” all of the ^ and $ present in the regex
I explained in this post.
Other than that, it’s pretty much self explanatory: if the match succeeds, then
the number is not prime. In case of a successful match, String.matches()
returns true (number is not prime), otherwise, it return false (number is prime),
so to obtain the desired functionality we negate what the method returns.
new String(new char[n]) returns a Stringof n null characters (the . in our regex matches them).
4. Code Examples
Now, as promised, it’s time for some code examples!
Although I already presented this code example twice in this post,
I’ll do it here again, just to keep it organized.
That’s a joke I like. I didn’t come up with it and I don’t really know its
first source, so I don’t know whom to credit. Anyways, I’m actually going to
give you two versions here, one which works in ES6
and one that works in previous versions.
If you gotta use previous versions of ES, you can always fall back to
Note, however, that we’re passing n+1 to join(), since it actually
places those characters in between array elements. So if we have, let’s say,
10 array elements, there are only 9 “in-betweens”. Here’s the version
that will work in versions prior to ECMAScript 6:
Last, but not least, it’s time for Perl. I’m including this here because the
regular expression we’ve been exploring in this blog post has been popularized
by Perl. I’m talking about the one-liner perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' <number>
(replace <number> with an actual number).
Also, since I haven’t played around with Perl before, this seemed like a good
opportunity to do so. So here we go:
Since Perl isn’t the most popular language right now, it might happen that
you’re not familiar with its syntax. Now, I’ve had about 15 mins with it, so
I’m pretty much an expert, so I’ll take the liberty to briefly explain the
sub - defines a new subroutine (function)
$_ - we’re accessing the first parameter passed in to our subroutine
1x<number> - here we’re using the repetition operator x, this will
basically repeat the number 1<number> of times and return the result as a string.
This is similar to what '1'*<number> would do in Python or '1'.repeat(<number>)
=~ is the match test operator, it will return true if the regular expression
(its right-hand side) has a match on the string (its left-hand side).
! is the negation operator
I included this brief explanation, because, I myself, don’t like being left
in mystery about what a certain passage of code does and the explanation didn’t
take up much space anyways.
That’s all folks! Hopefully, you’re now demystified about how a regular expression
can check if a number is prime. Keep in mind, that this is far from efficient,
there are a lot more efficient algorithms for this task,
but it is, nonetheless, a fun and interesting thing.
I encourage you to go to a website like regex101 and
play around, specially if you’re still not 100% clear about how everything
explained here works. One of the cool things about this website is that
it includes an explanation of the regular expression (column on the right),
as well as the number of steps the regex engine had to make (rectangle right above
the modifiers box) - it’s a good way to see the performance differences (through the number of steps taken)
in the greedy and non-greedy cases.
I didn’t want to get into the topic of regular/non-regular languages and related, since
it’s theory that isn’t crucial for the topic of this post, but as lanzaa
pointed out, there is a difference between “regex” and “regular expression”.
What was covered in this blog post wasn’t a regular expression, but rather a regex.
In the “real world”, however (outside of academia), those terms are used