Delta Thoughts

First Blog Post

Adi Mittal

Hello, world!

Hello, World! I’m Adi Mittal, a student at Terman Middle School. I enjoy food, math, running, martial arts, and music of memes.

My main intent in starting this blog is to share my thoughts, ideas, and outlooks on cool math. My primary goal is to create content that's interesting and to share my thoughts on the world around me. I hope my ideas and thoughts can appeal to you, just as it did to me and showed me the interests of math!

UPDATE (AUGUST 1ST, 2021)

As the years have gone by, I have moved on from my lowly middle school self and am now a rising senior at Henry M. Gunn High School who is still obsessing over math, but am also actively participating in track and field, the concert and jazz bands, Model United Nations, and acting as a key organizer in my school's TEDx organization. My first year of high school was primarily situating myself in the new social dynamics and with all of these great clubs, so blogging was put on an indefinite hiatus as it floated to the back of my mind. A couple years ago, I partook in a research class in my sophomore year (which you can read about here) asked to summarize our work in a blog post, and it seemed like a great time to restart the site and continue writing. Since then, I've been trying to write between school vacations, writing a couple times a month at my most efficient, but I try to get post at least once every couple of months.

The Math of Coins

Adi Mittal

Spinning Coins

Coins! As some of you may relate to this, I love to take a coin, and just spin it on a flat surface or table. It's just satisfying, but being put to shame by the so called "Fidget Spinner". I was spinning a coin a few days ago, and was ultimately bored at the time, so I decided to ask myself a simple question: What information from the coin can be taken away from it spinning? This then was taken into is there a correlation, or ratio, between the rate at which the coin rotates, and the rate at which it "wobbles". With a goal in mind, I picked up my pencil and started to work things out.

First things first, defining and finding all of our givens:

This diagram represents what the coin might look like at a given instance.

What we know about the coin:

The radius of our coin $= R$
The circumference of our coin $C = 2 \pi R$

Now for the circle our coin rotates and wobbles upon:

The radius of this circle is entirely dependant on the angle of the coin to the horizontal (table/flat surface. We will define that as $\theta$). Using the diagram, we can find that $r = R \cos (\theta)$
The circumference is $c = 2 \pi R \cos (\theta)$

With all the givens that we need out of the way, on to the application.

A thing to note is that as the coin completes a full rotation around the smaller circle, the original placement of the coin moves by a certain amount. You can easily demonstrate this by drawing an arrow on a quarter, and guiding it through a rotation on a circle smaller than the quarter.

This was the quarter I used when getting ideas on how the coin behaved.

This extra distance it covers can easily be thought out to be as $C - c$.

$C - c = 2 \pi R - 2 \pi R \cos (\theta) = 2 \pi R (1 - \cos(\theta))$

$2 \pi R (1 - \cos(\theta)) =$ the rate at which the distance our coin completes per revolution.

The rate for the distance per revolution our coin completes while "wobbling" is the same as the circumference our coin moves around upon, which we know is $2 \pi R \cos(\theta)$

Putting these two together as ratio of rate of rotation, to rate of "wobbling", we get:

$\frac{\textrm{Rate of Spinning}}{\textrm{Rate of Wobbling}} = \frac{2 \pi R \cos (1 - \theta)}{2 \pi R cos (\theta)}$

$\large= \frac{1}{\cos (\theta)} - 1$

This expression represents that at any given moment, the ratio between how fast the coin is spinning, and the how fast the coin is "wobbling" (which can be seen as the amount of hertz produced by the coin), will be $\frac{1}{\cos (\theta)} - 1$. This also means that if you multiply the frequency of "wobbling" by this expression, it will output how fast the coin should be spinning at the given value of theta. For example, let's say the coin is wobbling at a frequency of $5\,hertz$ at an angle of $\frac{\pi}{4}\,radians$ (because radians are cool), the coin would have to be rotating about $4.52\,revolutions\,a\,second$ to maintain that angle to the horizontal at that "wobbling" frequency (because of how hertz measure $cycles\,per\,second$, the cycles translate into revolutions for the output).

Of course, this is all theoretical. In practice, the coin may slip. Wind may change the local air pressure, thus changing the air resistance. Everything needs to stay constant, with no disturbances or changes occuring during the coin's movement. But this is still a neat thing if you were to ask me!

Just to recap, we took the basics and givens of our coin and it's enviornment. Used those to get the generalized rates of the coin spinning and wobbling. We then used those calculate our ratio between the two at a given instant. Not bad!

If you have any questions or comments, send me an email or leave a comment!

Trial and Eventual-Success

Adi Mittal

This triangle though...

How do you... How do you even... This is when you know the problem you are about to be shown, will be annoying. When a friend of mine first introduced this problem, I thought this would be very, VERY simple, to solve. Use some angle properties, use the given similar triangles, and soon enough, a solution will be found. Of course this didn't work. I tried a few other things, same result. Showed it to my family, not much help was gained. They tried what I did. Then it hit me. The goal is to find a specific measure of the triangles. Trigonometry $=$ The Measurement of Triangles. The solution quickly followed using a specific property, and I was like, "Meh. That was quite obvious." Enough ranting, it's time you got a look at the problem itself.

In the diagram below, $\angle ABC = \angle ACB = \angle DEC = \angle CDE$, $\,\overline {BC} = 8$, and $\,\overline {DB} = 2$. Find $\,\overline {AB}$

When drawing everyting out...

SPOILER ALERT
Before you continue reading, I highly encourage you attempt this geometry problem. It's an interesting problem, and once you found the concept you need to use to solve it, it's all an easy ride down from there. Following this warning will be the full solution and my thoughts on how I solved this myself. I know I already talked about my thoughts and how I solved this a little in the beginning of this post, but from here on will be everything I thought about.
You have been warned...

The property I thought of (after my 45 minutes of trial-and-error) that we can use to solve for $\,\overline {AB}$ is the The Law of Sines, which states: $\large \frac{a}{\sin A} = \frac{b}{\sin B} = \frac{c}{\sin C}$

Where $A$ is the angle oposite of $a$, $\,B$ is the angle oposite of $b$, and $C$ is the angle oposite of $c$.

We can rewrite this using the diagram:

$\large \frac{\overline{BC}}{\sin \angle BAC} = \frac{\overline{AC}}{\sin \angle ABC} = \frac{\overline{AB}}{\sin \angle ACB}$

So now that we have this written out, we can start solving for $\,\overline {AB}$. For convenience, I'm going to refer to the angles equivalent to $\,\angle ABC$ as $\,\theta$.

Using the Angle Sum Theorem, $\,\angle BAC = 180 - 2 \theta$. Using this, we can find an expression equal to $\,\overline {AB}$.

$\large \frac{\overline {BC}}{\sin \angle BAC} = \frac{\overline {AB}}{\sin \angle ACB}$

Doing some substitution...

$\large \frac{8}{\sin 2 \theta} = \frac{\overline {AB}}{\sin \theta}$

For where the $\sin 2 \theta$ came from, $\sin 180 - 2 \theta$ when evaluated, is the same as $\sin 2 \theta$. Now for some expansion and evaluation...

$\large \frac{8}{2 \sin \theta \cos \theta} = \frac{\overline {AB}}{\sin \theta}$

$\overline {AB} \sin \theta \cos \theta = 4 \sin \theta$

$\overline {AB} = \large \frac{4}{\cos \theta}$

Now that we have an expression for $\overline {AB}$, we just need to find a value of $\cos \theta$, and that will give us the length of $\overline {AB}$! So now, what can we do? What I first thought (based on the information we were given), if we find two expressions representing the value of the same side length, we can set those two expressions to equal one another, to find a value that makes that equation true. That equation will mostly likely output a value of a function of an angle, as we know very few side lengths, and know no angles (we're hoping it would output a value of $\cos \theta$). Again, this is only what I was thinking when solving this problem at the time. The only reason I thought this, is that I noticed two triangles, that were similar to $\triangle ABC$, contained within $\triangle ABC$.

We have similar triangles $\triangle DEC$ and $\triangle BEC$. And we know that they are similar as the both triangles share the same angles ($\theta, \theta, and \,180 - \theta$) as the original triangle $\triangle ABC$. And rememeber that side length I mentioned earlier that we could find two expressions for, and use those to solve for its length (that's a mouthful)? That length is $\overline {CE}$! It shares a side length with $\triangle DEC$ and $\triangle BEC$, and we can find our two expressions, by solving for the length of $\overline {CE}$ once, in $\triangle DEC$, and again in $\triangle BEC$. Agian, we're hoping for a value of $\cos \theta$. Starting by solving for $\triangle BEC$...

We are given the length of $\overline {BC} = 8$, which simplifies our job quite a bit. We can do the same thing we did to find an expression for $\overline {AB}$: Use the Law of Sines!

$\large \frac{8}{\sin \theta} = \frac{\overline {CE}}{\sin 2 \theta}$

$\large \frac{8}{\sin \theta} = \frac{\overline {CE}}{2 \sin \theta \cos \theta}$

$\overline {CE} \sin \theta = 16 \sin \theta \cos \theta$

$\overline {CE} = 16 \cos \theta$

We now have a value of $\overline {CE}$ from $\triangle BEC$, time to solve $\overline {CE}$ for $\triangle DEC$...

First off, all though it's not stated, we know the length of $\overline {DE}$. $\triangle BEC$ is an isosceles, where $\angle BEC = \angle ECB$, which also means $\overline {BC} = \overline {BE}$. As $\overline {BC} = 8$, therfore $\overline {BE} = 8$. Since we were told $\overline {DB} = 2$, we can solve $\overline {BE} - \overline {BD} = \overline {DE} = 6$. Now back to the all-mighty, Law of Sines...

$\large \frac{\overline {DE}}{\sin 2 \theta} = \frac{\overline {CE}}{\sin \theta}$

Substitution and expansion...

$\large \frac{3}{\sin \theta \cos \theta} = \frac{\overline {CE}}{\sin \theta}$

$\overline {CE} \sin \theta \cos \theta = 3 \sin \theta$

$\overline {CE} = \large \frac{3}{\cos \theta}$

Great! We're lucky that it came out as a value of $\cos \theta$, but anyways, we have our two expressions, now just to set them equal to one another...

$16 \cos \theta = \large \frac{3}{\cos \theta}$

$16 \cos^2 \theta = 3$

$cos^2 \theta = \large \frac{3}{16}$

$\cos \theta = \large \frac{\sqrt{3}}{4}$

Now that we have our value of $\cos \theta$, we can just substitute this into our original expression for $\overline {AB}$...

$\overline {AB} = \large \frac{4}{\cos \theta}$

$= \large \frac{4}{(\frac{\sqrt{3}}{4})}$

$ = \large \frac{16}{\sqrt{3}}$

And there it would be, our solution! Although it might of seemed quite lengthy to get to $\frac{16}{\sqrt{3}}$, it all just revolved around the one concept of the Law of Sines, so not to bad.

Although this is one way to obtain the solution, I'm sure there are other ways to tackle this problem, and I found another way which completely negates our first step, to find an expression for $\overline {AB}$, but adds an extra step to the end.

With our value of $\cos \theta = \frac{\sqrt{3}}{4}$, we can draw a right triangle with this as one of our angles with a bit moving around.

$\theta = \arccos \large \frac {\sqrt{3}}{4}$

We can do this, because as we stated earlier $\theta = any\,angle\,equivalent\,to\, \angle ABC$ (and that's the exact angle we're working with). We also bisected $\overline {BC}$ at $F$ to form the 2 right triangles within our isosceles triangle, so the length of $\overline {BF} = 4$. We can then use some basic trigonometry and evaluation to solve for $\overline {AB}$.

$\theta = \arccos \large \frac {\sqrt{3}}{4}$

$\cos \theta = \large \frac {4}{\overline {AB}}$

$\cos (\arccos \large \frac {\sqrt{3}}{4}) = \large \frac {4}{\overline {AB}}$

$\large \frac {\sqrt{3}}{4} = \large \frac {4}{\overline {AB}}$

${\large \frac {\sqrt{3}}{4}} \overline {AB} = 4$

$\overline {AB} = \large \frac{16}{\sqrt{3}}$
Just another simple way of getting to the exact same answer.

If you have any questions or comments, send me an email or leave a comment!

EDIT (JULY 3RD, 2017): RECURSIVE DIVISIONS SOLUTION
This specific solution, is one of my favorites that I have seen. One of my inital attempts was to use the dimensions of the similar triangels and find the common ratio between the side length and the base of the triangle. I knew it could be done, but never put my finger on it. However, when a friend of mine took a look at this problem, after a bit of thought, he managed to come up with this. It's really quite a spectacular of a solution, and this is credited entirely to him (no use of name for privacy reasons). Oh, and I'll be speaking in first person, just so I don't cause any confusion, or make it seem like I'm taking it as mine. Just to be clear.
So the first step is to take the three triangles we know to be similar to one another ($\triangle ABC, \triangle BEC, and \triangle CED$. We know that they are similar due to the fact they all share two common angles, which force them to have a common ratio between the base and a leg of the triangle. This will be important to remember later), and we will $0-index$ them from the original triangle, to the following divisions within one another. I will also now be referring to the triangles by their respective index numbers.
Now using the fact that every triangle is similar, and that each progressive triangle was formed by using the base length of the previous triangle to form the leg of the next triangle, we can find a ratio between a dimension (say, the base) of a triangle, and its previous/next triangle, and use that to find the length of $\overline{AB}$. I know that is kind of confusing right now, but trust me, it will makes more sense the more I go on.
So we know the base length of two bases of two triangles ($\triangle 0$, and $\triangle 2$). Since we know that they should share a common ratio, we can right them as a ratio between one another, and hence find said ratio.
$\large \frac{\overline{BC}}{\overline{DE}} = \frac{8}{6}$

$\large = \frac{4}{3}$

So we have a ratio, but the problem with this ratio it's for two divisions. It's for going between $\triangle 0$ and $\triangle 2$. We want one between $\triangle 0$ and $\triangle 1$, or $\triangle 1$ and $\triangle 2$. But this is easy! Since a division in this case is a factor of the previous triangle. This means if we take some dimenstion _a_ of a triangle, multiply it by our ratio once, we will obtain the dimension _a_ of the next division's triangle. For an example, if we have triangle-base $\overline{BC}$, and multiply it by our ratio, we should get the length of triangle-base $\overline{EC}$. Take a look at the diagram if that helps. Essentially, the base length of $\triangle 0$, multiplied by some ratio, we will get the base length of $\triangle 2$, and do that again, we will get the base length of $\triangle 3$. Now if you see, we had to multiply twice to get from $\triangle 0$ to $\triangle 2$. A.K.A., take the square of the ratio. To undo a square, you take the squareroot. So we can undo our two-division ratio, by taking the squareroot of that, to get our one-division ratio.

$\sqrt{\large{\frac{4}{3}}} = \large{\frac{2}{\sqrt{3}}}$

So that's our ratio between a one triangle division. So now we need to find the length of $\overline{AB}$. So we can do what we did originally with the base lengths, only with the legs of the triangle. Larger triangle, over the divided triangle. In this case, $\triangle 0$ over $\triangle 1$.

$\large \frac{\overline{AB}}{\overline{BC}} = \frac{2}{\sqrt{3}}$

$\large \frac{\overline{AB}}{8} = \frac{2}{\sqrt{3}}$

$\overline{AB} = \large \frac{2 \cdot 8}{\sqrt{3}}$

$\overline{AB} = \large \frac{16}{\sqrt{3}}$

And there it is! The answer we had before. I cannot tell you how cool of a solution this is. Simple in concept, but well executed. So major props to my friend for carrying out such a clean solution. I tried this, completely missed the obvious, and he did not. Major props to him.

How fast can I travel? Part 1

Adi Mittal

Answer: Too fast

The Earth has a diameter of approximately 12742000 meters. Most people of course wouldn't travel that far, but what if you did? How fast can you get across with nothing but yourself? That's essentially what people have asked in the form of the question: How long will it take to fall through the center of the Earth?

SPOLIER WARINING:
THIS WARNING IS TO INDICATE ANY WORD FOLLOWING THIS MESSAGE WILL BE A PART OF THE SOLUTION TO THIS PROBLEM. HIGHLY ENCOURAGE THE ATTEMPT TO SOLVE THIS PROBLEM.
YOU HAVE BEEN WARNED...

Following our standard procedure, let's list all the givens:
$The\,radius\,of\,the\,Earth is R = 6,371,000$ meters

$The\,Force\,of\,Gravity\,is\,F = \large \frac{G m M}{r^2}$

Where...
$G = Gravity$
$m = Mass\,of\,Object_1\, (in\,this\,case,\,us)$
$M = Mass\,of\,Object_2\,(in\,this\,case,\,Earth)$
$r = the\,distance\,between\,m\,and\,M.$

So now we are just trying to find as many values or expressions to variables within that eqauation of force. We can leave $m$ as is, becuase that's the mass of our human/us. So what we really need is $r$ and $M$.

One thing we have to worry about though, is that as we fall $r$ will change. As we fall we will get clsoer to Earth's center of mass, eventually pass it, and then get farther from it. So we will call our current distance relative to Earth's center of mass as $x$. And what's great about this, if we are any distance into our fall, we can just ignore any mass above us. Using the diagram as an example, if we are $R-x$ deep into our fall, we can ignore any mass of Earth contained between $R$ and $x$. Some of you may think, "But wait! Wouldn't the mass above us have it's own force of gravity acting upon you, and therefore slowing you down as you fall?" The answer is technically yes, but that all balances out with the mass below you and to the side of you. All these forces cancel out, making it not affect you at all. So, all we really care about is the amount of mass below us, and the distance between us and the Earth's center of mass (which would be the radius $x$ as we have been discussing). So we have one variable filled.

$F = \large \frac{G m M}{x^2}$

Now we need $M$. The formula for mass is $M = volume \times density$. The volume of the Earth $= \frac{4 \pi x^3}{3}$ (we are using $x$ again as the mass affecting us changes over our fall). And we can represent density with $\rho$. So the $M$ equals:

$\large \frac{4 \pi \rho x^3}{3}$

Putting this all together, the force of gravity acting upon us during this fall equals:

$F = \large \frac{4 \pi G m \rho x^3}{3 x^2}$

$ = \large \frac{4 \pi G m \rho}{3} x$

If we let $\frac{4 \pi G m \rho}{3} =$ say, v, we get $F = -v x$. It's negative because we are falling first. This is actually an oscillating system. To represent this, I've made a mock graph to show how gravity affects us over time starting from the top of "Earth". The graph is just a representation.

If the x-axis is time, and we fell from the top of Earth (and there is NO air resistence), as you can see, we would just continuously bounce back and forth between the top and bottom of the Earth. Now we need to find the period of our oscillating system. The period is the time it takes for one cycle to be completed. To be more precise, we need half of the period. That is because one cycle (in this case) is falling all the way down, and coming all the way back. We only want the time it takes to fall down, so that's why the half.

The eqauation for the period of a simple oscillating system (also called a harmonic motion) is:

$P = 2 \pi \sqrt {\large \frac {m}{k}}$

The variable representation is that $k$ is our oscillating system, and $m$ is our mass. But since we want half of that, so therefor time to fall through the Earth is...

$Time = \pi \sqrt {\large \frac {m}{k}}$

Doing some substitution...

$Time = \pi \sqrt {\large \frac {m}{\large \frac{4 \pi G m \rho}{3}}}$

$ = \pi \sqrt {\large \frac {3 m }{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 m \pi^2}{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 \pi}{4 G \rho}}$

Now all we need to do is put in $G$ as the Gravitational Constant, and $\rho$ as the density ($\rho = \frac{mass}{volume}$) of Earth (I did some Googling...)!

$Time = \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot m^3 \cdot kg^{-1} \cdot s^{-2} \cdot \frac{5.972 \cdot 10^{24}\, kg}{\frac{4 \pi \cdot 6371000^3}{3} \, m^3} }}$

This looks bad. Let's clean it up.

$= \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot m^3 \cdot kg^{-1} \cdot s^{-2} \cdot \frac{5.972 \cdot 10^{24} \cdot 3 \, kg}{4 \pi \cdot 6371000^3 \, m^3} }}$

$= \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot s^{-2} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = \sqrt {\large \frac {3 \pi \cdot s^{2}}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = s \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

So, I don't know about you, but when I have something like this, I just straight up put it into Wolfram Alpha , or a similar calculator, as I am just lazy and it's a pain to evaluate. So, letting it be computed by the calculator...

$Time = s \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ \large = 2530.5\,seconds$

This, funnily enough is also the answer to the universe and all of its questions. $2530.5\,seconds = 42\,minutes\,(+10.5\,seconds)$. Quite a coincidence if I say so!

Now what's great about our equation we used ($ = \sqrt {\frac {3 \pi}{4 G \rho}}$), it's quite easy to apply to other objects, as most of it is constant! 3, is well, a constant. So is 4. $\pi$ has been universally agreed upon for its value. And as far as we can tell in the universe, the Gravitational Constant is true. The only thing that determines the fall length is the density. So you could have two planets, one with $x$ as its radius, and the other as $100 x$. If the are just as dense as one another, you will fall through them (across the diameter) in the same time.

Now just as a random fact that I thought was amusing, was the top speed you would attain. We know that acceleration due to gravity on Earth is $\frac{9.807\,m}{s^2}$. The top speed would be when you reach the center of the Earth, which is 6,371,000 meters from the surface (aka, the radius of Earth). Using this, we can calculate the speed at which we would be at in meters per second at the center. Just to be sure, we can calculate acceleration due to gravity, using our original formula, where we ignore our mass: $g = \frac{G \cdot M_{Earth}}{R^2}$

$g = \large \frac{G \cdot M_{Earth} }{R^2}$

Substituting everything we have...

$g = \frac{6.67408 \cdot 10^{-11} \cdot m^3 \cdot kg^{-1} \cdot s^{-2} \cdot 5.972 \cdot 10^{24}\, kg }{6371000^2\,m^2}$

$g = \frac{6.67408 \cdot 5.972 \cdot 10^{13} \cdot m}{6371000^2 \cdot s^2}$

Thanks to a calculator...

$\approx \large\frac{9.82m}{s^2}$

Of course this is not the same as what others have put on the internet, values will differ from here to there. I trust the value of $\frac{9.807m}{s^2}, as I think there values they used to calculate it would be more accurate. Back to the top speed now.

$g = \large \frac{9.807\,m}{s^2}$

$ = \large \frac{9.807 \cdot 6371000}{s^2}$

$ = \large \frac{\sqrt{9.807 \cdot 6371000}}{s}$

$\approx \large \frac{7904.454251 m}{s}$
Which is equivalent to...

$\approx \large \frac{28456.035304\,km}{hour}$

$\approx \large \frac{17681.760583\,miles}{hour}$

That's about 23.23 times the speed of sound! This literally means you can't yell during this fall, as you would be going literally faster than the time it takes to vibrate the air around you. It will be a silent fall. That is, if there was air, and the terminal velcoity of a human wasn't $\frac{53m}{s}$.

So that would be it for this post! We found out we can cross Earth in under 45 minutes, and break the sound barrier 23 times over!

I plan on following it up with another post showing how you can use integration to find the time to fall through Earth (and that equation to find the period of an oscillating system/simple harmonic motion that kind of came out of nowhere. The $2 \pi \sqrt{\frac{m}{k}}$), and to show some other cool properties and interesting things about falling, pendulums, and oscillating systems in general.

Now here's an extra challenge for you: How long will it take to fall through Earth, 500 kilometers above the surface?

If you have any questions or comments, send me an email or leave a comment!

To equal NP or to not equal NP. That is the question.

Adi Mittal

I tried...

There is just no introduction needed here. The problem at hand is probably one of the hardest, most controversial topic in computer science:

Prove that $P = NP$, or otherwise

In case this is not clear (or never have heard this problem before), it is to show that all NP-hard problems are P problems, or show that they are not equal. An NP-hard problem is a problem that cannot be solved in polynomial time ($NP$ represents for non-deterministic polynomial-time, and $P$ just represents for polynomial-time). Polynomial time is time that can be represented as a function of the input (input being whatever you need to achieve/solve for in the problem), and the function is a simple polynomial function. For example, the following function representing the time it takes to solve some problem,

$f(x) = p(x^k)$, where $k$ is constant
..., this would classify the problem as a $P$ problem, as we can represent the time it takes to solve the problem as a simple polynomial function. An example for how long some $NP-hard$ problem might take to solve would be such as...
$f(x) = p(k^x)$, where $k$ is constant
This is bad for computation time, since $x$ is our input, our values would explode the greater the amount of input we have. That's why this is an $NP-hard$ problem. We would essentially have to brute force our, and check every possible scenario (within allotted values for our problem) to solve for this.
So now the reason why this problem is so controversial, it's because that if we can show that $P = NP$ is true, we can then theoretically solve ANY problem within an algorithmic, and in polynomial time. It will cut so much time off of the time it takes to solve all the crazy hard, unsolved problems.
And I know, some of you may be thinking, "But, hey! Wouldn't most problems need a completely different approach to solve, than another problem?" Well, my response to this, would be yes, but, there are some $NP-hard$ problems that have been solved. These are called $NP-complete$ problems. The reason why these are important (especially when answering the above question) is because a common way to tackle these $NP-hard$ problems is by boiling it down to a smaller, similar, $NP-complete$ problem, and then brute force it in a way similar to that $NP-complete$ problem. Here is a list of $NP-complete$ problems.

This here shows how you can work from one $NP-hard$ problem to another, smaller, but related problem.

Okay, now with that all out of the way, the reason why I started discussing $NP-problems$, the $P$ versus $NP$ problem. This problem bothers me so much, for a few reasons. ONE: This seems a lot easier to solve than it actually is, and this just intuitively bothers me more than other problems do. It seems like such a simple statement to show, but it's just not. TWO: The way people are approaching this problem, it seems all to awkward and incorrect to me. It seems that they are overcomplicating this quite a bit. But this is computer science, so I don't have much say. And the person who proved Fermat's Last Theorem did so in more or less 150 pages (I think), so this could very well be so as well.

My attempts haven't been as successful (well, if it was successful, I would be too excited to write this up), but I do have a few thoughts on the matter. My first attempt was rather bleak. Take a generalized form of the time it takes to solve an $NP-hard$ problem, and just try to work it down to some representation of polynomial time. This obviously, did not work. What ended up happening was that I was trying to represent the wrong variable into polynomial-time representation, and couldn't find a way to expand on onto the variable that I needed to express. So, that idea was gone. The second idea, would be a bit more practical. Take some $NP-complete$ problem, look at it how it's time is in its NP form, then try to find some algorithm that results in the same solution, but is in polynomial time. The reason why I would do this, is because you can link almost any $NP-problem$ to one of the $NP-complete$ problems. Using this, we can creat a map, linking every $NP-complete$ problem to another. That way, if we can solve for one, we have then technically shown for every $NP-problem$. We can do that, or generalize somehow our $NP-complete$ problem, and show from there. My last idea on the matter, is to think of the consequences of this statement ($P=NP$) of being true, or false. If this is true, I feel that this would create a paradox. Because finding the polynomial-time fuction of a $NP-problem$ is $NP-hard$ in itself. But that cannot happen, as we said that $P=NP$, so we have a contradiction in itself. So you would then have to show that finding a P function of a NP function is in P. But that is also $NP-hard$. Then you would have to show that is also in P. But that's also $NP-hard$, so we have to show that it's in P, etc., etc. So we end up with having to contiuously prove that something that is in NP is in P, to show that the smaller $NP-hard$ problem of P versus NP (showing the conversion of NP to P in a given $NP-problem$), is also in P (that was a bit long and a mouthful. Essentially you get a recursive $NP-problem$, and each iteration of this recursive problem is slightly different than the last iteration, but with the same goal of showing that that iteration of an $NP-problem$ takes P time to actually solve). If it was false, we would stay where we are computationally, and nothing would of changed. Personally, based on what I have done so far, I think $P \neq NP$. But don't think that is my final decision. People have shown $NP-hard$ problems to be computed in polynomial-time, so based on my second idea of mapping $NP-complete$ problems, there is still some possibility. Expect some updates, and future posts, as this is one of many other problems (I'll just say them: The Millenium Problems) that have gotten me thinking in almost no way I have done before (that's probably because they are not all math-based, and I'm a math-based guy, so math-based + not-math-based-problem = new type of thinking). Actually, don't expect future updates and posts, just know there will be future updates and posts.

If you have any questions or comments, send me an email or leave a comment!

(Un)Expected Value

Adi Mittal

Okay, you don't make that much...

Not much for an introduction this post. Found this problem when looking for interesting problems for myself. Shoutout to Harvard's Problem of the Week (from 2002 to 2004). The problem at hand is:

Consider the following game: You flip a coin until you get tails, and the amount of money you win is equal to number of coins you end up flipping (i.e. If you flip a coin, and immediately get tails, you win one dollar. If it takes 2 flips to get tails, you get 2 dollars. 3 flips = 3 dollars. So on, and so on).

(a) What is your expected value you win when playing the game?

(b) Play the same game, except let your earnings be $2^{n-1}$, where $n$ is the amount of flips. What do you expect to win now? Does it make sense?

SPOILER WARNING: SOLUTION INCOMING

(a): Expected value is the amount you win, multiplied by the probability of it occuring, and adding up all the possible outcomes.
You have a 50% chance to win 1 dollar. 25% chance to win 2 dollars. 12.5% chance to win 3 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$= \large \sum _{n=1}^{\infty }\: \frac{n}{2^n}$

$= \large 2$

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$\large =(\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{6} + ...) + (\frac{1}{4} + \frac{1}{8} + \frac{1}{16} + ...) + (\frac{1}{8} + \frac{1}{16} + ...) + ...$

$= (1) + \large (\frac{1}{2}) + (\frac{1}{4}) + (\frac{1}{8}) + (\frac{1}{16}) +...$

$\large = 2$

So you can expect to win 2 dollars every time you play this game.

This is where the fun is at.

(b): We have 50% chance to win 1 dollar. We have a 25% chance to win 2 dollars. We have a 12.5% chance to win 4 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{4}{8} + \frac{8}{16} +...$

If you don't mind, since I like to write things in sigma notation, I would like to write the simplified verison of this sum in sigma notation.

$=\large \sum _{n=1}^{\infty}\: \frac{1}{2}$

$\large = \infty$

This is why I picked this problem. The first part is quite simple, but this part creates quite a dilemma. What can we do now? How should we interpret this for the expected value of our game? Now one would ever put up a game in which the player is expected to win an infinte amount of money, since no one has an infinite amount money!

The following explanation is a jumble between what I thought, and Harvard's. I recommend looking at what they said specifically.

The solution is that our game (would be known as the experiment in our scenario) doesn't agree with the exact definition of expected value. Expected value is defined as an average over an infinite amount of attempts/trials (this can be viewed at least as the limit towards an infinite number of attempts/trials). The thing is that, you'll never be able to play an infinite amount of games. Essentially, our experiment (game) doesn't agree with our calculated expected value, as the experiment has nothing to do whatsoever with the precise defintion of expected value. Just as an example, if you were to (somehow) play an infinite amount of games, your earnings would indeed average an infinite amount. This whole idea of this expecting to win an infinite amount, and it "not working/making sense/not being possible" arises when we try to make expected value, something it isn't.

Okay, I like math, but from this point onward I didn't have much. And what I did wasn't cohesive, as 25% was written down, the other 75% was in my head. The problem is, that 75% was in my head. I would try to go through and get my complete explanation, but I feel that Harvard's solution is already quite nice. So the rest is all Harvard's explanation. Only credit I get here is for the fact I formatted it for this page. Here you go.

"This might not be a very satisfying explanation, so let us get a better feeling for the problem by looking at a situation where someone plays $N = 2^n$ games. How much money would a “reasonable” person be willing to put up front for the opportunity to play these N games? Well, in about $2^{n−1}$ games he will win one dollar; in about 2^{n−2} he will win two dollars; in about $2^{n−3}$ games he will win four dollars; etc., until in about one game he will win $2^{n−1}$ dollars. In addition, there are the “fractional” numbers of games where he wins much larger quantities of money (for example, inhalf a game he will win $2^n$ dollars, etc.), and this is indeed where the infinite expectation value comes from, in the calculation above. But let us forget about these for the moment, in order to just get a lower bound on what a reasonable person should put on the table. Adding up the above cases gives the total winnings as: $2^{n−1}(1) + 2^{n−2}(2) + 2^{n−3}(4) +· · ·+ 1(2^{n−1}) = 2^{n−1}n$. The average value of these winnings in the $N = 2^n$ games is therefore $\frac{2^{n−1}n}{2^n} = \frac{n}{2} = \frac{(\log_2 N)}{2}$. A reasonable person should therefore expect to win at least $\frac{(\log_2 N)}{2}$ dollars per game. (By “expect”, we mean that if the player plays a very large number of sets of $N$ games, and then takes an average over these sets, he will win at least $2^{n−1}n$ dollars per set.) This clearly increases with $N$, and goes to infinity as $N$ goes to infinity. It is nice to see that we can obtain this infinite limit without having to worry about what happens in the infinite number of “fractional” games. Remember, though, that this quantity, $\frac{(\log_2 N)}{2}$, has nothing to do with a true expectation value, which is only defined for $N → ∞$. Someone may still not be satisfied and want to ask, “But what if I play only $N$ games? I will never ever play another game. How much money do I expect to win?” The proper answer is that the question has no meaning. It is not possible to define how much one expects to win, if one is not willing to take an average over a arbitrarily large number of trials."

Neat little problem if I do say so myself. Some of my work, some of Harvard's, hope it was cohesive and clear who was writing what when. I wish I could of gotten my last piece of explanation, just would of taken a bit too long for something I need to redo. Moral of the story: Take complete notes.

If you have any questions or comments, send me an email or leave a comment!

Persuasion is Hard

Adi Mittal

System 1 and System 2 Persuasion Tactics and Their Impacts on Secondary School Absenteeism

This was done as part of the Advanced Authentic Research program during the 2019-2020 academic school year.

Over 7 million students across the United States missed 15+ days of school in the 2015-16 school year (US Department of Education, 2012). These chronically absent who miss 10% of their academic year cause over 40 billion worth of instructional minutes to go to waste. Even more jarring, in the same report, it was cited that inconsistent student attendance is a better indicator than test scores to whether a student will drop out of school or not. In case it wasn't clear enough, student absenteeism is a major issue haunting the education system, and the need find a solution to it is greater than ever. Over the past 8 months, I have conducted a small series of experiments to try and alter the state of this issue in one public high school in Palo Alto, California.

Introduction

In the case of Palo Alto's Henry M. Gunn High School, only 5% of the school's 2000 students are in this category of the chronically absent. That means on average, 1 student is missing in every class on campus. What makes this statistic concerning though is when one considers the demographic of Palo Alto. Here are two maps of the United States: one of the median household income (New York Public Radio), and another of chronic absentee rates by district (US Department of Education, 2012).

Although Palo Alto is in one of the most affluent neighborhoods in the nation, its chronic absentee rate is equal to some areas with with not even half its median household income. This led to the idea that absenteeism may be fueled by different motivators across the country. For example, in some less fortunate neighborhoods, kids may be an active contributor to their family income, leading to conflicts with their academic commitment. Similarly, because of Palo Alto's wealth and greater access to resources, academic competitiveness may fuel absenteeism. This may seem counterintuitive at first: if someone wants to do well in class, why would they skip it? These strategic absentees skip class not because they want to, but rather as a necessary evil: they feel the need to skip a class a means to prepare for another one. These are the students that can most affect the absentee rate, as not only are they motivated to go to class, but their absences are likely more sporadic than they standard chronically absent, meaning that they are more likely to be in class to be influenced by a teacher or administrator, which nicely leads into the next section.

Nudges

Students can't be forced to go to class. Especially as Henry M. Gunn High School sports an open-campus policy, it is impossible to be able to force a substantial number of absentees to go to class beyond the already instated measures. So, instead, we tried to persuade the students, as if they agree for themselves attending class is the right thing to do, they are more likely to act on it. To do so, we used specially engineered social measures to try and convince the students as best as possible, known as nudges.

Nudges, at their core, are suggestions. They don't affect one's ability to choose, but they utilize the person's experience to guide them to pick an option. The most commonly used nudge utilizes comparisons: let's say a restaurant has a dish that doesn't sell super well, and is ultimately costing them money having it listed on the menu. To boost its sales, they can list a slightly cheaper dish that has no intention of actually being sold in large quantities. This "fake" dish provides a reference frame for the buyer, making what originally seemed as overprice as suddenly as a great deal. The first time this type of persuasion was first formalized in a book of the same name: Nudge: Improving Decisions About Health, Wealth, and Happiness, written by Cass Sunstein and Richard Thaler ( highly recommended read ). It should be reiterated that this doesn't change one's ability to pick what food item they want, and that's what makes a nudge so effective. It allows the person to convince themselves without needing to feel as if the choice was imposed on them (i.e. remove everything on the menu except one dish). This is a form of indirect persuasion: we're not explicitly saying what we want the affected person to do, but we are adding information to guide one choice over another. The counterpart to this would be direct persuasion: giving explicit desires for which option to choose.

Experimental Model

This project employed 2 types of persuasion tactics that were each disseminated via 2 types of mediums. The two types of mediums you are already familiar with: direct and indirect persuasion. This was examined by having the teacher give information by describing a set of data (see below) for direct persuasion. Indirect persuasion was tested by having the environment give the information, meaning instead of having the teacher give information on data, the students take in the information for themselves by noticing the data as a poster in the classroom. Now, I have been purposely vague about what data was shown to the students as there were actually two different sets of data that were shown to different classes, but they communicated the same idea.

If these curves look the same, that's because they are; they are the same set of data, but one is presented with a positive connotation (attendance is good) and the other with a negative connotation (absenteeism is bad). These are the two sets of data presented, as to see whether how one presents data affects one's ability to influence.

So, in total, there are 5 classes: $\mathrm{i})$ teacher presents positive connotation data; $\mathrm{ii})$ teacher presents negative connotation data; $\mathrm{iii})$ classroom presents positive connotation data; $\mathrm{iv})$ classroom presents negative connotation data; and lastly, all compared to a $\mathrm{v})$ control with no intervention (this is formally known as framing)).

Results

As you can see, there are 5 lines shown on the graph: 4 lines, 2 blue and 2 red, to represent data collected and a green line that represents the aforementioned Gunn's 5% chronic absentee rate. The blue lines represent tardy and absentee data from January, to collect pre-experimental data. Red lines represent data collected in February, the month in which the experiment was set in motion.

This graph may be a bit intimidating to read, but it helps to realize that the x-axis is not a representation of time like most line graphs, but rather are the individual class models. This allows for easy comparison between months, as if there was, say, a virus moving throughout the classes that caused a 3% increase to absences to all classes, it will look as if one line was shifted upwards, having the same structure more or less to the other.

Analyses

These results are really surprising, as there was no improvement at all from any implementation of our stimulus. If anything, it got marginally worse with a slight increase to tardies in all classes ($\approx$3-4% increase). Which makes some sense as when you consider whenever someone commands you to do something or says you're doing something wrong, the first instinct is to disagree and defend your actions. This feeling is known as reactance, and it is likely what caused this mild increase in tardies.

What is extremely concerning however is the massive increase in absences seen in the "-/teacher" class (negative connotation data presented by teacher), which observed an astonishing 85% increase in absences. That is the difference between another 90 additional students absent in Gunn, and an additional 560 across all of the Palo Alto Unified School District. This incites an interesting thought: subconscious effects, such as reactance, can be amplified by other effects as well. In this case, it was the negativity bias -- the idea that negative connotations tend to be overestimated in their impact than positive ones -- that amplified the reactance. For instance, say you are an avid fan of the fruit, apples. If someone says oranges are better than apples, reactance will be incurred and you'll likely disagree. If someone says apples are worse than oranges, however, now there is this feeling of losing an opinion as well, amplifying the disagreement. Here, it is the difference between saying attending class is better than skipping it, and vice versa.

Conclusion

These results are highly specific. Before you try to go and apply these ideas beyond the realm of this project, you should consider who were subject to this experiment. In fact, I had witnessed this exact concept during my trials: the inspiration for an easy experimental model was inspired by Moore (2004), who conducted an almost identical scenario of my experiment and found it to be effective for university students, while mine showed to not be effective for high schoolers. Further testing is needed, but the greatest takeaway from this experiment is that communication and persuasion is something that can only be achieved when it is very specifically tailored for a specific audience.

If you want to read to a higher degree of depth on the matter, everything cited here and more can be found in my original paper.

Process

Originally, this project was never supposed to be about attendance. Originally, I was intending on studying voting theory as I was (and still am) super interested in how individual behavior affects the collective, and how that can be leveraged. I was looking at networks and graph theory, synchronization, and other related topics, but I was especially looking at behavioral economics and prospect theory. Realizing I was about a year too early to be able to study any recent elections or voting processes, I honed in on the behavioral economics aspect of the research, and started to look for a new problem to address. I talked to Gunn High School's principal on school issues that could be examined, and attendance was a recurring theme. This was only corroborated by the California Healthy Kids Survey, a questionnaire that surveyed 9th and 11th graders each year, and it reported that 7% of freshmen and 11% of juniors cut class to prepare for another class alone in the month the survey was distributed (California Department of Education, 2017-18). So, I started directing my attention to different studies and papers that were already conducted to learn what ideas have been tried and tested, such as what Moore (2004) and Self (2012) did.

Very quickly on, however, I realized I was probably going to have the same issue that I was going to have with voting theory: gathering original experimental data. Not that there is anything wrong with using pre-existing data, I personally wanted to collect my own original data to analyze, especially as I wasn't sure if something like attendance would be internalized the same way in a high school population as the more frequently studied college population (and as stated previously, there was in fact a discrepancy between Moore's and my findings due to different populations studied). So, I began reaching out to different teachers to see who would be willing to help run the experiment in their classes. Doing so proved to be very difficult, as I needed a teacher who taught at least 5 classes of 20+ students that had some absentees in each class, who also had the class time to be able to explain the necessary information to 2 classes. I was able to properly contact 2 teachers, one of which I was able to collect data for, and the other I was not due to the sudden COVID-19 outbreak.

Another thing that made this project difficult was that I had an incredible workload set out for this year. Between the 5 required academic classes, 2 electives, an after-school elective, club commitments, and a sport for two-thirds of the year, I just didn't have the time in my schedule to devote another 2+ hours of class time, which didn't even include time needed outside of class to research and write. Not to mention any extracurriculars I had in place as well. I had to schedule almost all of my meetings at 7am or earlier, or worse, do them entirely across email threads, which made communication ambiguous and difficult at times.

Regardless, this whole experience taught me so much about academia beyond the scope of what any high school classroom could have, and taught me how no matter no simple a question one has, willing to ask it can lead to incredible results.

About AAR

As stated at the top, this was done as part of the Advanced Authentic Research program that PAUSD provides to its two high schools as a means to introduce students to formal research and academic writing, well beyond what a standard English class essay or chemistry lab write-up teaches. Providing students with community mentors, experts, and connections, it fosters student growth via the students' own motivation to learn, creating an environment where projects, such as the former, can be created out of curiosity, and not by seeking a letter on a report card.

Golden Quartics

Adi Mittal

The hidden relationship of quartics and phi

This post looks to describe an interesting property intrinsic to any and all quartic functions, and it has to do with the relationship between the functions' inflection points. Below is a Desmos graph, with labeled $f(x)=Ax^4+Bx^3+Cx^2+Dx+E$, the general quartic equation, as well as its 2 inflection points $P$ and $Q$. A third point $R$ is labeled, which is the point that intersects the line between $P$ and $Q$ and $f(x)$. What we are interested in today is the ratio $\frac{PR}{PQ}$. Play with the graph below to vary $f(x)$ and see what happens to that ratio.

Quickly you will notice that for (most) non-zero $A$ values, $\frac{PR}{PQ}$ always remains at the rather famous constant, $\varphi=1.61803...$, the golden ratio. This may seem coincidental, but there is a rather nice way of proving that this ratio is exactly equal to the golden ratio.

The Golden Ratio

The classic definition of $\varphi$ comes from a specific geometric construction of a rectangle.

In this golden rectangle, there are two rectangles to focus on: the large one with aspect ratio $\frac{a+b}{a}$, and the smaller red one with aspect ratio $\frac{a}{b}$. The golden ratio is given by $\frac{a}{b}$ when the small red rectangle has the same aspect ratio as the larger rectangle (made up of the blue square and red rectangle). Letting $\varphi=\frac{a}{b}$, and setting the ratios equal to each other nets us:

$\begin{align} \frac{a+b}{a} & = \frac{a}{b} \ \newline 1+\frac{b}{a} & = \frac{a}{b} \ \newline 1+\frac{1}{\varphi} & = \varphi \ \newline \varphi+1 & = \varphi^2 \ \newline \varphi^2-\varphi-1 & = 0 \ \newline \end{align}$

$\large{\varphi = \large{\frac{1 \pm \sqrt{5}}{2}}}$

The positive solution to this quadratic is the more well known value $\varphi$. Taking some variations of the previous equations can net other interesting relationships that $\varphi$ pertains to. For example, taking the third line from the derivation of $\varphi$ nets a recursive, cyclic definition of the golden ratio. Expanding out the relation gives another famous definition of $\varphi$.

$\varphi = 1 + \large{ \frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}}} }$

An infinite descending fraction solely containing 1s. Taking a variation of the fourth line also gives an interesting appearance of $\varphi$.

$\varphi = \sqrt{1 + \sqrt{1 + \sqrt{1 + \sqrt{1 + \ldots}}}}$

An infinitely nested radical solely containing 1s. Notice, however, that the solution to the golden ratio has a negative counter part as well: $1-\varphi=-.61803$. Although it may seem nonsensical to assign a negative value to many of the expressions we used in defining $\varphi$, this value holds many of the same properties that $\varphi$ holds on its own as well, and the reason we don't see it as often has to do with the volatility of the value in these iterated scenarios, but that's for another time.

The Proof

First, let's look at how to find the inflection points of a quartic. Inflection points are given by the quality that it's the point along a function where its concavity changes. I.e. if you look at the tangent lines along a curve as you vary the input $x$, the tangent lines' slopes will change. The inflection points are found when the slopes' behavior alters. Take the function $x^3$, for example.

Notice how as we let our value $a$ increase, the slope of our tangent line — the first derivative of $f(x)$ — decreases from $-1.5$ to $0$. But from $0$ to $1.5$, the slope begins to increase. This is all visualized in our graph $f'(x)$ which plots every point $x$ and the value of its slope at $f(x)$. One can clearly see $f'(x)$ tends in a downward manner initially, before rising again. And for $f'(x)$ to have a slope that's first negative (decreasing) then positive (increasing), it must have a slope of zero in between. So, our point where our concavity changes is when the slope of $f'(x)$ equals 0. In other words, when the second derivative $f''(x)=0$. Here we can see it clearly visualized at the solution $x=0$, which confirms all of our previous observations. Doing so for any general quartic nets us:

$\begin{align} f(x) & = Ax^4+Bx^3+Cx^2+Dx+E \ \newline f'(x) & = 4Ax^3+3Bx^2+2Cx+D \ \newline f''(x) & = 12Ax^2+6Bx+2C = 0 \end{align}$

As this is a degree 2 polynomial, the quadratic formula quickly gives our two solutions for $x$ in general, which we will call $p$ and $q$.

$p = \large{ \frac{-3B-\sqrt{9B^2-24AC}}{12A} }$ $q = \large{ \frac{-3B+\sqrt{9B^2-24AC}}{12A} }$

This also explains why only most values of our constants have inflection points, as if the $9B^2-24AC$ term is negative, it results in an imaginary solution, meaning no inflection point is found within the real plane. With valid constants giving us solutions for our inflection points $P$ and $Q$ respectively, the line through them can quickly be written as:

$g(x)=\frac{f(q)-f(p)}{q-p}(x-p)+f(p)$

The intersection point $R$ can be found solving for when $f(x)=g(x)$, or in other words, when $f(x)-g(x)=0$

$f(x)-g(x)=0$
$Ax^4+Bx^3+Cx^2+Dx+E-\frac{f(q)-f(p)}{q-p}(x-p)-f(p) = 0$

One can try to factor and work this out, but there is a much nicer approach that avoids working with this messy equation.

Transformations

If we limit our transformations to purely scaling and translating our graph, all of our ratios will remain equivalent. So if we can find a set of transformations to make our work easier, we will still be able to prove our initial proposition, but in a much easier way. To (re)start, we're going to define a new function $h(x)$ that takes $f(x)$, and scales and moves it around as follows:

$h(x) = \frac{1}{q-p}(f([q-p]x+p)-f(p))$

This may seem arbitrary, but keeping in mind what $p$ and $q$ mean, this transformation alters the graph in a rather specific and useful way. First notice these two key components in our transformation:

$h(x) = \frac{1}{q-p}(f([q-p]x+{\bf p})-{\bf f(p)})$

This results in shifting the graph over to the left $p$ units and down $f(p)$ units. Or more clearly, it takes our first inflection point $(p,f(p)) \rightarrow (0,0)$, the origin. We'll refer to the origin as $P'$. Now, let's look at the remaining components of the transformation:

$h(x) = {\bf \frac{1}{q-p} }(f({\bf [q-p]}x+p)-f(p))$

Multiplying $x$ by $q-p$ results in compressing the $x$-axis by a factor of $q-p$. So, the x coordinate distance between our inflection points is condensed from a length of $q-p$ to a length $\frac{q-p}{q-p} = 1$. Just to keep our scaling consistent throughout $f(x)$, we also scale the $y$-axis down by a factor $q-p$, so we add an extra factor of $\frac{1}{q-p}$. This factor is almost purely for aesthetic purposes, as you will see it will preserve the structure of our graphs and make it easier to see our scaled copy of $f(x)$ in $h(x)$. So, as the difference in $x$ coordinates between $P'$ and $Q'$ is 1, $Q'$ will be at $(1,h(1))$. $R'$ will retain its same definition as $R$, differing only in that it is on our newly transformed function.

Notice how the two inflection lines are parallel. That is due to that extra factor of $\frac{1}{q-p}$ in $h(x)$, but note that the math that follows is not dependent on it.

It's worth noting that we don't actually know any of the constants that shape our new quartic $h(x)=ax^4+bx^3+cx^2+dx+e$ as they don't change according to our scaling factors (notice the change in capitalization; these new constants for $h(x)$ is separate and different to those of $f(x)$). However, we do know the solutions to $h''(x)=0$. Instead of using our function to find its second derivative like we did in our original approach, we are working backwards from our second derivative to narrow in on our function. Since we know where our inflection points are at, we can rewrite our $h''(x)$ as a product of factors.

$h''(x)=0 \rightarrow x=P',Q' \rightarrow x=0,1 \rightarrow h''(x)=12ax(x-1)$

The factor of $12a$ comes from the leading term when taking the second derivative of any general quartic, as we saw in the original attempt to prove this. Expanding this expression and integrating twice gives us:

$\begin{align} h''(x) & = 12ax(x-1)=12ax^2-12ax \ \newline h'(x) & = 4ax^3-6ax^2+b \ \newline h(x) & = ax^4-2ax^3+bx \end{align}$

Notice I didn't add a new constant after the second integration, as that is equivalent to the $y$-intercept, which we know to be at $(0,0)$. Now that we have $h(x)$ in terms of itself, separated from $f(x)$, we can easily find the coordinates of $Q'$ and find $h(1)$.

$h(1)=a(1)^4-2a(1)^3+b(1)=b-a \rightarrow Q':(1,b-a)$

Now we can create a new secant line $g(x)$ to pass through our two inflection points, $P':(0,0)$ and $Q':(1,b-a)$.

$g(x)=\frac{b-a-0}{1-0}(x-0)+0=(b-a)x$

Now we can continue using our original method, which is to find all solutions to $h(x)-g(x)=0$. Only this time, our transformations should net a cleaner equation.

$\begin{align} h(x)-g(x) & = 0 \ \newline ax^4-2ax^3+bx-(b-a)x & = 0 \ \newline ax^4-2ax^3+bx-bx+ax & = 0 \ \newline ax^4-2ax^3+ax & = 0 \ \newline ax(x^3-2x^2+1) & = 0 \end{align}$

That $ax$ we factored out is our solution at $x=0$, or $P'$, which we used to construct the line in the first place. Similarly, because we used $Q'$ to construct the line as well at $x=1$, we can factor out an $x-1$ as well.

$\begin{align} ax(x^3-2x^2+1) & = 0 \ \newline ax(x-1)(x^2-x-1) & = 0 \end{align}$

That last factor is the exact quadratic that we derived to define the golden ratio. Knowing that, we now have all of our solutions to the intersection points between our quartic and secant line.

$x=0,1,\large{ \frac{1 \pm \sqrt{5}}{2} }$

The negative solution to the golden ratio here is the fourth point of intersection at $S:(s,h(s))$ with $s<0$. Now the last thing to note is that our 3 points of interest, $P'$, $Q'$, and $R'$, are all collinear. So, they can be thought of as a projection of the $x$-axis to a sloped line that scales how far they are spaced apart. However, since this is multiplicative, the ratios will be the same, so we only need to look at the ratios between their $x$ coordinates.

$\large{ \frac{PR}{PQ} = \large{ \frac{P'R'}{P'Q'} } = \frac{\frac{1 + \sqrt{5}}{2} - 0}{1 - 0} = \varphi }$

You can also quickly find other ratios of different lengths and find other interesting connections. Take $\frac{PQ}{QR}$, for example.

$\large{ \frac{PQ}{QR} = \large{ \frac{P'Q'}{Q'R'} } = \frac{1 - 0}{\frac{1 + \sqrt{5}}{2} - 1} = \frac{1}{\varphi-1} }$

If you look at our defining quadratic $\varphi^2-\varphi-1=0$, it can be rewritten as $\varphi(\varphi-1)=1 \rightarrow \varphi=\frac{1}{\varphi-1}$. Completing our expression gives us:

$\large{ \frac{PQ}{QR} = \large{ \frac{P'Q'}{Q'R'} } = \frac{1 - 0}{\frac{1 + \sqrt{5}}{2} - 1} = \frac{1}{\varphi-1} = \varphi}$

Just as our golden rectangle previously foretold.

Retroreflectors

Adi Mittal

The utility of right angles

If you ever seen a bike at night, you've likely noticed the bright reflector many people use to ensure they're visible while riding. Why are they so effective at creating such visibility? It lies in the construction of the reflector itself. Looking closely at a bicycle reflector, you will notice that they aren't just plain mirrored facets; they have an almost pixelated, grid like look to them.

They are surfaces not covered in flat mirrors, but rather are tessellated with the corners of cubes that are mirrored. Why is that? To find out, we first need to talk about Fermat's principle, and $90^\circ$ angles.

Fermat's Principle

Fermat's principle, or the principle of least time, was an idea coined in 1662 by the mathematician of the same name, and it states that the path taken by any given ray of light is always the quickest one. Although this may seem obvious, it allows for many properties of light and optics to be derived from it. The one that it helps demonstrate for us is the common equality of the Law of Reflection: the angle a light approaches a surface is the same angle it reflects at.

Let's say we have a light source $S$, and we're reflecting it off a mirror (black) at point $R$, to have our ray reach an end point $E$. To show that the angle of incidence must equal the angle of reflection, we are going to create a mirrored copy of our end point, $E'$ (points $P$ and $Q$ are exclusively reference points). As $E'$ is a reflection of $E$ across the mirror, they are both equidistant to $R$, so we end up with two orange lines of equal length, $\overline{RE}$ and $\overline{RE'}$. However, because $\overline{RE} = \overline{RE'}$, our original path of reflection $SRE$ can be modeled with the new path $SRE'$. Note that the speed of the light isn't changing throughout our model, so we only need to find the shortest path $SE'$. To minimize $\overline{SE'}$, the shortest path is clearly just a straight line (blue). We already new that the angle $\angle{ERQ} = \angle{E'RQ}$ by definition of reflection of $E \rightarrow E'$, and now that we know $\overline{SE'}$ is a straight line, the angle that $\angle{SRP} = \angle{E'RQ}$. Combining these two inequalities nets us $\angle{SRP} = \angle{E'RQ} = \angle{ERQ}$, which was what we wanted to show.

Although this seems like an obvious fact, knowing why it this fact is true helps to understand how we will apply it to our bike reflector and corner cubes.

Right Angles

To understand why corner cubes are chosen as bike reflectors structure, looking at simpler cases always helps. Instead of looking at corners of cubes to see how light interacts with them, we can first work from the corner of a square and see what happens.

Notice how regardless of what angle the light is hitting the corner, the light reflected from the corner is always parallel to the ray entering it. We can prove this remains true for any angle $\alpha$ quite simply using some basic geometry.

We want to show that ray $\overrightarrow{M}$ is parallel to $\overrightarrow{N}$ given $\overrightarrow{M}$ intersects the corner at an angle $\alpha$ and that we have a true square corner that is a right angle. Filling in the givens, the rest follows nicely. The Law of Reflection gives the angle congruent to the initial $\alpha$, and the idea that all triangles' angles sum to $180^\circ$ gives the $90-\alpha$. The trick in proving this involves adding an auxiliary line as such and the rest follows.

We add another line parallel to one of the sides of our corner. This creates another right angle. Since we know that $90-\alpha$ makes part of the right angle, we know that $\alpha$ must make up the rest of the right angle, as $90-\alpha+\alpha=90$. By Law of Reflection we then know that there is a symmetrical angle of measure $\alpha$. Now since $\overrightarrow{M}$ and $\overrightarrow{N}$ both are attached to parallel lines at congruent angles, the only way that can happen is if $\overrightarrow{M}$ was parallel to $\overrightarrow{N}$ as well. Hence, a ray $\overrightarrow{M}$ has a reflected path $\overrightarrow{N}$ that exits parallel to its ray of incidence.

Moreover, we can show this only holds true for right angles using very similar logic. Setting our once right angle to $\theta$...

From our diagram, it's clear that for $\overrightarrow{M}$ to be parallel to $\overrightarrow{N}$, $\alpha=\alpha+\theta-90$, which when solving for $\theta$ gives $\theta=90$, our previous right angle.

All of the previous arguments can be applied to the 3-dimensional case by decomposing the ray of light into two other rays, and by showing that those two rays are parallel to the initial, that the composite ray is as well. With all of this together, it makes perfect sense why bike reflectors are corners of cubes: they send light back to its source. If you had a standard mirror, no light would return back to where it came from unless looking perfectly perpendicular to the mirror.

If no light goes back to its source, to say, a car's headlights, no light will hit the driver's eye to indicate that there is a bright, shining reflector to show that there is a bike up ahead (for this reason exactly, most reflectors actually have angles slightly large than 90$^\circ$ so that most light returns back to its source, and some can scatter to an observer slightly above/below/left/right of the source). These reflectors actually have a specific name to it, and they're known as retroreflectors, literally meaning to reflect backwards. This concept has been leveraged to aid satellites, and indirectly the military. There's a reason why no stealth-based aerial technology has no right angles: they want to avoid creating an accidental retroreflector that can return radio waves.

Hopefully this gave insight into a seemingly arbitrary design choice in one of the most common bike accessories used today.

Mad Max: Fury Road

Adi Mittal

Yes, the flamethrowing guitar was necessary

This post is a collection of essays analyzing scenes, themes, dialogue, and scoring of George Miller's 2015 120-minute action packed car chase of a movie, Mad Max: Fury Road. It is truly one of my favorite, raw action movies I've seen in a long time as it knows exactly what it is: an action movie. It delivers in spades what every action enthusiasts craves with some of the wildest practical effects seen to date, it still manages to have tender moments and characters develop through the action rather than just through some corny moments or having explosions for the sake of explosions. If you have not seen it, do what you can to do so, as even if you're not a fan of action movies, Fury Road transcends the category and delivers so much more.

You should also watch it as these essays assume: a) that you know the movie so you don't walk into thousands of words of spoilers, and b) that you are familiar with some of the terminology introduced in the film as well.

To start, let's look at the end.

The End…?

The ending could be a sort of allusion to the western genre as a whole (after all it is commonly referred to as a “western on wheels”). After having found his name and fulfilling his moral duties to the extent as he feels needed, Max leaves to be his lone ranger just as he started to move onto his next wandering adventure. This sort of functions with how Max blends into the crowd, illustrating how he is just as anyone in that crowd: seeking a purpose, and having now fulfilled that, he walks in the opposite direction from whichever everyone else walks, no longer looking at the Citadel nor Furiosa for hope, but walking away as he has already been given hope by it. This is furthered by the final quote presented by The First History Man at the film’s conclusion: “Where must we go, we who wander this wasteland, in search of our better selves?” This implies that the place to seek to better oneself doesn’t exist, or is unknown at the very least, and it has to do with because one of the central themes of the movie is that redemption is a self-realizing process, not one induced by a place or material object, but rather conducted through places, objects, and or people. They can be mediums, but not the incitation of being redeemed. Nux, having abandoned his technological faith in Immortan Joe, is a key example of the process, as he takes the phrase, “Witness me!,” to be one not of sacrificing himself, but rather as remembrance by asking the wives who he’s befriended to not forget him for the person he was rather than his death for them.

Returning back to Max’s extraneous leaving of the Citadel, having found redemption and reacquainting himself with his past, Max feels no obligation to stay as while the Citadel, the Wives, Furiosa, and Nux were his mediums of redemption, they weren’t a part of what he was redeeming: his quest was purely for himself and his past, ideas and memories unassociated with the movie’s setting, it’s the parallels that he inevitably sees in the other characters that draws him to help. Contrasted to the other characters’, such as aforementioned Nux, while they do have their own redemption arcs, they tie in directly to the immediate setting which gives them motive beyond personal reason to stay; they’re not only redeeming themselves, but the land itself (note that these should be treated as separate actions; intertwined as they are, Nux, Furiosa, and the Wives, redeeming the Citadel is more of a byproduct of their personal growth that was conducted through their reclaiming of the stronghold). While Max and the rest of the cast are all given hope from the present and their actions through the movie, what differentiates them is how they leverage this hope to recontextualize their life: Max feels he has atoned and repented for his guilted past, while the rest have their futures recontextualized, now knowing that they no longer lead a life of forced repression. Their repression wasn’t what likely induced their guilt, but knowing that they had little hope to have a better life at all is likely what made them feel powerless, and guilty they hadn’t attempted to advocate for a more moral society. But, because they never had that opportunity physically and only the thought, there isn’t anything in their previous life that they would be to perceive differently or acknowledge: they did what they could, and they only have the future to look forward to, and we see this time and time again in the intermittent sequences which the cast breaks from the intensity of the chase scenes: Furiosa is seeking to return to her dearly beloved home, the Green Place; the Wives are motivated, almost haunted by the prospect of a haven away from Immortan Joe’s abuse (I phrase the Green Place like this as they have no concrete image of it, and are seeking the concept not necessarily a specific place); Nux is seeking technological salvation, and to make his half-life existence meaningful (similarly, Valhalla is phrased like this, as he’s motivated by the concept, not the place as he’s only ever had it described to him, he’s never seen it); these are all motives that they envision in the future for themselves, not something they are gripping on to from the past. It just so happens that in the narrative, all of these instances of their individual sanctum sanctorum converge on the Citadel over the course of their character growth as they realize that they can only seek comfort in reforming their personas, so while the direct action they take switches from materialistic to impersonal, the motivation remains the same for a prospect of the future. This is why everyone but Max staying is so crucial, as it maintains the consistency of the messaging and the individual character plots we’ve been presented with throughout the course of the movie. This film’s ending is merely a means to help individualize Max’s personal journey from the thematic and character development that underpins the work’s message and structure.

My last thought on the end is that this could potentially hint at how this film wasn’t necessarily completely true, but rather a myth or legend passed on. The idea of there being a first history man implies there are others, too, all who pass on and carry stories such as what is told in Fury Road, similar to how in Mad Max 2: The Road Warrior, the narrator of the film wasn’t Max at all, but rather a survivor who was helped by him. So perhaps, the ending quote not only explains why Max left, but potentially how everyone interpreted why he left after all they have endured: Max is a legend looking for his next outpost to wander into, to better another untouched aspect of his life we’ve yet to be revealed. Just as old cliché westerns reiterate time and time again of the wandering, lone hero who’s been mythologized and solidified in glorious memory, perhaps Fury Road, intended for Max to represent just that: a memory to inspire. However, this would still help maintain all of the previous messaging aforementioned, as whether he was diegetically existent as the movie presents or just a fable, his story of proposing redemption still maintains the inspirational quality that he is remembered for anyway. It’s only the scope of the amount of people it reaches that changes.

Feminism and the Vuvalini

There is definitely a feminist connotation and tone in throughout the film, but feminism is not the complete term based on its etymology. To be concrete from the start, according to Merriam-Webster, the term “feminism” is used to describe “the theory of political, economic, and social equality of the sexes; organizing activity on behalf of women’s rights and interests.” Even though the first definition is gender neutral, based on the word’s etymology and the more popularly associated second definition, this response will be referring to that definition of feminism throughout this response, and it’s the gender specific terminology is what prevents classifying this film as a feminist one. While it’s perfectly reasonable to say that this film exhibits feminist messaging, it’s more justified to argue that this is a more specific case tailored to our own social context in reality that is being related to the film, rather than the broader, more powerful message at hand. Given our, the audience’s, perspective on our very own society’s development and history with gender inequality, the lead antagonist being male with extremely powerful women protagonist counterparts, these are prime conditions to further a feminist message. The issue is, however, these exact messages function almost equivalently with the genders swapped in the film: if Immortan Joe was a female counterpart, the Wives and other protagonists were primarily male, the film could still be classified as feminist, which may seem contradictory, until acknowledging the innate messaging surrounding not just women, but humanity as a whole. Take the standard woman in Immortan Joe’s world: they’re treated essentially to the extent of a rape victim, being sexually abused solely for their fertility; their reduced to an object of a single purpose to forcibly bring life to Immortan’s subjects and the Citadel (this association of fertility and women is supported by the Vuvalini’s own Keeper of the Seeds who seeks to plant and sprout a new life of the Green Place). So, to have a contrast in extremely powerful women between the nomadic Vuvalini, the Wives, and Furiosa, who end up conquering Immortan Joe in the end coupled with the one of the final shots of the women releasing the “aqua-cola” to the people makes for a very strong feminist message, which in that sense, it is understandable that some critics characterizing this film as feminist. However, it’s hard to ignore the parallels between the male characters’ portrayals throughout the film. Max, even before the title card of the film has played, Max is similarly bound and exploited for his social benefits, which instead of being fertility, is holding a healthy supply of O- universally donatable blood for Immortan’s War Boys, who themselves are exploited in their own ways. War Boys, instead of being exploited for their ability to give life, it’s their ability to take it, and execute Immortan’s visions of destruction. However, more importantly than the roles themselves, is their relation to the characters themselves as these roles even reverse with Furiosa and Max by the end, where Max becomes inducted into the Vuvalini and even heals Furiosa under his own volition to perform a blood transfusion on her, while Furiosa is the one to ultimately kill Immortan Joe at the end. These intermixing of the movie’s established gender roles further blur the lines to say which gender is being empowered more, by essentially neglecting the need for it to exist and categorize the characters at all; what defined women and their exploitation, and hence motive to revolt in an empowered fashion gets reassociated with the opposite gender, making it hard to define it as a distinct trait, and hence more apt to look at it as role-specific retaliations and revolts when it is a part of a greater collective body of both genders. Similar parallels can be found between major sacrifices in the movie: Nux from the, War Boys, sacrifices himself specifically to give the women (implied more than to end Immortan’s soldiers’ lives by his care for Capable) a chance to lead a longer, more fulfilling life, while many of the Vuvalini give their life to end Immortan’s army as they stand diametrically opposed to their own morals of community (see the scene where Max convinces the group to redirect themselves towards attacking the Citadel). There’s a constant defying of expectations to mirror and reverse supposed gender-specific roles within the film to unify the genders as one. So while there is in fact a large component that empowers women, due to the parallels that the men’s stories provide, it’s more accurate to say that this is a movie about empowering humanity as a whole above the tyranny and unethical practices that Immortan Joe embodies (it’s not surprising that Immortan Joe’s name mimics not only the word “immoral” in addition to “immortal”). Even the name of the stronghold the protagonists reclaim, the Citadel, echoes deeply not just the despot they’re overthrowing, but the entire symbol for oppression: it’s not just a fortress dominating people, it is an empire dominating the humanity of the people.

One fact that might augment the message to be a bit more women-centric, though, is the naming of the Wives: Angharad (Welsh name for “much loved one”), The Dag (Australian slang for “funny/amusing”), Toast the Knowing, Cheedo the Fragile, and Capable, each seemingly named after a core value they embody of charismatic/leader-like, comedic, wise, fragile, and confident respectively. With each embodying an extremely distinct human quality, it’s hard not to see how they themselves personify humanity as a group. That combined with their distinct light, (relatively) elegant clothing, the group also clearly is hope incarnate, which can seen most distinctly during the nighttime Green Place car chase scene with the Bullet Farmer, where they are the ones seen to be holding the only lightsource between the crew of the War Rig, contrasting the deep, unsettling emptiness of the exhausted, corrupted Green Place. However, when interpreted like this, it’s worth noting how the other characters perceive the Wives: they are the ultimate prize. Immortan Joe seeks them to bear healthy children, so in a literal sense, the Wives are the ultimate material possession for him. However, Max, Furiosa, Nux, and other aiding protagonists are not drawn to the Wives for their physical traits, but rather because of who they represent; they want to protect the Wives as they are the characteristics of humanity they all seek to restore themselves, which furthers the lack of gender as a needed category of a theme. When perceived as symbols, the Wives exemplify a much greater presence in the film as the “MacGuffin” that everyone seeks to find and protect as a medium to try and redeem some humanity within themselves; the core values they represent ends up being a universally sought after, set of appreciable qualities.

Furiosa's Story

Furiosa was born into the Green Place as the daughter of Mary Jabassa of the tribe Swaddle Dog of the Vuvalini, being taught and trained by her “initiate mother” K.T. Concannon. There, in her matriarchal society, she is taught to value her relationships and who she is in this tribe of mothers. However, she was abducted – stolen – from her home by Immortan Joe to the Citadel along with her mother, who died within 3 days of captivity. Immortan Joe took in Furiosa as one of his new wives, seeking for her to bear his new healthy son. Unable to successfully impregnate her with a possible heir, Immortan Joe had no use for her to serve in his vault as a breeder. Unable to watch a possible “resource” go underutilized, Immortan Joe gave Furiosa to one of his Imperators, a high commanding military officer who takes control of their invaluable War Rigs. Constantly exposed to war and automotive technology, Furiosa became an experienced, and newly indispensable asset from one unable to give life to others, to one able to quickly and efficiently take it. But, only so much experience can provide so much benefit without repercussions: she lost an arm in combat, forcing her to create a prosthetic extension to be able to continue serving. Once her half-life mentor had passed, she replaced his title and claimed Imperator for herself as a leader in Immortan Joe’s ranks, becoming one of his most trusted commanding officers and couriers. Due to his trust and surplus of willing warriors, Immortan Joe assigned Furiosa to watch over his most prized possessions that Furiosa was once almost inducted into: his prized breeders, the Five Wives. Relating to their physical abuse, the Wives were the first people since Furiosa’s capture that she connects with. The abuse and effort she had to commit to, though, took a toll on Furiosa across her 7000 days of imprisonment. On many occasions, she has considered defecting and escaping in search of her once lost Green Place and taking refuge in her family. Holding as much power as she did with her War Rig, she saw an opportunity, a clear one no less, to retrace her path, running from Immortan’s grasp, and find her lost home among the barren deserts of a once fruitful land. She tries to leave, but not alone: Furiosa smuggles the Five Wives with her, knowing they need the Green Place just as much as she does.

The first bit of basic information is given directly through her identity speech upon regrouping with the Vuvalini for the first time. We learn about her compassion and love of her people through her introductory speech to the Vuvalini: she never refers to herself by her name, but rather what that name was associated with, and she does so with very specific tenses. She was once part of Swaddle Dog, but is one of the Vuvalini. Her initiate mother was K.T. Concannon, but she still is the daughter of Mary Jabassa. She talks as if she has outgrown her childhood culture of Swaddle Dog – she had to for the sake of survival – but she still talks as one of the Many Mothers, and as if she still wishes to be associated with and accepted into this group she still cares for (present tense phrases). Seeking reaffirmation, she hopes to show that she is still selfless in the cause of the group and the people within it, because without them, Furiosa’s name means nothing to her. Elucidating on her combat experience, one could only imagine that her inability to bear children is why she was selected to become the inevitable driver of the War Rig, and how she lost her arm. Fury Road, within the first 5 minutes, before the title card, makes sure to establish norms and the social constructs that govern Immortan Joe’s Citadel, and from the very start, women have been boiled down to a single purpose: fertility. Her being assigned under the command of an existing Imperator would connect a lot of the scenes to losing an arm in battle, driving the War Rig in the first place, and how she is able to be so prepared for combat. Take the scene when Max, still muzzled and enchained to Nux, Furiosa is able to take down Max, be more than dominant at close quarters combat with a wrech, disarm Max of his shotgun, pull out a secret handgun on Max, while also preemptively pulling the kill switches on the War Rig as well so that even if Max proved victorious in their small skirmish, he couldn’t steal the resources. Not to mention the amount of firearms Furiosa stashes in the War Rig that Max reveals immediately after their scuffle, and her experience with a sniper rifle to take out the Bullet Farmer later during the night chase. There’s no way in her 20 years she could have advanced nearly as far as she could have without already being valued by a highly ranked member in society, as we by see the number of War Boys forced to conform to that initial ranking and die in battle, or grow only to the extent of the military’s drum corp. It is difficult to connect the Wives and Furiosa, other than the fact that Furiosa was at one point almost inducted into the group of prized breeders. We know Immortan Joe trusted Furiosa immensely for executing his water, bullet, and resource runs that extended beyond the Citadel, holding upwards of 3000 gallons of the prized resource “guzzoline” plus a surplus of water at a time. For how much of the Citadel he constantly monitored, to extend someone beyond his immediate control, and even reallocate some power and influence to someone else speaks immensely of the trust he bestowed unto Furiosa. So, for the fact that she’s a powerful commanding officer, and has a non-insignificant connection to the Wives indicates that is how she got in touch with them, and how she communicated her plan to smuggle them out as well. After she escapes the Citadel, we are now well into the film’s plot, and concludes the biography.

While the above looked at three specific presences within the film, the three below are a series of more opinionated pieces discussing some of my favorite sticking points.

Favorite Shot

My favorite shot in Mad Max: Fury Road occurs for only a few frames during the Rock Riders chase scene after Furiosa’s exchange with them went south (this shot is one that I don’t think I consciously took in until my 2nd or 3rd viewing of the movie). They are well into the chase at this point with the Rock Rider’s signature bikers attacking the War Rig at all angles: explosives from the side, bikes jumping over the War Rig, armed bikers firing rounds whenever the can, all the meanwhile Immortan Joe in his Dodge Fargo 1940 “BigFoot” monster truck is catching up to them from the rear. In a moment of panic and without weapon, Furiosa lunges into her arsenal and grabs some kind of pistol from the bag, and in a single movement, lines up her shot alongside Max at a Rock Rider angling themselves along the side of their vehicle. This shot is only a few frames in length, but communicates so much about the character development of both Max and Furiosa and the relationship between them. We can actually map out their entire relationship visually from strangers, to foe, to forced allies, to tight-knit friends. You can see that they are strangers when they never share a frame; for the first part of the movie, there is usually a cut, or shift in lens focus to direct the audience’s attention to either Max or Furiosa, neither at the same time. They become foes at their first interaction after the sandstorm, and this is clearly indicated via their exchange of weapons. Furiosa attempts to shoot Max with his own shotgun, and moments later, Max threatens to shoot Furiosa with her own pistol. This helps emphasize the turbulent power dynamic between the two: they are both just as capable, but are perceiving each other as threats, so they continue to try and disarm each other, and inevitably turn one’s own weapon against them. This continues in the following scenes where they become reluctant partners, driving the War Rig together. Since Furiosa is the only one who can drive the War Rig, she seems to be in control of the situation. But, the first thing Max does before the War Rig departs is take and hold hostage every single last gun compartmentalized throughout the vehicle, bringing the power dynamic back into his favor. It’s similar to the back and forth they had as enemies, but now instead of being a constant duel of competitors, it’s now more of a dance of adversaries: they both want to accomplish different goals and have different ideas in mind separate from one another, but are both tense as they both potentially act as a threat to that success if they choose to act on it, almost like mutually assured destruction – one party stops another also forces to stop themselves. This is a change from the previous relationship where they both saw each other as direct impediments that are competing for the same goal, but they now realize that their ideas aren’t mutually exclusive. So, as unlikely companions, Max doesn’t kill Furiosa, but completely disarms her. This tension eventually gets alleviated as the two are both placed in precarious situations, and Max ultimately returns a weapon to her to help defend the crew, visually showing the trust building. Now, the shot I picked as my favorite is where the two are finally visually cued into being equals to one another: capability-wise, trust-wise, and as trusted allies. In those few frames, showing the two both aim at a single target together in the same shot is all it took for this film, with minimal dialogue between the characters, to graphically establish their relationship. I also really like it because of how short it is: it forces us, the audience, to passively take in the shot which helps us to understand the two characters are now strongly bonded without forcing their ever changing connection and feeding it to us directly. These cues allow us to take in a lot of information very quickly, and this is one of the best examples the film offers for how it does so with such a nuanced topic like character interaction.

Favorite Line

My favorite line in Mad Max: Fury Road is one of Furiosa’s final lines of the film during the final action sequence in which the protagonists end Immortan Joe’s corrupt rule. As Furiosa forces her way up to the driver’s side window to finally kill Immortan Joe, she delivers a final send off before he lashes out one last scream: “Remember me.” It twists the War Boys fabled motto, “Witness me!”, that they call before attempting a suicide act in an effort to be permitted into Valhalla under Immortan Joe’s servitude by giving their life in noble act of war. This phrase, “Witness me!”, evokes a tone of acknowledgement of the action; a phrase that asks for those who know they have already died to do so. It holds a connotation that what is being seen is merely recognized, but not appreciated, and that’s due in part by what the purpose of “witness” is: it’s an act to better oneself into the salvation of Valhalla, which they celebrate in normalizing their expendability of their current life and to carry on into an uncertain next. Furiosa’s spin on the phrase reverses that tone completely by saying “remember” instead of “witness”, which instead of asking for acknowledgement and self-betterment, is asking for gratitude. While Furiosa isn’t the one who physically dies, and Immortan Joe doesn’t physically survive, it makes sense to view it symbolically as if that was the way the scene was framed. By perceiving as so, it turns the messaging around by asking “Immortal” Joe to live on forever with full knowledge that she is who restored the world with the amount of sacrifices she had to endure; she’s asking not to “die” in vain, but to live in memory of those she has worked so hard to help. While it has the exact same physical outcome that the old, willing sacrifices that were “witnessed” had, it changes the mentality and respect surrounding those who may not have had a choice in their death to truly internalize the impact it had on those who had beared the trauma of the death in question.

This expression also immediately sets the ideology she seeks to replace Immortan Joe’s society with: people aren’t objects to watch be expended, but they are intelligent beings who deserve to forever live by their accomplishments and compassion they have shared for others beyond their self-interest. What makes this specific sentiment especially powerful is that the only memory that Immortan Joe would have to remember Furiosa by is her own exploitation that he leveraged, so by executing him with that memory is an extremely clear indicator that she wants those memories and experiences to die with him so no other person has to suffer through it and pushing her own abuse behind her. It’s Furiosa climactic developmental moment: just as Max gave his name to Furiosa to accept his past, Furiosa buries her past to accept her new future. She wants to have Immortan Joe explicitly know that he was the one who “killed the world” by destroying the very world he has built upon abuse and denigration with ending his life. The final component of this line that makes it so inspiring is coupled with the next important line that Nux so emotionally delivers: “Witness me.” The very phrase that Furiosa has reformed dies gently with Nux, the transitory character that was once a War Boy, now Vuvalini takes the cursed connotation of the phrase “witness” and completely transforms it to accommodate his developed moral enlightenment. Nux, for his whole life, has been told that the world doesn’t care for him, but he cares for the world; the Citadel will continue to run with the powerful machines that they worship, and they are merely the small, insignificant cogs in a social machine. Even though they are aware they are “kamakrazee”, they know that their life has no more meaning than what Immortan Joe provides. But by the end, after all he’s undergone and tolerated, he pours all of his new found emotion Furiosa has modelled throughout their adventure into his delivery, completely dropping his trochaic-accented meter of the War Boys’ chants, remnant of the famous Gregorian chant of death, Dies irae, before burying the last of the phrase along with the remaining scraps of Immortan Joe’s convoy, influence, and power. He knew he would die, and that there was no Valhalla for him to seek, but he hopes that him accepting a willing death for those he loves would be remembered beyond just a glorified suicide to the group, but someone of a friend. That last utterance of “witness” gives the very contrast needed to emphasize the importance of Furiosa’s choice of wording.

“Remember me!” is such a powerful line as it reverses the very sentiment that Immortan Joe terribly exploited back onto him, while also completely rebuilding a new tone and society within just those two words to break down what we, the audience, have been conditioned to hear so frequently that normalized death to respect its consequences and implications beyond a single use case. Furiosa, in just a few syllables, was able to take an entire society, destroy it, rebuild it, and accept her sacrifices for it.

Theme

Mad Max: Fury Road contains many broad, strongly supported topics of discussion, one of which includes a central thematic subject that revolves around how redemption is a self-realizing process that cannot be induced forcibly by a person, place, or material object, but can be conducted through one as a medium to enlighten oneself beyond past perception.

Similar to the analysis of the ended above, every major character follows an arc in which their persona is fleshed out such that they are able to newly perceive their life and how they fit into the events that they have been witness to. Continuously, the film, almost forcefully, imposes redemption as a subject matter. The first instance that redemption appears as a focus is within some of Immortan Joe’s first lines: “I am your redeemer! It is by my hand you will rise from the ashes of this world!” Many of his War Boys buy into this false ideology and into the notion that their salvation will be presented as an action that Immortan Joe provides directly to them. Max and Nux learn this first hand: literally being abandoned in the aftermath of a sandstorm, and symbolically “rising from the ashes” that Immortan Joe has incited with his chase of Furiosa. They then go on through their redemption arc (Nux’s arc won’t enter a phase of redemption until further into the film), but this scene helps to recontextualize Immortan Joe’s previous dialogue to fit the theme described. It wasn’t Immortan Joe who presented them with their redemption, but it was he whose hand presented the opportunity to redeem themselves; he was merely the objective that awarded redemption, not salvation himself. Another clear instance of the film’s portrayal of redemption is via Furiosa, with her deeply intimate and revealing conversation with Max, moments before her reunion with the Vuvalini. It’s here where she unveils not only why she’s seeking the Green Place for herself but for the others as well:

“And [the Wives]?”
“They’re looking for hope.”
“And you?”
“...Redemption.”

This distinction is important as it aids in differentiating the purpose of the Green Place between the Wives and Furiosa. The Wives specifically are holding onto seeking something material — something tangible that directly impacts their life to separate them from Immortan Joe’s exploitation. They need that promise to believe in a life worth living. Furiosa, on the other hand, has a much deeper connection to the Green Place. It was her home that she was extracted from, enslaved at the hands of the very force she was trained to avoid: men. Knowing the trauma her people had to endure along with the acts she deeply laments serving Immortan Joe, Furiosa desires to repent and atone for, which she tries to do through helping the Wives and restoring herself to the Vuvalini. In the end, Furiosa does find acceptance and redemption, it doesn’t come without cost: nearly all of the Vuvalini die in her and the Wives’ names, reiterating how there isn’t manifestations of redemption that one can possess or know to enter a new state, but there are certain people and objects that can help guide one through their self-solace, because if that wasn’t true, the ending of Fury Road would have almost no impact; Furiosa’s story wouldn’t have a chance to resolve with the lack of the Vuvalini present, and Max’s leaving would make even less sense, as he is then literally abandoning his redemption, which is his most overarching plot motivator that has guided him through the film.

This is even reiterated in the music scoring for the film. When listening to the track labelled, “Redemption”, which plays during the previously described scene of Furiosa’s personal conversation with Max on their way to the Vuvalini and the Green Place. In it, a certain leitmotif is established along with a very distinct tone. However, in the following scene’s track, “Many Mothers”, it takes the redemption leitmotif that “Redemption” established, but emboldened with a fuller orchestration, connecting the theme to the scene. Similarly, during the blood transfusion scene where Max desperately tries to save Furiosa from dying of blood loss, his associated track, “My Name is Max”, it also contains the same leitmotif, but this time emphasizing rests and silence to allow room for the music to breathe and react to Max’s dialogue, especially his key titular line the track has been named for. If we were to take the connection in musical theme and apply it to the idea that redemption is something that can be contained within a person or group of people, like Max and the Vuvalini — characters who either left or died — then it would only reiterate the aforementioned theory that Fury Road along with many of its characters would remain unresolved, and the conclusion would have no significance nor impact that they movie clearly tried to convey. It’s more reasonable that the connection between “Redemption” and “Many Mothers” as well as “My Name is Max” is that it is used to represent the action that transpired through or from the relevant characters, permeating long after their presence with the others. They conducted and induced their redemption, albeit via different mediums, they are the ones who incited the redemption themselves, instead of being given it, or attained it via a sudden acquisition of someone or something.

There are many other examples, especially with Nux and his relationships between Capable and Max that exemplify the versatility and omnipresent nature of this theme of the film, these are some of the most distinct examples that elucidate Miller’s conveyance of the nature of redemption, and its intrinsically self-realizing and growing process.

Fixed Points and Fantastic Plots

Adi Mittal

Stability amidst the chaos

Introduction

Let me propose a question to start. Try to solve the following:

$\large{x^{x^{x^{x^{.^{\hspace{.07cm}.^{\hspace{.07cm}.}}}}}} = 2}$

An infinite power tower which supposedly equals 2? Seems unlikely, but those familiar with these infinite-operation type problems likely know the strategy to solve this. Notice how there's a copy of our equation stacked on top of itself.

$\large{x^\fbox{${x^{x^{x^{.^{\hspace{.07cm}.^{\hspace{.07cm}.}}}}}}$} = 2}$

Since we know that equation in the box is equal to 2 because it's a duplicate of our original equation, we can easily reduce the problem down to something much more manageable.

$\large{x^2 = 2} \rightarrow x = \sqrt{2}$

So, raising $\sqrt{2}$ to itself over and over again equals 2. What other equations can we solve? Let's try this one.

$\large{x^{x^{x^{x^{.^{\hspace{.07cm}.^{\hspace{.07cm}.}}}}}} = 4}$

Using the same strategy as before, this one is trivial.

$\large{x^4 = 4} \rightarrow x = \sqrt[4]{4} = \sqrt{2}$

Which is… the same answer as before? How can $f(x) = \sqrt{2}^x$ iterated over itself equal both 2 and 4 at the same time? When in doubt, we can ask our calculator for some confirmation.

Estimation with Python

With some simple Python, we can get a pretty good approximation quickly.

import math
def f(x):
    temp = x
    for i in range(1000):
        temp = math.sqrt(2)**temp
    return temp
print(f(1))

The above code creates and evaluates a power tower 1000 numbers tall, giving us an approximation of 2.0000000000000004, which is pretty close to 2. So, is 4 anywhere to be seen? Actually, yeah; our solution wasn't completely false. Notice that at the end of the script it says f(1). That 1 is our seed value. Since our power tower can't be infinite in order to get a calculable approximation, we need to cut it off after some amount (in this case, 1000 numbers high). In order to do that, though, there has to be some number there at the top of that power tower. In this case it was 1, but it can be anything as we constantly plug our output back into our input, in the case of an infinitely stacked power tower, that seed value is negligible. Let's see what happens if that is changed to f(4).

print(f(4))

Due to rounding, our script actually blows up to infinity with f(4), but we can reason this out by hand. If we start with 4, then our first output of iteration will be $\sqrt{2}^4 = 4$. Since 4 is our output, that's our new input. But since 4 was also our seed value, it'll just constantly output 4 at every iteration. So 4 is a convergent value (as we can only calculate finite approximations) to the infinite power tower of $\sqrt{2}$, but only for its seed value. To better understand this, we can use a tool known as a cobweb plot.

Cobwebs

Cobweb plots are a simple, elegant method to model iterative functions in the Cartesian plane by utilizing a seemingly mundane auxiliary function: $y = x$. What is probably the first graphs people are taught in elementary school is one of the most helpful in modeling these complicated and otherwise impossible to view functions. Here's how to make a cobweb plot: 1) Plot the function to be iterated on (in this case, $f(x) = \sqrt{2}^x$) and $y = x$ together. 2) Pick a seed value to start iterating on. 3) Alternately draw vertical and horizontal lines within bounds of each graph for as many iterations as one needs. Steps 1 and 2 should be clear enough as they're fairly similar to what we did above, but Step 3 might need a visual to go along with it.

Here's the first step's resulting plot:

Nothing too crazy. The green graph is our $f(x) = \sqrt{2}^x$, while the red graph is our $y = x$. For Step 2 we'll pick $x = 1$ as our seed value as we did before. This is where the magic of Step 3 comes in: from $x = 1$, we'll draw a vertical line from the red graph until it intersects at the green graph.

Now we have a line segment with points $(1,1)\rightarrow(1,f(1))$. This step is equivalent to plugging in 1 into the top of our power tower, geometrically doing the operation of $f(x)$. Since we just a drew a vertical line, we now draw a horizontal one from the green graph $f(x)$ until it intersects the red one $y = x$.

Now we have a new line segment from $(1,f(1))\rightarrow(f(1),f(1))$. You can probably see where this is going. Now that we have a new point at $x = f(1)$, we can draw a new vertical line until it hits the green graph, geometrically finding the value of $f(f(1))$, performing our repeated operation! We can do this series of horizontal to vertical lines as many times as we want to get as many iterations of our repeated function as we want!

Now you can probably see why this is called a cobweb plot, as we weave back and forth creating a net-like shape between the graphs (and it only gets more wild looking with different iterative functions!). Even in the previous graph where I set the seed value to be $x=-1$, our graph still quickly hones in on evaluating to $x = 2$ for the $\sqrt{2}$ power tower, just where it happens to be the intersection of our two plots. This is a pretty narrow scope of our graph, though; let's zoom out and see more of this plot.

There's also an intersection at $x=4$! Even with all of this, I don't think it would be wrong to feel that $x=4$ should not be a solution to some extent. Even though, it clearly shows a lot of the same characteristics that $x=2$ does, it still feels weird for this to be considered an answer, or at least to the same extent that $x=2$ is. For any seed $x<4$, our iteration converges to $x=2$, and for any $x>4$, it diverges. Only at $x=4$ does our repeated power tower equal 4. To properly understand this, we'll need to utilize derivatives.

Derivatives and Sensitivity

The classic definition of the derivative $f'(x)$ is a function that returns the slope of $f(x)$ at every point $x$. While this definition of the derivative isn't wrong, it is fairly limiting when only considered in the contexts of slopes. We can reframe the idea of a derivative not to be the slope of a function at a point $(a,f(a))$ but rather how sensitive the function is at the point $(a,f(a))$. This will be more apparent if we plot our $f(x)=\sqrt{2}^x$ in a new way.

You can generate the above plot with the following Python:

import numpy as np
import matplotlib.pyplot as plt
def f(x):
  return np.sqrt(2)**x
inp = np.linspace(-5,5,40)
out = [f(n) for n in inp]
d = 10
fig = plt.figure(figsize=(20,4))
axes = plt.gca()
axes.set_xlim([-5.3,5.3])
axes.set_ylim([-6,6])
plt.scatter(inp, [d/2 for n in range(len(inp))])
plt.scatter(out, [-d/2 for n in range(len(out))])
for n in range(len(inp)):
  plt.plot([inp[n], out[n]], [d/2, -d/2], color='green')

This basically just took the $y$-axis of our Cartesian graph and rotated it $90^\circ$. The blue dots represent the preimage of points $x$, while the orange dots represent their associated transformations under $f(x)$ with green lines connecting them. Just looking at it, it's consistent with our Cartesian graph as $f(x)$ never goes below 0, which makes sense as an exponential is always positive. The reason why we want this graph as it guides the intuition behind this idea of sensitivity and the derivative.

Notice the dots around $x=-3$ in the preimage (blue) points. They all get mapped and squished down near $.354$ under $f(x)$; they get tightly pressed together. But just how tightly pressed together are they? That's exactly what the derivative tells us! For a small change $dx$, we want to know how much that changes the output $df$. In this case, $f(x)=\sqrt{2}^x \rightarrow f'(x)=\sqrt{2}^x\cdot\ln{\sqrt{2}}$. Plugging in $f'(-3)=.1225$. This means that around $x=-3$, the ratio between how much the points around it changes under $f(x)$ is $.1225$, in other words, the area around $x=-3$ appears to have shrunk inward by a factor of $.1225$. In the contexts of slopes, this ratio would be the slope of our tangent line, telling us how tall $df$ would be relative to $dx$. Since the derivative $f(-3)$ is small, we can say that $f(x)$ is not very sensitive around $x=-3$, as a small change in input from $-3$ will still evaluate to about the same value.

Now let's look on the right half of the graph. Trying $f'(4.5)=1.6486$ would imply under our previous logic, that we'd expect points to stretch away from $x=4.5$ by a factor of $1.6486$. Just by looking at our plot, that's not so hard to believe. This means that our $f(x)$ is sort of sensitive around $x=4.5$, as a small difference in input from $4.5$ can lead to a big difference in evaluating $f(x)$.

So now we know that for a given $a$, if $|f'(a)| < 1$, it's a shrink, and if $|f'(a)| > 1$, it's a stretch (a negative derivative implies there's also a flip occurring, but we care only about magnitude). You can now kind of imagine what effects these have when we iterate over $f(x)$ for a long time: points will gravitate towards numbers that shrink the area around them, and be repelled away from numbers that stretch them. Now, relating this back to our original Cartesian plot, let's highlight the areas in which $|f'(a)| > 1$.

Well, look at that! Our $x=4$ solution is in our blue $|f'(x)|>1$ region, while our $x=2$ solution is not!

Connecting this all together now, we had two solutions to an iterative function, but only one of which was appearing in practically every case. When graphing its respective cobweb plot, we see that one solution lies in a non-sensitive region ($f'(2) = .6931$), while the other does ($f'(4) = 1.3863$). So what can we say about either solution? Since we know $f(2)$ is not sensitive to small changes and moreover shrinks space around it, we know that $x=2$ is a stable fixed point of the iterative function $f(x) = \sqrt{2}^x$. It's stable under the notion that because it isn't sensitive to small changes in its neighborhood of points, with each iteration we take, we map points closer and closer to $x=2$ due to the squishing effect of its derivative. But for $x=4$, which is sensitive, each iteration tends to stretch and repel points away from $x=4$, even though it too intersects in our cobweb plot as well as analytically solves the equation. Hence, we call $x=4$ an unstable fixed point of the system. Just like we've described, while $x=4$ is valid for its seed value, the slightest discrepancy in value pushes numbers away from it to either start approaching $x=2$, or diverge to infinity (like in our rounding error in the Python script before!). If we quickly go back to our graph style with 2 number lines and perform the function iteratively there, we can really see what these pulls and pushes of numbers looks like. Here's what the first 10 iterations of $\sqrt{2}^x$ looks like:

You can really see how tight the points coil around $x=2$, and split away from $x=4$. Even with an initial value that starts so close to $x=4$, you can still see it slightly drift away from it at each iteration. This is why thinking of derivatives as measures of sensitivity is so important: the value of the derivative tells you how strong of a pull or push certain numbers have. Consistent with our findings, $x=2$ has a pulling effect around it with a small derivative, while $x=4$ has a pushing effect with its large derivative.

This is why we were also able to use cobweb plots: they were the geometric algorithm to solve when $f(x)=x$, which makes sense as if something is a fixed point, no matter how many times we apply a function to it, it should remain the same. So when solving $\sqrt{2}^x = x$, you'll get the intersections we found earlier at $x=2,4$ (if you want to try and actually solve this equation, it requires the clever use of the Lambert W-function). That's why we were able to analytically solve for two different solutions, but only one kept popping up everywhere. This isn't limited to just power towers, though.

Variations

This type of relationship between stable and unstable fixed points is everywhere. Take the well-known infinite fraction below:

$1 + \large{\frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}}} }$

By setting this equal to $x$, we can solve it just like we did before with the power towers.

$1 + \large{\frac{1}{ \fbox{$1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}} $} }} = x$
$1 + \frac{1}{x} = x$
$x^2-x-1=0$

Using the quadratic formula, we once again get two solutions:

$\varphi = \frac{1+\sqrt{5}}{2} \approx 1.618$ and $1-\varphi = \frac{1-\sqrt{5}}{2} \approx -.618$

The famous Golden ratio $\varphi$ and its underrated second solution. Still, it begs the question, how can a completely positive infinite fraction equate to something negative? Illustrating this with our cobweb and sensitivity regions will make this clear once again. Setting $f(x)=1+\frac{1}{x}$, we get…

A lot like $x=4$ when iterating $\sqrt{2}^x$, $1-\varphi$ is the unstable fixed point in the sensitive region, with numbers getting pushed away at every iteration, while $\varphi$ is the stable one which we quickly spiral down towards. We can quickly verify that $1-\varphi$ is a "valid" solution by plugging it into $1+\frac{1}{x}$ just like we did with $x=4$ into $\sqrt{2}^x$.

$1+\frac{1}{1-\varphi} = 1-\varphi$

For its own seed value, $1-\varphi$ is valid, but I guess that's up to you if you want to equate a negative value to a positive infinite fraction.

For those who are interested, try setting your seed value to a number in the form of $-\frac{F_n}{F_{n+1}}$ where $F_n$ represents the nth Fibonacci number. The Golden ratio is closely tied to the Fibonacci numbers, so it may be a bit unsurprising why they may relate here. If you try to iterate over any number in this form, you'll eventually hit a point where evaluating the function becomes undefined. Try plugging in a few and watch the strange cascading effect happen.

There are a whole host of functions that have interesting iterations as well. Let's try $f(x) = \cos(x)$

Since $f'(x) = -\sin(x)$, $|f'(x)|$ is always less than or equal to 1, so all fixed points it has will not diverge. In this case, we get a solution of $\approx .73909$, sometimes referred to as the Dottie number, which has its own set of interesting properties (for one, it's a transcendental number of the likes of $\pi$ and $e$!). If you are interested in a bit of why this has a fixed point, allow me to point you towards the Banach Fixed-Point Theorem for an interesting perspective that guarantees this fixed point. Let's try another function. What happens if we scale $f(x)$? Let's try $5f(x) = 5\cos(x)$

We have not one, not two, but three different intersection points of where $5\cos(x) = x$. But notice, all three of them lie within the sensitive region where $f'(x) > 1$; they're all unstable. You can probably tell just by looking at it, it's a very chaotic diagram. This might not be unexpected for some of you though. If it doesn't converge to anything, but also not diverge, why wouldn't it just randomly jump around ad infinitum? Well, let me just present another function to explain why. Let's make a cobweb plot for $f(x) = 3.2x(1-x)$

Here we have 2 intersection points, both of which are in the sensitive region where points should not converge to excluding its own value, and that's exactly what we see with no definite attraction to any one fixed point. Yet, it's not like our iterations are randomly moving. In fact, just looking at the diagram, it's quite predictably going in a cycle between two $x$-values of $\approx .516$ and $\approx .8$. The difference between $5\cos(x)$ and $3.2x(1-x)$ is how it interacts with our seed value. For the former, it has a quality known as sensitive dependence on initial conditions, or more commonly referred to as the Butterfly effect: a small change in the seed value can produce wildly different outputs in iteration in the long run, just like how a butterfly's wings can produce a hurricane years later halfway across the globe. This is a common property of what is aptly deemed chaotic behavior. The latter function, while it may not have a convergent value, it does not exhibit Butterfly effect-esque behavior nor chaos while iterating over it, and instead settles into this cycle. As a kickstarter for those interested, $3.2$ in the latter function was not an arbitrary choice: it comes from a family of iterative functions of the form $rx(1-x)$ known as the logistic map. There's so much to talk about there, it likely will be its own post later, but that's for another day.

I want to go back to the Golden ratio problem as there's a neat extension to a more general case of an iterative approximation technique that can be more applicable to problem solving that I want to share. It is known as the Newton-Raphson Method which can (usually) effectively hone in on roots of a polynomial quite efficiently.

Newton-Raphson Method

The idea is fairly similar to what we did before, but since it's catered to finding roots of polynomials, its iterations have a modified step as we're looking for intersections with the $x$-axis instead of the line $y=x$. Here's the basic idea: 1) Pick an initial seed value $x_0$. 2) Draw a vertical line (like we did with the cobweb) until we hit the function $f(x)$. 3) Draw the tangent line of $f(x)$ at $x_0$, and see where it hits the $x$-axis. Call this new point $x_1$. 4) Repeat the process as many times as you'd like for as accurate an approximation as you'd like up to some $x_n$. Here's an example geometric interpretation for this method with $f(x) = x^2 - 13$.

I had to zoom in extremely close for this graph because, as you can see, just after two iterations from a seed value $x_0=5$ finds a really accurate approximation of one of the roots of $f(x)$ and you wouldn't be able to see those lines unless magnified by this much. Let's work out a general iterative formula for this method. We first start with some $f(x)$. Just by using derivatives and definition of a line passing through the point $(x_n,f(x_n))$ for our tangent, we can solve the equation

$f'(x_n)(x-x_n) + f(x_n) = 0$

to find the next point $x_{n+1}$ to continue iterating on (as it should be the $x$-intercept of that line like the instructions describe). Doing some basic algebra shows that:

$f'(x_n)(x-x_n) + f(x_n) = 0$
$f'(x_n)(x-x_n) = -f(x_n)$
$x = x_n - \frac{f(x_n)}{f'(x_n)}$

So, tidying things up, for a given (continuous and differentiable) function $f(x)$, we can approximate its roots by iterating over with some initial $x_0$:

$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$

Trying this out with our $f(x) = x^2 - 13$, our recurrence relation after some simplifying becomes

$x_{n+1} = \frac{1}{2}(x_n + \frac{13}{x_n})$

Or if you liked our previous notation, we can rewrite this as a function and iterate over

$g(x) = \frac{1}{2}(x + \frac{13}{x})$

Since this is in function form, we can use our old friend the cobweb to solve this for us.

It nicely finds $\sqrt{13}$ as a solution, just as we would expect. However, notice that there are two intersection points that lie outside of the sensitive region. One we found at $x=\sqrt{13}$, and the other is actually the second solution to $x^2-13=0$ at $x=-\sqrt{13}$. Our seed value significantly matters more in this case, as now depending on which zero of $f(x)$ is closer, our iteration will target only the closest solution, and this only becomes more important the more zeroes our function contains.

Even with all those caveats, notice what we just made! Our iterative function $g(x)$ is essentially a square root estimator, but with no exponents! While it's nice and convenient just to use exact answers, having decimal approximations are just as useful, especially for computers who don't have unlimited memory to use exact answers. For any number $n$, we can calculate $\sqrt{n}$ as accurately as we'd like by iterating over the function

$g(x) = \frac{1}{2}(x + \frac{n}{x})$

as many times as we want. There are some exceptions where certain seeds can infinitely cycle or actually result in no subsequent $x_{n+1}$ (imagine a horizontal tangent line), but this method is incredibly useful, as this doesn't just extend to square roots, but to any function you want to approximate using the aforementioned formula

$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$

Here are a few other iterative functions for other roots of $n$:

$\sqrt{n} \rightarrow \frac{1}{2}(x + \frac{n}{x})$
$\sqrt[3]{n} \rightarrow \frac{1}{3}(2x+\frac{n}{x^2})$
$\sqrt[4]{n} \rightarrow \frac{1}{4}(3x+\frac{n}{x^3})$
$\sqrt[p]{n} \rightarrow \frac{1}{p}((p-1)x+\frac{n}{x^{p-1}})$

Going back to our Golden ratio iteration, we can rewrite it under the fixed point formula $f(x)=x\rightarrow 1+\frac{1}{x}=x$. If you multiply that through by $x$ and rearrange, we get a quadratic $x^2-x-1=0$. That's a quadratic we can solve for with the Newton-Raphson Method! Plugging it into the formula, we get a function to iterate over as

$g(x) = \frac{x^2+1}{2x-1}$

And sure enough, it works! The advantage of using the Newton-Raphson Method in this case, is that we no longer have to worry about unstable fixed points, as all of our solutions lie outside the sensitivity region. So even if we lose some insight into the nature of each solution, we consistently find each solution of $\varphi$ and $1-\varphi$ to an accurate decimal expansion with the right seed.

What Else?

Iteration and fixed points become one of the prime topics for dynamical systems and describing much of the world around us. We discussed the Newton-Raphson Method of root finding, but there are many other recurrence relations for approximating roots of functions, each catered for their own purpose with different convergence rates and fail cases. Moreover, this is just a single use of the Newton-Raphson Method, for it is more well known as an alternative to gradient descent. Solving systems of differential equations comes down to finding the equivalent of a higher-dimensional fixed point, or in other words, an eigenvector: a vector (which is just an object that can encode more than one number and hence dimension) which doesn't change direction under the transformation describing the system of equations. Markov chains are also another extremely important occurence of fixed points over iteration: after a long series of transitions between states, we can make an overarching statement about the system as a whole reaching an equilibrium state where transition probabilities are expected to remain the same (going back to that idea of eigenvectors!). Synchronization is a prime example of a fixed point under iteration: even if a group of fireflies begin out of phase with one another, their coupling over time will reduce each other into a single large group with one cyclic, uniform behavior. The Mandelbrot set (and all of the Julia sets, for that matter) arise out of the fact that some complex numbers are bounded under iteration of functions $f(z)=z^n+c$ that remain bounded after a long time (sometimes being bounded to multiple values at once!). There are even entire studies dedicated to this. Invariant theory studies mathematical groups and polynomials to see how they remain unchanged under transformations. Almost all of chaos theory is about stability (or the lack thereof) over long periods of time (Nicky Case has a great introduction to attractors), and especially when what should be simple, predictable equations are not (we already talked about the logistic map, but see it illustrated in the Bifurcation diagram. It is particularly interesting for it appears in the most unlikely of places). We saw some chaotic behavior earlier, and the way I deduced it was chaotic was with a quantifier all iterative functions and maps have known as the Lyapunov exponent, and this itself is so interesting to look at for how functions change in behavior along with its Lyapunov exponent. For fixed points alone, there are hundreds of theorems dedicated to analyzing them (most notable of them being Brouwer's Fixed-Point Theorem).

If you are interested in anything covered here, popular math YouTube channel 3Blue1Brown made not one but two videos discussing this idea of derivatives and infinitely stacked operations with the exact puzzle I posed at the start of this post. Their first video is what originally inspired me to look into these objects more when I first saw it a couple yeas back. Their animations do wonders compared to what any text post can do, so please do check them out if you want a more visual approach to these processes along with some additional justification for solutions to iterative processes.

Fixed points appear everywhere, and I hope this shared a few insights into how they can appear, deceive, and approximate even the most out there of expressions.

Metropolis-Hastings and MCMC, Briefly

Adi Mittal

How to guess mathematically

Today, I want to talk about a really powerful tool in math and statistics, that on its own may seem very niche, the concept behind it is something really—and I mean really—powerful and is how many other discoveries and tools are made and immortalized. In particular, I want to talk about the Metropolis-Hastings algorithm and Markov chain Monte Carlo methods. If you want to skip the basics, here's a short table of contents:

Markov Chains
Monte Carlo Simulations
The Metropolis-Hastings Algorithm

Brief summaries are at the bottom of each section if you want a quick referesher for anything above, but first, some review.

This is also all written more formally with other examples in this paper.

Markov Chains

Markov chains, in essence, are a way to model a process that randomly jumps between different outputs, where each output is said to have some probability to jump to other outputs. They're sort of like rolling dice, but the likelihood you roll any number is only dependent on the number you rolled last. It might help to describe this with an example. Let's say you want to know what the weather will be in 5 days: will it be sunny or rainy? Fortunately, the weather doesn't vary too much, so if it's sunny one day, it's likely to be sunny again the next day with 80% chance. If it's rainy, it will likely be rainy again too, with, say, 60% chance. This can be shown quite succinctly in a little diagram:

This is our actual Markov chain, showing the two transition states, S(unny) and R(ainy) with their associated transition probabilities. However, we can't actually do much with just a picture alone. So, we can rewrite these probabilities and encode them in a matrix:

$ M = \begin{bmatrix} .8 & .2 \\ .4 & .6 \end{bmatrix} $

You can think of each row as a different state for current weather, and the columns as probabilities for different states of tomorrow's weather. In this case, I have written row 1 and column 1 to indicate sunny days, and row 2 and column 2 to be rainy days. That's why entry $a_{1,1}$ in row 1, column 1 shows 80%, because if it is sunny today (row 1), we expect an 80% chance for it to be sunny tomorrow (column 1). Similarly $a_{2,2}=.6$, as if it's rainy today, we expect a 60% chance for rain again. $a_{1,2}=.2$ means that if today is sunny, then there is a 20% chance of rain tomorrow, and for completeness sake, $a_{2,1}=.4$ indicates a 40% chance for it to be sunny given today is rainy.

What we've built here is known as a transition matrix, as, well, it's a matrix that shows transition probabilities; it's a matrix that shows how likely we are to jump from one state to another. In this case, our states are the different weathers: sunny or rainy. So, how does this help us answer our original question of the what the weather will be in 5 days? Well, let's first try to find the weather 2 days from now. We know how to model 1 day from now, and since these are probabilities, wouldn't it make sense just to multiply our matrix by itself?

$ M^2 = \begin{bmatrix} .72 & .28 \\ .56 & .44 \end{bmatrix} $

Our probabilities have changed a little bit. Now it's saying, if today is sunny, there is a 72% chance it will be sunny 2 days from now. The reason why multiplying our matrix itself to get this result makes sense is because of the mechanics of matrix multiplication essentially asks: "What is the probability from getting from one state to another in two steps?" If you work out the multiplication itself, it might be clearer, but the way I like to think about it is in terms of transformations of space. For those familiar with a bit of linear algebra, we can think of our matrix $M$ as a collection of basis vectors that scale space (where our vectors in space can be thought of as a collection of starting states, i.e. the initial observed proportion of sunny days to rainy days). So applying $M$ once transforms space, we can then take that as a new "default" or "unit". If we apply $M$ again to our basis vectors, it has the effect of transforming space once again. This can be thought of as our standard, independent probability multiplication, but instead of changing a singular probability (i.e. dice value), we are changing two (likelihood of sunny and likelihood of rainy days).

With this in mind, our question is easy. It boils down to what $M^5$ is.

$ M^5 = \begin{bmatrix} .67008 & .32992 \\ .65984 & .34016 \end{bmatrix} $

So if today is sunny, we look at row 1 and can expect a 67.008% chance of sunny weather, and if it's rainy, row 2 shows a 65.984% chance for sunny weather. Nice! But you might be looking at that matrix and notice that row 1 and row 2 are almost the same. Watch what happens if we don't check for any 5 days in the future, but if we look towards an infinite number of days ahead?

$ \lim\limits_{n\to\infty} M^n = \begin{bmatrix} .\overline{666} & .\overline{333} \\ .\overline{666} & .\overline{333} \end{bmatrix} $

The rows do become the same. So, if we were to pick a random day far, far into the future, we can expect it to be twice as likely to be sunny than rainy regardless of today's weather. There's two important interpretations of this fact. 1) going back to our transformation of space idea, this equilibrium state is our eigenvector (specifically for $\lambda=1$) of our transition matrix $M$. Meaning, it is the solution to the matrix equation $vM = v$ where $v$ is a row vector (here, $v=\begin{bmatrix} .\overline{666} & .\overline{333} \end{bmatrix}$). The second—and more important—way to think of this equilibrium state is that it is the final, or stationary distribution of sunny and rainy days. That is, if you took the fraction of $\frac{\textrm{Sunny Days}}{\textrm{Total Days}}$, you'd expect it to approach $\frac{2}{3}$ as time went on, and $\frac{\textrm{Rainy Days}}{\textrm{Total Days}}$ to likewise approach $\frac{1}{3}$.

To summarize, here are a few important concepts about Markov chains:

A Markov chain is a random process that describes the ability to switch between multiple states.
A Markov chain's probability for any future state depends only on the current state (this is also known as the Markov property).
The sum of each row of a Markov chain's transition matrix must sum to 1 (something has to occur at each time step for each state, even if that means not changing states)
All Markov chains will eventually reach an equilibrium state that describes the final distribution of states over a long time.

Markov chains are extremely powerful tools to model dynamics with multiple states due to their above properties, but some of their uses from chaos to disease modeling deserve their own post another day.

If you understood this so far, you've got the hardest part of Markov chain Monte Carlo methods under your belt. That being said, we are still missing second MC of MCMC.

Monte Carlo Simulations

Monte Carlo simulations are probably the closest you'll ever get to the scientific version of guess-and-check. The idea is if there is something that's too hard to calculate, you do a bunch of mini, random experiments to obtain data that can give us numerical approximations. It's very akin to Bayesian thinking: the more data you give to your approximation, the better the you can "update" your approximation to be more accurate and confident. As with all things, let's do a quick example.

If I hand you a coin, you probably would assume it's a fair coin: 50/50 chance for either heads or tails. But how could you verify that it is indeed a fair coin? Well you could flip it and see what it turns up as. Heads! "It must be an unfair coin as it flips heads 100% of the time!" said no one ever. Of course a single data point isn't nearly enough to draw any conclusions, so you need to flip it again. Heads again! Definitely weighted, right? Even if you get only heads twice in a row, that still isn't conclusive. You need to flip the coin a lot of times. By a lot, upwards of hundreds for a reasonable guess at the balance of the coin, and upwards of thousands for an ideal approximation. For all you know, those first 2 heads could be in a much larger sequence of flips you have yet to unfold:

H-H-T-H-T-T-H-T-T-T-H-T-H

Just like that, our coin reaches that 50/50 split significantly closer within just a few additional flips.

Each one of our data points were flips in this case, and we call those data points samples. The important part to note, though, is that there is a sense of randomness in each sample. The idea behind a Monte Carlo simulation is that even if our sampling method is random, the more samples we take will average out to the true value (think the Law of Large Numbers). The is why the more samples we take, the more accurate our estimations become. This is a lot like unbiased sampling in research studies: you can't reasonably survey everyone in a population, so you take a smaller, random sample in the hopes that it will be representative enough to make reasonable conclusions of the larger population.

Again, just to summarize a few details:

Monte Carlo simulations use random sampling to get numerical estimations for hard to otherwise calculate results.
The more samples/trials we take, the more accurate our results.
While taking more samples is more accurate, it also become less efficient to compute and gather results, so you have strike that balance between more accurate results or quicker results.

With all that out of the way, let's put it all together into one cool algorithm.

Metropolis-Hastings and MCMC

So far, we've sampled from relativiely easy things to run trials on and get samples. Flipping a coin and rolling a dice are nice distributions to run trials on are they both can be modelled by a nice uniform distribution (even for weighted dice/coins by partitioning the uniformness). This is due to the niceness of a discrete distribution where there is only a finite number of results our black box can output. Often the case, we have a continuous function where we don't have probabilities for individual results, but rather a range of results. To get the gist of it, take the uniform probability distribution between $[0,1]$. What's the probability that you pick $0.235326…$? Obviously, out of an infinite amount of possibilities, a single, specific number to pick is probability 0. BUT, the probability of picking a number between $[.25,.75]$ is exactly $.5$, as we're picking from half of our total range. This is the idea of probability density. So, you can imagine for more complicated distributions (especially those taken from real life data) can be a lot more difficult to get samples from, or properly know the densities of regions. Here's where our MCMC comes from.

Markov chain Monte Carlo methods combine two important aspects of the two concepts the name implies: a Markov chain's equilibrium distribution and Monte Carlo simulation's random sampling. Here, we make a Markov chain who's stationary distribution is equal to our hard-to-model probability distribution by doing a random walk around the distribution (for the sake of notation, we'll call our "target" distribution we're trying to model $\pi(x)$). In this case, we do so with the Metropolis-Hastings algorithm which is extremely simple:

Pick a starting point $x_0 \rightarrow$ this is the start of our "walk". An initial sample, if you will, that we provide ($x_t$ means our current sample at time $t$).
Now pick a new, random point $y$. Call $y$ the "proposed state" for $x_{t+1}$.
See how "good" $y$ is compared to $x_t$.

i. If $y$ is "better", we let $x_{t+1}=y$

ii. If $y$ is "worse", we might let $x_{t+1}=y$, but not always.
For $t=1,2,3,…$, repeat steps 2 and 3.
Profit.

This is extremely vague, but I intentionally left it as such, because often times the formulas can confuse the language. In essence, this is what Metropolis-Hastings does to generate samples. We take a sample $x_t$ at a time $t$ that "traces" our distribution, and as $t$ gets larger, the more accurate our "trace" of the curve we walk around gets better. Let's put some of the formulas back into the instructions above and go at it one step at a time.

Step 1 is easy enough: we give any number for our algorithm to start with. Literally anything. You can give smart guesses that speed up the process, but that will be clear in a second.

Step 2 we don't actually perform, but rather design. Unlike Step 1 where we gave some determined number of our choosing, Step 2 we implement a transition kernel to pick a step for us. This kernel is a function $Q$ that takes a current spot $x$ and with some probability outputs a new spot $y$. That is, $Q$ is a distribution that randomly generates a new point $y$ given a current one $x$, which we will write $Q(y|x)$. This is how we make our "proposed state" and how we actually implement our walk. You may be wondering though, "What actually is $Q$?" Well, that's up to you to decide! Since $Q$ itself is a distribution around our current state $x$, you can shape $Q$ in whatever way you want! In general, though, it's not too important, but spending time to design a specific kernel can optimize and speed up the process.

Step 3 is our "goodness" check. Once we have a proposed state generated by $Q$, we need to see if this proposed state is in a more "likely" or dense spot on our distribution $\pi(x)$. The idea is we want to generate samples representative of $\pi(x)$, so it should be obvious that we should visit the probabilistically more dense spots, a.k.a. visit the spots the distribution says is more likely. Geometrically, this is a point higher on our distribution curve.

But remember, just because $y$ is not better doesn't mean that we outright reject it. We instead accept it with probability proportional to how much worse it is. If $y$ is half as high as our current spot $x$, we flip a coin and might accept it with 50% probability. If $y$ was a third as high $x$, we flip a weighted coin and might accept it with probability $\frac{1}{3}$. In other words, we can write our acceptance probability $A=\min(1, \frac{\pi(y)}{\pi(x)})$. If $y$ is higher than $x$, or $\pi(y)>\pi(x)$, then $\frac{\pi(y)}{\pi(x)} > 1$ and we accept it outright. If $\frac{\pi(y)}{\pi(x)} < 1$, then we accept it with probability of that fraction.

This acceptance probability is also what makes this algorithm so good: we only need to know our target distribution $\pi(x)$ up to a constant! If $\pi(x) = c\cdot P(x)$, then our acceptance probability would be $A=\min(1, \frac{c\cdot P(y)}{c\cdot P(x)})$ which simplifies to $\min(1, \frac{P(y)}{P(x)})$, making the constant irrelevant. This is ideal for real life experiments as perfectly measuring constants from observation can be very difficult.

Steps 4 and 5 are pretty self-explanatory, so just to rewrite it more formally, here is the whole algorithm one more time:

Pick a starting point $x_0$.
Sample a new proposal state $y$ with probability $Q(y|x_t)$
Compute $A=\min(1, \frac{\pi(y)}{\pi(x_t)})$.

i. With probability $A$, accept our proposed state and let $x_{t+1}=y$
For $t=1,2,3,…$, repeat steps 2 and 3.
Profit.

However I must admit, I did lie to you, but only a little bit. The acceptance probability I gave is actually for the Metropolis algorithm, not the Metropolis-Hastings algorithm. The acceptance probability for the Metropolis-Hastings algorithm is $A=\min(1, \frac{\pi(y)Q(x_t|y)}{\pi(x)Q(y|x_t)})$. This is because the Metropolis algorithm only works when $Q$ is a symmetric distribution, meaning that $Q(y|x_t)=Q(x_t|y)$, which returns us to our familiar fraction from before. MH allows asymmetric kernels to speed up the algorithm, but otherwise the concept is the same.

With 5 very simple steps, we are able to take samples from continuous distributions just like that! The Monte Carlo aspect is pretty obvious with the random steps with generating random "proposal states" $y$ in Step 2. The Markov chain might be a bit more concealed, as we never actually explicitly define it. But, look at Step 3 again, as that resembles something very close to our transition probabilities before. Step 3 is actually our Markov chain implicitly defined! Since there are an infinite number of states/values to pick and another infinite number of states to transition to, we can't define an infinitely sized transition matrix. So, instead, we define transition probabilities as needed with our kernel $Q$. And notice, our kernel maintains the Markov property as each proposed state only relies on the current. This is because we sort of reversed the way we defined our Markov chain! In our weather example with sunny and rainy days from above, we defined transition states and the stationary distribution followed suit, almost like property or characteristic of our Markov chain. Here, our Markov chain is instead defined by the fact we want our stationary distribution to mimic $\pi(x)$. This is why we don't outright reject states that are less "good" in our acceptance probability, but rather accept it proportional to how less "good" it is as that will reflect our distribution's shape.

But just like in our original Markov chain example, it's not perfect immediately. Notice in our original weather example with sunny and rainy days, 2 iterations with $M^2$ was no where near close our stationary distribution, and while 5 iterations at $M^5$ was closer, it still was nowhere near ideal. You have to burn in some states before proper, accurate samples can be generated.

Here's some short Python to implement the Metropolis-Hastings algorithm to estimate the following Laplace distribution:

$\large{\pi(x)=\frac{1}{2}\exp(-|x|)}$

Here it is in only 20 lines of code:

import numpy as np
import matplotlib.pyplot as plt
def target(x):
  return .5 * np.exp(-abs(x)) # Target distribution π(x)
def accept(p):
  flip = np.random.uniform(0,1)
  return p >= flip
def metropolis(iterations):
  states = [] # Samples generated by the algorithm
  # Step 1 --> initialize an x0
  current = 1 
  for i in range(iterations):
    states.append(current)
    # Step 2 --> Q generates a proposal (normal distribution)
    proposal = np.random.normal(current, 1) 
    # Step 3 --> Check how good our proposal is
    goodness = min(1, target(proposal)/target(current))
    if accept(goodness): 
      current = proposal # If we like the proposal state, we jump there!
  return states

Here is the scatter plot of our algorithm walking all around $\pi(x)$ across 10000 iterations...

...and here is the corresponding histogram that fits almost too perfectly to our target distribution.

We can now generate discrete samples proportional to our continuous distribution!

The algorithm aside, an extremely important concept is shown here: reframing questions and objects and asking them from a different perspective can lead to extremely powerful tools and thoughts. We take a Markov chain, and instead of letting its equilibrium state arise as a property, we use it to turn our definition inside out and use the equilibrium state itself to define the Markov chain. This pattern of rethinking concepts has always been a useful, sobeit from building intuition while learning, to defining tools in all of math. From connecting why Mandelbrot set to its cardioid and cycloids, to encoding parameters in 4-dimensional space means, to even Fourier rebuilding functions from sine waves, the most impactful question one can ask is usually in the form of, "What if?"

Fruitful Fractions and Delightful Dice

Adi Mittal

A powerful tool to reimagine counting

Try typing the fraction $\frac{1}{98}$ into your calculator see what you get. Don't have one on hand? Here's a calculator ready and waiting for you.

Next try $\frac{100}{9899}$. See if anything stands out to you. Even with the few amount of decimals this displays, you might notice some patterns appearing. $\frac{1}{98}$ expanded as a decimal appears to contain the powers of 2! The second fraction might require for a more robust calculator, but with enough decimals it's clear that it too has a hidden sequence: the Fibonacci numbers are in its decimal expansion!

You can try and guess at other fractions with unique expansions, but there is a systematic way to generate these fractions to show not just simple sequences like this, but any sequence you want! It's all a byproduct of one of the most powerful tools in discrete math and combinatorics: the generating function.

Generating Functions and Recurrence Relations

First, some terminology and context. A generating function may look complicated, but its essence is actually very simple. If you have some sequence of numbers, say, $A = \{ a_0, a_1, a_2, a_3, \cdots \}$, its corresponding generating function is the power series $A(x) = a_0 + a_1x^1 + a_2x^2 + a_3x^3 + \cdots$. That's all a generating function is! If you like fancy math notation, we can write this more concisely as $A(x) = \sum_{n=0}^{\infty} a_nx^n$. Something important to note, though, is that the powers of $x$ in the series don't actually mean anything. We only really care about the coefficients, and we happen to be using the series to encode our sequence $A$. Herbet S. Wilf put this best in his aptly named book, generatingfunctionology: "A generating function is a clothesline on which we hang up a sequence of numbers for display." Basically, our generating function is purely a convenient way to place all of our sequence terms into a singular object. That's why it's not just any power series, but a formal power series, where it extends on towards an infinite number of terms where we don't really care about convergence, but rather just the representation itself. What's great about generating functions too are that it turns questions about sequences and integers into one about functions, and over the course of centuries, we can do a lot with functions. You'll see quickly why we use a power series specifically, as exponent properties play very nicely into the types of problems and tricks generating functions can help us out with. Knowing this, you shouldn't let the notation of a generating function ever scare you! They're truly a simple object obscured by harsh notation, so always focus what they represent instead of how they are written.

So, for our powers of 2, its sequence would be $P=\{ 2^n \}_{n=0}^{\infty}$ and corresponding generating function would be $P(x) = 1+2x+4x^2+8x^3+\cdots$. If you're familiar with your series, this is a geometric series and we can condense it into the following formula: $A(x) = 1+(2x)^1+(2x)^2+(2x)^3 + \cdots = \frac{1}{1-2x}$. Remember how I said the powers of $x$ don't really mean anything? This is a case where we can actually leverage the fact that our generating function is in fact a "function" (this is a specific use case as we normally don't treat them as standard functions). Consider the general generating function $A(x)=a_0+a_1x+a_2x^2+a_3x^3+\cdots$. Watch what happens if we plug in $A(.1) = a_0+a_1(.1)+a_2(.001)+a_3(.0001)+\cdots$. This may not look like much, but since we use a base 10 counting system, plugging in $.1$ is the same thing as moving a decimal point to the left one spot. So, we can rewrite that infinite sum as the nice float $a_0.a_1a_2a_3\ldots$ Each number in our sequence becomes a decimal in our final number!

But, this can become a problem if a number in our sequence $a_n$ is more than one digit long, so we can change the value we plug in to get more precise decimals with more numbers from our sequence: $A(.01) = a_0.0a_10a_20a_3\ldots$ and just like that we have buffer 0s in between numbers. So, doing this for our generating function for the powers of 2, we get that $P(.01) = \frac{1}{1-2(.01)} = \frac{1}{.98} = 1.0204081632\ldots$ Just for aesthetic pleasure, I like to multiply the final fraction by the value of $x$ we plugged in to shift that initial $a_0$ after the decimal point, and get a nicer looking fraction at the end: $\frac{1}{.98}\cdot .01 = \frac{1}{98}$, giving the familiar fraction from the start and the nice decimal of $.010204081632\ldots$

As cool as this may be, this relied on the fact we recognized what kind of series the generating function was (for the powers of 2, it was geometric). Let's take a look at a slightly more complicated sequence: the Fibonacci sequence. Unlike the powers of 2 where we knew a nice closed formula off the bat for each term, we don't have one (shhh) for the Fibonacci numbers. Instead, we can define the sequence by relating it to other terms. We'll call the Fibonacci sequence $F = \{ f_0, f_1, f_2, f_3, \cdots \}$ and its associated generating function $F(x) = \sum_{n=0}^{\infty} f_nx^n$ where $f_n$ is the nth Fibonacci number. By definition of the Fibonacci numbers, we also know that

$f_{n+2} = f_{n+1} + f_n \hspace{.3cm} (f_0 = 0, f_1 = 1)$

This equation is known as a recurrence relation, as, well, it's a recursive relationship; any given term in the sequence can be expressed in some form related to other terms. What's useful about having an equation like this is that we can relate this to our generating function! If we can solve for the generating function, we might be able to get a function that can get us our cool fraction with the sequence embedded in the decimals again! If we multiply through by $x^n$, we get…

$f_{n+2}x^n = f_{n+1}x^n + f_nx^n$

…and then we sum over from $0$ to $\infty$ we end up with…

$\sum_{n=0}^{\infty} f_{n+2}x^n = \sum_{n=0}^{\infty} f_{n+1}x^n + \sum_{n=0}^{\infty} f_nx^n$

We're now starting to have a set of terms that awfully resemble our generating function $F(x)$. Let's look at the left-hand and see if we can make any sense of it. Just writing it out can tell us a lot, so let's do that.

$\sum_{n=0}^{\infty} f_{n+2}x^n = f_2 + f_3x + f_4x^2 + f_5x^3 + \cdots$

It looks like our original generating function, but offset! Remember, we want the subscript of the term coefficient to equal the power of the $x$ it is attached to. We can multiply through by $x^2$ to easily fix that.

$x^2 \cdot \sum_{n=0}^{\infty} f_{n+2}x^{n} = f_2x^2 + f_3x^3 + f_4x^4 + f_5x^5 + \cdots$

However, we don't want to actually change the value of our recurrence and add extra factors to both sides. To counter the effects of the multiplication, we just add a term of $\frac{1}{x^2}$ before it since $\frac{1}{x^2} \cdot x^2 = 1$, negating the effects of our multiplication.

$\sum_{n=0}^{\infty} f_{n+2}x^{n} = \frac{1}{x^2}(f_2x^2 + f_3x^3 + f_4x^4 + f_5x^5 + \cdots)$

Now look at that right-hand side: it's our generating function $F(x)$ missing the first two terms, $f_0$ and $f_1x$!

$F(x) = \color{red}{f_0 + f_1x} + f_2x^2 + f_3x^3 + f_4x^4 + f_5x^5 + \cdots$
$F(x) \color{red}{- f_0 - f_1x} = f_2x^2 + f_3x^3 + f_4x^4 + f_5x^5 + \cdots$

Finally, after plugging it all back in, we end up with

$\sum_{n=0}^{\infty} f_{n+2}x^{n} = \frac{1}{x^2}(F(x)-f_0-f_1x)$

You can do a similar process with the other terms on the right-hand side of our original equation to finally get an expression in terms of the generating function, instead of the recurrences.

$\frac{1}{x^2}(F(x)-f_0-f_1x) = \frac{1}{x}(F(x)-f_0) + F(x)$

Now we just need to turn the wheel and solve for $F(x)$!

$\begin{align} \frac{1}{x^2}(F(x)-f_0-f_1x) & = \frac{1}{x}(F(x)-f_0) + F(x) \ \newline F(x)-f_0-f_1x & = F(x)x-f_0x + F(x)x^2 \ \newline F(x)-F(x)x-F(x)x^2 & = f_0 - f_0x + f_1x \ \newline F(x)(1-x-x^2) & = f_0 - f_0x + f_1x \ \newline F(x) & = \small{\frac{f_0 - f_0x + f_1x}{1-x-x^2}} \ \end{align}$

Remember, we had initial values $f_0=0$ and $f_1=1$, so we can plug those in to further simplify our fraction.

$F(x) = \large{\frac{x}{1-x-x^2}}$

And sure enough, $F(.01) = \frac{100}{9899} = 0.0101020305081321 \ldots$

But why stop there? Although we just solved that $F(x) = \frac{x}{1-x-x^2}$, don't forget our original definition that $F(x) = \sum_{n=0}^{\infty}f_nx^n$. These equations imply that if we can find a power series $\sum_{n=0}^{\infty}f_nx^n = \frac{x}{1-x-x^2}$, we should get a closed form for the nth Fibonacci number!

First, we need to decompose our function into its partial fractions. Let $\phi = \frac{1+\sqrt{5}}{2}$ and $\varphi = \frac{1-\sqrt{5}}{2}$.

$\normalsize{ \frac{x}{1-x-x^2} = \frac{x}{(1-\phi x)(1-\varphi x)} = \frac{1}{\phi-\varphi}(\frac{1}{1-\phi x} - \frac{1}{1-\varphi x}) = \frac{1}{\sqrt{5}}(\frac{1}{1-\phi x} - \frac{1}{1-\varphi x}) }$

Note our final result mimics the closed form of two different geometric series!

$\frac{1}{\sqrt{5}}(\frac{1}{1-\phi x} - \frac{1}{1-\varphi x}) = \frac{1}{\sqrt{5}} ( \sum_{n=0}^{\infty}(\phi x)^n - \sum_{n=0}^{\infty}(\varphi x)^n) = \sum_{n=0}^{\infty}\frac{\phi^n-\varphi^n}{\sqrt{5}}x^n$

So, to wrap it all up:

$F(x) = \sum_{n=0}^{\infty}f_nx^n = \sum_{n=0}^{\infty}\frac{\phi^n-\varphi^n}{\sqrt{5}}x^n \rightarrow$

$\large{f_n = \frac{\phi^n-\varphi^n}{\sqrt{5}}}$

Just like that, we've found a formula for the nth Fibonacci number (this is known as Binet's formula)! This is only a sliver of the power of generating functions: being able to turn a recurrence relation into a closed form solution, barely even interacting with the sequence at all!

Now, let's try a different type of problem generating functions can help us out with.

Counting Birds with Generating Functions

Say you're visiting an aviary with some friends. Well-respected, the aviary has a vast number of birds, but they've noticed some interesting patterns in the behavior of their avifauna: their hummingbirds always fly solo; blue jays tend to nest in fours; toucans perch in pairs; and cassowaries chill in fives. How many ways can you see a total of 20 birds?

This may seeem like an odd spot for generating functions, but we'll see a nice property of exponents that allows us to use them. Here, a generating function $A(x) = \sum_{n=0}^{\infty}a_nx^n$ is an encoding such that each term $a_n$ denotes how many ways there are to see $n$ birds. Let's write the generating function for hummingbirds:

$H(x) = 1+x+x^2+x^3+x^4+\cdots$

So, if we want to see any number of birds, there is exactly one way we can see that many birds with only seeing hummingbirds. That makes sense! What about blue jays?

$B(x) = 1+x^3+x^6+x^9+x^{12}+\cdots$

Jays come in groups of 3, so it would make sense we could only see total birds in multiples of 3. If we want to see a group of 6 birds with only jays, there is one way we can do that (that is by seeing two groups of jays), but 0 ways to see 5 birds of only jays. Similar generating functions can be written for the other birds.

$\begin{align} H(x) & = 1+x+x^2+x^3+x^4+\cdots \newline B(x) & = 1+x^3+x^6+x^9+x^{12}+\cdots \newline T(x) & = 1+x^2+x^4+x^6+x^8+\cdots \newline C(x) & = 1+x^5+x^{10}+x^{15}+x^{20}+\cdots \end{align}$

The surprising thing is now, if we want to see the number of ways to see $n$ birds through a combination of different birds, all we have to do is multiply the generating functions together! But why would this ever work? Well, let's think of what our exponents mean in each function: they are the total number of birds we see from a group. So, if our giant product results in a term of, say, $x^{14}$, we know that is one way to see 14 birds. Why? Because exponents turn multiplication into addition: $x^a \cdot x^b = x^{a+b}$. So, if we get multiple copies of $x^{14}$, they'll all accumulate in the coefficient of that term, giving us the different ways to see a total of 14 birds! This is why using a power series specifically for generating functions are so helpful: not only do the exponents have a clear meaning when applied, they also carry over the nice exponent properties we can leverage in counting. In general, to count the number of ways to see $n$ birds, we look for the coefficient in front of $x^n$.

$H(x)B(x)T(x)C(x) = $
$(1+x+x^2+\cdots)(1+x^3+x^6+\cdots)(1+x^2+x^4+\cdots)(1+x^5+x^{10}+\cdots)$

Expanding that out seems like a terrible idea, so we won't… We'll let Python do it instead! It's totally doable to do this by hand to systematically extract the coefficient of $x^{20}$ (especially with the series we've selected, involving many binomial coefficients with its partial fraction decomposition), but the algebra along with it can get annoyingly tedious. I'm sure there are clever ways to go about keeping track of which terms you're multiplying, but that's out of the scope of this post.

$H(x)B(x)T(x)C(x) = 1+1x+2x^2+3x^3+ \cdots + 80x^{19} + 91x^{20} + 101x^{21}+\cdots$

So, if you go to the aviary, we know there are 91 different ways to see a total of 20 birds. If you're interested in seeing the entire mathematical crank turn, great YouTuber Mathologer made an excellent video answering a similar question counting the number of ways to make change for dollar in which he spends much more time going into detail the algebra to solve such a problem analytically. Regardless, I hoped this gave insight into how great generating functions are as a combinatorial tool for counting, in addition to its utility as a discrete tool.

Before we end, I want to show you one more cool use case of generating functions that involve probability distributions.

Delightful Dice and Probability Distributions

A staple of tabletop gaming has always been the pair of six-sided dice. Notably, it's respected for being a considerably fair distribution, with the most likely outcome being the middle value of 7 at $\frac{1}{6}$, and the two extreme values of 2 and 12 being the least likely, both at probability $\frac{1}{36}$. This makes it great for board games, with extraordinarily good, high values being just as likely as their low, unlucky counterparts. But, these dice are boring: for centuries our dice have remained a simple numbering from 1–6, but is there a different numbering that we can use to maintain our fair play?

If we use our familiar friend the generating function, we can find out with little thinking required! We can represent our die as a generating function $P(x) = \sum_{n=0}^{\infty}p_nx^n$ where $p_n$ is the probability of rolling a value $n$. So, for a standard die its generating function would be

$\normalsize{\frac{1}{6}x^1+\frac{1}{6}x^2+\frac{1}{6}x^3+\frac{1}{6}x^4+\frac{1}{6}x^5+\frac{1}{6}x^6}$

as you have an equal chance of rolling any number 1–6, and no other possible number (so all of their coefficients are 0 and get cancelled out). If we had a die with sides 1,2,2,7,7,7, it's generating function would look like

$\normalsize{\frac{1}{6}x^1+\frac{1}{3}x^2+\frac{1}{2}x^7}$

So, the generating function of the sum of two dice are just it's product like last time (to see why this is true, think about what the product means: exponents multiply into a sum, and therefore count the number of ways to sum a number from rolling two dice. We then normalize it by $\frac{1}{36}$ to get final probabilities). So, if we had two new dice—we'll call them die A and die B—with new generating functions $A(x)$ and $B(x)$, their product should equal the product of the normal dice!

$A(x)B(x) = \frac{1}{36}(x^1+x^2+x^3+x^4+x^5+x^6)^2$

So, now what? The right-hand side is currently packaged as two factors: two copies of the normal die's generating function. That means that if we can find a way to re-factor that right-hand side into two new generating functions, we should get the labelling for two new dice that are still just as fair as our ordinary dice!

$\begin{align} \frac{1}{36}(x^1+x^2+x^3+x^4+x^5+x^6)^2 & = \frac{1}{36}x^2(1+x+x^2+x^3+x^4+x^5)^2 \ \newline & = \frac{1}{36}x^2(1+x+x^2)^2(1+x^3)^2 \newline & = \frac{1}{36}x^2(1+x+x^2)^2(1+x)^2(1-x+x^2)^2 \end{align}$

Now, we just need to figure out how to repackage this into 2 terms and we should have our dice! Some things to note: 1) All the coefficients in $A(x)$ and $B(x)$ need to be nonnegative multiples of $\frac{1}{6}$, as they all should have positive probabilitiy of rolling something and each die has 6 sides. 2) A(x) and B(x) both need to have at least one factor of $x$ as if otherwise, we might end up with 0s on our dice (which can make for some very boring dice). So, right now we have $A(x) = \frac{1}{6}x$ and $B(x) = \frac{1}{6}x$. Now we need to distribute the remaining factors $(1+x+x^2)^2(1+x)^2(1-x+x^2)^2$. Since we only have six sides on our dice, it follows that our coefficients of both $A(x)$ and $B(x)$ must sum to 6 (how can we put 7 numbers on a six sided die?). Since our factors' have coefficient sums of 3, 2, and 1 respectively, it follows immediately that both $A(x)$ and $B(x)$ need at least one factor of $(1+x+x^2)$ and one factor of $(1+x)$. So, what do we do with the two factors of $(1-x+x^2)^2$? We can either give both to die A (or B, whichever you like thanks to symmetry), or one to A and one to B. If we do the latter, we get:

$\begin{align} A(x) & = \frac{1}{6}x(1+x+x^2)(1+x)(1-x+x^2) \ \newline & = \frac{1}{6}(x^1+x^2+x^3+x^4+x^5+x^6) \ \newline B(x) & = \frac{1}{6}x(1+x+x^2)(1+x)(1-x+x^2) \ \newline & = \frac{1}{6}(x^1+x^2+x^3+x^4+x^5+x^6) \end{align}$

Which is just our normal dice from before, labelling both dice 1,2,3,4,5,6. But if we try the former option...

$\begin{align} A(x) & = \frac{1}{6}x(1+x+x^2)(1+x)(1-x+x^2)^2 \ \newline & = \frac{1}{6}(x+x^3+x^4+x^5+x^6+x^8) \ \newline B(x) & = \frac{1}{6}x(1+x+x^2)(1+x) \ \newline & = \frac{1}{6}(x+2x^2+2x^3+x^4) \end{align}$

Now we get two very unique dice: label die A 1,3,4,5,6,8 and die B 1,2,2,3,3,4. Of course, multiplying these two generating functions together will verify their fairness as we didn't actually change any of the factors that goes into it, but you can also draw these dice's summation table and verify that all the numbers 2–12 appear as much as they should. If you want to mess with your friends a bit, making a pair of these dice for your next occasion is definitely an easy project to do in a day.

Conclusion

Hopefully this has shown you just how powerful generating functions and how wide of an application they have in discrete problems. From sequences, to counting, to probability, these are just a fraction of the potential generating functions have, and should always be kept in the back of your mind as not just a tool, but really as a symbol of an ongoing theme in problem solving: always look for out-of-the-box perspectives. I've spoken about duality a bit before (and it definitely warrants its own post), but just how powerful alternative representations can be can't be understated. Generating functions took seemingly impossible questions about discrete sequences and indistinguishable counting to questions about functions and series and required at most a bit of high school algebra to manipulate some of the equations.

Footnote

While I left links to resources for relevant techniques and tools that I didn't explain, I do want to talk briefly on how I determined distributing factors for our polynomial coefficients in the dice problem as it's not completely obvious if you haven't seen it before. In the dice problem, I said that we need our final polynomial's coefficients to sum to 6. To ensure they summed to 6, I said that they both must be the product of a polynomial with coefficient sum of 3 and a polynomial with a coefficient sum of 2. This is because of the nice property that the product of two polynomials' coefficient sum is equal to the coefficient sum of their polynomial product. In other words: let $C$ be a function that takes a polynomial $f(x)$ as an argument, $C(f(x))$ returns the sum of the coefficients of $f(x)$. I want to show you that $C(f(x)) \cdot C(g(x)) = C(f(x)g(x))$.

Let's first do an example. Let $f(x) = x^2 - 3x + 2$ and $g(x) = 2x - 4$. The coefficient sum of $f(x)$ is $C(f(x)) = 1-3+2 = 0$. Similarly, for $C(g(x)) = 2-4 = -2$. Therefore, $C(f(x))\cdot C(g(x)) = 0\cdot -2 = 0$. So, we'd then expect $C(f(x)g(x)) = 0$ as well.

Now, let's see what the product of the two functions is and what its coefficient sum is. Let $h(x) = f(x)g(x) = 2x^3 - 10x^2 + 16x - 8$. Then, $C(h(x)) = 2-10+16-8 = 0$, just as we foresaw.

Why is this true? It comes down to a clever way of viewing the coefficient sums. Note that for any polynomial $f(x)$, $C(f(x)) = f(1)$. This fact is because plugging in 1 to any polynomial completely removes the powers of $x$, as $1^n = 1$ and $m\cdot 1 = m$, leaving us only with the coefficients. This allows us to rewrite $C(f(x)) \cdot C(g(x)) = f(1)\cdot g(1)$. Now, what about $C(f(x)g(x))$? Well, remember we defined $h(x) = f(x)g(x)$, so that means $C(f(x)g(x)) = C(h(x)) = h(1) = f(1)\cdot g(1)$, which is exactly what we got before! So this means that for any polynomial with coefficient $n$, it can be written as the product of two smaller polynomials with coefficient sums $a$ and $b$ with the only requirement that $ab = n$. That's how I knew in the dice problem that each die's generating function needed a factor with coefficient sums 3 and 2, since $3\cdot 2 = 6$.

Steiner's Porism and Incredible Inversions

Adi Mittal

Time to flip circles inside out

Today I want to talk about a type of geometry I think is grossly overlooked, especially when compared to the popularity of its Euclidean brother. In a world where linear transformations are the norm between translations, rotations, and dilations, sometimes it's hard to see anything but them as the workhorse geometric tools. However, there is an additional transformation that takes us from the solidarity of linear transforms to one of a type of circular transform that may seem novel at first, but is able to even extend complex analysis. Today I want to talk about inversive geometry. Inversive geometry takes the standard plane we know and quite literally flips it inside out. By the end of this post, you will be familiar with not only what in the world an inversion is, but a very cool theorem that results in the animation above that relates tangent circles to one another. But before we can get there, we first need to learn how to flip our world inside out.

Plane Inversions

As you can imagine, inversive geometry is geometry that relies on something called inversions. You can think of an inversion as a function that takes a point $P$ and spits out a transformed point $P'$. But, what exactly is our function? It's not a standard $f(x)$ as we're giving two coordinates not one. So maybe it's a 2-by-2 matrix, as we're giving a 2D vector and outputting another 2D vector? Not a bad idea, but it will quickly become clear why we don't want to do that. So, what is our functional object? It's actually a circle. As weird as that sounds, just hear it out. Given a circle $Ø$ with center $O$ and radius $r$, the point $P$ is inverted to $P'$ based on the following equation:

$\large{|OP| \cdot |OP'| = r^2}$

...where $P'$ lies on the ray $\overrightarrow{OP}$. Try dragging the points below to get a handle on this idea.

Here we have the green circle $Ø$ which we are inverting the point $P$ over. Try dragging $P$, $O$, and $R$ around to see where its image $P'$ goes under inversion.

The numbers above each point represent their distance from $O$, so you can verify the distances satisfy the inversion equation. As a nice little double entendre, this mapping is called an inversion for both algebraic and geometric reasons. The equation itself is an inverse relationship between $|OP|$ and $|OP'|$ (this is why we can't use matrices in the standard sense to represent the transform), but better yet, I'm sure within just within a few seconds that you can much more intuitively see the geometric reason: every point $P$ on the inside of the circle gets mapped to the outside of the circle, and every point outside the circle gets mapped to the inside (and every point on the circle stays on the circle—we say the circle itself is invariant under inversion). We're taking our plane and flipping it inside out centered around the circle.

As such, this specific inversion is known as a circle inversion or plane inversion.

However, there might be a glaring issue to some of you: what if the point $P$ we're inverting is the center of the circle $O$ itself? Then we get $|OP|=0$, and how can 0 times anything equal anything but 0? To get around this issue, we have to formally introduce a point at infinity. That way, if we try to invert the center of our inverting circle, we have a place for it to go.

Circles Inverting Circles

Now that we can invert points, we can now easily invert shapes. All we have to do is invert the collection of points individually, and remember the order to connect them. We could try basic polygons like squares and triangles, but the one that is most interesting (and will be most helpful) is inverting other circles. Below, we'll again invert over the green circle with center $O$, but now instead of a point, we'll invert the blue circle with center $C$ to the red circle.

Again, we have the green circle $Ø$ as our inversion circle, but instead of just a point $P$ that we'll invert, we'll invert the entire blue circle with center $C$. Drag the different points around to see where our blue circle inverts to along with its center $C$ to $C'$.

A lot of our rules with inverting points can easily give us an intuition for how our circles might invert. Points on a circle that are inside of our green inversion circle get flipped to be outside of it, and vice versa, and points on the inversion circle stay on the inversion circle (see how the red circle passes through the intersections of the green and blue circles). But, since we are looking at a group of points, some discrepancies between points obviously exist. For instance: distance. I have drawn both the center of our blue circle $C$ as well as its inversion $C'$. But just by looking at it, it's obvious that there's no way $C'$ can be the center of the blue circle's inversion! That's due to one key aspect of inversions: they do not preserve distances.

That should be apparent due to the actual inversion equation $|OP| \cdot |OP'| = r^2$. This inverse relationship between the length of $OP$ and $OP'$ is what really exaggerates inversions with very small or very big values of $OP$. In fact, this is why inverting a circle is so interesting as even though the distances get all messed up, a circle will still always invert into another circle. Try inverting a square and you'll almost definitely get something that doesn't look like a square (almost definitely as if you align the center of the square with the center of the inversion circle, then that will result in another square; just try drawing it out and it will all fall into place). Even when it doesn't look like a circle, it really is! Try dragging the blue circle such that it intersects the center of our green inverting circle. You'll get something that looks like a line. While it acts as a line, we formally say that this is a really really big circle. Specifically, a circle with an infinitely large radius. It's a lot like how in calculus they say if you zoom in super far in on a curve it looks like a line, if you have a super big curve it locally looks like a line from our perspective.

Recap

Ok, that was a lot, but what does this tell us? Well, just experimenting with inverting one circle tells us much about inversions and some properties they have. Let's write them out.

Inversions do not preserve distances. We saw this with how a circle's center may not invert to the center of the inverted circle.
Every point $P$ has a unique inversion $P'$ for any given circle of inversion $Ø$ with center $O$ and radius $r$. This may seem obvious, but it's important to be aware of as it leads to the next very important characteristic...
Intersections and tangency between two or more shapes are preserved during an inversion. This fact is the one you want to hold onto the most for the upcoming sections. This should make sense as if two or more shapes share some point in common such as an intersection or tangent point $P$, that singular point has only one unique inverse $P'$ which they must also share. And if all the points need to be connected to that point after the inversion, then we should expect to see that intersection/tangent point remain after an inversion.
Lastly, just as a neat fact, performing the same inversion twice results in nothing changing (the identity). You can think of this sort of like what happens when turning a shirt inside out twice: the first time the seams come out, but the second time it just goes back to how it started. Just going back to our formula $OP \cdot OP' = r^2$, we know that $OP'$ has some length that corresponds with $OP$ to keep that product the same. So, if we let $P \rightarrow P'$ representing our first inversion, $P'$ needs to go back to $P$ to keep the formula the same for the second inversion. This makes the function of an inversion a special function called an involution.

With the basics of inversion down, we are now ready to explore that animation from the very top of this post.

Steiner's Porism

Steiner's Porism can sound a lot more complicated than it is, but I promise you the animation at the top explains everything. Let's break it down step by step. First, draw two non-intersecting circles with one inside the other. Second, draw a third circle that is tangent to the inside and outside circle. Third, draw as many circles as you can, each tangent to the inside, outside, and last circle you drew until no more can fit (we call this chain of circles a Steiner chain). Steiner's Porism states if the last circle is tangent to the first circle you drew, then there are an infinite number of chains that are tangent to one another (and if they are not tangent, then there are an infinite number of chains that are not tangent). So in the GIF above, since the chain of black circles are tangent to one another, the black circles are free to rotate like ball bearings in between the outer blue and inner red bounding circles and the chain will always link up to make one of the infinite configurations. It's pretty interesting, but how would someone ever prove this? That's where our friend inversion comes in.

Before we go any further though, let's quickly see if there's an easier version of this scenario. One of the main issues I first had when looking at this was the fact that the black circles rolling around didn't have to be the same size. Fortunately, there's an obvious case where we don't have to worry about that: when the two bounding circles are concentric.

Try dragging the red point to change the red circle's radius, and try rotating the black point to see the symmetry in the chains of circles.

When the circles share the same center, then Steiner's Porism becomes obvious: our set up now becomes symmetrical, so you can think of any starting point as a rotation of the original chain of circles. Since this is an easy fact to see, we can now use a key property of inversions to prove the general case for any pair of bounding circles:

Intersections and tangency between two or more shapes are preserved during an inversion

This is great for us, as then if we can find an inversion that turns our non-concentric circles into concentric circles, we can then use the fact that our tangents of the chain circles are preserved and use the obvious concentric circles case to close out the theorem. It may sound complicated, but think of it as a way to work backwards: if we can show we can turn any non-concentric circles into concentric ones, then the reverse is also true where there is some corresponding pair of concentric circles that invert to our original, non-concentric ones. Since the rules of tangency remain true between inversions, the rules for our circle chains remain as well (since they are only governed by tangents).

To find our desired circle of inversion, I'll present it as a series of steps that might not make sense immediately, but will definitely make sense retroactively. So, for now, I ask that you follow along with the steps and we'll discuss it at the end.

Step 1: Find the radical axis

The radical axis of a pair of circles is the line (or axis, I guess) that every point $P$ along the line is the same distance away from the tangents of the two circles. This sounds more complicated than it is and is much easier to see with a picture. Fortunately, it's not too hard to find with some simple geometry. We'll draw the radical axis in green.

Although we can drag the green point anywhere, it always allows us to find an purple orthogonal circle.

Of course, the point $P$ in question has to be outside the two circles to be able to find tangents, but that's only a worry for intersecting circles (which we don't care about). I drew a purple circle center around $P$ to show that the tangents are in fact equal in length. This purple circle, however, has one notable property due to the 4 tangent lines it has as its radii: it is orthogonal to the red and blue circles, meaning that it intersects the red and blue circles at right angles. This is a result from the fact that a circle's radius is perpendicular to its tangent. Hold up a corner of a piece of paper and you'll see the right angles clearly. Speaking of orthogonal circles, that brings us to our next step.

Step 2: Draw two orthogonal circles and find their intersection

This step is easy enough since we've basically already done half of it. We just need to draw another purple orthogonal circle as we've done before, and then find their intersection. Our space will start to get cluttered quickly, so I'll remove the purple tangency lines, but just know that those are what determine our purple circles. We'll draw this intersection point in black.

For any pair of red and blue circles, their orthogonal circles always intersect in the same two locations.

Here I've selected the outermost intersection for clarity, but we'll see in just a second that either of the two intersections work just fine. First, it's worth noting that for a given configuration of outer blue and inner red bounding circles, the intersection points remain constant. No matter how you may slide those green points, the intersection point doesn't change. That should help cue you into its importance.

As a separate, interesting fact (that I haven't looked into enough), the centers of the red and blue circles are collinear with the two intersection points of the purple circles. Quirk aside, though, we can move onto the third and final step of this inversion circle finding process.

Step 3: Draw any circle at the chosen intersection point

Notice how I said "any" circle. A circle with any radius will suffice as our desired circle. This will be our circle to invert over! We're going to invert a total of four circles: of course, the red and blue ones, but we'll also invert the two helper, purple circles. The diagram might look a bit busy, but just remember that this building off the same diagrams from before; look for what's new in the graphic, and it will be less overwhelming.

Finally, we are able to invert our red and blue circles into concentric ones based on the black intersection point.

And just like that, we've obtained our concentric circles just as we desired! Just to reiterate, because tangencies are preserved through our inversion, we can then draw our chain of tangent circles in the original blue and red bounding circles and know for a fact that they'll remain tangent after our inversion as well. Moreover, since the inversion turns our circles into concentric ones which is the nice symmetric case from before, Steiner's Porism is nicely proved as we know, once again, tangencies are preserved during an inversion.

Analysis

Ok, but why does this even work? I mean, yeah, it produces concentric circles, but our steps seemed so arbitrary. Why should it work? It has to do with our purple orthogonal circles. Remember, these circles are orthogonal meaning that they intersect our blue and red circles at right angles. Moreover, remember that by definition of our construction in Step 2, these orthogonal circles pass through the center of our black inversion circle. As we saw before, a circle passing through the center of the inversion circle means that these purple circles will invert into circles of infinitely-large radii (or lines, if you prefer). Lastly, we also know that, in addition to tangencies, intersections are preserved during an inversion. So, not only do we know that our inverted purple lines must intersect, but the intersections between the red and blue circles as well as the purple circles are also maintained.

So, we have two lines that intersect that need to be orthogonal to two other circles. What configuration allows this? The only way that a pair of lines can be orthogonal to a circle is if those lines are the radial line of the circle! So, both circles must share the same center of the intersection of the lines which ensure the lines become radial, and by definition of sharing a center, they must be concentric! Isn't that neat?

Also, this explains why our black intersection points of the orthogonal circles are invariant: regardless of what pair of orthogonal circles we use, there is precisely one center of inversion that maps both circles to have the same center (two, technically, but that just flips what circle is on the outside).

One thing worth noting, though, is that we get a solution even when the two circles are not contained within one another. If the two circles are non-intersecting and are completely separated from one another, we can still follow our procedure from before: we can find a radical axis of the two circles, which leads to our two purple orthogonal circles, that finally intersect at the center of our inversion circle. However, we now get a reversed solution with the red circle becoming the outer concentric circle instead of the inner one (this only happens as a result of the choice of intersection point of our orthogonal circles).

Conclusion

Inversive geometry has all sorts of interesting quirks and facts to explore, and should be more well known than what it is. Maybe one day I'll touch on its connection to polar curves. But anyways, this post wouldn't be complete if you couldn't build a Steiner chain of circles of your own, so below there is one last widget to experiment with tangent circles. I have left the special black inversion circle on the canvas just so you can see how all of our work to get concentric circles relates to any pair of nested, bounding circles. There's so much I had to gloss over to keep this short, such as the hidden conics in the path of the tangent circles, so I highly recommend skimming other articles such as Wolfram MathWorld's and even Wikipedia's discussions on Steiner chains. With all things in math, this story is never over: Steiner's Porism has a projective geometry cousin known as Poncelet's Porism, but that deserves its own post entirely some day. Inversive geometry is a simple yet powerful tool, and even just knowing the concept alone is useful to keep in your back pocket as you never know when you may come across something that has an uncanny resemblence to it. Nevertheless, I hope you can at least leave this page with not just an appreciation of a cool bit of math, but a nice animation as well. As before, don't forget to try separating the circles to be outside of one another to get some strange, but special solutions to Steiner's Porism (if you're having trouble seeing the animation clearly, try reducing the radius of the black inversion circle).

Rotating Regulars and Greedy Grids

Adi Mittal

Why grids love squares so much

We are all familiar with the idea of a grid. From making up the small pixels on our screen, to the compact city maps of New York, grids pop up everywhere due to the kind nature of the innate squares built into them; grids are extremely space efficient packing in squares above and below each other while still maintaining a sense of order. But, why do we grids love squares so much? Today, we'll look at a nice proof for why the square is the only regular polygon that can fit in a grid.

Tilted Thoughts and Grid Properties

First, let's define what a grid is for us. A grid is a set of lattice points whose cardinal neighbors (up, down, left, right) are all equidistant from the given point. That's a lot more complicated than it sounds, but all you need to think of is a generic, square grid like you would find on a piece of graph paper.

No need to worry about any triangular or hexagonal grids (thank you organic chemistry). Obviously, squares fit in our grid, but how can any other regular polygon possibly fit? Well, remember, we don't necessarily need to only draw horizontal or vertical lines: we can easily draw tilted squares too.

Now that you know about tilted squares, here's nice puzzle to think about: given an $n \times n$ square grid, how many different squares can you draw? Check the footnote below if you want a solution, but just drawing it out will likely give you the intuition you need. Anyways, this tilted square reveals an important property of grids: rotating a lattice point by 90° around another point gives you a new, different lattice point. You can see this nicely with complex numbers. If you have a lattice point $a+bi$, a 90° rotation is equivalent to multiplying our number by $i\rightarrow i \cdot (a+bi) = -b + ai$. The coefficients remain integers, so if $(a,b)$ is a point, so is $(-b,a)$.

Another (less relevant for us) property is that if you know a line segment defined by 2 lattice points, and you are given a 3rd lattice point, you can find a 4th one by drawing a second line segment from your 3rd point (think of it like vector addition: if we know a vector and a point, we can find a new point by adding that vector the point). For the purpose of this post, though, just remember the former property.

Rotating Regular Polygons

The proof that only regular polygon that a grid can define is a square is very simple, but very clever. Just as an example, we'll use a pentagon for demonstrative purposes. Let's assume that our regular $p$-gon (in our case, pentagon; I use $p$-gon due to poor variable naming later) exists in the grid.

If these 5 points that define our pentagon exists in the grid, then we should be able to generate 5 new, totally valid grid points by rotating them 90° around their neighbor.

Notice, though, that we just made another, smaller regular pentagon! ...Or did we? We can prove this quite simply geometrically (trust me, drawing it out and symmetry will guide you all the way through), but I don't want to draw anything right now so instead I'll show you a much more needlessly complicated, linear algebra approach to it (this will, though, give us specific numbers at the end of it). If you can accept this red pentagon is in fact a regular pentagon, just skip ahead, but for now I'll present the proof.

If we can show that the new red pentagon lies on a parametric circle, we can then show that we our 5 angles to generate the original, black pentagon, maps to the new red pentagon. The way we generated our red pentagon was by taking a black point $v$, rotating it around its neighbor $t$ by 90° to land at $v'$ as seen above. We can write this transformation as a product of 3 matrices: translating by $-t$, rotating 90°, then translating back by $+t$ (in a linear transformation, the origin remains fixed so the translations are our way to rotate about any point we want). If $v$ is a point of the form $(\cos\frac{2\pi n}{5}, \sin\frac{2\pi n}{5})$, then $t$ is the point $(\cos\frac{2\pi (n-1)}{5}, \sin\frac{2\pi (n-1)}{5})$ just as definition of being a pentagon on the unit circle, and $v$ and $t$ being neighboring points. So, our matrix equation of going from $v\rightarrow v'$ is

$ \begin{bmatrix} 1 & 0 & \cos\frac{2\pi (n-1)}{5} \\ 0 & 1 & \sin\frac{2\pi (n-1)}{5} \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & -\cos\frac{2\pi (n-1)}{5} \\ 0 & 1 & -\sin\frac{2\pi (n-1)}{5} \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos\frac{2\pi n}{5} \\ \sin\frac{2\pi n}{5} \\ 1 \end{bmatrix} = v' $

$ \begin{bmatrix} 0 & -1 & \sin\frac{2\pi (n-1)}{5} + \cos\frac{2\pi (n-1)}{5} \\ 1 & 0 & \sin\frac{2\pi (n-1)}{5} - \cos\frac{2\pi (n-1)}{5} \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos\frac{2\pi n}{5} \\ \sin\frac{2\pi n}{5} \\ 1 \end{bmatrix} = v' $

Before we really dig into the matrix computations, take a look at the final column of that last matrix: it has something that looks like $\sin(x) + \cos(x)$ and $\sin(x) - \cos(x)$. These seem too nice not to have a formula for this sum and difference. So, before we go on, it will be worthwhile to see if we can condense those into nicer formulas. In fact, when you plot these functions, you do get what looks like nice sine waves.

Let's say we think it is some type of cosine curve.

$\sin(x) + \cos(x) = A\cos(x + \phi)$

$A$ is the amplitude of this new curve, and $\phi$ is the phase offset. Now, we fortunately have a well known angle addition formula for cosine.

$\color{red}{1}\sin(x) + \color{red}{1}\cos(x) = A\cos(x + \phi) = \color{red}{A\cos(\phi)}\cos(x) \color{red}{-A\sin(\phi)}\sin(x)$

This may seem hard to solve, but all we need to do now is match our coefficients (highlighted in red). For the right hand side to equal the left hand side

$A\cos(\phi) = 1$
$-A\sin(\phi) = 1$

That way the $\cos(x)$ and $\sin(x)$ terms will be equal on either side. We can now square both equations and add them together to get

$ \begin{array}{cccc} & (A\cos(\phi))^2 & = & 1^2 \\ + & (-A\sin(\phi))^2 & = & 1^2 \\ \hline & A^2(\cos^2(\phi) + \sin^2(\phi)) & = & 2 \end{array} $

Remembering that $\cos^2(\phi) + \sin^2(\phi) = 1$, we get that $A = \sqrt{2}$. Now we can solve for $\phi$ fairly quickly, too. Though, we'll have to be careful about range restrictions on $\cos^{-1}(x)$ and $\sin^{-1}(x)$, so we should corroborate them to make sure we get a value that satisfies both equations.

$\begin{align} A\cos(\phi) = 1 & \rightarrow \phi = \frac{\pi}{4}, \color{red}{-\frac{\pi}{4}} \ \newline -A\sin(\phi) = 1 & \rightarrow \phi = \color{red}{-\frac{\pi}{4}}, \frac{5\pi}{4} \end{align}$

The only (reduced) angle that is satisfies both equations is $\phi = -\frac{\pi}{4}$. Putting it all together now, we can see that

$\sin(x) + \cos(x) = \sqrt{2} \cos(x - \frac{\pi}{4})$

In a similar manner,

$\sin(x) - \cos(x) = \sqrt{2} \cos(x - \frac{3\pi}{4})$

but we can make this more akin to our previous equation recalling that $\cos(x) = \sin(\frac{\pi}{2} - x)$.

$\sin(x) - \cos(x) = \sqrt{2} \sin(\frac{\pi}{2} - (x - \frac{3\pi}{4})) = \sqrt{2} \sin(-x + \frac{5\pi}{4}) = \sqrt{2} \sin(x - \frac{\pi}{4})$

Working with one term instead of the sum of two will make our life easier moving forward.

Unfortunately, this transformation matrix alone can't show us our net tranform is purely a rotation and scaling due to that third column (which indicates a translation). So, we will have to look at the individual components of $v'$.

$ \begin{bmatrix} 0 & -1 & \sqrt{2}\cos(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ 1 & 0 & \sqrt{2}\sin(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos\frac{2\pi n}{5} \\ \sin\frac{2\pi n}{5} \\ 1 \end{bmatrix} = v' $

$ \begin{bmatrix} -\sin\frac{2\pi n}{5} + \sqrt{2}\cos(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ \cos\frac{2\pi n}{5} + \sqrt{2}\sin(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ 1 \end{bmatrix} = v' $

If $v'$ truly is just a rotated and scaled version of a vertex of our original pentagon, then it follows that any $v'$ should lie on a circle, just as $v$ does. So, we can use the Pythagorean identity that $(r\cos\theta)^2 + (r\sin\theta)^2 = r^2$ which implies that if we square the $x$ and $y$ components of $v'$ and add them together, we should get a constant. For simplicity in writing, we'll use $\alpha = \frac{2\pi (n-1)}{5} - \frac{\pi}{4}$.

$ (-\sin\frac{2\pi n}{5} + \sqrt{2}\cos \alpha)^2 + (\cos\frac{2\pi n}{5} + \sqrt{2}\sin \alpha)^2 = $
$ \cos^2 \frac{2\pi n}{5} + \sin^2 \frac{2\pi n}{5} + 2\cos^2 \alpha + 2\sin^2 \alpha + 2\sqrt{2}(\cos\frac{2\pi n}{5}\sin \alpha - \sin\frac{2\pi n}{5}\cos \alpha) $

This may look like a pain to work with, but just grouping like terms and trigonometric identities clean this up real fast.

$\begin{align} 1 + 2 + 2\sqrt{2}\sin(\alpha - \frac{2\pi n}{5}) & = \ \newline 3 + 2\sqrt{2}\sin(-\frac{2\pi}{5} - \frac{\pi}{4}) & = \ \newline 3 - 2\sqrt{2}\sin(\frac{2\pi}{5} + \frac{\pi}{4}) & \approx 0.479852979 \end{align}$

So it does simplify to a constant! This constant represents the radius$^2$ of the new circle the red pentagon lies on (based on the black pentagon's unit circle). Meaning, the radius of the circle the red pentagon lies on is $\sqrt{0.479852979} \approx 0.692714211$. To find the angle it is rotated by, we just find how far the first vertex ($n=0$) is rotated:

$\theta = \tan^{-1}(\frac{y}{x}) = \tan^{-1}(\frac{1 - \sqrt{2}\sin(\frac{2\pi}{5} + \frac{\pi}{4})}{\sqrt{2}\cos(\frac{2\pi}{5} + \frac{\pi}{4})}) + \pi \approx 3.526465492$

What's great about our linear algebra approach too is that it quickly generalizes for any regular $p$-gon! The only part of the result impacted by our choice of a pentagon is any appearance of $\frac{2\pi}{5}$. So if you wanted to do it for any regular $p$-gon, all you do is replace $\frac{2\pi}{5}$ with $\frac{2\pi}{p}$.

To get back to the original point though, we have shown that a vertex $v$ on the unit circle under our specific transformation maps to a vertex $v'$ on a scaled, rotated circle (and because it was all linear transformations, the scalings and rotations are uniform around the origin), which implies that our black pentagon maps to another regular pentagon in red. Since those vertices in red are valid points in the grid since we found them with 90° rotations of other, valid lattice points, we can find yet another set of 5 valid lattice points by doing the same operation again of rotating each vertex 90° around its neighbor.

With our previous logic, we know that this too is a regular pentagon, which again, allows us to find 5 new lattice points in our grid by doing the same operation of rotating around a neighbor by 90°... And again... And again... For as many new lattice points and pentagons as we want.

Here are 15 nested pentagons; try dragging any of the vertices to zoom in and you can see they are just smaller, rotated regular pentagons.

Ok, but why do we care? Remember, our grid is made up of a set of discrete points with some distance between each point. But, since we can always find 5 new lattice points that make a pentagon as small as we want by repeating the 90° rotation, that means that we can find a pentagon who's radius/sides are smaller than our grid point distances themselves. How do you draw a pentagon smaller than the space between points itself? It would be like trying to draw between the pixels of the screen you're reading this on, which is obviously impossible. So by contradiction, our initial assumption that a pentagon can exist in the grid must have been false. $\blacksquare$ While that shows it's impossible to have a pentagon in the grid, we still want to show that only the square can exist in the grid. Recall our formula for the radius of the bounding circle of our pentagon (if you skipped ahead originally go back to see where this comes from): $r(p) = \sqrt{3 - 2\sqrt{2}\sin(\frac{2\pi}{p} + \frac{\pi}{4})}$ Since we were working with the example of a pentagon, originally $p=5$ for us. The reason why our pentagons kept shrinking is because $r < 1$ for $p=5$. We need to show all regular polygons except the square results in a shrinking radius. In other words, we need to show that $r<1$ for all $p>4$. First, let's note the value of $p=4$ to start: $\begin{align} r(4) & = \sqrt{3 - 2\sqrt{2}\sin(\frac{2\pi}{4} + \frac{\pi}{4})} \\ & = \sqrt{3 - 2\sqrt{2}\cdot\frac{\sqrt{2}}{2}} \\ & = \sqrt{3 - 2} \\ & = 1 \end{align}$ This is why the square works out as it never shrinks as a 90° rotation around neighbors just maps one vertex to another already existing one. Now, let's examine the term $\sin(\frac{2\pi}{p} + \frac{\pi}{4})$. As $p$ gets larger than 4, the angle we're taking the $\sin$ of is not only bounded to quadrant 1, but is also getting smaller and closer to $\frac{\pi}{4}$, but never quite equal. This results in a $\sin$ value slightly larger than $\frac{\sqrt{2}}{2}$, which makes the difference $3-2\sqrt{2}\sin(\frac{2\pi}{p} + \frac{\pi}{4}) < 1$. Here's the graph of $r(p)$ as well as the bound of $r=1$.

Some might look at this graph and see an obvious flaw: what about the case of the equilateral triangle though ($p=3$)? Yes, technically it has an $r>1$ and this does actually result in it expanding out.

If the points expand out, we can't really say much about it existing in the grid or not since we only have a minimum bound on the distance between lattice points. But, there was a second property I glossed over regarding grids:

If you know a line segment defined by 2 lattice points, and you are given a 3rd lattice point, you can find a 4th one by drawing a second line segment from your 3rd point.

This property allows the equilateral triangle to be turned into the equivalent case of a hexagon, leading it to a case where $p>4$ and hence an $r<1$, indicating that an equilateral triangle, too, is impossible to draw within the grid.

Another neat little fact is the bend at $p=8$; an octagon has the smallest bounding radius for the 90° neighbor rotations (since that makes the $\sin$ term go to 1). But after that, $r(p)$ starts to increase. How can we know it won't equal or exceed 1? Well, no matter how big $p$ is, $\frac{2\pi}{p} + \frac{\pi}{4}$ will always be greater than $\frac{\pi}{4}$. Just thinking about it though leads to an interesting thought: what about at the limit of $p\to\infty$?

$\lim\limits_{p\to\infty} \sin(\frac{2\pi}{p} + \frac{\pi}{4}) = \sin(0 + \frac{\pi}{4}) = \frac{\sqrt{2}}{2}$

Which as we saw, would lead to an $r$ value of 1. So, according to this limit, an infinite sided $p$-gon, better known as the circle, is possible... at the limit. You'll get better and better approximations of the circle the more sides you add, but this essentially turns our grid into lattice points that are infinitely close together, which ruins the point of the grid in my opinion. So, it's up to you if you think a circle can exist in a grid, but an interesting thought nonetheless.

Other Grids?

Earlier I mentioned that we did't need to worry about triangular or hexagonal grids, but what if we did?

The triangular (left) and hexagonal (right) grids are less obvious for what other regular polygons they can fit.

Fortunately, this just requires tweaking our matrix equation from before a little bit: instead of rotating by 90°, we now rotate by 60° and 120° for the triangular and hexagonal grids respectively. In general, if we want to rotate by $\theta$ radians (easy conversion from degrees) around a neighboring point for a regular $p$-gon, the vector for $v'$ in terms of $v$ is

$ \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} \cos(\theta + \frac{2\pi n}{p}) + 2\sin(\frac{2\pi (n-1)}{p} + \frac{\theta}{2})\sin(\frac{\theta}{2}) \\ \sin(\theta + \frac{2\pi n}{p}) - 2\cos(\frac{2\pi (n-1)}{p} + \frac{\theta}{2})\sin(\frac{\theta}{2}) \\ 1 \end{bmatrix} = v' $

Finding $r=\sqrt{x^2 + y^2}$ again reveals that

$r_\theta (p) = \sqrt{3+2\cos(\theta + \frac{2\pi}{p}) - 2\cos(\theta) - 2\cos(\frac{2\pi}{p})}$

...which is in fact a constant. So rotations of any angle around neighboring points output more regular polygons.

Try rotating one of the red dots to watch the regular pentagon grow and shrink according to the angle you rotate around its neighbor.

As before, we can plot this function to see what regular polygons have a bounding radius less than one for each grid:

Even though it would appear that no hexagons can be made in the triangular grid as they have a bounding radius $r_{\frac{\pi}{3}}(6)=0$. However, the point they actually collapse into is valid vertex of the triangular grid. This is because that's a type of definition of the triangular grid: draw a hexagonal grid and place extra points in the center of each hexagon. So again, the only apparent shapes in that can be contained in the triangular and hexagonal grids unfortunately appear to be none other than just the equilateral triangle and regular hexagon themselves.

This post was inspired by a Mathologer video discussing an application of shrinking polygons and this out-of-the-box thinking that is so cool. Part way through the video, he glosses over the reasoning behind why the shrinking polygons are similar, so this is my own take on that portion of the video.

Solution to squares in a grid puzzle

As I said, drawing it out is your best bet. Let's start with how many $1 \times 1$ squares there can be in an $n \times n$ grid. Well, it's just $(n-1)^2$ by definition of the size of the grid (remember, $n$ refers to the number of dots, so there are only $(n-1) \times (n-1)$ squares). How about $2 \times 2$ squares? Well, we have eliminated a possible row and column from which we can place the square, so there are $(n-2)^2$ total $2 \times 2$ squares. You might be tempted to say that there are $\sum_{i=1}^{n-1} (n-i)^2$ total squares (where we are individually counting every $i\times i$ square up to $n-1$), but you can't forget that there are tilted squares too. The trick here is now that you have all the non-tilted squares, you just need to find how many possible tilted squares can be contained in a non-tilted square. In a $1 \times 1$ square, there is no room to tilt a square in it, so we move on. In a $2 \times 2$ square, there is exactly one extra lattice point to tilt on, so we add a factor of $2$ to our count of $2\times 2$ squares $\rightarrow 2(n-2)^2$. Similarly for a the $3 \times 3$ squares, we have 2 new lattice points we can tilt to, tripling our count $\rightarrow 3(n-3)^2$ squares. So finally, we can write it as a final sum of $\sum_{i=1}^{n-1} i(n-i)^2 = \frac{1}{12}(n^2)(n^2-1)$ total squares for an $n \times n$ grid.

Quarantine Quandaries and Variable Vaccines

Adi Mittal

Modelling the past 17 months in 17 minutes

COVID-19 is one of those events that will likely not just define the way people will interact with each other, but likely entire socieities. I wouldn't even be surprised to see this pop up in an AP US History textbook in a decade just for how long the pandemic has been drawn out for. So it should be no surprise that from the first month of the pandemic, a vaccine was the only thing on people's minds. I mean, just look at this graph depicting the number of total coronavirus cases in the U.S. alone.

Data as reported by the CDC (updated as of July 15th)

It took around 2 months to hit a million cases in the U.S., and another 3 months to reach 5 million cases. That's the issue with pandemics: they explode at a rate faster than we can realize. So, today, I want to talk a little bit about different types of disease models, and how these can educate ourselves on the right preventative courses of action.

The SIR Model 2 Ways

The Susceptible Infected Recovered Model of disease spread groups individuals into 3 boxes and relates them to show how each box grows or diminishes over time: susceptible means that you are currently healthy but are vulnerable to the infection; infected means just that and indicates a current illness; and while recovered doesn't necessarily mean you overcame the disease, it means you are unable to spread the disease anymore (sobeit proper recovery and developed immunity, but also the unfortunate case if you die since both are no longer disease vectors). This sounds like the perfect use for a Markov chain (here's a refresher if you need one)! Our Markov chain will have 3 states being the aforementioned susceptible, infected, and recovered, and will follow a transition model as such:

The SIR Markov chain model

Let's say everyone starts out as susceptible; we can write our initial population as $N$, and the initial distribution of those $N$ people as $P = \begin{bmatrix} N & 0 & 0 \end{bmatrix}$ for $N$ susceptible, 0 infected, and 0 recovered.

With that out of the way, let's think about the above Markov chain. If you are currently susceptible, each day there's a chance you might become infected. Let's call this probability $\beta$ as the average chance of getting infected. At the same time though, if you are smart and act carefully, you might not become infected, and this will be $1-\beta$. Similarly if you're infected, there's a chance you might (positively or otherwise) overcome the disease! We'll call this probability $\gamma$, and the chance of not recovering $1-\gamma$ (you can think of $\gamma$ by taking its inverse: $\frac{1}{\gamma}$ is the average number of days for a recovery to occur). Finally, if you're recovered, that's the end of your journey, as once recovered you're always recovered. This can be condensed into the simple transition matrix below.

$ M = \begin{bmatrix} 1-\beta & \beta & 0 \\ 0 & 1-\gamma & \gamma \\ 0 & 0 & 1 \end{bmatrix} $

With that, let's run a trial with infectivity $\beta=.4$ and $\gamma=\frac{1}{5}$ (we expect an infection to take 5 days to recover). So on Day 0, everyone is healthy and (ironically) susceptible to our disease. The following Day 1, people are interacting and enjoying themselves, unbeknown to them that they are spreading a new contagion. Day 1 results in $PM = \begin{bmatrix} 800 & 200 & 0 \end{bmatrix}$, which makes sense as we expected 20% to become infected. We can track each state of S, I, and R and plot them accordingly across the span of a month.

Evolution of the SIR Markov chain as percent of the population

For such a simple model, it's not bad, but there are some obvious flaws. For starters, the infected population doesn't really infect others; all infections stem from random appearances in the susceptible population. This is why by the end of the 30 days, we have 0 susceptible and only recovered, since infections don't require infected to be present which is kind of odd (notice how our initial population was only susceptible and yet infected pop up). We can do better.

While Markov chains were appealing due to the nature of having 3 states, what we really want to focus on is the relationships between the 3 states: if there are lots of infected people, that should increase the rate at which others get infected. Since we're looking at how the value of one box (infected people) affect the rate at which the other boxes change (how fast susceptible and recovery decrease/increase), perhaps we should try a system of differential equations. Instead of states, we'll now have 3 different functions: $S(t)$, $I(t)$, and $R(t)$, which returns the number of susceptible, infected, and recovered at a time $t$. If you're not familiar with differential equations, don't worry, this section will be brief. The idea behind differential equations is that sometimes it's hard to exactly quantify a function or value, but we know how the function changes relative to another value. See the first few minutes of this video if this idea intrigues you and for some nice opening examples. Besides, you already know more or less what the equations look like we know what the function should look like.

$ \begin{align} \frac{dS}{dt} & = -\textrm{Number of new infections} \ \newline \frac{dI}{dt} & = \textrm{Number of new infections} - \textrm{Number of new recovered} \ \newline \frac{dR}{dt} & = \textrm{Number of new recovered} \end{align} $

That's really all there is to them. The specific math behind it looks more complicated than it is, but keeping the above in mind will make it much clearer.

$ \begin{align} \frac{dS}{dt} & = -\frac{\beta S}{N} I \ \newline \frac{dI}{dt} & = \frac{\beta S}{N} I - \gamma I\ \newline \frac{dR}{dt} & = \gamma I \end{align} $

I like to think of it that of that the fraction of the vulnerable people $\frac{S}{N}$ have a $\beta$ chance of being infected, which is scaled up by the number of infected people $I$. That makes sense for number of new infected people. Similarly, if the average chance of recovery is $\gamma$, then you should expected a proportion of $\gamma$ infected people to recover ($\gamma I$). Last important fact to note is what happens when we add these equations together:

$\frac{dS}{dt}+\frac{dI}{dt}+\frac{dR}{dt}=0$

No matter how the 3 different categories of people evolve over time, the net change between all of them will be 0, meaning our total population remains the same over time (which is good since we don't want people appearing and disappearing out of nowhere). Let's watch the scenario unfold once more with $N=1000$ with a distribution of $P = \begin{bmatrix} 999 & 1 & 0 \end{bmatrix}$ (someone has to start the pandemic), $\beta=.4$, and $\gamma=\frac{1}{5}$.

Evolution of the SIR differential equations as percent of the population

That's more like that famous curve. It may not be as drastic as the 10 day everyone-is-infected model as the Markov chain we tried earlier, but it's definitely much more realistic. A single person was able to infect about 800 people across only a couple of months, which is still scary fast.

$R$ and Disease Containment

Just as famous as a curve this may be, you likely have heard of the idea of $R$ and $R_0$ too. $R$ is the reproduction number of a disease/virus that tells you on average, how many people an infected person will spread the disease to. If $R>1$, then you get the epidemic issue, where the disease is spreading exponentially as each person is giving more people than just themselves the disease. When $R=1$, you have an endemic where the disease is neither spreading nor being contained. When $R<1$, then you have a contained virus that is decaying throughout a community. $R_0$ is just the $R$ value at the start of the outbreak, and can be found with the formula $R = \frac{\beta}{\gamma}$. In our simulation, $R_0 = \frac{.4}{.2} = 2$. So if we wanted to contain this disease, we want to find how many infections we need to contain for $R<1$. Since $R_0 \cdot .5 = 1$, we only need to contain 50% of infections to contain the outbreak!

You can see as soon as there is less than 50% of the population left to infect, infections start to decline—not end, but decline. You can also reduce infectivity by wearing preventative measures (i.e. masks) or getting the appropriate immunization against the disease (i.e. vaccines). If you want to read more about this model and more of its intricacies, Nicky Case wrote a very nice interactive article about it.

Small Scale Vaccinations and Graph Theory

While we did find a nice model to represent disease spread, the SIR model only is really nice for analyzing very big communities. With the SIR model, we assume everyone interacts with enough people so that infected people can always target a susceptible person if they need to in this very dense network of people.

Most people, though, only regularly interact and really care about those immediately in their social circles. In this small scale, such as say, a neighborhood or even a friend group, people do not interact with others equally. Some are more introverted and interact with only a few people, and some are extroverted beacons that interact with most of the group. This changes the dynamics of disease spread a lot as obviously those who are more outgoing and meet more people will be more at risk of being infected than someone who only talks to a few people. As such, we'll need a different approach than the SIR model.

In order to model connections between people we'll use a graph. A graph in this case is not the standard parabola of $f(x) = x^2$, but rather it is a visual tool consisting of two parts: there are nodes which are dots (to represent people here), and edges that are lines that connect nodes (to represent interactions or friendships).

Example of a graph that may represent the individual friendships in a clique

As in most groups, we have a couple of people at the center connected to quite a few people, and some on the edge of the circle regularly interacting with as few as 2 people. This is a really helpful representation as now not only can we watch where the virus spreads, but we also have a direct way to implement small scale prevention tactics as we can see specifically who is infected. Ideally, we would vaccinate everyone in this group and make sure (consequential) disease spread can occur, but that might not happen. So, if we could only vaccinate, say, 25% of this group, who would you vaccinate to minimize a disease vector? There are some obvious candidates such as the nodes with the most connections, but what is the best way Let's leave our susceptible people in green and infected in red, but we'll add a new purple node to indicate vaccinated.

Randomly vaccinating 5 of the 20 people in the network. Can we do better?

Before we talk strategies, let's be clear about some of the assumptions made for simplicity sake.

Vaccinations are 100% effective both ways, completely negating the possibility of infection of and transmission from the vaccinated person as if fully immune.
Edge lengths do not matter; if someone is connected to another person, we assume that that edge will act the exact same way as all other interactions and edges. This ensures a contant $\beta$.
Dying and recovering from the disease will be again treated the same under an overarching "Recovered" state, in which the affected person becomes fully immune.
An infected person "recovers" after only a single day ($\gamma = 1$) since a) it will ensure our simulation can spit out some numbers at the end as it's hard to completely protect an entire network of people, and b) our program wasn't written the most efficiently.

As before, we want to find strategies ways to reduce $R_0$. But, what is $R_0$ in this case? Well, remember $R_0$ is just the average number of transmissions a single infected person will instigate. So, we can approximate this with the average number of edges an infected node has, multiplied by the infectivity. Since we can't change the infectivity rate ($\beta$), the only way we can lower it is by removing possible edges for infected people to transmit across. So, let's look at different strategies that are both really good and really bad at removing edges from our graph.

Random: Realistically, vaccinating people randomly seems like a bad idea. In a simulation, though, it's never a bad starting point.
Random then neighbors: Here, we start with a single random person to vaccinate, then only vaccinate people connected to an already vaccinated person. If your friend got vaccinated, why shouldn't you?
Most connected: This is the obvious solution. If we want to reduce edges efficiently, why don't we vaccinate the nodes with the most edges connected to the most people?
Most connected non-neighbors: This is a spin on the last one. We want to vaccinate the most connected people, but why not spread them out a bit? We pick the highest connected node to vaccinate, and then proceed to vaccinate the next highest node that is not connected to a vaccinated person.
Least connected: This one is more of a why not scenario. It's like the previous two strategies but reversed: find the loners of the group and vaccinate them.

Again, we'll randomly vaccinate $\approx$25% of our population (50/100, 100/200, 200/400, and 400/800 people/total connections) according to our strategies above. We'll also infect 10% of our population with the contagion of your choosing with the same infectivity at $\beta = .4$ as before. We'll do 10 trials per population size, and average them to get a rough estimate at how each strategy faired.

Unsurprisingly, vacciating the most connected nodes without restriction faired the best; closing off as many routes of infection as fast as possible led to only about 10% of non-vaccinated people getting infected. Following close behind is most connected non-neighbors strategy. Even though we tried spreading out vaccinations since the edge between two vaccinated people doesn't require two does of the same vaccine, by nature of being a popular node, it is likely connected to other popular nodes too. So in reality, we are removing fewer edges than we could be just vaccinating the most popular nodes. It's for this reason why our orange strategy of randomly vaccinating one person then its neighbors was such a bad strategy: not only do we not know if we're vaccinating a well connected person, but we're wasting vaccines but doubly protecting edges between neighbors. Since reducing the edge count is our only form of reducing $R_0$, this is a very bad strategy. What is surprising though, is that vaccinating the least connected people was almost as good as random. Since unpopular people can only be connected to so many people, they tend to be pretty spread out removing a fair number of edges between all the vaccinated people.

What next?

Honestly, our graph model isn't all that bad considering how simple it is, but let's look at some nice upgrades we can hand over to it.

Graph Generation

This was originally made as a school project, so we generated our graph by taking some number of edges and placing them randomly between nodes for simplicity in presentation. However, these obviously aren't the only types of networks. A prominent one for modelling interactions between people is the Erdös-Rényi graph. For $n$ nodes, there are $n \choose 2$ possible pairings of nodes for edges. In an ER graph, we flip a weighted coin to accept an individual edge, and the final result is just the collection of edges we added. If our acceptance probability is $p$, then we would expect the graph to have a total of ${n \choose 2}p$ edges, each node to have on average $p(n-1)$ connections. In this network, people are approximately as popular as others (since our coin flip method leads to a binomial distribution with few super popular or super unpopular people).

An Erdös-Rényi graph with an average connection of 4.8. Look how nice and even this graph is.

Another type of graph is the Barabási-Albert type graph. The idea with this network is that people tend to be connected to a distinct, central "hub" of a few people that make up most of a network. The idea is that you start with a small network of $m$ people/nodes to begin with. At each step, you add a new node and connect it to another person $i$ with probability $p_i = \frac{E(i)}{\sum_jE(j)}$ where $E(x)$ is the number of edges for node $x$ and $j$ represents all other nodes. The idea is that if $E(i)$ is really big (i.e. a super popular person), then you'll have a much bigger probability to connect to that person over someone who only has a few connections. Wouldn't you want to hang out with the cool kids too? You can give $i$ as many connections as you want, so long as it's less than $m$ (think about the first new node: how can someone have 10 different friends in a group of 5?).

A Barabási-Albert graph where each new node gets 2 connections to the existing network. Can you see the clusters of the graph?

If you want to explore this more, the inspiring article that led to this project analyzes these 2 exact graphs in the exact problem we discussed above, going a bit more in depth with the idea of the counterintuitive friendship paradox.

New Vaccination Schemes

We looked at a very specific set of vaccination parameters, namely relying on all edges are treated equally and that the number of edges a node has is constant. In reality, people don't interact with each other equally, so instead of having a generic edge, we could implement an edge weighting. This quality would add a number between 0 and 1 to each edge to represent how "friendly" two people are with each other. We could then simulate the difference between best friends and mere colleagues interacting that may have more or less of a chance to spread a virus. The other aspect to consider was updating edge counts. Ideally we would only want to count a vaccine's candidate's number of edges to vulnerable people, not vaccinated onr infected people too, with counts updating after every vaccination. Then we could focus on protecting people specifically.

The final problem I wanted to talk about what the goal of vaccines are in a network. As we said, it's about removing edges efficiently, but some edges are definitely worth more than others. Imagine you have two friend circles, with exactly one mutual person between them. Even if that mutual friend has only two edges, vaccinating him would close that bridge between the two friend circles, isolating the virus into only one of them.

Vaccinating the literal middle-man isolates the two groups

This searching for a way to split our graph in two parts with the fewest number of edges removed is called the sparsest cut problem. If we can find the (usually approximate) sparsest cut, we can divide our graph in two and attempt to isolate the virus in only a singular bubble. Before, we were looking to remove something known as Hamiltonian paths, where you can't connect any one node to another. Here, we are trying to isolate the virus into a smaller bubble.

Sparsest cut of a random ER graph that approximately cuts the graph in half

So here we would want to vaccinate everyone on the border of the yellow-purple divide to make the sparsest cut a reality. Of course, this only works if you have enough vaccines AND if your sparsest cut splits the graph in half evenly. This strategy can be used recursively on the two sub-graphs, so the more vaccines you have, the better this strategy works.

Conclusion

I hope this brought some light to the epidemic math that goes on behind the scenes of so many news articles, but more importantly showed you the power of modelling situations not just in different ways, but creative ones as well. Graph theory in particular has such widespread applications, often just thinking of something simply with connections can allow you to borrow from its variety of tools and techniques.

Treeline Riddles and Triangular Rends

Adi Mittal

A puzzle to test in the forest

If you break a stick at 2 uniformly random points into 3 segments, what's the probability you can form a triangle out of those 3 segments?

As with all puzzles, drawing something always helps.

Label the ends of the stick to be 0 and 1, and we'll make the first break at point $x$.

The key to this puzzle is to use the triangle inequality: no side of a triangle can be longer than the sum of the other two.

The triangle inequality visualized.

So, that means that no segment can be longer than half the length of our stick. Now, we can start solving the puzzle. Without loss of generality, make the first break at point $x$ between 0 and .5 to ensure our first break is on the left side (if it's not within this range, it'll be some symmetric case on the right half of the stick). Our left half of the stick is already less than .5, so that's good, but that means our second break must be on the right half, as otherwise that will be a leg that will be greater than .5, and that's no good. The probability our second break will be on the right half of the leg is equal to the length of that segment over the total line (since it's uniformly random): $\frac{1-x}{1}$. Let's analyze that longer leg now.

The feasibility region in red shows the location of a valid cut to ensure it does not make a segment longer than .5 units.

We can't make a cut at or further than the .5 mark on the segment for the obvious reason: it would make a leg longer than .5, breaking the triangle inequality. For a similar reason, we can't make a cut at $1-x-.5$ or earlier, as that will make a leg on the right side longer than .5 too. We need a cut in between those two boundaries. Just like we found the probability for making a cut on this longer leg to begin with, if we can find the length of that feasibile region and divide that by the length of the leg, that will give us the probability for making a good cut here. The length of the red feasible region is $|.5-(1-x-.5)| = |1-1+x| = x$, and the total length is just $1-x$. So the probability our second cut is valid is $\frac{x}{1-x}$.

Putting it all together, we get the probability that our second cut is valid given a first cut at $x$ is $\frac{1-x}{1} \cdot \frac{x}{1-x} = x$. Kind of neat that the probability is exactly proportional to the length of the first break. But of course, $x$ isn't constant. It too is a random variable, so we can average it to get an expected probability across a large number of trials.

$\frac{1}{\frac{1}{2} - 0} \int_0^{\frac{1}{2}} x \,dx = 2 \cdot \frac{x^2}{2} \Big]_0^{\frac{1}{2}} = (\frac{1}{2})^2 = \frac{1}{4}$

As a gut check, this should make sense as .25 is exactly the midpoint between 0 and .5, the boundaries we set at the start of the solution.

So, if you break a stick at 2 uniformly random points into 3 segments, the probability those 3 segments can form a triangle is exactly $\frac{1}{4}$.

Linear Programming and Laudable Polytopes

Adi Mittal

Monopolize the world with mx+b

You're a business tycoon. You have ideas in mind but no products on hand. You need to make a quick buck now and all you have is some needlework on your belt to carry you. If you wanted to maximize profit between making hats and shirts, what combination of the two should you make? Well, let's look at what our goal is. If I make \$15 per shirt, and \$10 per hat, we can have a simple goal of

$\max(15s + 10h)$

Obviously, we can't just make infinite of each wearable. We have some constraints between available cloth, printing, and effort. If you have 20 sheets of cloth, and each shirt requires $\frac{1}{2}$ of one and the hats only require $\frac{1}{5}$ of one. For printing, maybe you only have enough ink for 100 prints total. Finally, hats are hard to make, so you cap yourself at only making 70 and no more. We have a list of constraints which can be nicely added as a set of inequalities:

$\max(15s + 10h)$
$\begin{align} \frac{1}{2}s + \frac{1}{5}h & \leq 35 \ \newline s + h & \leq 100 \ \newline h & \leq 70 \end{align}$

This isn't an easy system to solve as, well, we're not solving anything in the traditional sense by plugging in equations to one another, nor is it a standard maximization problem where we can take some form of a gradient since every equation here is a line, so we would only get constants that give us no new information.

This type of problem where we want to maximize a linear equation that is also bound by a set of linear restraints is called a linear program. These types of problems are called specifically programs as you'll see that we have analytic ways to find solutions, but to find the best solution requires some well-crafted algorithms to make solving them faster and more efficient. Before we can understand the 2, and later the general $n$, variable set up, let's try an easier case.

1D Linear Program

As with all problems, let's try and cut back on degrees of freedom so we can really focus on what's going on here. Instead of maximizing a function of 2 variables, let's do a function of 1 variable.

$\max(3x)$
$\begin{align} 5x & \leq 50 \ \newline x & \geq 6 \end{align}$

This isn't all that interesting, as I'm sure the answer is popping out almost immediately: $\max(3x)=30$ from since we can only have at most 10 of variable $x$ from the first constraint. What will become an important concept later is how we visualize these solutions to linear programs, so let's graph our inequalities on a number line to represent the quantity of variable $x$.

Feasiblity region graphed for our 1-variable linear program.

In blue, we have our first constraint of showing all values of $x$ that satisfy $5x \leq 50$, and in red we have our second constraint showing all $x$ that satisfy $x \geq 6$. Where those two regions overlap in the purple-ish area is the feasibility region, as well, it's the area where all of our constraints are satisfied and where feasible values of $x$ lie. The feasibility region defines a polytope, a generalization of the 2D polygon and 3D polyhedron. The key takeaway is that our solution of $x=10$ is an edge case: it lies on the furthest possible boundary of our feasibility region.

This makes sense for two reasons. 1) We want to always be looking to maximize or minimize our objective function ($3x$, in this case) so if we can add more or less of $x$ to our solution, we want to do that depending on our goal. This is a more intuitive way to show reason 2) our function is linear, so as long as we go in one direction, it will always be increasing or decreasing. Let's add the $y$-axis into this graph and let $y=3x$ to see the value of the function we're maximizing at different values of $x$.

Objective function $y=3x$ plotted with an extra dimension.

Since our objective function is linear, it is directly proportional to $x$, so as $x$ increases or decreases, $y$ also has to only increase or decrease with no chance of weird curves or bends in its function. So if we want to maximize our function, we just want to walk in the direction of $x$ that does so until we hit the edge of our feasiblity region. Similarly, if we wanted to minimize it, we'd walk the other way along $x$ that decreases $y$.

With this in mind, we can now go back to our original 2 variable case.

2D Linear Program

Recalling what our original setup was

$\max(15s + 10h)$
$\begin{align} \frac{1}{2}s + \frac{1}{5}h & \leq 35 \ \newline s + h & \leq 100 \ \newline h & \leq 70 \end{align}$

Let's plot our feasibility region like we did before. Only now, it'll be a 2-dimensional region instead of just the number line. We'll put $s$ for shirts on the $x$-axis and $h$ for hats on the $y$-axis.

5 vertices map out our 2D linear program's feasibile region, meaning there are 5 possible points that can be our optimal solution.

Just as in the 1-variable case, our solution for maximizing our objective function should lie at the edge of our feasible region. I've marked the feasible region's defining vertices in purple. It's worth pointing out that I have added an additional vertex not defined by our original constraints, and that's the vertex $(0,0)$. This is aptly called the non-negativity constraints as it, well, constrains our variables to be non-negative (shocking, right). These tend to be a requirement for some minimization cases just so that we don't end up with problem of running into a pair of negaively infinite numbers as our solution, which obviously doesn't make sense.

Although what I said about our solution being on the edge of the region isn't true, I can tell you more about our maximum solution: it will (usually) be one of the vertices of our feasible region. If you want a more math-based explanation, check this out, but there's a very intuitive way to view think of this, especially with a 2-variable linear program.

Remember how in the 1D case, our objective function $y=3x$ could be represent as a line? We can do a similar idea and add a third $z$-axis to our 2D problem and get $z=15s+10h$. This equation gives us a 3-dimensional plane as we encode the number of shirts, hats, and the amount of money they profit. Our constraints are also planes, acting as curtain-like walls extending forever vertically in the $z$-axis (as all values of $z$ will satisfy whatever values of $s$ and $h$ that lie on the line). Now, imagine our objective function plane of $z=15s+10h$ to be like a tilted table, and we place a ball on it to roll, what does the ball do? Well, the ball will just roll down the incline to the lowest point gravity will push it down. Normally, the ball will roll down forever and ever as our table extends infinitely in all directions, but we have our constraining walls to stop the ball. As soon as the ball hits a wall, it'll continue to slide down that wall as long as the ground tilts downward. It'll only stop if another wall pushes it into a corner, which is where our constraints form a vertex. Isn't that neat? The only times this won't be true is if one of our constraints form a wall that is exactly perpendicular to the direction the ball is rolling, in which case you will have a line segment worth of solutions (which includes two vertices at the end of it). This specific logic applies to finding minimums of our objective function, but you can also think of the reverse for maximums with a tilted ceiling and a balloon instead of a ball. It's basically a streamlined version of gradient descent since the gradient of the plane is always constant.

So if you manually check all 5 vertices, you'll find that a combination of $(50 \textrm{ shirts}, 50 \textrm{ hats})$ gives the optimal combination of $1250 profit. Not bad.

The Simplex Algorithm

Not to dismiss the wonders of guess-and-check, but it only worked well for us due to the small number of variables and constraints we had. As we tackle more complex and intricate problems, we want to be able to solve linear programs much quicker and more efficiently. There are many algorithms that have been developed for LP optimization with varying motives, but the first one was George Dantzig's simplex algorithm.

The simplex algorithm is essentially a systematic way for us to narrow our guess-and-check quickly. We start at some vertex, and then travel edges of the feasbile region between vertices until it lands on the optimal solution. The idea is by turning our constraints into a matrix, we can use Gaussian elimination to move between possible candidate solutions until no improvement to our objective function can be made. Let's use our original 2-variable problem to try it out.

$\max(15x + 10y)$
$\begin{align} \frac{1}{2}x + \frac{1}{5}y & \leq 35 \ \newline x + y & \leq 100 \ \newline y & \leq 70 \end{align}$

To start, we're going to turn these inequalities into equations by adding slack variables. To avoid confusing variable names later, I have replaced $s$ with $x$ and $h$ with $y$. The idea is that if our inequalities are anything less than the right-hand side, we should be able to cover that excess by adding an extra variable to make it equal. Rewriting all of our inequalities (including the objective function with a new objective variable $z$) into equations, we get

$\begin{align} -15x - 10y + z & = 0 \ \newline \frac{1}{2}x + \frac{1}{5}y + s_1 & = 35 \ \newline x + y + s_2 & = 100 \ \newline y + s_3 & = 70 \end{align}$

This is a system of linear equations we can encode as an augmented matrix!

$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{1}{2} & \frac{1}{5} & 1 & 0 & 0 & 0 & 35 \\ 1 & 1 & 0 & 1 & 0 & 0 & 100 \\ 0 & 1 & 0 & 0 & 1 & 0 & 70 \\ \hline -15 & -10 & 0 & 0 & 0 & 1 & 0 \\ \end{array} $

Here are some things to identify in our simplex tableau above. The first row is more a convenience than anything, keeping track of which column corresponds to each variable. The second, third, and fourth rows all correspond to some constraint we have rewritten as equations with slack. The fifth and final row is our objective function which we are trying to maximize. So, by default, we assume we start at the point $(0,0)$ in our feasibility region. Yes, it is a vertex and therefore a candidate solution, but obviously it's not the maximum solution we want, giving us a whopping $z=0$. How do we find the next candidate point? Well we first want to identify which of our variables would increase $z$ the quickest. Well, notice that $x$ has a coefficient of $-15$ in the objective function final row, while $y$ has only a value of $-10$ (we call these numbers indicators). This means for every unit of $x$, we gain an additional \$15 contrasting an additional \$10 from a singular unit of $y$. Since $x$ increases $z$ faster than $y$ does, let's focus on column 1.

If we're going to increase $x$ to increase $z$, how do we know how much to increase it by? We can use our handy constraints to tell us exactly that. What we do is we take each value in the $x$ column and divide its associated row's constraint by that value. So, for the $x$ variable we have

$\begin{align} 35 \div \frac{1}{2} & = 70 \ \newline 100 \div 1 & = 100 \ \newline 70 \div 0 & = \textrm{undefined} \end{align}$

Now why is this helpful? These divisions tell us the maximum amount of $x$ we can have according to each constraint (as we assume $y$ can be 0). So, if $\frac{1}{2}x + \frac{1}{5}y \leq 35$, then $x$ can be at most 70 without breaking that constraint. Similarly, $x$ can be at most 100 for the second constraint, and there's no limit for $x$ on the third constraint since it doesn't impact that inequality. So, this tells us we should use the first row as our guiding row! This is because if we used the second row with a maximum of 100, we would be violating our first constraint that $x \leq 70$. So, we call $\frac{1}{2}$ our pivot as we use that value to shift our focus from one vertex solution to another. We use our pivot to do Gaussian elimination, and create 0s in that column's other rows to land us at a new vertex.

$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \color{red}{\frac{1}{2}} & \frac{1}{5} & 1 & 0 & 0 & 0 & 35 \\ 1 & 1 & 0 & 1 & 0 & 0 & 100 \\ 0 & 1 & 0 & 0 & 1 & 0 & 70 \\ \hline -15 & -10 & 0 & 0 & 0 & 1 & 0 \\ \end{array} $
$\downarrow$
$R_2 - 2R_1$
$R_3 - 0R_1$
$R_4 + 30R_1$
$\downarrow$
$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{1}{2} & \frac{1}{5} & 1 & 0 & 0 & 0 & 35 \\ 0 & \frac{3}{5} & -2 & 1 & 0 & 0 & 30 \\ 0 & 1 & 0 & 0 & 1 & 0 & 70 \\ \hline 0 & -4 & 30 & 0 & 0 & 1 & 1050 \\ \end{array} $

This new vertex we have arrived at is exactly the solution you get only focusing on achieving a maximum $x$ at the point $(70,0)$, which is in fact, the solution with the most units of $x$.

The first step in our simplex algorithm visualized as we walk the edge between $(0,0)$ to $(70,0)$. This first step is definitely a better solution than before, but how can we tell if it is the best one?

\$1050 is definitely much better than netting \$0, but how do we know if we can make more? Going back to our objective function in the last row of our tableau, there's a $-4$ in the $y$ column, and as before, should mean there's a potential to increase profit by adding some $y$ component to our solution. Before we do that though, notice how some of our constraints have changed. $R_2$ used to denote $x+y+s_2=100$, but now it shows $\frac{3}{5}y - 2s_1 + s_2 = 30$, or as an inequality, $\frac{3}{5}y \leq 30$. What caused our constraint to change? It has to do with the fact we added rows together. Let's analyze the two original constraints in question of $R_1$ and $R_2$.

$\begin{align} \frac{1}{2}x + \frac{1}{5}y & \leq 35 \ \newline x + y & \leq 100 \ \end{align}$

We can then treat these as a system of equations like we did with Guassian elimination, and perform the same operation as we did before with $R_2 - 2R_1$.

$\color{blue}{x + y} - (\color{red}{x + \frac{2}{5}y}) \leq 100 - 70$

The important part to notice is that we're subtracting from the blue region $\color{blue}{R_2}$ and expecting a positive result (since both $x$ and $y$ must be greater than 0). This can only happen for our feasible region where the blue area $\color{blue}{R_2}$ is strictly greater than the red area $\color{red}{R_1}$. This appears only when $\color{red}{R_1}$ is contained within $\color{blue}{R_2}$. The $y$-value we solved for of $\frac{3}{5}y \leq 30$ is that maximum $y$ value of which that holds true. Anything above $(50,50)$ and suddenly the blue region is contained by the red, but anything below is fair game.

Our constraints haven't actually changed, they actually narrowed in given new information of which boundaries we are at with our current candidate solutions.

With that justified, we can now move on to repeating our pivoting process like before. In our new tableau, the $y$ column has a value of $-4$ in the indicators, meaning we have a possibility to increase $z$ by changing increasing $y$. Going through our process all over again to find and use the pivot:

$\begin{align} 35 \div \frac{1}{5} & = 175 \ \newline \color{red}{30 \div \frac{3}{5}} & \color{red}{= 50} \ \newline 70 \div 1 & = 70 \end{align}$

With a new pivot found...

$\begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{1}{2} & \frac{1}{5} & 1 & 0 & 0 & 0 & 35 \\ 0 & \color{red}{\frac{3}{5}} & -2 & 1 & 0 & 0 & 30 \\ 0 & 1 & 0 & 0 & 1 & 0 & 70 \\ \hline 0 & -4 & 30 & 0 & 0 & 1 & 1050 \\ \end{array} $
$\downarrow$
$3R_1 - R_2$
$3R_3 - 5R_2$
$3R_4 + 20R_2$
$\downarrow$
$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{3}{2} & 0 & 5 & -1 & 0 & 0 & 75 \\ 0 & \frac{3}{5} & -2 & 1 & 0 & 0 & 30 \\ 0 & 0 & 10 & -5 & 3 & 0 & 60 \\ \hline 0 & 0 & 50 & 20 & 0 & 3 & 3750 \\ \end{array} $

There are no more negative numbers in the objective function row's indicators, so this should be our optimal solution! This is because of the actual equation that row represents:

$50s_1 + 20s_2 + 3z = 3750$
$3z = 3750 - 50s_1 - 20s_2$

Since every variable, including our slack variables $s_1$ and $s_2$, is non-negative, the maximum our objective function can be is $3z=3750$ with both $s_1$ and $s_2$ equal to 0. So, while we may be done doing computing our solution, we should simplify our tableau to make it more readable. Let's make $x$, $y$, $z$, and $s_3$—our non-zero variables—have coefficients of 1 so we can easily find their values.

$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline 1 & 0 & \frac{10}{3} & \frac{-2}{3} & 0 & 0 & 50 \\ 0 & 1 & \frac{-10}{3} & \frac{5}{3} & 0 & 0 & 50 \\ 0 & 0 & \frac{10}{3} & \frac{-5}{3} & 1 & 0 & 20 \\ \hline 0 & 0 & \frac{50}{3} & \frac{20}{3} & 0 & 1 & 1250 \\ \end{array} $

Our second step in the simplex algorithm brought us to our optimal solution of $x=50$, $y=50$, $s_1=0$, $s_2=0$, $s_3=20$, and $z=1250$.

The second and final step in the simplex algorithm takes us from our last candidate solution $(70,0)$ to the optimal vertex of $(50,50)$.

Here's a summary of the simplex algorithm to solve linear programs:

Rewrite the objective function and constraints into equations with slack variables.
Create the initial simplex tableau using the newly written equations.
Identify the most negative indicator to find the pivot variable.
Calculate quotients to find upper bounds on the pivot variable, and select the smallest quotient; this is the pivot for this iteration.
Using Gaussian elimination and row operations to turn all other values in the column to 0 using the pivot.
If any negative indicators remain after all row operations, repeat steps 3–5.
If no negative indicators remain, we are done and at the optimal solution$^*$!

While we only looked at 1- and 2-dimensional linear programs, remember that this can work for as many variables as you'd like, which is nice since I don't want to deal with imagining a ball rolling down a 12-dimensional tabletop to find my function minimum. There is a small asterisk though, since sometimes the simplex algorithm can "stall" or "cycle", resulting in no net improvements of the objective function. Fortunately other algorithms based on other concepts have been built to not just be quicker, but avoid degenerate cases stalling and cycling.

Sensitivity and Slack Analysis

What the values of $x$, $y$, and $z$ should be clear as those correspond to the point on the feasible region and maximum of the function, but what do the values of $s_1$, $s_2$, and $s_3$ mean? Recall that these are our slack variables as these are the variables that turned our constraining inequalities into equations by accounting for any slack in the constraints themselves. So, if we have 0 slack for one of our constraints, it means that we are using up as much of that constraint as we can; there is no slack to account for, and the corresponding slack variable is 0. If there is slack, it means we are not using up a constraint to its fullest potential. Recall our third constraint was

$y \leq 70 \rightarrow y + s_3 = 70$

Also remember that our optimal solution included that $y=50$. We set a cap of $y=70$, but we are only using 50 of those possible 70 units. $s_3$ tells us that in the third constraint, we have an excess of 20 unused constraining units. In that same sense, $s_1$ and $s_2$ tell us we have 0 wasted resources for the first and second constraints. This also tells us that we are precisely at the vertex of where the first and second constraints graphically meet, since our solution is on the edge of both inequalities.

That's not the only information our slack variables tell us, though. You can also find how sensitive our result is in the final row of our tableau. The $\frac{50}{3}$ and $\frac{20}{3}$ tells us for every additional unit we add to constraints of $R_1$ or $R_2$, we will make an additional \$ $\frac{50}{3}$ or \$ $\frac{20}{3}$ respectively since $\frac{50}{3}(s_1 + 1) = \frac{50}{3}s_1 + \frac{50}{3}$. These are called shadow prices of our objective function. If you want to read more about shadow prices and other marginal analysis such as reduced costs, MIT OpenCourseWare has you covered, but for now there are a few cool extensions of linear programming I want to cover.

Duality and Dual Problems

Even though we have a systematic way to find our ideal solution to a linear program, we could have quickly found some facts about our solution before we started. Again, here is our previous, 2-variable linear program.

$\max(15x + 10y)$
$\begin{align} \frac{1}{2}x + \frac{1}{5}y & \leq 35 \ \newline x + y & \leq 100 \ \newline y & \leq 70 \end{align}$

In our 1st constraint, we could have equivalently rewritten it as

$30x + 12y \leq 2100$

by multiplying both sides by 60. Furthermore, since both $x$ and $y$ are greater than 0, we can also compare it to our objective function.

$15x + 10y \leq 30x + 12y \leq 2100$

So we now have an upper bound on our objective function: we know for sure it has to be less than 2100. We can be even smarter about this and do a similar process with our second constraint. By multiplying both sides of the second constraint by 15, we can say

$15x + 10y \leq 15x + 15y \leq 1500$

which gives us an even tighter upper bound on our optimal solution. This is the heart of duality: instead of trying to directly solve our maximization problem, we can turn it into an equivalent minimization problem to find the lowest upper bound of our objective function, indirectly solving it.

We can generalize this by multiplying all of our constraints by a scalar $a_i$ for each constraint and adding them all together.

$\begin{align} \frac{1}{2}a_{1}x + \frac{1}{5}a_{1}y & \leq 35a_1 \ \newline a_{2}x + a_{2}y & \leq 100a_2 \ \newline a_{3}y & \leq 70a_3 \end{align}$

Adding these all together, we get a unifying inequality of

$\color{red}{15x + 10y} \leq (\frac{1}{2}a_{1} + a_{2})x + (\frac{1}{5}a_{1} + a_{2} + a_{3})y \leq 35a_1 + 100a_2 + 70a_3$

Lastly, remember this is supposed to be an upper bound on our objective function, so we set this all greater than our objective function which I've highlighted in red. In our first few tries to bound the problem, we had $(a_1, a_2, a_3)$ equal $(60,0,0)$ and $(0,15,0)$ respectively. We can summarize our goal of minimizing the right-hand side while maintaining all these inequalities as true as

$\min(35a_1 + 100a_2 + 70a_3)$
$\begin{align} \frac{1}{2}a_1 + a_{2} & \geq 15 \ \newline \frac{1}{5}a_1 + a_2 + a_3 & \geq 10 \ \end{align}$

This looks like another linear program! We originally started with a primal problem with 2 variables and 3 constraints and used it to formulate its dual problem with 3 variables and 2 constraints!

But, what even is the purpose of this dual formulation? We already have a way to solve for an optimal solution, why needlessly copmlicate it with an extra intermediary step? The dual problem is useful as it allows us to readily access a lot of the sensitivity analysis. Each variable $a_i$ we're solving in the dual problem corresponds to the optimal shadow price (marginal utility of a resource) of the $i^\textrm{th}$ constraint in the primal problem.

Let's take a quick look back at our primal problem's solution. Recall that it had shadow prices of \$ $\frac{50}{3}$ for the first constraint of $\frac{1}{2}x + \frac{1}{5}y \leq \color{blue}{35}$, a shadow price of \$ $\frac{20}{3}$ for the second constraint of $x + y \leq \color{blue}{100}$, and a final shadow price of $0$ for the third constraint of $y \leq \color{blue}{70}$. If we value our total resources (highlighted in blue) at their marginal costs, we get that

$\color{blue}{35} \cdot \frac{50}{3} + \color{blue}{100} \cdot \frac{20}{3} + \color{blue}{70} \cdot 0 = 1250$

which is precisely the optimal profit we got originally from the primal problem. The leading principle of a dual problem is that if we can solve for the optimal marginal profits of each resource, then we know implicitly how much of that item we should buy given the total selling price of each primal variable.

$\begin{array}{c|c} \min(35a_1 + 100a_2 + 70a_3) & \textrm{Minimize total cost of resources} \\ \newline \frac{1}{2}a_1 + a_{2} \geq 15 & \textrm{Marginal profit for variable $x$ must be at least 15} \\ \newline \frac{1}{5}a_1 + a_2 + a_3 \geq 10 & \textrm{Marginal profit for variable $y$ must be at least 10} \\ \end{array}$

We try to minimize the cost of our primal constraints, while trying to ensure our dual constraints satisfy the coefficients of the primal objective function. In the case of shirt and hat selling, we want our marginal profits to at least equal the price we want to sell our products at. Even more interestingly, just like how our dual variables are equal to the primal shadow prices, the dual shadow prices are equal to the primal variables! Everything has been switched around!

Curiosities aside, we still haven't talked about many of the reasons why analyzing dual problems is useful. Here's a rundown of some of the benefits of duality:

Sometimes it's just easier: As you just saw, we turned a problem of 2 variables and 3 constraints into one of 3 variables and 2 constraints, and turned it from a question of maximization to one of minimization. For problems with few variables and lots of constraints, it's usually much easier to turn it into a problem with lots of variables and fewer constraints as every constraint in the problem will add some number of extra vertices for algorithms like the simplex to check, making it less efficient to check every case.
Feasibility and boundedness: Sometimes a linear program will have no solutions whatsoever to check (imagine constraints like $x \geq 0$ and $x \leq 0$; can't be both at the same time). If the primal is unbounded (think no non-negativity constraints), then the dual is infeasible, and vice versa.
Specialized algorithms and theorems: Beyond optimizing business plans, duality has found its way into combinatorics, graph theory, and even into the fame of game theory as a means to prove the minimax theorem for zero-sum games. Many more results like these tend to stem out of the Weak and Strong Duality Theorems (Weak Duality talks about how a dual problem can set an upper bound to a primal soultion, and Strong Duality says it can find the optimal primal solution).

Duality is a powerful concept not just in linear programming, but frequently pops up in other areas of math and recognizing when you can represent one problem with another is a great tool to have in your back pocket.

Conclusion

Linear programming is only one small niche of optimization study, but the depth and applicability in its simple premises is wildly effective. Starting out in the 40s with Dantzig's original simplex algorith, it has grown to affect much of computer science and math to systematically solve otherwise impossibly long computations. Even now, variations of the original linear programming formulation is still being researched with new methods to not only solve them, but also adding new fundamental restrictions to the problem. If you were a car manufacturer that wanted to know the optimal distribution of models to produce, you can't just make 41237.7963 cars; you would only care about integer solutions, and thus integer linear programming (ILP) was born. Linear programming's innate utility in optimization has lent itself kindly to modern applications, but from ILP, to the even crazier mixed-integer linear programming with a combination of integer and non-integer variables, as well as duality, LP has found itself touching every corner of math from combinatorics to graph theory as a simple multidimensional geometric encoding of a constraint function.

Billiard Balls and Secure Squares

Adi Mittal

The ultimate secret to win at pool and laser tag

Today's post is one that's been months in the making. It originally started as one that only covers a single problem, but quickly branched off as I delved deep into dozens of papers and videos, just with more and more questions coming up. We're going to be discussing one of the oldest mixed studies of algebra and geometry: dynamical systems. Today's post is going to be a long one, but should have a fair number of fun visuals to keep it worth the scroll. This is really two, maybe three posts in one, so I recommend reading this with breaks at each header as to make it less overwhelming.

To begin, let's look at a type of problem you might have experienced before, rather than have read formally: billiard problems.

For the best experience, avoid reading this in Safari; most other browsers should work and load the visuals correctly, but Safari breaks rendering a few of them.

Ricochets and Rebounds

If you have ever played pool, or even laser tag, you might already be familiar with billiard problems. If you have a billiard ball you want to land in a pocket, but there are other balls in your way, where on the side of the pool table do you want to bounce your ball to land in the pocket? Alternatively, if you have a laser gun and an opponent by a mirror, where do you want to aim your laser to hit your opponent? To reduce this problem further (and for future reference), if you have a laser, a target, and a wall, where do you want to aim the laser to hit the target?

Can you make the light hit the target? The light automatically follows your cursor, but you can press the 1 key on your keyboard to lock the angle. Try dragging the points for different problem setups.

We covered how to solve this exact problem in a previous post, which also shows how light reflects and bounces (which is what we'll be using today). To recap that post:

Our laser/billiards bounces must follow the Law of Reflection: the angle the laser strikes the mirror is the same angle it reflects.
To solve for where to bounce the light off the wall, we create a "mirror world" to find the reflection point.

The idea is to reflect our target over the wall, draw a straight line from our laser to the reflected target, and the intersection point is the point of reflection. This seems arbitrary, but there's a good reason for it: the angle that our straight line creates in the "mirror world" is equal to the incident angle, and therefore ensures the angle the light bounces off at in the normal world is equal. I recommend following through with the previous post for a more on this, but the following demonstration should suffice.

We reflect our target over the wall, and draw a line between the light and that reflection. Can you see why this finds us our desired reflection point?

Simple enough, but this reflection technique is an invaluable tool for solving these types of problems, so put a pin in that for later. But now, let's look at a problem that throws that very easy technique out of the window.

Alhazen's Circular Mirror

Dynamical systems have been studied since forever at this point. One of the oldest (hard) billiard problems comes from Ptolemy 150 AD:

If you have a candle and a circular mirror, where on the wall do you have to aim to hit a target?

This problem plagued the Greeks for centuries, primarily since they tried to solve this with their typical ruler-and-compass constructions. Dr. Peter Neumann proved that this is an impossible task, but other solutions and proofs have arisen over the years by some of the most famous mathematicians including Huygens and l'Hôpital. The problem was named after Abu Ali al Hassan ibn al Hassan ibn Alhaitham, later retroactively given the mononym Alhazen, who discussed this in book Optics around the 10th century.

Now, with the same light and target, can you hit the target? Drag and drop points for different problem setups; if you want to lock the light's direction, hit the 2 key.

First, let's clear some details up.

To make sense of a curved mirror, the reflection acts according to the tangent line at the point of reflection. So, you can think of the curved mirror as made up of infinite, infinitesimal straight-walled mirrors.
The light is a pure ray like a laser. No point source or beams to get cheap answers like, "If I stand about here, some of the light will reflect on the target." This is a laser beam that can only blind us if it directly hits us.
In a similar manner to the light assumption, the target has been reduced a to 0-dimensional point. The light must hit the target exactly in our solution.
And, importantly, we assume a solution exists. If the light and target are on opposite sides of the circle, clearly no reflection will make it. So, to simplify, we'll just assume that they are in positions with a possible reflection point.

Before I show you my solution to the problem, I suggest you try this problem for yourself. This is one of the few, very simple-to-state problems that has caused me a lot of trouble deciphering a clean solution for it. Mathematicians have developed quite the dictionary to solve this one problem which we'll discuss a bit at the end, but only one way has appeared the most elegant (in the most stretched definition) to me.

Clutch Complex Contours

The way I solved this problem was with the magnificence of complex numbers. Let's center our circular mirror as the unit circle $Ø$ at the origin $O$, and let's call the light and target $\color{red}{a}$ and $\color{green}{b}$ (if your circular mirror is not of radius 1, just scale all coordinates appropriately to make it so). We want to solve for the point $\color{purple}{z}$ on the mirror such that $\angle azO = \angle Ozb$.

The path our light takes can be modelled in two segments: from the candle $a$ to the mirror $z$ as $a-z$, and then from the mirror to the target as $b-z$ (you'll see why we pick these directions later). To make our lives easier, let's rotate our whole setup so that the mirror reflection point occurs at $z=1+0i$ by dividing our whole setup by $z$. So right now, we have two vectors representing our reflected ray of light as $\frac{a-z}{z}$ and $\frac{b-z}{z}$ (if you're not completely familiar with the geometry of complex numbers and why this division tactic works, here's a good introductory video).

Here we have the light at $\color{red}{a}$ and the target at $\color{green}{b}$ at some arbitrary spots, with a theoretical solution at $\color{purple}{z}$. Since we don't want to work with some random tilted axis, we divide everything by $\color{purple}{z}$ to so that our set up is centered around the real axis with $\color{purple}{z} = 1$.

Since our vectors are now symmetrical about the real axis, we know that $\arg(\frac{a-z}{z}) = -\arg(\frac{b-z}{z})$ as the angle of reflection must be equal angle to the angle of incidence (see this previous post for a proof).

Ok, so why is any of this helpful? Just as we divided by two complex numbers to subtract the angles of their vectors, we can multiply them to add them together. If we multiply $(\frac{a-z}{z})(\frac{b-z}{z})$, we get that their angles sum to 0 since their arguments are opposite! If the argument of the product is 0, that means that it lies on the real axis, and therefore is a real number! We can then extract that

$\large{(\frac{a-z}{z})(\frac{b-z}{z}) \in \mathbb{R} \rightarrow \operatorname{Im}((\frac{a-z}{z})(\frac{b-z}{z})) = 0}$

where $\operatorname{Im}(c)$ denotes the imaginary part of a complex number $c$. Noting that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$ we can further simplify this equation.

$\begin{align} \operatorname{Im}((\frac{a-z}{z})(\frac{b-z}{z})) & = 0 \ \newline \operatorname{Im}(\frac{ab - (a+b)z + z^2}{z^2} \cdot \frac{\overline{z}^2}{\overline{z}^2}) & = 0 \ \newline \operatorname{Im}((ab)\overline{z}^2 - (a+b)\overline{z} + 1) & = 0 \ \newline \operatorname{Im}((ab)\overline{z}^2) - \operatorname{Im}((a+b)\overline{z}) + \operatorname{Im}(1) & = 0 \end{align}$

$\large{\operatorname{Im}((ab)\overline{z}^2) = \operatorname{Im}((a+b)\overline{z})}$

This might not seem like much, this describes all possible points $\color{purple}{z}$ our reflection point can lie on! Let $\color{purple}{z} = x + yi$, $\color{red}{a} \color{green}{b} = p + qi$, and $\color{red}{a} + \color{green}{b} = r + si$, and we can rewrite our equation in terms of cartesian coordinates $(x,y)$.

$\begin{align} \operatorname{Im}((ab)\overline{z}^2) & = \operatorname{Im}((a+b)\overline{z}) \ \newline \operatorname{Im}((p+qi)(x-yi)^2) & = \operatorname{Im}((r+si)(x-yi)) \ \newline \operatorname{Im}(px^2 \color{red}{-2pxyi}-py^2 + \color{red}{qx^2i} + 2qxy \color{red}{- qy^2i}) & = \operatorname{Im}(rx \color{red}{-ryi} + \color{red}{sxi} + sy) \ \end{align}$

$\large{q(x^2 - y^2) - 2pxy = sx - ry}$

This is an equation for a hyperbola! And since we want $\color{purple}{z}$ to lie on the circumference of our spherical mirror, the point $(x,y)$ must also be a solution to $x^2+y^2 = 1$ to lie on the unit circle as well. To find where our desired $\color{purple}{z}$ is, we just need to find the intersection between this hyperbola and the unit circle.

Given our previous light and target positions, we get this specific hyperbola which intersects our mirror in 4 locations, giving 4 possible reflection points. How can we compute their coordinates, and find the correct point?

$ \begin{cases} q(x^2 - y^2) - 2pxy = sx - ry \ \newline x^2 + y^2 = 1 \end{cases} $

It's no coincidence that this problem involves finding the intersection between a circle and a hyperbola. While yes, all of our complex number algebra does the job, there's a purely geometrical way of coming to the same conclusion involving isogonal conjugates. I wasn't very familiar with them, so I decided to present the complex number approach instead.

You could try and solve this system of equations, but the nature of the conic sections make it a pretty tedious and gross task. Fortunately, we can actually reduce this sytem of 2 simultaneous equations to a single polynomial! Going back to one of our previous equations:

$\operatorname{Im}((ab)\overline{z}^2 - (a+b)\overline{z} + 1) = \operatorname{Im}((ab)\overline{z}^2 - (a+b)\overline{z}) = 0$

The important thing to note is that we have a complex number whose imaginary component is equal to 0. This means that this expression is equal to its conjugate, since there is no imaginary component to flip the sign of: $x + 0i = x - 0i = x$.

$\begin{align} (ab)\overline{z}^2 - (a+b)\overline{z} & = \overline{(ab)\overline{z}^2 - (a+b)\overline{z}} \ \newline (ab)\overline{z}^2 - (a+b)\overline{z} & = (\overline{ab})z^2 - (\overline{a+b})z \end{align}$

Again noting that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$, we can multiply both sides by $\color{purple}{z}^2$ to get that

$\large{(\overline{ab})z^4 - (\overline{a+b})z^3 + (a+b)z - ab = 0}$

All we need are the complex solutions $\color{purple}{z}$, and since this a quartic equation, we technically have a closed form solution. Once we have the coordinates of our reflection points, all we do is graph $(\operatorname{Re}(\color{purple}{z}), \operatorname{Im}(\color{purple}{z}))$, and we are done.

Our complex quartic generates 4 possible solutions for $\color{purple}{z}$, all on the unit circle.

If these points look familiar, they should: they are precisely the points our hyperbola predicted before!

While not ideal to compute, our hyperbola did in fact find the same potential solutions.

Also, notice how we only used information about where the supposed solution $\color{purple}{z}$ to find our quartic; we never specified any conditions for where the light $\color{red}{a}$ or target $\color{green}{b}$ had to be! This means we can have our light and target on the inside of our circular mirror and find points that satisfy the Law of Reflection.

A valid bounce inside a circular mirror.

The best part is, all 4 of the possible solutions our quartic and hyperbola find are valid! No need to worry about the laser clipping through the mirror randomly; it all works out.

While we solved the problem, there are a few details to address that are not completely obvious about this approach using complex numbers.

The Core 4

Why are there 4 "solutions" according to our polynomial? Being a quartic equation, 4 complex solutions isn't unexpected, but they don't seem to have any physical significance for our bouncing laser. Obviously, only one looks like it can reflect our points correctly; how can this be considered a viable spot to aim your laser?

A supposed "solution" our quartic generates, despite the fact the laser has to phase through the mirror on its way to the target.

No mirror bounces like that, let alone allow the laser to move straight through it. Moreover, our Law of Reflection looks completely broken, too. What's happening here?

It lies in the direction of our vectors. Watch what happens as I extend the line segment from the target to the "solution".

If we extend the ray from the target to the "solution", we can get a "mirrored target", kind of like what we did for the straight wall case.

If we extend the ray, then it's clear that the Law of Reflection is satisfied, and this is true for the 3 other supposed "solutions": every "bad" reflection point is correct if the rays are extended far enough. So the 4 "solutions" correspond with how are vectors are lined up, since if we change the direction of our light bouncing we can get different points where the Law of Reflection is satisfied.

That doesn't answer, though, how we know which point to pick as the "correct" reflection point? If you read the previous post on retroreflectors, then you know light takes the fastest and (only in this case) shortest path. So, we just pick the point where the total distance of the light's path is minimized: $\min(|\color{purple}{z} - \color{red}{a}| + |\color{purple}{z} - \color{green}{b}|)$

To be (on the circle), or not to be (on the circle)

Some of you might be wondering why this should produce any solutions on the unit circle—let alone 4 at that. That is in part by the property we have mentioned a few times: $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$. If the geometry of this isn't obvious to you with the rotations and scalings, we can turn this into Cartesian coordinates by setting $\color{purple}{z}=x+yi$, we get that

$\begin{align} z \cdot \overline{z} & = 1 \ \newline (x+yi) \cdot (x-yi) & = 1 \ \newline x^2 - xyi + xyi + y^2 & = 1 \ \end{align}$

$\large{x^2 + y^2 = 1}$

Which is precisely the equation for the unit circle. The property that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$ forces $z$ to be on the unit circle.

…for the most part. The proof I've highlighted is adapted from this paper. As it shows, if you have the light at $\color{red}{a}=.5+.5i$ and a target at $\color{green}{b}=.5+0i$, two of the supposed solutions for $\color{purple}{z}$ are completely off the unit circle.

When $\color{red}{a}=.5+.5i$ and $\color{green}{b}=.5+0i$, one supposed reflection point is on the inside of the circle, and the other is so far outside of the circle its offscreen.

I'm not totally sure why this happens, but the previously linked paper proves that at least two of the generated solutions must be on the unit circle.

If you want to look at this problem more, there is also an algebraic solution, here there are ideas involving tangent ellipses in this paper, and there's even an approach discussed in Dorrie's 100 Great Problems of Elementary Mathematics. These would be my recommended starting points. For even more depth, this is a solution involving origami (yes, the paper folding) and here's the same problem in hyperbolic space.

Now that we've seen where billiard problems started, let's see how far they've come with the main focus of today's post.

Get Down Mr. President!

You and an assassin are trapped in a square room. With a single bullet, the assassin wants to do everything he can to take you out without wasting his shot. You, however, came prepared and hired a bodyguard to prevent direct line of sight between you and the assassin. But remember, you're trapped in a room. Without hesitation, the assassin flicks a shot to the side and ricochets off the wall and grazes your arm, avoiding the bodyguard completely. You might have been lucky this time, but who knows what happens next.

If you (target) and an assassin are placed in a square room, can you hire a finite number of bodyguards to prevent any shot from hitting you (including ricochets)?

At first glance, this might seem absurd. There are an infinite number of ways for the assassin to line up and bounce his shot, so how can anything less than an infinite number of bodyguards suffice? As one might anticipate, this wouldn't be a blog post if it didn't have an incredible answer.

Can you hit the target with the assassin's shot despite the bodyguards? Drag the assassin and target points to move them, and press the 3 key to lock the angle.

It's no coincidence I placed the bodyguards where I did in the above widget; not only does a finite number of bodyguards make do, you can prove you only need to hire 16 to ensure 100% protection!

I first found this problem through Tai-Danae Bradley's video and post, where she writes up the proof very well on her own. It was this problem that inspired me to look for other problems to extend this post, and moreover it was a fun programming challenge. Here, I want to outline the proof with the key insights Bradley utilizes, as well as pose a few other questions of my own.

Just as we did before, let's clarify some problem details:

Just like before, the assassin's bullet is a pure ray, and the target has been reduced a to 0-dimensional point.
Now, though, we have bodyguards, which are 0-dimensional points as well; if they are going to protect the target, they have to fully take the hit.
Lastly, just to make it clear, this is a perfectly square room and the bullet ricochets at exact angles off the walls, so the assassin's shot can bounce forever if needed to hit its mark.

The Tiling Torus

Let's look at an easy case: what if the assassin can only reflect off the left wall? Since this is a square room with straight walls, we can use our reflection trick from before. Now, though, we have to analyze the target's position in space relative to the room.

By reflecting the room, we create a "mirrored world" to track our reflection.

I've color-coded the left, right, top, and bottom walls to be yellow, gray, magenta, and cyan respectively. The reason if we want to track the target's location within the room even after the reflection, we have to reflect the room itself too, creating an actual "mirror world" that I referred to earlier.

Ok, so that isn't too different than what we've already been doing, so what's the point? The magic lies in modelling multiple bounces. Remember, we reflected over the yellow wall to say we wanted our bounce to be off that wall, but we can chain these reflections to give multiple instructions to our bounces. If we first reflect over the yellow wall and then the magenta wall, we create a doubly mirrored world with our straight line showing the path of what 2 bounces looks like.

Reflecting over a second wall gives us another straight-line intersection to find our first and therefore second reflection point too.

If you want to convince yourself this trick works for multiple bounces, I recommend finding the congruent angles within the mirrored world's straight line and the actual bounces within the room in question.

An important part of finding this mirrored world's straight line though is the fact that it intersects the colored walls in the order the assassin's bullet bounces. In the initial setup provided, the straight line hits the yellow wall first before intersecting the magenta wall, just as the beam's bounce path reflects off the yellow then magenta walls in that order. If you move the assassin and target, you can see this idea holds for a magenta then yellow wall bounce too.

So, if we wanted to model the assassin's hitting the target in more bounces, we just reflect our room more times and draw the straight line between the assassin and mirrored target.

Even with many more reflections of the room, our bullet still bounces off the colored walls in the order our straight line intersects them. Try dragging both the assassin, target, and mirrored target to see how the paths change. Note these are only paths that result in the assassin successfully hitting the target.

Moreover, since squares can tile the plane maintain the same "silhouette" under reflection, we can infinitely tile the plane with reflected copies of our room. Since a line through this plane can represent any bounce shot from the assassin in the original room, we have successfully simplified our problem setup. Why? With straight lines, we can now use coordinate geometry to place our bodyguards and not worry about annoying reflectedl light patterns within our square.

In more math-y terms, we have turned our original room into what is known as a flat torus (yes, the thing that's equal to a coffee cup). Essentially, all this means is that our problem sort of exists in a world similar to the game of Asteroids: as you exit the top or left of one flat torus, you enter through the bottom or right of another one (and vice versa). This fact is what allows us to tile the plane consistently with our problem setup. If you look back at the 2-by-2 grid setup from before modelling the 2-bounce paths, you'll see that our top/bottom edges are both cyan and our left/right edges are both gray, showing that exact relationship we'd expect in a flat torus.

Connecting opposite edges of a square turns it into the equivalent of a torus.

This is the first key insight to solving this problem: turning our bouncing shots into straight lines in an infinitely tiled plane. Working with straight lines makes life so much easier than bent ones. With that, we can move on to the second epiphany to prove our result.

The (Different) Core 4

Even though we tile the plane infinitely, there aren't an infinite type of rooms. Just looking at our 2-by-2 grid that makes our flat torus shows us everything we need to know: there are exactly 4 types of rooms that build our tiling: the original one, the one reflected over the yellow wall, the one reflected over the magenta wall, and the one reflected over both walls. This regularity is clear visually: watch what happens if I reflect the target into every mirrored room.

Having 4 "unique" rooms generates 4 unique lattices of mirrored targets.

Each one of the reflected rooms generates a lattice of that reflected target! I've colored the 4 different lattices in green, yellow, magenta, and cyan. Now, since each dot represents a way of hitting our target, we just need to block every line from the assassin to any one of these colored dots.

This is the second critical idea to finish out this proof: every reflected target falls into 1 of 4 possible lattices (each represented by a color). Dividing the mirrored targets into lattices is nice since it places all dots in a given lattice to be the same distance away from each other.

At this point, you have everything you need to finish this proof using the flat torus tiling and the 4 lattices. If you want to try and finish it through, I recommend doing so as it has some pretty satisfying reasoning throughout it. If you just want to keep reading, I'd recommend visiting Bradley's post where she completes the proof there.

Once you reach the end of the proof, you'll find that you need exactly 4 bodyguards to protect any given lattice, and since there are 4 lattices, we need $4 \cdot 4 = 16$ bodyguards total to completely protect the target.

No matter where the assassin shoots, the target remains safe and sound. Try moving either of them around, and watch the bodyguards adapt and reduce the assassin's efforts to nil.

Since it is possible to protect the target from the assassin with a finite number of bodyguards, we can say that the square is a secure polygon.

What Next?

This is one of the most surprising facts I've come across in a long time. But, there's more places to take these billiard problems and dynamical systems. One of the first extensions I thought of upon seeing this was other grids. As we know, there are also hexagonal and triangular grids in addition to the square one we analyzed today.

Examples of triangular (left) and hexagonal (right). Are they secure polygons?

The regular triangle and hexagon seemed like the natural progression since they play nicely with reflections and tiling the plane, so they seemed like good candidates to explore next. But for anyone who's dabbled in linear algebra and transformations knows that reflections are only commutative if the axes are perpendicular. That's why squares worked well for tiling the plane since no matter what the order of the walls you reflected the target over was, you'll get exactly one, unique "mirrored target" per reflected room. For triangles and hexagons, though, you don't, and that's an issue (superficially, at least; if anything, annoying).

Not to mention, every other regular polygon doesn't tile the plane, so modelling their bouncing paths will be even more difficult. What about non-regular polygons? Or concave ones? This is definitely something I'll revisit in the future and try to find the conditions for a polygon to be secure, but until then, we have just scraped the surface of billiard problems and dynamical systems.

A few other, related problems to consider. While we only looked at rays for light sources, others have considered other types of light sources. In an Illumination Problem, we consider point sources (i.e. light source that produces light in every direction instead of one). Actually, we've already looked at one type of Illumination Problem: the secure square! It can be rephrased as the the following: if a light bulb is placed in a mirror room, is it possible to place a finite number of pillars such that a given spot is never illuminated? It's idential to our secure polygon question. One of my favorite Illumination Problems is the Art Gallery Problem, which is not only a readily applicable problem, but also has a wonderfully elegant proof that Steve Fisk conjured (it speaks volumes how nice this proof is for it to be in Martin Aigner's Proofs from THE BOOK). Here's a great, in-depth paper I found discussing this famous problem along with some proofs and extensions as well.

Even outside of classic billiard problems, even just knowing of the simple reflection technique to model bounces is invaluable. Grant Sanderson of 3blue1brown fame used bouncing light as an analogy to solve a kinematics problem and bring in circles almost magically. I've said it before, and I'll say it again: duality and different perspectives are some of the most powerful problem solving tools you can have. This small reflection technique, or the complex number algebra with Alhazen's problem, might not mean much to you now, but it's another tool to stow away in your back pocket. Despite only seeing this ability to turn dynamics into geometry, I've seen them enough to know that these techniques and ideas are more than just an intriguing fact. You'll never know when you might be able to use such a tool, but when you do, who knows the new worlds that a new paradigm can unlock for you.

If you're interested in learning a bit behind today's graphics and widgets, see the follow up I wrote up detailing some seemingly innocuous math with some high-budget applications and cool patterns.

Circle Computations and Raytracing Remedies

Adi Mittal

Basically, Pixar should hire me

Last post, we looked at different types of billiard problems, a class of math problems analyzing how light bounces with different setups of mirrors. Notably, we saw how straight lines make for very simple, easy to compute mirrors, while others like circular ones, can be incredibly frustrating.

A large portion of last post's content, though, was made up of interactive graphics. While I went over much of the math that goes into solving these types of problems, we skipped over a large part of the math that goes into simulating them. Math is very nice in that many problems can be solved with nothing more than a pen, paper, and your mind, but oftentimes, that's only helpful if you are confident in how to approach the problem. What computer's can do is help build our intuition to solve a problem by calculating, drawing, and modelling scenarios with precision and speed we can only wish to achieve.

So, today, we'll look at some of the clever math that goes into computer graphics (that we'll later extend), and to introduce such a topic, we'll look at a simple, fundamental problem in graphics: how do you find the intersection between a line and a circle?

Languishing Lines and Confounding Circles

Before we can even attempt this problem, we're going to have to start from scratch, since we have one slight issue: a computer has no idea what a line or a circle is! So before we can do anything, let's teach our computer how to draw a line.

Perfect Parameterizations

At its core, computer graphics is displaying a set of pixels with certain colors. If we want to visualize anything on a computer screen, we just need to find all the relevant pixels (coordinates) to light up and color. Because we want to compute these individual coordinates of, say, a line or circle very quickly and easily, almost always we will use vectors. These can be typical column or row vectors you see in linear algebra, or it can even take the form of complex numbers. The reason why these tend to be helpful is that they give very easy ways to compute coordinates for lines, circles, and other shapes.

If we want to draw a line with slope, say 2, we need to ensure that it is constructed by a vector of slope 2. An easy one to find is the vector $v=\small{\begin{bmatrix} 1 \\ 2 \end{bmatrix}}$ since we know that will pass through the point $(1,2)$. So, to get other points beyond this vector, we can scale $v$ by a factor of $t$ to get other vectors (i.e. points) with the same slope. If $t=2$, we get the point $(2,4)$. If $t=1.5$, we get the point $(1.5,3)$. If $t=239470$, we get the point $(239470,478940)$. Whatever you choose $t$ to be, our vector $v$ will give us a point on the line $y=2x$.

However, this isn't super helpful, since we are still only restricted to lines that go through the origin at $(0,0)$. So, we can add a starting point $\color{red}{p}$ to our vector equation to offset the line by $\color{red}{p}$, guaranteeing our line goes through the point $\color{red}{p}$ (since that's the coordinate generated by $t=0$).

$\large{l = \color{red}{p} + tv}$

Now we just plot every point for $t \in (-\infty, \infty)$, and we get a line with $v$ dictating the slope of our line (negative $t$ values gives us coordinates behind $\color{red}{p}$)!

Our parametric line $l$ going through point $\color{red}{p}$. Drag the point to adjust it's position.

We can do a similar process for a circle. To parameterize a circle, we'll have to pull from trigonometry. We know that a circle is defined by $x^2 + y^2 = r^2$. The Pythagorean identity tells us that $\cos^2(\theta) + \sin^2(\theta) = 1$, so we can quickly make the connection that $x=r\cos(\theta)$ and $y=r\sin(\theta)$ (which the geometry justifies). This precisely defines $x$ and $y$ in terms of the parameter $\theta$! Again, though, this is centered at the origin, so we can center the circle around a point $\color{blue}{q}$ by adding it to our parameterization.

$\large{c = \color{blue}{q} + r\begin{bmatrix} \cos(\theta) \\ \sin(\theta) \end{bmatrix}}$

where $r$ is some real number for the radius of the circle, and $\theta \in [0, 2\pi)$. We can now easily draw both lines and circles!

Now we also have a circle centered at $\color{blue}{q}$ too. Drag the center point to change its position, and the radial point its radius.

Collisions and Intersections

Now that we have defined our line and circle for our computer to interpret, we can start thinking about how to detect collisions between a line and a circle.

Discerning Distances

A good place to start is by looking at how far away the line $l$ is from the center of the circle $\color{blue}{q}$. For reference, the distance from a point to a line is the shortest (i.e. perpendicular) distance from the point to the line. If $l$ is more than a distance of $r$ away from $\color{blue}{q}$, then we know that it's outside the circle and doesn't intersect, and if $l$ is less than a distance $r$ away from $\color{blue}{q}$, then we know it's inside the circle and does intersect.

$l_1$ is a distance less than $r$ away from the center, and clearly intersects the circle. $l_2$ is a distance greater than $r$ away, and clearly does not intersect the circle. $l_3$ is exactly a distance $r$ away, making it tangent to the circle (1 intersection point instead of 2).

Let's look at an individual line and see if we can draw any useful conclusions about this distance.

From a given point $\color{red}{p}$ on our line $l$, we can find a new vector between $\color{red}{p}$ and the circle's center $\color{blue}{q}$ as $\overrightarrow{\color{blue}{q} - \color{red}{p}}$. This will form some angle $\theta$ with $l$, more specifically its vector $v$. Recalling that $\color{green}{d}$ is the perpendicular distance between $\color{blue}{q}$ and $l$, we have a right triangle that gives us that $\color{green}{d} = |\overrightarrow{\color{blue}{q} - \color{red}{p}}| \sin \theta$.

If you're familiar with your linear algebra, this almost looks like the formula for the magnitude of the cross product: $|v \times u| = |v||u|\sin \theta$. So, writing our two relevant vectors and rearranging we can see that…

$|\overrightarrow{\color{blue}{q} - \color{red}{p}}||v| \sin \theta = |\overrightarrow{\color{blue}{q} - \color{red}{p}} \times v|$
$|\overrightarrow{\color{blue}{q} - \color{red}{p}}| \sin \theta = |\overrightarrow{\color{blue}{q} - \color{red}{p}} \times \frac{v}{|v|}|$

So all we need to do to see if our line intersects our circle is if that cross product is less than or equal to the radius of our circle (if you're concerned about the dimensionality of our vectors—cross products only exist in dimensions 3 and 7—we can treat them as 3D vectors with z-component 0, which makes the calculation easier and equivalent to the determinant).

If this isn't totally apparent why this is true, it has to do with the geometrical interpretation for the cross product: we're finding the area of the parallelogram that the two vectors span, and since the area of a parallelogram is $A=\textrm{base}\cdot\textrm{height}$, we're essentially finding the height of that parallelogram by dividing by its base.

Using the closest distance between the circle and line, we can successfully identify when the line intersects our circle.

We have a working condition! Using the cross product, we can identify point-circle intersections with a single line of computation. However, this simple solution does have its limitations. Mainly, this is only a boolean condition; this method only tells us whether or not an intersection occurs, but nothing else. We don't know where on the line it intersects, nor how many times. Sometimes, this doesn't really matter like when you want to approximate lines intersecting points (since then you can treat points as small circles). But for more complex tasks and graphics like raytracing, this won't cut it.

Fancy Vector Operations

If we have a point $x$ on our circle, then the distance between $x$ and the center of the circle $\color{blue}{q}$ should be equal to the radius $r$. As an equation, the magnitude of the vector from $x$ to $\color{blue}{q}$ equals $r$.

$|x - \color{blue}{q}| = r$

Moreover, we want this point $x$ on our circle to also be on our line $l$. So, $x = \color{red}{p} + tv$ for some value of $t$. With this in mind, we can substitute $x$ in our previous equation.

$|\color{red}{p} + tv - \color{blue}{q}| = r$

Now, let's square both sides.

$|\color{red}{p} + tv - \color{blue}{q}|^2 = r^2$

This may seem pointless, but it helps us rewrite that left side of the equation. Generally, working with the magnitude of a vector as an operator isn't super helpful, but we can quickly rewrite the square of the magnitude in terms of the dot product, since for any vector $v \cdot v = |v|^2$.

$(\color{red}{p} + tv - \color{blue}{q}) \cdot (\color{red}{p} + tv - \color{blue}{q}) = r^2$

Expanding this out and collecting like terms gives us…

$(\color{red}{p} + tv - \color{blue}{q}) \cdot (\color{red}{p} + tv - \color{blue}{q}) = r^2$
$t^2(v \cdot v) + 2t(v \cdot (\color{red}{p} - \color{blue}{q})) + (\color{red}{p} - \color{blue}{q}) \cdot (\color{red}{p} - \color{blue}{q}) - r^2 = 0$

Which is just a quadratic equation in $t$! With coefficients…

$\begin{align} a & = v \cdot v \ \newline b & = 2(v \cdot (\color{red}{p} - \color{blue}{q})) \ \newline c & = (\color{red}{p} - \color{blue}{q}) \cdot (\color{red}{p} - \color{blue}{q}) - r^2 \end{align}$

…we can solve for $t$ using our trusted quadratic formula (note that $a$, $b$, and $c$ are all outputs of dot products, ensuring they are valid scalars to plug in).

$\large{t = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}}$

Remember, $t$ is the scalar that tells us where on our line we are, so if there are real solutions to $t$, then we will have the exact intersection points for our line and the circle!

Our quadratic formula now not only tells us when the line intersects the circle, but also where they intersect.

We can analyze this quadratic like any other to give us insight into our intersection points. Specifically, using the discriminant. When $b^2 - 4ac > 0$, then we get two solutions/intersection points. If $b^2 - 4ac < 0$, then we get no real solutions and therefore no intersection points. Finally, if $b^2 - 4ac = 0$, then we have exactly one intersection point, and can conclude our line is tangent to the circle.

Also, this quadratic can straight up replace our closest-distance method from before, since the point at which our line is closest to the circle corresponds to the vertex of the parabola at $t=\frac{-b}{2a}$.

Not to mention, notice how everything we did here was independent of the fact our line and circle exist in two dimensions; we can easily use this for 3D graphics, and even higher dimensions as well to find the intersections between lines and hyperspheres! Below is a raytraced scene I drew of 3 balls using this exact quadratic to compute lighting with shadows and reflections (a.k.a. my formal application to Pixar).

This raytraced scene is just thousands of uses of the quadratic formula.

And to think that we'd never use the quadratic formula in real life.

I Don't Know Where Else to Put This

Before I end off this post, I want to include some other interesting circle facts since I don't know where else to put them.

Squaring the Circle (Bounces)

If you have a ray of light start from the circumference of the circle, after a total of $n$ reflections within the circle, the sum of all the angles of reflection will be $n^2$ times the initial angle.

Between this and the Basel problem, circles and squares are just weirdly intertwined. The reason this particular statement is true is because of how much the angle with the horizontal increases after a single bounce. If your light starts at an angle $\alpha$, we can show that every additional bounce will add $2\alpha$ to the angle with respect to the horizontal.

With the help of some auxiliary lines, I hope the above picture makes this clear. Then by symmetry, of the circle, we can see that each subsequent bounce will also add $2\alpha$ to the angle. Moreover, since our initial angle itself is $\alpha$, every bounce will just be the odd multiples of $\alpha$ (since odd numbers can be thought of as a multiple of 2 plus 1, which is precisely what our angle bounces mimic)! So, for a series of $n$ bounces, the sum of the angles of each reflection is equal to

$\begin{align} \alpha + 3\alpha + 5\alpha + 7\alpha + … + (2n-1)\alpha & = \ \newline \alpha(1+3+5+7+…+2n-1)& = \ \newline \alpha(n\cdot 1 + 2(0+1+2+3+…+n-1)& = \ \newline \alpha(n + 2\cdot\frac{(n-1)(n)}{2}) & = \ \newline \alpha(n + n^2 - n) & = \alpha n^2 \end{align}$

(Yes I am aware there is a formula for an arithmetic sequence with with any initial term but this is how I remember to solve them okay) I didn't know how to fit it in last post with the mention of circular mirrors there, but here seems like a good spot to mention it.

Perpendicular Parabolas

The set of intersection points between two orthogonal parabolas lie on a common circle.

To show this is true, we just need to crank out the algebra. To find our intersection points, we need to solve the system of equations

$\begin{cases} (x - \color{red}{x_1})^2 = y - \color{red}{y_1} \ \newline (y - \color{blue}{y_2})^2 = x - \color{blue}{x_2} \end{cases}$

If these individual equations are true for our intersection points, then so is their sum.

$(x - \color{red}{x_1})^2 + (y - \color{blue}{y_2})^2 = y - \color{red}{y_1} + x - \color{blue}{x_2}$
$x^2 - x(2\color{red}{x_1}) + \color{red}{x_1}^2 + y^2 - y(2\color{blue}{y_2}) + \color{blue}{y_2}^2 = y - \color{red}{y_1} + x - \color{blue}{x_2}$
$x^2 - x(2\color{red}{x_1} + 1) + y^2 - y(2\color{blue}{y_2} + 1) = -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$
$(x - (\color{red}{x_1} + \frac{1}{2}))^2 - (\color{red}{x_1} + \frac{1}{2})^2 + (y - (\color{blue}{y_2} + \frac{1}{2}))^2 - (\color{blue}{y_2} + \frac{1}{2})^2 = -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$
$(x - (\color{red}{x_1} + \frac{1}{2}))^2 + (y - (\color{blue}{y_2} + \frac{1}{2}))^2 = (\color{red}{x_1} + \frac{1}{2})^2 + (\color{blue}{y_2} + \frac{1}{2})^2 -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$

While that last line may seem a bit unruly, note that $\color{red}{x_1}$, $\color{red}{y_1}$, $\color{blue}{x_2}$, and $\color{blue}{y_2}$ are all constants, so the right-hand side of that last equation can be summarized as one big constant.

$(x - (\color{red}{x_1} + \frac{1}{2}))^2 + (y - (\color{blue}{y_2} + \frac{1}{2}))^2 = C$

That's precisely the equation of a circle with a center at $(\color{red}{x_1} + \frac{1}{2}, \color{blue}{y_2} + \frac{1}{2})$ and radius $\sqrt{C}$, and that's exactly what is plotted above.

I have a few more circle tidbits to share, but they have more to expand on in their own posts for another day.

Until then, hopefully you found this slight detour into the world of graphics interesting. There are (as you could imagine) a lot more to graphics I want to share. From image homography, to video textures, to even a more in-depth look into raytracing and rasterization, but we'll save those for later.

Incredible Integrals and Tactful Techniques

Adi Mittal

Flexible? Maybe. Flex-able. Definitely

This will be a slightly more theoretical, conceptual post than the others, but these tricks have that mathematical and problem-solving elegance that is too good not to share. We'll go over a shortcut for some certain integration by parts problems, and one that allows us to make educated guesses of antiderivatives to find an answer.

First, let's take a loot at the integration by parts shortcut.

IBP, but You Forgot +C

This is by far my new favorite trick to pull out of my back pocket whenever I can. It leverages the fact of the innate products built into integration by parts, and the nature of antiderivatives. Though, I'm sure some of us could use a refresher on integration by parts.

To Undo the Product Rule

When given the product of two functions $f(x)g(x)$, the standard formula to compute its derivative is

$\frac{\mathrm{d}}{\mathrm{d}x} \ f(x)g(x) = f'(x)g(x) + f(x)g'(x)$

Now if you integrate both sides and rearrange a little bit we can conclude that

$\int \frac{\mathrm{d}}{\mathrm{d}x} \ f(x)g(x) = \int f'(x)g(x) \ dx + \int f(x)g'(x) \ dx$
$\int f(x)g'(x) \ dx = f(x)g(x) - \int f'(x)g(x) \ dx$

Let $u = f(x)$ and $v = g(x)$ to get that

$\boxed{ \int u \ dv = uv - \int v \ du \hspace{.2cm} }$

…which is precisely the integration by parts formula we've come to know. It really just is the opposite of the product rule. The main reason why it's such a useful technique is because if you have a function that's really hard or you don't know how to integrate, you can use it as your function $u$ and express its integral purely in terms of its derivative. Let's try this with an example:

$\int \ln x \ dx$

This doesn't seem like a particularly product-y integral that can leverage integration by parts, but it really is!

$\int \ln x \ dx = \int 1 \cdot \ln x \ dx$

Since we don't want to deal with integrating $\ln x$ (besides, that's what we're trying to find anyway), we can set $u = \ln x$ and $dv = 1 \ dx$. Then working it through we get that

$\int 1 \cdot \ln x \ dx = x \ln x - \int x \cdot \frac{1}{x} \ dx = x \ln x - x + C$

That's pretty neat! We were able to reduce a relatively hard integral into one that was much simpler by thinking of it in terms of what product of functions when differentiated would include our original integral.

However, this doesn't always work by itself, and this is where our integral trick comes into play.

Now, Remember Your +C

Let's try a very similar integral from before:

$\int \ln(x+1) \ dx$

This doesn't look too bad, right? It's basically the same as before. Let's try the same choice of $u = \ln(x+1)$ and $dv = 1 \ dx$ again and see where it takes us.

$\int 1 \cdot \ln(x+1) \ dx = x \ln(x+1) - \int x \cdot \frac{1}{x+1} \ dx$

Aaand there's our problem. Our supposed simplified integral of $\int v \ du$ ended up with something also annoying: $\int x \cdot \frac{1}{x+1} \ dx$. This isn't too bad if you're okay with polynomial division (with this one being relatively easy, too), but it isn't necessarily trivial. Since we want to avoid doing more work, we can do much better by realizing an overlooked aspect of integration by parts.

We rewrote our original integral in the form of $\int u \ dv$, and later found an antiderivative $v$ from that differential. In our case, we let $dv = 1 \ dx$ and deduced that $v = x$ by undoing the power rule of differentiation. This isn't wrong, per se, but it is incomplete. The antiderivative of $1\ dx$ isn't $x$, but $x \ \mathbf{+ \ C}$. Since, remember, the derivative of any constant goes to 0, we can add whatever constant we want to the end of our antiderivative and it'll still remain valid.

So how can this help us? Well, let's do the same integral with the same choice of $u$ and $dv$, but instead of letting $v = x$, let's make $v = x+1$.

$\int 1 \cdot \ln(x+1) \ dx = (x+1) \ln(x+1) - \int (x+1) \cdot \frac{1}{x+1} \ dx$

Look at that! That last, previously annoying integral, has become much simpler! Instead of getting two polynomials dividing each other, our new choice of $v$ reduces it to $\int (x+1) \cdot \frac{1}{x+1} \ dx = \int 1 \ dx = x + C$. So, finally, we can conclude that

$\int \ln(x+1) \ dx = (x+1) \ln(x+1) - x + C$

In fact, for any choice of a constant $\alpha$, we can see that

$\int \ln(x+\alpha) \ dx = (x+\alpha) \ln(x+\alpha) - x + C$

It's such a simple trick, but an important reminder to remember the basics and fundamentals when attempting a problem. Besides, $+C$ being more than a formality is at least a little bit satisfying.

Polynomial Predictions

The last integral trick is a bit niche—it relies on that second integral when doing IBP to give a quotient two polynomials of matching degrees. This next shortcut, though, uses this idea of polynomial degree a bit more cleverly, but does not always work. When it does, though, it's certainly satisfying.

Let's find the following antiderivative:

$\int \large{\frac{x^3 + 2x^2 + 3}{\sqrt{x^2 + 3}}}$ $dx$

This doesn't look particularly friendly, but we can make some observations about this function. The numerator of our function is a cubic, or a polynomial of degree 3. Similarly, our denominator is the square root of a quadratic, or polynomial degree 2. In the very loose sense of "degree" we can say that asymptotically, the denominator is closer to a degree 1 or linear polynomial (yes, I know that that $\sqrt{x^2} = |x| \neq x$, but just play along for now). So, if we were to carry out all of the polynomial long division, we'd expect our original function to behave like a degree $3 - \frac{2}{2} = 2$ polynomial.

Ok, so what? With basic integration, antidifferentiating a polynomial increases its degree by one. This is just the power rule.

$\int x^{\color{red}{n}} \ dx = \frac{1}{n+1}x^{\color{red}{n+1}}$

This fact implies that if our polynomial is loosely of "degree" 2, then integrating it should give us a function of degree $2+1=3$. So, let's make a guess at what our integral might look like.

$\int \large{\frac{x^3 + 2x^2 + 3}{\sqrt{x^2 + 3}}}$ $dx = (ax^2 + bx + c)\sqrt{x^2+3}$

This guess should look somewhat reasonable, since we have a quadratic multiplied by square root of another quadratic, which we loosely said was degree 1. And a polynomial of degree 2 multiplied by a polynomial of degree 1 gives us a polynomial of degree 3, which we wanted. However, you might wonder why we even wrote this as a product; why not just write this out as a pure cubic of $ax^3 + bx^2 + cx + d$? The main reason is expecting the chain rule of some kind to occur. When composing functions, the derivative—and therefore the integral—tend to include the structure of these compositions, so it's not unreasonable to make a guess with the denominator in the result.

Now here's the trick: let's differentiate both sides.

$\large{\frac{x^3 + 2x^2 + 3}{\sqrt{x^2 + 3}}}$ $= (2ax + b)\sqrt{x^2+3} + (ax^2 + bx + c) \cdot \large{\frac{x}{\sqrt{x^2 + 3}}}$

If we simplify this expression and expand the right side…

$\large{\frac{x^3 + 2x^2 + 3}{\sqrt{x^2 + 3}}}$ $= (2ax + b)\sqrt{x^2+3} + (ax^2 + bx + c) \cdot \large{\frac{x}{\sqrt{x^2 + 3}}}$
$x^3 + 2x^2 + 3 = (2ax + b)(x^2+3) + (ax^2 + bx + c) \cdot x$
$x^3 + 2x^2 + 3 = 3ax^3 + 2bx^2 + (6a + c)x + 3b$

For this last equation to hold, we need the coefficients to match.

$\color{red}{x^3} + \color{blue}{2x^2} + \color{green}{0x} + \color{purple}{3} = \color{red}{3ax^3} + \color{blue}{2bx^2} + \color{green}{(6a + c)x} + \color{purple}{3b}$

Therefore,

$\begin{align} a & = \frac{1}{3} \ \newline b & = 1 \ \newline c & = -2 \end{align}$

Putting it all together, we can go back to our original guess of the antiderivative and find that

$\int \large{\frac{x^3 + 2x^2 + 3}{\sqrt{x^2 + 3}}}$ $dx = \boxed{\ (\frac{1}{3}x^2 + x - 2)\sqrt{x^2+3} + C \ }$

There we go! A succcessful antiderivative found.

This is trick is a great first attempt at integrating rational functions, but it is also extremely sensitive to minute changes in the integrand. For example, if we change our integral to

$\int \large{\frac{x^3 + 2x^2 + \color{red}{3}}{\sqrt{x^2 + 3}}}$ $dx \rightarrow \int \large{\frac{x^3 + 2x^2 + \color{red}{2}}{\sqrt{x^2 + 3}}}$ $dx$

our algorithm breaks. It's the cost associated with what makes this algorithm so convenient: we don't touch the numerator of the integrand at all. Our antiderivative guess only depended on the denominator, and as a result, the coefficients we tried to match at the end had no intrinsic tie to the numerator and thus polynomial we were matching.

This integral shortcut's convenience is definitely a double-edged sword, but the method behind making these educated guesses is a useful idea in its own right to take away. For more on this type of integration, I recommend reading up on the Risch algorithm, a standard in computing indefinite integrals. Here's also a very thorough synopsis on evaluating integrals on the Wolfram Blog.

Feynman's At It Again

This last integration stratagem comes from none other than the celebrated Richard Feynman of physics fame, and thus has been aptly coined as Feynman's Intregral Trick.

…as it has been popularized. What I'm about to show has historically been known as the Leibniz Integral Rule, or differentiation under the integral sign. Not quite the same ring to it, but nonetheless good to know for accuracy.

Let's try the following integral:

$\int_{0}^{1} \large{\frac{x^2 - 1}{\ln x}}$ $dx$

What we're about to do might seem insane, but it will be immensely helpful in a second. What we're going to do is generalize this integral. Let

$f(\color{red}{t}) = \int_{0}^{1} \large{\frac{x^\color{red}{t} - 1}{\ln x}}$ $dx$

We've replaced the exponent of 2 with a $t$. So, in our new, generalized problem, we want to find $f(2)$. A useful fact also to note is that we know some values of $f(t)$. For example, we know that $f(0)=0$. How does this help? Well, now we can what the name of this trick alludes to—moreso outright says: we'll differentiate under the integral sign. Let's take the derivative of $f(t)$ with respect to $t$.

$\large{\frac{\partial f}{\partial t}}$ $= \large{\frac{\partial}{\partial t}}$ $\int_{0}^{1} \large{\frac{x^{t} - 1}{\ln x}}$ $dx = \int_{0}^{1} \large{\frac{\partial}{\partial t} \frac{x^{t} - 1}{\ln x}}$ $dx = \int_{0}^{1} x^t \ dx$

That last integral is super easy, only reversing the power rule to calculate.

$\large{\frac{\partial f}{\partial t}}$ $= \int_{0}^{1} x^t \ dx = \large{\frac{x^{t+1}}{t+1}}$ $|_{0}^{1} = \large{\frac{1}{t+1}}$

Now that we know the derivative of $f(t)$, we can now integrate this simpler function in terms of known values and use the Fundamental Theorem of Calculus to find $f(2)$.

$\int_{0}^{2} \partial f = f(2) - f(0) = f(2) = \int_{0}^{2} \large{\frac{1}{t+1}}$ $\partial t = \ln3 - \ln 1 = \boxed{\ln 3 }$

Note how I used the Fundamental Theorem of Calculus with clever bounds for our integral. You could instead solve the differential equation generally, but the FTC skips shortcuts a few steps. So, after all of that, we can conclude

$\int_{0}^{1} \large{\frac{x^2 - 1}{\ln x}}$ $dx = \ln 3$

As counterintuitive as it may seem, solving a general problem can sometimes actually be easier to solve than its individual cases. The best part about this technique, though, is that we haven't just solved one integral, but a whole family of integrals. For any exponent $\alpha$, we can conclude that

$\int_{0}^{1} \large{\frac{x^\alpha - 1}{\ln x}}$ $dx = \ln (\alpha + 1)$

You might have noticed something different with this integral compared to our previous approaches: this applies to definite integrals as opposed to indefinite integrals (or antiderivatives). Namely, because of the fact we have to integrate not once but twice in this method. So, at the following step,

$\large{\frac{\partial f}{\partial t}}$ $= \int_{0}^{1} x^t \ dx = \large{\frac{x^{t+1}}{t+1}}$ $|_{0}^{1} = \large{\frac{1}{t+1}}$

if this was not a definite integral, we would end up with a $+C$ attached to the end of it that we would not be able to solve for.

Here's another application of Feynman's trick:

$\int_{0}^{1} \large{\frac{\ln (x+1)}{x^2 + 1}}$ $dx$

Knowing Feynman's trick wins you the battle, but knowing how to use it wins you the war. Many times, you have to be creative in your choice of parameter when wanting to differentiate under the integral sign, so don't be discouraged if it doesn't work the first time. For this particular integral, we'll want to consider

$f(t) = \int_{0}^{1} \large{\frac{\ln (tx+1)}{x^2 + 1}}$ $dx$

Now, we want to find $f(1)$, and we know that $f(0)=0$. Now let's differentiate both sides with respect to $t$.

$\large{\frac{\partial f}{\partial t}}$ $= \int_{0}^{1} \large{\frac{\partial}{\partial t} \frac{\ln (tx+1)}{x^2 + 1}}$ $dx = \int_{0}^{1} \large{\frac{x}{(tx+1)(x^2 + 1)}}$ $dx$

Decomposing that last integral into its partial fractions yields

$\large{\frac{\partial f}{\partial t}}$ $= \int_{0}^{1} \large{\frac{x}{(tx+1)(x^2 + 1)}}$ $dx = \large{\frac{1}{t^2 + 1}}$ $\int_{0}^{1} \large{\frac{-t}{tx+1}}$ $+ \large{\frac{x}{x^2+1}}$ $+ \large{\frac{t}{x^2 + 1}}$ $dx$

Now, with more elementary calculus, we can evaluate that integral.

$\int_{0}^{1} \large{\frac{-t}{tx+1}}$ $+ \large{\frac{x}{x^2+1}}$ $+ \large{\frac{t}{x^2 + 1}}$ $dx = -\ln(tx+1) + \frac{1}{2}\ln(x^2+1) + t\tan^{-1}(x) |_{0}^{1}$

$\large{\frac{\partial f}{\partial t}}$ $=\large{\frac{-4\ln(t+1) + 2\ln(2) + t\pi}{4(t^2 + 1)}}$

Now, we want $f(1)$, and know that $f(0)=0$, so let's integrate this function of $t$ from 0 to 1.

$f(1) = \int_{0}^{1} \large{\frac{-4\ln(t+1) + 2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t = \int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t + \int_{0}^{1} \large{\frac{-4\ln(t+1)}{4(t^2 + 1)}}$ $\partial t$

$f(1) = \int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t - \int_{0}^{1} \large{\frac{\ln(t+1)}{t^2 + 1}}$ $\partial t = \int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t - f(1)$

$f(1) = \large{\frac{1}{2}}$ $\int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t = \large{\frac{\pi \ln 2}{8}}$

Again, almost magically, by generalizing a hard integral, it became a much easier one to tackle.

$\int_{0}^{1} \large{\frac{\ln (x+1)}{x^2 + 1}}$ $dx = \large{\frac{\pi \ln 2}{8}}$

To give you an idea how powerful this technique is, the above integral comes from the 2005 Putnam Exam. Not only does it come from one of the most difficult math tests, it's also the 5th problem of the first set of problems (with problem 1 being the "easiest" and 6 being near impossible). And, in only a few lines, Feynman and Leibniz had it beat.

Conclusion

These are the three most recent integration techniques I have picked up and tucked away in my problem solving toolbox, but if you're interested in more advanced integral shortcuts and tricks, take a look at this MathStackExchange post I came across while doing this write-up. There are some genuinely mesmerizing ideas showcased there that just are out of the scope of my ability to explain, so do browse the forum if you're interested.

The Calculus of Variations

Adi Mittal

The end of an era

It's a been a little while since I've last posted. As part of my calculus class, we end the year with an exploration into a calculus related topic that we present to the rest of the class. I and my partner chose to explore the origins behind the brachistochrone: the curve of fastest descent for a rolling ball. Below is the related write-up I did as part of this project, and thought it might make for a good posting. It helped refine my LaTeX skills and I think is a good introduction into the field and motivations of the calculus of variations. Now that summer is right around the corner, hopefully I'll be able to write more in the coming months. For now, I hope this will do.

Quake III's Smartest Quasi-Square Root

Adi Mittal

See? Video games ARE useful

A couple months back, we covered a little bit about some random circle computations and facts I had collected over the months leading into that post. In it, we highlighted and rederived the basic raytracing equation for circles and spheres. In a few words, we used properties of vectors to be able to reduce the problem of where a line intersects a sphere into a quadratic equation. And with all quadratics, we were then able to use the quadratic formula to quickly find those points of intersection. In that post, I go on to say that I was able to generate this incredible raytraced scene in a mere matter of tens of minutes in even as simple a programming language as Python:

This raytraced scene is just millions of uses of the quadratic formula.

But I have to admit, I sort of lied to you. While, yes, that image does use the quadratic formula millions of times, it doesn't only do that. To render shadows and reflections, the scene also had to compute lighting and the physics you'd expect with mirror-like objects. Without any of this, our scene would just look like, well, uh, this:

Now this is a peak graphical performance. In a word: art.

Whichever one you think is better looking is up to the eye of the beholder, but what can't be argued, is that the second image is much cheaper to render; I'm sure you could guess, no shadows and reflections causes the scene to be rendered in a fifth of the time. A fifth.

Quizzing Quotients

Intuitively, more stuff to compute should take a computer a longer amount of time to go through, but can we pinpoint this bottleneck? Let's quickly look at what it takes to compute some of these reflections. When light bounces off, say, a mirror, these calculations become much easier when we use the mirror or surface's normal vector: the vector perpendicular to the surface (or the point at the surface) in question.

How to reflect a ray over a normal vector.

The above formula for reflecting a ray works in general for reflecting any ray $\vec{R}$ over another vector $\vec{N}$ (even if they're not normal)… Under a small assumption: the vector $\vec{N}$ is normalized (yes, the naming scheme isn't ideal), or of unit length (denoted by a little hat $\hat{N}$). We can do this by just scaling the vector down by its own length:

$$\hat{N} = \large{ \frac{\vec{N}}{\lVert N \rVert} }$$

Recalling that the length of a vector 3D $\lVert N \rVert = \lVert \rVert = \sqrt{x^2 + y^2 + z^2}$, we end up with

$$\hat{N} = \large{ \frac{\vec{N}}{\sqrt{x^2 + y^2 + z^2}} }$$

And here lies our bottleneck. While we, as humans, treat division not too differently from multiplication in theory, computers can't work with "just in theory"; computers have to actually compute this arithmetic somehow. It turns out, while multiplication is a bit more complicated than addition, we've been able to make algorithms for decades to accelerate the computation. Division, on the other hand, has been such a difficult endeavor to match other operations speed, major companies like Intel have lots of research dedicated to this alone.

So, what do we do?

The Fast Inverse Square Root

Under pressure, people can do some amazing things. You can imagine if someone was making a game or anything that required lots of lighting calculations, say, in a video game, calculating $\frac{1}{\sqrt{x}}$ millions of times, therefore also computing millions of divisions won't really cut it.

The developers of the video game Quaker III, an incredibly fast-paced shooter that definitely needed these optimizations, used a now infamous algorithm aptly called the fast inverse square root, because, well, it computes the inverse square root $\frac{1}{\sqrt{x}}$, fast and avoids dreaded division. The history of the algorithm has been found to predate the game that made it so infamous, but pop culture assigns value to whatever it latches onto first. Without further ado, the original source code (along with all the original comments and annotations) for Quake III was released in 2005, and the program is right there for us to learn from:

float Q_rsqrt( float number )
{
      long i;
      float x2, y;
      const float threehalfs = 1.5F;

      x2 = number * 0.5F;
      y  = number;
      i  = * ( long * ) &y;                     // evil floating point bit level hacking
      i  = 0x5f3759df - ( i >> 1 );             // what the fuck? 
      y  = * ( float * ) &i;
      y  = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
//    y  = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

      return y;
}

It's not a long algorithm by any means, but I think the comments themselves explain just how crazy this is; even the developers who USED it are impressed, but let's break it down line by line.

      long i;
      float x2, y;
      const float threehalfs = 1.5F;

Here we define four different numbers: i, x2, y, and threehalfs. But not all of these numbers are treated the same.

Binary and Floating Point

In our day-to-day routine, we (at least, most of us) use base 10, decimal, to represent our numbers. What this means is that each digit in a number corresponds to some power of 10 we add together. For example, the number 1409, can be grouped as

$ \begin{array}{c|c|c|c} 1 & 4 & 0 & 9 \\ \hline 10^3 & 10^2 & 10^1 & 10^0 \\ \end{array} $

with 1 thousands, 4 hundred, 0 tens, and 9 ones. You add these all together to get $1(10^3) + 4(10^2) + 0(10^1) + 9(10^0) = 1409$. This may seem obvious, but this is a really important idea in how we write numbers. Each digit represents a condensed shorthand for how many of a specific power of 10 is in our number. Computers do it similarly, but instead of base 10, they use base 2, or binary. If we wanted to represent 1409 in binary, we'd have

$ \begin{array}{c|c|c|c|c|c|c|c|c|c|c} 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \hline 2^{10} & 2^9 & 2^8 & 2^7 & 2^6 & 2^5 & 2^4 & 2^3 & 2^2 & 2^1 & 2^0 \\ \end{array} $

Now if you went and added these together you could verify that $2^{10} + 2^{8} + 2^{7} + 2^{0} = 1409$. Now each digit—or decimal-igit—represents a power of 2. We call these binary-igits bits. That first line of code that defines a long i means we define a number with 32 bits that looks like

00000000 00000000 00000000 00000000

With 32 bits, we can write any number from 0 to $2^{32} - 1 = 4294967295$. Let's notice some nice properties with this format. In decimal, if we wanted to add a 0 to the end of our number like $1409 \rightarrow 14090$, this is the same as multiplying our number by 10, because now every digit has moved up into one bucket higher than before.

$ \begin{array}{c|c|c|c|c} 1 & 4 & 0 & 9 & 0\\ \hline 10^4 & 10^3 & 10^2 & 10^1 & 10^0 \\ \end{array} $

In the same way, we can remove zeroes on the right $14090 \rightarrow 1409$ by dividing by 10 since every digit will be shifted into one power lower. Binary works the same. If we want to add a 0 to the right end of our number, we now multiply our number by 2, but if we wanted to remove a 0, we divide by 2. This is known as bit shifting, and serves as one of the nice workarounds of division: if you want to divide specifically by a power of 2, just bit shift the number in binary by however many zeroes you need.

But that leads to an issue: only even numbers can be wholly divided by 2, so what do we do if we want to divide an odd number? How would we write a decimal like .5? How would we write any rational number? We currently can only represent 32-bit integers, since we have no way of writing fractional parts. If we wanted to add decimals, why don't we just throw in a decimal point then?

00000000 00000000 . 00000000 00000000

(Remember this decimal point is just for our convenience; the computer doesn't actually see anything here but the 32 bits) If we use the left 16 bits for the integer part, and the right 16 bits for the decimal, we can now in fact write rational numbers, but all of a sudden the range of numbers we can represent has shrunk immensely, only to just under 65535.9999847, but the upside is now we can do much more precise, rational numbers. This is okay, but we can do better.

If you have ever taken a chemistry or physics class, you're probably all too familiar with scientific notation. We can write really big numbers and really small numbers in a much more condensed way by turning everything into a product of a power of ten:

$\begin{align} 4230987 & \rightarrow 4.230987 \cdot 10^6 \ \newline 0.000154 & \rightarrow 1.54 \cdot 10^{-4} \end{align}$

This is how our calculators can do arithmetic beyond just the digits shown on the screen. By slapping on a power of 10, we can now represent a much wider range of values using the same number of digits previously. Similarly, binary works just the same except with powers of 2.

$\begin{align} 11010111 & \rightarrow 1.1010111 \cdot 2^7 \ \newline 0.00101 & \rightarrow 1.01 \cdot 2^{-3} \end{align}$

So, let's do just that. Let's allocate 8 bits of our previous 32 to representing the exponent, and another 23 to our actual number.

00000000 0.0000000000000000000000

With 8 bits for the exponent, we can represent anything from 0 to 255, but we also want negative exponents, so we just subtract 127 to get the new range of exponents from -127 to 128. With our fractional number and its 1 whole bit and 22 decimal bits, we can represent numbers from 0 to 1.999999761581. We call this part of the number the mantissa. However, this is actually not the full extent of the potential precision we can get. In all of our examples of scientific notation, there was always a non-zero number before the decimal point, since if there was a leading 0, that's another power of our exponent we could factor out. In binary, there's only two values: 0 and 1. If we know the first digit is non-zero, then we know it has to be a 1! So we can actually shift our decimal point over and gain an extra bit.

00000000 .00000000000000000000000

Now all we have to do is affix a leading 1 and we're good to go. So, the number

11001010 .01110100000000000000000

would represent $1\color{blue}{.453125} \cdot \color{green}{2^{202-127}}$. In general, if we're given a 23 bit number representing the mantissa and an 8 bit number representing the exponent, we can write our number we're expressing as:

$\large{(1 + \color{blue}{\frac{M}{2^{23}}}) \cdot \color{green}{2^{E-127}}}$

We divide our mantissa by $2^{23}$ so that it is only the fractional part of our number like we want. However, doesn't part of this just feel wrong? Like, when we were defining an integer as a 32 bit long, we used every bit to denote a new power of 2. Here, we really are writing two different binary numbers side-by-side. If we wrote our number as

.01110100000000000000000 11001010

would it really make a difference? So, just for the sake of consistency, we'll put the exponent first, and we can then represent our number's bit representation as a sum:

$\begin{align} \color{green}{00000000} 00000000000000000000000 & \ \newline \color{blue}{00000000000000000000000} & \ \newline \hline \color{green}{00000000}\color{blue}{00000000000000000000000} \end{align}$

We just added 23 filler zeroes to our exponent to make sure it landed where we wanted to in the final bit representation. That sounds like a bit shift! We can thus multiply our exponent by $2^{23}$ to give us our 23 extra zeroes So, this final sum—our exponent and mantissa together—can be written as

$\large{ \color{green}{E}\cdot 2^{23} + \color{blue}{M} }$

Now, there are some flaws that do need to be addressed. If we assume our leading bit is non-zero, how _do_ we represent 0? That actually doesn't matter in the grand scheme of our intended use in lighting (i.e. we only call the fast inverse square root when we have to use it), and when we're not, this is a single edge case that can be put in later. Yeah, I'll admit, it's not ideal, but it gives us more precision where we need it. The other issue is we haven't used all of our 32 bits! $8+23=31$, so where did our last bit go? In our set-up, we have it such that we can only represent positive numbers. We can attach an additional sign bit at the front, where if it's a 0, we say the number is positive, and if it's a 1, the number is negative.

0 11001010 .01110100000000000000000

However, we want to know how to find positive square roots, not enter the complex world, so we always just assume it's positive. So if we don't use the bit, why don't we repurpose it? Conventions. The standard for binary fractional-part representation and arithmetic is known as IEEE 754, and for that reason we just have to abide by it.

I've been calling this with terms like "decimal points" and "fractional-parts", but a decimal point seems wrong when we're doing it in binary. The type of number we've just formatted is called a floating point or a float as we see in lines 2 and 3. While floats give us nice ways to represent a lot of numbers, they are a bit annoying compared to longs in the sense that we can't bit shift or manipulate a float since the bits in a float represent multiple, different parts of the number in question, namely the exponent and the mantissa.

Bit Approximations… With Themselves?

This may seem like a gross, unnecessary dive into how computers understand numbers, but understanding what binary, bits, and floats are will help us greatly in understanding the ingenuity behind the fast inverse square root. To recap, we've found that we can represent a binary number as a float with two parts, an exponent and a mantissa, as if we were using scientific notation. To find the actual number our float represents, we use the formula

$\large{(1 + \color{blue}{\frac{M}{2^{23}}}) \cdot \color{green}{2^{E-127}}}$

Since we're working with two different binary numbers together, we combine them into one sequence of bits that to as a shorthand to represent our float with the formula

$\large{ \color{green}{E}\cdot 2^{23} + \color{blue}{M} }$

We can now perform some mathematical magic. Let's take the logarithm of the actual number of our float (note by $\log(x)$, we assume it to be $\log_2(x)$ since we're working in binary).

$\large{ \log((1 + \color{blue}{\frac{M}{2^{23}}}) \cdot \color{green}{2^{E-127}}) = \log(1 + \color{blue}{\frac{M}{2^{23}}})} + \color{green}{E-127}$

This may not seem that useful, but there's an important detail here: we're looking to optimize a program, not get exact results. So, a useful fact to note is that for $x$ between 0 and 1, $\log_2(1+x)\approx x$.

We can simplify $\log_2(1+x)$ by approximating it as $x$.

We can get an even better approximation by slightly offseting our estimate; $\log_2(1+x)\approx x + \delta$ is a better approximation than $\log_2(1+x)\approx x$

We can approximate $\log_2(1+x)$ better with small shift.

It turns out the best value for $\delta = 0.0430357$ (as in minimizing the average error). By definition, our mantissa is between 0 and 1, so we can use this approximation ourselves.

$\large{ \log(1 + \color{blue}{\frac{M}{2^{23}}}) + \color{green}{E-127} \approx \color{blue}{\frac{M}{2^{23}}} + \delta + \color{green}{E-127}}$

If we rearrange this a bit,

$\large{\color{blue}{\frac{M}{2^{23}}} + \delta + \color{green}{E-127} = \color{blue}{\frac{1}{2^{23}}}(\color{green}{E} \cdot 2^{23} + \color{blue}{M}) + \delta \color{green}{- 127} }$

Okay, why did we do any of this? This definitely is kinda random to not only take the $\log$ of our float, but also do all these approximations to then get rid of that $\log$ too? Why?

Look inside the parantheses in the above equation.

$\large{\color{blue}{\frac{1}{2^{23}}}(\boxed{ \color{green}{E} \cdot 2^{23} + \color{blue}{M} }) + \delta \color{green}{- 127} }$

That's precisely the bit representation of our float! So, in a way, the $\log$ of our number is equal to the bit representation of our float, up to some scaling and shifting.

$\large{\log(\textrm{number}) \approx C(\texttt{bits}) + K}$

With this under our belt, we can finally start looking at the steps of the fast inverse square root algorithm.

Evil Floating Point Bit Level Hacking

First, we assign our number we want to find the inverse square root of into a float (a.k.a. scientific notation-type decimal number).

      y  = number;

Now, recall that a float isn't that compatible with bit shifting or that many operations, so here's the first clever part of the algorithm.

      i  = * ( long * ) &y;

What this does is we take the exact bits of our number as a float and copies it into a long. That's it. Under the hood, it takes the number at the memory address of y and exactly transfers the bits over to i. This will make our life easier here in the next step.

Since we have now put our number that we're trying to find the inverse square root to, y, as its bit representation, we have effectively stored approximately $\log({y})$ into i.

What the F#@k?

The fabled step that makes this algorithm so smart.

      i  = 0x5f3759df - ( i >> 1 );

Remember, at the end of all of this, we want to find a number, $\alpha = \frac{1}{\sqrt{{y}}}$, but we have been working almost exclusively in logarithms. So, let's take the $\log$ of both sides.

$\large{ \log(\alpha) = \log(\frac{1}{\sqrt{{y}}}) = \log({y}^{-\frac{1}{2}}) = -\frac{1}{2}\log({y}) \approx -\frac{1}{2}\texttt{i}}$

Wait, but we have a division in there! On quite the contrary, it's a division by 2, and since i is a long, we can just bit shift to the right 1 to divide by 2! That's precisely what i >> 1 does: it bit shifts i once to the right.

But what is the deal with that 0x5f3759df? Well, remember that i is only an approximation for the $\log(y)$ up to some constants. So, we have to account for those constants somehow. Let's go back to $\alpha$. We know that

$\large{ \log(\alpha) = -\frac{1}{2}\log({y})}$

In terms of floats…

$\large{ \log({(1 + \frac{\color{red}{M_{\alpha}}}{2^{23}}) \cdot 2^{\color{red}{E_{\alpha}}-127}}) = -\frac{1}{2}\log({(1 + \frac{\color{blue}{M_{y}}}{2^{23}}) \cdot 2^{\color{blue}{E_{y}}-127}})}$

Fortunately we already know how to expand this from before.

$\frac{1}{2^{23}}(\color{red}{E_\alpha \cdot 2^{23}} + \color{red}{M_\alpha}) + \delta - 127 = -\frac{1}{2}[\frac{1}{2^{23}}(\color{blue}{E_y \cdot 2^{23}} + \color{blue}{M_y}) + \delta - 127]$

This looks pretty bad, but after some simplifying and rearranging…

$\color{red}{E_\alpha \cdot 2^{23}} + \color{red}{M_\alpha} = \frac{3}{2}2^{23}(127 - \delta) - \frac{1}{2}(\color{blue}{E_y \cdot 2^{23}} + \color{blue}{M_y})$

We know that anything of the form $E\cdot 2^{23} + M$ is just the bit representation of the number, and we know the bits of $y$ is just i, so

$\color{red}{\alpha}_{\texttt{bits}} = \frac{3}{2}2^{23}(127 - \delta) - \frac{1}{2}\texttt{i}$

That magic constant 0x5f3759df is the hexadecimal (not totally sure why there is so many changes of bases) approximation of that constant $\frac{3}{2}2^{23}(127 - \delta)$. So what we do in this line of code is we bit shift i once to the right to halve it, and take that result and subtract it from 0x5f3759df to correct for all the constants that came with our approximations of $\log(y)$. Not totally sure why the developers felt the need to write a variable for threehalfs and not this number, but what can we do.

But now note we are storing this value in i. So, from here on i no longer refers to the bits of $y$, but the bits of $\alpha$, our desired number. The bits, though, aren't particularly helpful since we want the float and decimal representation of $\alpha$, so we do just that:

      y  = * ( float * ) &i;

Just like how we casted the bits of a float $y$ into a long i, we now do the reverse and cast the bits of i into a float $y$.

At this point, we're technically done: $y$ currently stores an approximation of $\frac{1}{\sqrt{\texttt{number}}}$, using 0 steps of slow division! But we can do better for a marginal amount of extra computation.

1st Iteration

Say we wanted to solve for the zeroes of the function

$\large{f(y) = \frac{1}{y^2} - C}$

where $C$ is any arbitrary constant. Solving for $y$…

$\large{0 = \frac{1}{y^2} + C \rightarrow y = \frac{1}{\sqrt{C}}}$

If we could find a way to approximate the roots of this function, we'd then in turn have a way to approximate the inverse square root of any number!

In a previous post, we discussed a technique to precisely do that: the Newton-Raphson Method (sometimes just called Newton's Method).

Let's say we have a random function $g(x)$. To find a solution, what can we do? Well, not a good idea, but an idea, we could just guess a random number $x_0$ as a solution. If $x_0$ is a solution, then obviously $g(x_0)=0$.

A pretty bad first guess.

As you'd imagine, the chances of guessing a root of $g(x)$ immediately is slim. So, the next step in Newton's Method tells us to draw the tangent line at our first guess $(x_0, g(x_0))$ to get our next guess $x_1$.

A better, but still not ideal, approximation.

Now we're getting pretty close. That's the whole premise of the Newton-Raphson Method:

Pick an initial guess $x_n$
Draw the tangent line at $(x_n, g(x_n))$ and find where it intersects the $x$-axis
Use that as your new guess $x_{n+1}$
Repeat steps 1–3 as needed

So, if we do another iteration of our example above…

Now we're getting to a reasonable estimation.

There are some edge cases though where this obviously won't work, such as if our guess happens to hit an extremum.

In this case, there's no additional guess since our tangent line is parallel to the axis.

We could even get loops where we just continuously bounce back and forth between two guesses. Fortunately, we don't have to worry about that. If our first guess is already really accurate and near the actual solution, then our graph $g(x)$ begins to look like this:

Up close, smooth, continuous graphs look linear.

$g(x)$ starts to look like a line! And when a function locally looks like a line, it also locally looks like its tangent line.

Can't really beat that now.

This is important to us since we already have a good estimate from all of our bit manipulation from earlier, so we do one iteration of Newton's method to get an even better approximation.

To put this in terms of some equations to compute, we want to estimate the root of

$\large{f(y) = \frac{1}{y^2} - C}$

Given an initial guess $y_n$, our next guess $y_{n+1}$ is the solution to

$\large{f'(y_n)(y-y_n) + f(y_n) = 0}$

since this describes where our tangent line generates our next solution. Solving for $y$ we get that

$\large{y = y_{n+1} = y_n - \frac{f(y_n)}{f'(y_n)}}$

Now it's just a matter of plugging everything in.

$\begin{align} y_{n+1} & = y_n - \frac{f(y_n)}{f'(y_n)} \ \newline & = y_n - \frac{\frac{1}{y_{n}^2} - C}{-\frac{2}{y_{n}^3}} \ \newline & = \frac{3y_n - Cy_{n}^3}{2} \ \newline & = y_n(\frac{3}{2} - \frac{C}{2}y_{n}^2) \end{align}$

With a small substitution of $x_2 = \frac{C}{2}$,

$\large{ y_{n+1} = y_n(\frac{3}{2} - x_{2}y_{n}^2) }$

If we look at the line of code that entails this "1st iteration",

      y  = y * ( threehalfs - ( x2 * y * y ) );

That's precisely the formula they have right there. You might wonder if that $\frac{3}{2}$ poses an issue at all in terms of division, but it is of no concern since we know its decimal expansion to be 1.5 so we can just use floating point arithmetic from the start; division becomes an increasingly hard problem when we don't know what the decimal representation of the quotient in question is.

Conclusion

Let's quickly recap what we've learned about the fast inverse square root algorithm and how it works:

float Q_rsqrt( float number )
{
      long i;
      float x2, y;
      const float threehalfs = 1.5F;

      x2 = number * 0.5F;
      y  = number;
      i  = * ( long * ) &y;                     // evil floating point bit level hacking
      i  = 0x5f3759df - ( i >> 1 );             // what the fuck? 
      y  = * ( float * ) &i;
      y  = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
//    y  = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

      return y;
}

We first take our number as a float $y$ and store its bits in a long i.
Noting that $\log_2(y) \approx \texttt{i}$, we approximate $\frac{1}{\sqrt{y}}$ using bit shifting and the magic number 0x5f3759df.
We then perform one iteration of Newton's method to get an even better approximation.

These three steps required a fair bit of knowledge to properly unpack, but it's incredibly insightful when thinking about the lenghts programmers go to optimize their code. Remember: this is all to avoid using any division! What we consider such a simple operation is almost never such for a computer, and when it comes to teaching computers to do these things, they are as blind as a bat. But it's this difficulty and blank slate of a circuit board that makes computers be able to teach us, just as much as we teach them.

Pleasing Panoramas and Matrix Multiplication

Adi Mittal

Capture the world as it (almost) is

Out of all the apps on my phone, the camera is one that I can't see myself without anymore. Between being out with friends, or travelling with family, the camera rarely remains idle as I capture memories forever. Though, there is one particular feature of mobile photography I've come to especially love: the panorama.

Even this struggles to capture how well the Grand Canyon lives up to its name.

Something about the extreme wide angle capturing more than what your eye can behold all at once makes it truly magical. And, if you have an Android phone, you can get even more incredible types of otherwise impossible photos with Google's staple Photo Spheres:

An "unwrapped" Photo Sphere; normally this would be an interactive literal sphere of the environment.

While these are cool, it does beg the question: how are these made? It's not like your phone is able to take them in a single snapshot; usually you have to spin or take multiple photos. How is your phone able to tie multiple images together into a single cohesive one? We'll explore a little bit of linear algebra to manipulate our photos exactly how we want to, and indulge in stylistic photos your others can only dream of.

The Problem

Before anything else, let's first define what a panorama is. In computer graphics, a panorama is a type of mosaic, that is, a unification of two or more images. A panorama in particular, though, is a mosaic in which all the photos to be stitched together are all taken from the same camera position. When you take a panorama on your phone, all you're doing is spinning in place, so that's what we want to recreate.

So, if we're given two images from different angles,

how do we combine them into a single unified image? Thinking about it this way is not super helpful, since, well, that's what we already know about panoramas. What do we want to get out of a panorama? If our panorama is done well, then objects in one image should overlap properly with the same objects in the other image.

For example, take a look at the top left of my computer monitor:

In our panorama, these two points should overlap, since we obviously recognize these to be the same object in the real world. But to the two pictures, they are wildly different! In the first picture, the corner of my monitor is closer to the left side of the frame, while in the second picture, it is almost on the top-right edge of the image. So, if we try to manually align these two corners so that they overlap, we get:

While our single corner of the monitor is aligned between the two images, I don't think I have to try very hard to convince you that this isn't a great panorama. I mean, just look at the rest of the overlap.

The skew angles of the resting laptop and the monitor itself don't align at all, and while my cable management is bad, it's not that bad. This is the real challenge at the heart of making panoramas: images are flat, while the motion of a panorama is cylindrical. Ideally we would take a "cylindrical" photo and unwrap that into a rectangle, but we can't. Undoubtedly, we will have to warp our images somehow to align.

How do we find the right way to warp our image? The first thing we will need to do is get more data! Having one point line up between the photos is not great, but, say, 10 different data points might not be bad.

So, our final panorama would like the same numbered red and blue points to overlap. With data to use, the second thing we will need is a way to actually warp our image; how do we actually make our points between photos line up? For that, we turn to linear algebra.

Rethinking Coordinates

If you had a random ordinary point, how might you describe its position to someone?

A lonely, solitary point living in the plane.

A common choice we're all familiar with in some way is by using a coordinate system. That is, we define a place to be $(0,0)$ and locate every point relative to that origin in terms of its $x$ and $y$ coordinates $(x,y)$.

A still lonely, solitary point living in the plane, but with more lines.

In the above coordiante space choice, we might say the point is at $(3,2)$. But what exactly do we mean by the point being at $(3,2)$? What this really implies is that the point is 3 steps to the right of the origin, and 2 steps above the origin.

So, instead of thinking of this point in terms of separate coordinates, we can think of it in terms of these two basis vectors. Let's use $\color{blue}{i}$ to represent the blue, horizontal vector, and $\color{red}{j}$ to represent the red, vertical vector. So, our point is really the combination of $\color{blue}{3i}$ and $\color{red}{2j}$, or simply, $\color{blue}{3i} + \color{red}{2j}$, which itself repsents another vector (the one pointing from the origin to the point $(3,2)$).

This might seem extra and unnecessary, since we just rewrote a vector as the sum of its horizontal and vertical components, which is what coordinates literally do in the first place. But the useful insight here is that there is nothing that says our basis vectors have to be in the unit directions! We can now rewrite points in multiple ways depending on our choice of $\color{blue}{i}$ and $\color{red}{j}$.

A new, quirky choice of basis vectors.

With a new choice of basis vectors, the vector $\color{blue}{3i} + \color{red}{2j}$ has a totally new position as that now encodes the coordinate $(5,3)$ since neither $\color{blue}{i}$ nor $\color{red}{j}$ represents horizontal or vertical steps anymore, but rather skew, diagonal steps.

But look at that! We've basically accomplished our goal of warping points! We've managed to transform the point $(3,2) \rightarrow (5,3)$ by manipulating $\color{blue}{i}$ and $\color{red}{j}$; both points are techincally at $\color{blue}{3i} + \color{red}{2j}$, just for different basis vectors.

This is what linear algebra and matrices encode geometrically. If we write our basis vectors $\color{blue}{i}$ and $\color{red}{j}$ in a matrix and multiply that by the vector representing our initial point, we will get a new point representing our transformation (a.k.a., our warp). How do we write our basis vectors in a matrix? Each vector implicitly has coordinates associated with themselves! In the above picture, $\color{blue}{i}$ points at $\color{blue}{(1,-1)}$, since from its tail to its tip it moves one step to the right and one step down. $\color{red}{j}$, on the other hand, points at $\color{red}{(1,3)}$, and these are precisely the vectors we see in our matrix.

For the unit basis vectors that point at $(1,0)$ and $(0,1)$, to differentiate them from any old pair of basis vectors, we call them $\color{blue}{\hat{\imath}}$ and $\color{red}{\hat{\jmath}}$ wearing a little hat, and their respective matrix the identity matrix, since it leaves vectors unchanged after multiplication (since that's what we used to define coordinates in the first place).

The underlying idea of linear transformations.

These $2 \times 2$ matrices represent linear transformations. They're transformations in they way that they transform points from one coordinate to another (well, most of them at least), and they are linear in the sense that keep all grid lines parallel, evenly spaced and, well, linear after the transformation. This is best seen through video and not stills. For this post you don't need to understand the mechanics of matrix-vector multiplication, but just understand that it represents some transformation on a point.

Warping Images

So why should we care? Why is this helpful in any way? If we think of each pixel on our images as a coordinate, we can just apply our transformation to all pixels on that image, find where they land and color them, and generate a new image. Let's take this picture as an example.

We can scale it, rotate it, or even shear it by applying the same transformation matrix to every pixel by changing $\color{blue}{i}$ and $\color{red}{j}$ like before.

An example transformation acting on our image.

But we have a big problem here: the typical linear transformation does not allow for translations. By the qualities of linear transformations, the origin cannot move, therefore forcing the bottom left corner of our images to always overlap! That's pretty restrictive in terms of the panoramas we can make—and for practical purposes—a complete nonstarter. If we want to continue through with making a panorama, we'll need to find a way around translations.

Homogenous Coordinates and Affine Transformations

There's a very sneaky workaround being confined to the origin. To do so, we'll need to do something that might seem a bit weird to do translations. Let's rewrite our 2D points with a 3rd coordinate. For a given $(x,y)$, let's rewrite that with a $z$-coordinate $(x,y,1)$. If $z \neq 1$, then we can just divide all the other coordinates by $z$ to make it equal to 1: $(x,y,w) \rightarrow (\frac{x}{w}, \frac{y}{w}, 1)$ (we generally use $w$ to represent the $z$-coordinate to indicate that there is no "real" $z$ value since everything is projected into 2D; we use $w$ as a "weight" to say how much we scale our projections down to). This means we have multiple coordinates represent the same point. In this way, $(2,5,1)$ and $(4,10,2)$ and $(-3,-7.5,-1.5)$ and $(2w,5w,w)$ all represent the same point (we don't include points when $z=0$ as it represents a point at infinity).

This might seem arbitrary, but what we're doing here is not too different than our original, 2-coordinate system. When we look at a cross-section of the $xyz$ coordinate space, it looks exactly like the $xy$ plane. What we are doing here is projecting all of $xyz$ space onto the plane $z=1$.

The geometry of projecting points onto $z=1$ is equivalent to drawing a line through the origin and the point, and finding where it intersects that plane.

In fact, many of you are already familiar with homogeneous coordinates (representing 2D points with a 3rd scalar coordinate) and projective planes! When you take a photo on your phone, how does the camera know what's drawn in its frame? How does it take a 3-dimensional world and put it into a 2-dimensional picture? The many light rays that enter the camera lens (the origin) will intersect a plane ($z=1$) based on its focal length, and colors the pixel based on the projection.

A photo is homogeneous coordinates in disguise. While there's many sides to the building, our camera only cares about what it sees in front of it (a.k.a., what gets projected onto the frame). Our worldview is contained to a small projection.

Using this analogy with photos, clearly translations should be possible! If you've seen any cat videos on the internet, clearly it is possible for the cat to enter and exit the frame freely without the camera necessarily moving, and that is precisely possible due to the fact the origin $(0,0,0)$ is not contained in our projective plane $z=1$, since all of our basis vectors have to stem out of the origin! (For those interested a translation matrix is equivalent to a shear along the $z$-axis.)

Think about what we're doing here: we're turning a linear transformation in 3-dimensions to create special non-linear affine transformations in 2-dimensions! When I first learned this geometrically, awe can't encapsulate the total shock I felt. So, if we're given a point $p$, we can transform it with a matrix $M$ to get its image $p'$.

$ Mp = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} wx' \\ wy' \\ w \\ \end{bmatrix} = \begin{bmatrix} x' \\ y' \\ 1 \\ \end{bmatrix} $

(Notice how $M$ is now a $3 \times 3$ matrix as we are working with 3D coordinates now) If our matrix-vector product results in a $w$ value not equal to 1, then we just divide everything by $w$ to make it so, and get our coordinate in terms of our 2D plane.

These projections with homogeneous coordinates are known as homographies. When we take one picture, and reproject it according to a matrix like this but keeping the same camera center (like the origin), we call it a homography. Again, like homogeneous coordinates, people have been leveraging homographies for a while now. You know how on roads, whenever there's text written like, "STOP AHEAD", it's always a bit vertically stretched? That's because when you read it at the angle of a driver in a car, it looks more regular and readable. At least, that's my theory.

More creatively, that weird, perspective street art you might have seen before? That's the most manual you can get to using homographies—literally warping images with the angle you look at them to make them appear at a normal proportion.

Street artists have used the power of perspective for a long time.

What we're doing is computing a homography to build a mosaic. Just like the decorative tile art of the same name, we are taking tiles of photos that we transform to overlap, and them stitching them all together into one, broader image.

Moreover, our homographies have a really funny interpretation to them. Since we are reprojecting pictures, what it geometrically looks like is that we're taking two photos which should be rotated in space (as you would spin taking the panorama), and taking a photo of the two photos. Photo-ception.

If you take a photo of two existing photos, you get one photo that unites the two together. If we can find out the right way to take the photo such that the overlap is correct, we get a panorama.

There are other ways to reproject images to make other mosaics with the own benefits and downsides, but this is what we'll use for now. Benefits with this type of mosaics? They are (relatively) easy and fast(er) to compute. Downsides? Since we are projecting onto a plane like this, we can only take panoramas up to 180° wide (can you see why?).

One Slight Issue…

While it's great that we are able to transform points with matrices, let me remind us what our goal is.

We have these two photos, where we want to transform one image's points to overlap with the other's. In terms of our matrix arithmetic from before, we have $p$ and $p'$, but no matrix $M$... Up to this point we have been finding our image points using our own matrices, but how do we find that intermediate matrix given a point and its image?

Regressions and 4-Dimensional Lines

Like homogeneous coordinates, many of you will already be familiar with solving for the intermediate matrix given a $p$ and $p'$. Let's do it with a simpler example.

If you had the points $(1,2)$ and $(6,4)$, and I needed you to find the line $y=mx+b$ that went through them, most of you would be able to do that. We'd set up a system of linear equations

$\begin{align} 2 & = m(1) + b \ \newline 4 & = m(6) + b \end{align}$

and solve for $m$ and $b$ respectively. In this case, $m=\frac{2}{5}$ and $b=\frac{8}{5}$. Simple algebra with little to worry about here. What is important to note here is that we could solve for a unique $m$ and $b$, since two points define one unique line.

An ordinary line going between two points.

But what if I introduced a third point $(p_3,p_3')$? Or even a 4th point $(p_4,p_4')$? How do we draw a line through those 4 points? There might be a line that goes through all 4 points, but it's highly unlikely.

While there's no one line through all 4 points, what's the closest to a line we can get?

We may not have exact values for $m$ and $b$, but what's the best value for both to get the closest solution to this system of equations?

$\begin{align} p_1' & = m(p_1) + b \ \newline p_2' & = m(p_2) + b \ \newline p_3' & = m(p_3) + b \ \newline p_4' & = m(p_4) + b \end{align}$

That's a task as simple as plugging it into a spreadsheet and doing a linear regression. More specifically, we can use the common least-squares regression where we want to minimize not the sum of the errors, but the sum of the square of the errors (as the name would suggest). For those a little more comfortable working with matrices and linear algebra, here's a more in-depth explanation of what we're doing with our data when finding a regression.

To many, this might seem like an obvious thing to do; everyone from middle schoolers to office workers have been finding trend lines forever. But what we did here is pretty useful when we think more abstractly: given a system of linear equations that correlated independent data $p$ with their dependent data $p'$, we were able to solve for the best coefficients of that system of linear equations that most closely solved the system (in the previous case was $m$ and $b$). Finding a line was a nice byproduct, but what we're really doing here is solving that system of linear equations.

Now I promise this will be helpful. Let's look at our original expanded matrix equation of $Mp = p'$.

$ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} wx' \\ wy' \\ w \\ \end{bmatrix} $

Remember, we're working in homogeneous coordinates, so $p'$ might not land on the plane $z=1$, and we account for that with $w$ here. I also set $i=1$, since a) that corresponds with a certain scaling and is not necessarily unique vector in the land of homogeneous coordinates, and as a result, more importantly b) gives us one less variable to solve for.

Here, we will need to actually do the matrix-vector multiplication, and carrying it out nets a system of linear equations! (I know I said you won't need to know the mechanics of these operations, but it's hard to avoid it now. If you can accept this fact, that's great, but I'd recommend looking here if you are unfamiliar.)

$\begin{align} ax + by + c & = wx' \ \newline dx + ey + f & = wy' \ \newline gx + hy + 1 & = w \end{align}$

Using the third equation in tandem with the first two…

$\begin{align} ax + by + c & = (gx + hy + 1)x' \ \newline dx + ey + f & = (gx + hy + 1)y' \end{align}$

Just like before, we can solve for $a$, $b$, $c$, $d$, $e$, $f$, $g$, and $h$ with a least-squares regression! Since we have 8 variables, at minimum we need 8 equations, or 4 pairs of $p$ and $p'$ (since each pair contains two equations: one for $x'$ and one for $y'$). Though, just like we have 10 points, generally it is better to have more data and overfit than less (we'd rather have an overall average fit, than just 4 points be exatly where we want them to be). It's weird to think of this geometrically, since what we're doing here is not finding the line between one independent variable and one dependent variable, but rather two independent variables $(x,y)$ with two corresponding dependent variables $(x',y')$; our regression exists in 4-dimensions!

Putting It All Together

Let's quickly reflect on what we've covered thus far.

We've redefined coordinates purely with vectors, allowing us to nicely compact our image-warping transformations in matrices.
Our original definition of coordinates failed to include translations—a key transformation. We described 2-dimensional points in 3-dimensions with homogeneous coordinates, resolving our worries.
We then ran into ANOTHER problem in that while we knew how to warp images given the transformation matrix, we really wanted to be able to find the matrix given a starting point and an end point to map to.
Using a least-squares regression, we were able to turn our unknown matrix equation into a system of linear equations that were much easier to work with to compute our homography (sort of, see the aside below).

Let's use this first photo to give our list of points $p$.

And we'll try to match those red points to these blue points on the second photo: our list of $p'$.

Having the computer compute the transformation matrix, we take that matrix and multiply every pixel (remember, treating them as coordinates/vectors), and warping the first image. Then, we can overlay them to see how close our points line up! If our points were well selected, and our computed homography—with the least-squares regression—has minimal error, we should get a pretty decent attempt at a panorama.

Sure, the blending isn't great, and it didn't completely fix the overlap issue, but the seams and photo stitching definitely is much nicer! And honestly, it's pretty cool seeing how the image was transformed and finding the outline of the images cross like that.

With some simple masking and basic filtering (basically averaging every pixel's color with the pixels around it), suddenly it really begins to look clean.

While this is cool, it does reveal another unfortunate downside of our choice of mosaic: if we want a uniform picture, we have to sacrifice a lot of data.

Even so, it doesn't even look that bad. All in all, though, not a bad first attempt at building a panorama.

What Next?

While we have a working prototype, we can do signficantly better. For one, I used only 10 labelled points to compute our homography, but if you use even more, it's not hard to get a better, and closer fit. With algorithms like LoFTR, finding lots of corresponding labelled points between multiple images is quick and easy.

Some really smart people made an algorithm specifically to finding high quality object matching between multiple photos. Credit: LoFTR Team

Also, since we are manually constructing our panorama, we can stitch and blend photos that have no right being together in a panorama.

Going from a well-lit to a dark photo makes for some artsy renditions (even more if you blend it a little nicer).

In a similar manner, we only conjoined two photos together, but we can easily extend this to as many photos as we want (but I can't say how well the photos towards the end will necessarily stretch).

We never really touched on our homographies, either. When we decided 10 initial points $p$ and 10 warped points $p'$, our $p'$ was decided as a result of lining up 2 photos. What if we didn't want to line up multiple photos, but rather just creatively warp a single photo?

Something not quite lined up? A simple homography can fix that for us.

This is know as rectification, as it is a means to correct for mistakes we might have had in our photo.

Finally, the last improvement we can make to our mosaics is trying new projections and warpings. If we want something even as simple as just wider, up to 360°, full views, we'll need to find something more robust than our previous approach. Or what if we wanted to make something akin to a full photosphere like from before?

What we did today was simply planar projection, or just reprojection onto a plane. We did that with homogenous coordinates. For wider, more complete mosaics, we'll need either cylindrical or spherical projection, which is exactly what it sounds like. These have their own benefits like wider field of view, but because of the nature of projecting onto a curved surface, the images being stitched together do tend to, well, curve. The type of mosaic one uses comes down to preference and artistic need.

And lastly, there are many optimizations and polishing details we could add to make our panoramas cleaner, and run faster. For instance, we never mentioned the discontinuities that could be present in warping our images with matrix multiplication. While linear transformations keep lines before the warp as lines afterwards as well, that's only helpful if our line is continuous. Pictures are not continuous! They are discrete points! So, forward warping with our matrix multiplication and finding where pixels lands can sometimes create (albeit, usually imperceptible) holes in our images, but they are there nonetheless. Instead, we can reverse warp by applying the inverse of our transformation matrix, and find what coordinates land on our original image! Not to mention different blending and masking techniques, or even just algorithmic improvements to make the code run faster. Check out the Python notebook below for more details.

For more like this and additional resources, I recommend reading these slides from UC Berkeley's introductory computer vision and computational photography class.

I hope this gave an interesting peak at the intersection of linear algebra and photography, and more over, I hope this gave you an appreciation for the math your phone goes through every time you take a panorama.

If you're interested, here's a link to a Python notebook where you can see some of my experiments during my struggle and exploration with panoramas and homographies.

Aside: Least-Squares with Linear Algebra

Okay, this previous section is really hard to describe without already knowing a fair amount of linear algebra, and it felt a little flat without having a more methodical procedure of solving a least-squares regression. I wasn't planning on including this section, but it felt incomplete otherwise. For those interested, feel free to peer over it, but this is not necessary within the scope of this post; all you need to understand is what our regression is accomplishing, thinking of that "line of best fit" idea giving rise to optimal coefficients in a overfitted system of linear equations.

Let's go back to when we were trying to find a line between two points. If you have 2 points, $(p_1, p_1')$ and $(p_2, p_2')$ being fit to the line $y=mx+b$, we have a system of linear equations like before.

$\begin{align} p_1' & = m(p_1) + b \ \newline p_2' & = m(p_2) + b \end{align}$

We can solve this just like we did before to find $m$ and $b$, but there's another, sly way we can approach this. If we look carefully at the structure of these equations, there's actually a secret matrix relationship embedded into this system.

$ \begin{bmatrix} p_1' \\ p_2' \end{bmatrix} = \begin{bmatrix} p_1 & 1 \\ p_2 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix}$

In a sense, that's what a matrix is: a system of linear equations, and you can freely go between either a system of linear equations or a matrix via matrix multiplication. (I know I said you won't need to know the mechanics of these operations, but it's hard to avoid it now. If you can accept this fact, that's great, but I'd recommend looking here for more details.)

If we write this in general terms, we are basically solving the equation

$b = Ax$

where $A$ is a matrix, and $b$ and $x$ are vectors, and we are solving for the latter. It might seem pointless to rewrite it, but what we're actually solving is

$Ax - b = 0$

Since $Ax$ is exactly equal to $b$ in the 2-point case, we can solve this matrix equation fairly directly; when there's a unique, perfect solution $Ax$ is the same vector as to $b$. We were able to find a unique line with $m$ and $b$ through them, no? Just as we were able to solve the system of linear equations before, we can easily solve this with matrix inverses:

$A^{-1}b = x$

Now, let's add more points.

$\begin{align} p_1' & = m(p_1) + b \ \newline p_2' & = m(p_2) + b \ \newline p_3' & = m(p_3) + b \ \newline p_4' & = m(p_4) + b \end{align}$

Now we turn this into a matrix equation like before.

$ \begin{bmatrix} p_1' \\ p_2' \\ p_3' \\ p_4 \end{bmatrix} = \begin{bmatrix} p_1 & 1 \\ p_2 & 1 \\ p_3 & 1 \\ p_4 & 1 \end{bmatrix} \begin{bmatrix} m \\ b \end{bmatrix}$

We know that there's a good chance our four points don't all lie on the same line. So it's unlikely that $Ax - b = 0$. Moreover, now that our matrix $A$ isn't square, we can't just use inverses to solve for $x$. So instead, we want to get a line that gets as close to 0 (a.k.a. being a perfect fit). So our goal is to

$\min||Ax-b||^2_2$

Here, the $||x||_2$ means we're looking at the Euclidean distance (a.k.a. straight line distance) as our error for our line of best fit, and we're squaring it to get a tighter fit since small errors are kept relatively small, while large errors are weighed heavier. We know $A$ and $b$ with $x$ as our unknown—this sort of looks like a parabola-y equation! When we minimize a single variable function, we do so with the derivative. We can do the same thing here except with the multivariable equivalent: the gradient. So, we know the minimum occurs where the gradient of this function is 0.

$\nabla_{x}||Ax-b||^2_2 = 0$

Even if you're not familiar with multivariable calculus, much of the following should still look vaguely familiar to the chain and power rules of single-variable calculus.

$\begin{align} \nabla_{x}||Ax-b||^2_2 & = 0 \ \newline 2A^T(Ax-b) & = 0 \ \newline 2A^TAx - 2A^Tb & = 0 \ \newline A^TAx & = A^Tb \end{align}$

All finally simplifying to the very nice formula of

$x = (A^TA)^{-1}A^Tb$

I like this approach for it's intuitive roots in the geometry of single-variable calculus, but if you want a more strictly linear algebra approach, here's this excerpt from Georgia Tech that explains another proof for the same formula:

Theorem. Let $A$ be a $m \times n$ matrix and let $b$ be a vector in $\mathbb{R}^m$. The following are equivalent:

$Ax=b$ has a unique least-squares solution.
The columns of $A$ are linearly independent.
$A^TA$ is invertible.

In this case, the least-squares solution is

$x = (A^{T}A)^{-1}A^{T}b$

Proof. The set of least-squares solutions of $Ax = b$ is the solution set of the consistent equation $A^TAx = A^Tb$, which is a translate of the solution set of the homogeneous equation $A^TAx = 0$. Since $A^TAx$ is a square matrix, the equivalence of [facts] 1 and 3 follows from the invertible matrix theorem. The set of least squares-solutions is also the solution set of the consistent equation $Ax=b_{\textrm{Col}(A)}$, which has a unique solution if and only if the columns of A are linearly independent.

Basically, it says if our system of linear equations contain only unique equations (i.e. no one equation is a multiple of another), we can turn our non-square matrix $A$ into a square one by multiplying by its transpose $A^T$, and solve our least squares the way we'd solve it before with inverses. In other words, if our matrix follows the criteria listed above, our minimizing solution comes from creating an equivalent equation with an invertible matrix:

$\begin{align} Ax & = b \ \newline A^TAx & = A^Tb \ \newline (A^TA)^{-1}A^TAx & = (A^TA)^{-1}A^Tb \ \newline x & = (A^TA)^{-1}A^Tb \end{align}$

Netting precisely the same formula as before.

Now, let's recall the our matrix equation from before of the homography we wanted to solve.

$ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} wx' \\ wy' \\ w \\ \end{bmatrix} $

Then, we expanded this into 3 linear equations, and further simplified them to the following two:

$\begin{align} ax + by + c & = (gx + hy + 1)x' \ \newline dx + ey + f & = (gx + hy + 1)y' \end{align}$

This, can be rewritten as another, secret matrix equation:

$ \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -x' \cdot x & -x' \cdot y \\ 0 & 0 & 0 & x & y & 1 & -y' \cdot x & -y' \cdot y \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix} $

Wait, we turned our original matrix equation into another one? As awful as that may look, this is much more useful than our original equation since now, all of our unknowns are in a vector instead of a matrix; it really is no different than our previous least-squares examples, and we're still solving for the vector $x$ in

$Ax = b$

So, we can still solve it like before finding

$x = (A^{T}A)^{-1}A^{T}b$

And with that, we now have also gone through what our program is doing under the hood, and have gone through some of the tedium of justifying what a least-squares regression is from a linear algebra perspective.

Constructive Proofs and Careful Practices

Adi Mittal

Is this really math anymore?

Many who have dipped their toes in math for even a little bit will know that $\sqrt{2}$ is irrational. Like many other well known mathematical constants like $\pi$ and $e$, $\sqrt{2}$ can't be written out as a fraction, and its decimal expansion goes on forever without repeating. It's a novel fact, and comes with a fairly simple proof too.

Claim: $\sqrt{2}$ is irrational.

Proof: Suppose that $\sqrt{2}$ is not irrational. Therefore it is rational, and can be written as $\sqrt{2} = \frac{p}{q}$ where $p$ and $q$ are distinct coprime numbers (i.e. they don't share any factors in common; this is a way of sayiing that there is a unique way of writing $\sqrt{2}$ as a fraction in lowest terms). Squaring both sides and moving terms around, this equation becomes $2q^2 = p^2$. Therefore, $p^2$ is even, since it will always have a factor of 2 according to the lefthand-side of the equation. If $p^2$ is even, then $p$ is also even (if you're unsure of this fact, try some test cases and proving it!). Since $p$ is even, then let's rewrite it as $p=2n$. Plugging this into our equation yields: $2q^2 = p^2 = (2n)^2 = 4n^2$. Dividing both sides by 2 gives us $q^2 = 2n^2$. Similar to before, the equation implies that $q^2$ and hence $q$ is even due to that factor of 2 on the righthand-side. But this is a contradiction! $p$ and $q$ cannot simultaneously both be coprime and even (share a factor of 2). Hence, our initial assumption that the $\sqrt{2}$ is rational is wrong, and allows us to conclude that $\sqrt{2}$ is irrational. $\blacksquare$

It's not a long proof at all, and in the grand scheme of the history of math, pretty important, too. Supposedly, this was the first instance of irrationality being proven as a possible quality. More specifically, legend has it that Hippasus of Metapontum was credited with showing that the $\sqrt{2}$ is not rational (note the word choice) in the 5th century BCE while at sea, and upon sharing his amazing discovery, was instantly thrown overboard and left to drown.

Arguably of more significance, the proof highlights a general technique we use not just in math, but everyday logic and reasoning: proof by contradiction.

Proof by Contradiction

I'm sure you have all heard someone say, "For the sake of argument…," at some point in your life before. The idea is to disprove any hypothetical argument by showing it leads to an impossibility.

Let's say someone was arguing that Euclid was still alive today. Instead of remaining in utter shock at such a claim, here's a simple reply to show otherwise. For the sake of argument, let's assume Euclid is still alive. It is generall agreed that Euclid lived around 300 BCE, so if he is alive now, that would make him around 2300 years old now, which is crazy! Only a handful of people live past 100, let alone 1000 years. So it is safe to say, it is not the case that Euclid is still alive today.

In fact, the Latin phrase that describes proof by contradiction, "Reductio ad absurdum," literally means, "Reduced to the absurd/impossible".

Even if you haven't realized it before, you've probably used this same principle a lot in school. When taking a test, process of elimination can be an immensely helpful tactic to find the right answer to a question: if you can eliminate what you definitely know to be false, the rest of the options must include the truth.

At its core, we're using double negatives to prove a "positive" statement; we assume something to be false, show that it can't be, and therefore conclude, it's true since something that is not false must be true. If a number is not rational, it's irrational. If someone is not alive, they're dead. If someone's not guilty, they're innocent.

That last one feels weird, right? Just because there's no evidence to show someone is involved in a crime does not mean we always feel them to have been completely removed, right? In the previous examples, there's a strong sense of complements. "Rational" means "not irrational", so "not not irrational" is the same as "irrational". "Alive" and "not dead" are the same, so "not alive" is "not not dead" is "dead". But is "not guilty" really the same as "innocent"? Semantically, that's what it means. If you Google it, that's what you get.

Of course, Google never lies.

From a casual, conversational point, it might, but from a legal standpoint, there's a reason why we make a distinction. We don't live in a binary world. So, it's not totally clear that we can say "not guilty" is the same as "innocent", even if that's what their definitions may mean. Even if two things are definitionally polar opposites, we, as the humans behind the words' significance to begin with, add subtlety and nuance.

"Not guilty" still insinuates that there's a feeling that there's something missing; we think they are guilty, but are unable to prove it. So even if we define "guilty" to be "not innocent", we might not want to say "not guilty" is the same as "innocent".

Which should come with a weird realization: if "guilty" is "not innocent", then "not guilty" is "not not innocent", which we said to not be equal to "innocent". Double negatives don't always cancel out.

Maybe our original definitions are wrong then? But it would be hard to define either "guilty" or "innocent" external to each other.

This may just seem like nitpicking, but it's important to consider this, since it puts into doubt such an intuitive, clear cut idea. If we can't use double negatives in one place, who's to say where else we can't use it? It might seem obvious at first that "not alive" must mean "dead", but have we really shown that? Just because it's absurd to think about Euclid being alive right now, does our proof really guarantee that? Is there something more to rational and irrational numbers that we might be missing? Just like with differentiating "not guilty" and "innocent", just because we have not been able to show someone guilty does not mean that they aren't.

Yet, this is precisely the concept that proof by contradiction relies on. We assume the opposite of what we actually want to prove, show it is impossible, then negate it to get to our original statement. But we don't actually show anything directly about our claim, only that it is impossible that it can't be true. If that's not valid—like it is the case between "guilty" and "innocent"—this no longer becomes a valid proof technique!

Above, we've only demonstrated that $\sqrt{2}$ is not rational, when we wanted to show that it is irrational. Did we really do that?

The Law of Excluded Middle

The easy answer to that is yes. We state a strict definition, and we can pretty clearly figure out when something is binary and when something has wiggle room. We can define a number to either strictly be rational or irrational. If it's not one, it has to be the other.

Technically, fine, yeah, we can do that. But that's kind of unsatisfying. But, there's a more fundamental question we should ask, since we're not just concerned with a number being rational or not, or someone's mortality, but proofs in general; we want to know if we can use proof by contradiction consistently. In general, we're trying to prove the truth of claims, so what we really care about is the following question: if something is not true, is it false (and vice versa)?

Again, obviously yeah, right? What else could it be? All declarative sentences (sentences make a claim about something, i.e. not questions) have to be true or false. It's that simple. It either is or it isn't. "Cats are animals," is true, and, "The Earth is flat," is false. Easy.

This is known as the Law of Excluded Middle, for, well, we exclude any type of middle value in between true or false. For those who like their logical formalities, if $\Phi$ is our claim about the world, then we say $(\Phi \lor \neg \Phi)$ is a tautology (that is, something that is always true). In natural language, this literally means "either our claim of the world is true or our claim of the world is not true", which just makes sense.

This forms one of the, and I can't make this up, Three Laws of Thought upon which "reasonable arguments" are based off of. These principles include:

The Law of Excluded Middle: every claim or its negation is true.
The Law of Non-Contradiction: nothing can simultaneously be true and false.
The Law of Identity: everything is identical to itself.

These three are usually taken as "common sense" axioms, and along with a few other assumptions build the basis of inference and deduction. Some might note that the first two laws might mirror each other through De Morgan's laws, but remember these don't actually explicitly state De Morgan's laws and we have no way of inferring that with these three principles alone. These three laws come up in many contexts, and have been refomalized many times over across history. Even the last Law of Identity, which seems almost pointless, has been taken by none other than Leibniz (yes, the same one from calculus) and been turned into two laws. That's for another day, though, and we will continue to only focus on the first law.

Ok, then how about the following claim?

This sentence is false.

Is this true or false? If it's true, then it declares itself to be false. If it's false, then it must be giving the wrong information, and therefore it actually is declaring itself to be true. And if it's true…

Suddenly binary truth values don't seem to be as reliable as they first seemed. This sentence doesn't neatly fit into one cateogry. If you consider one side, it flips to the other. It's been aptly named the Liar's paradox for its deceptive nature, and there are many attempted solutions, of my favorites being the fuzzy logic answer for its clarity and funny name. The number of attempts at resolving such an annoying sentence, though, should highlight that binary truth values are not always the right approach.

At this point, it's time we might have to reconsider the Law of Excluded Middle. Maybe this is just a dumb paradox that should be ignored as an edge case, but if you leave it as it is, then are we really going to be happy with what we have, knowing that there are some problems out there we just can't address?

Intuitionism

Perhaps, we don't have to reject the Law of Excluded Middle, but instead come up with a stronger type of reasoning. As we saw, the Law of Excluded Middle lines up pretty well with most things we observe in general, so we shouldn't discount it for that. Though, we might find a way that can avoid this issue we ran into with "not guilty" versus "innocent" altogether, and that would be foolproof from the start.

Our proof by contradiction is known as a non-constructive proof, in that it reasons about a claim without ever demonstrating the claim itself; there's no actual evidence presented for the thing we want to show true, since we try to make it appear "obvious" that there is no other way for such a claim to exist.

Our proof by contradiction for $\sqrt{2}$ being irrational is one example, but here's another one of my favorites.

Claim: There exists irrational numbers $a$ and $b$ such that $a^b$ is rational.

Proof: We know (and continue later to show) that $\sqrt{2}$ is irrational, so consider the number $\sqrt{2}^\sqrt{2}$. If this is rational, we're done. If this is irrational, then consider the number $(\sqrt{2}^\sqrt{2})^\sqrt{2} = \sqrt{2}^2 = 2$, which is rational, and hence we are done.

Never once did we actually deduce the rationality of $\sqrt{2}^\sqrt{2}$, but by exhausting the possible cases, it doesn't matter since every possibility leads to a conclusion that proves our claim.

If we can find a constructive proof and actually manifest our claim in some way, then we don't have to worry about double negatives or anything like that since we directly showed what we wanted to.

We're now dabbling along lines of intuitionistic logic where, ironically, we are breaking our natural sense of double negation and relying solely on these constructive proofs to validate truths. Since, in a way, non-constructive proofs are just left in abstraction and are no more than just a construct of our mind. Intuitionism tries to separate ourselves in the proof by showing an objective way of realizing a truth.

Let's go back to our original proof by contradiction above:

Claim: $\sqrt{2}$ is irrational.

Remember, we don't trust double negatives, so we want to avoid saying "irrational" is the same as "not rational". What does it really mean for a number to be irrational?

Well, even though we don't want to say "irrational" is "not rational", we can draw inspiration from what the latter to find a better definition. We say a number $x$ is rational if it can be written in the form $x=\frac{m}{n}$ where $m$ and $n$ are integers. Or, equivalently, $x$ is rational if there exists integers $m$ and $n$ such that $x - \frac{m}{n} = 0$. Since we're using subtraction here, there's a natural way to think of rationality in terms of distance; a number is rational if it is distance 0 away from some integer fraction. This gives us a nice way to think of irrationality.

Definition: A number $x$ irrational if for all integers $m$ and $n$, $|x-\frac{m}{n}| \neq 0$. In other words, a number $x$ is irrational if it is some distance away from every rational number.

We add the asbolute value signs since we're quanitifying distance, and we don't care if a number is less than or greater than any rational number, just that it is not exactly equal to one of them.

Some of you may be skeptical of what I've just laid out, since if you're familiar with mathematical quantifiers, it does look like there is some negation work that is not explicit has been used under the cover of my words. Even so, the point is we have arrived at a distinct, specific definition of the property we set out to prove, allowing us to try and prove a positive claim as opposed to a negative one (like in proof by contradiction).

The New Proof

We can rewrite our claim now as such:

Claim: $\forall m,n \in \mathbb{Z}, \ |\sqrt{2} - \frac{m}{n}| \neq 0$

That first part just means $m$ and $n$ are integers. We can now do our proof fairly quickly.

Proof: Note that $1<2<\frac{9}{4}$, therefore $1<\sqrt{2}<3/2$. So the only rationals we care about comparing $\sqrt{2}$ are the ones in between $1$ and $\frac{3}{2}$, since it's apparent that any number outside this range will have a non-zero distance away from $\sqrt{2}$. Therefore, let's take our integers $m$ and $n$ such that they are positive, and $1<\frac{m}{n}<\frac{3}{2}$. Let's also take a second to note we can add these inequalities, giving us that $\sqrt{2} + \frac{m}{n} < 3 \rightarrow n\sqrt{2} + m < 3n$.

Now we can do some simple algebra on our claim to show it true.

$\large{\bigg |\sqrt{2} - \frac{m}{n} \bigg | = \frac{|n\sqrt{2} - m|}{n} \cdot \frac{n\sqrt{2} + m}{n\sqrt{2} + m} = \bigg |\frac{2n^2 - m^2}{n(n\sqrt{2} + m)} \bigg |}$

Let's examine the numerator. Every integer, $x$, has some number of factors of 2 in its prime factorization i.e. $x = 2^\alpha \cdot C$ where C is the rest of its factors. When you square that integer, the number of 2s in the denominator doubles $x^2 = 2^{2\alpha}\cdot C^2$. So, we can write $2n^2 = 2 \cdot 2^{2\alpha} \cdot C_1^2 = 2^{2\alpha + 1} \cdot C_1^2$. Similarly, $m^2 = 2^{2\beta} \cdot C_2^2$. The key detail here is that $2n^2$ has an odd number of factors of 2, while $m^2$ has an even number of factors of 2, and therefore cannot be the same number! And since they are integers, then $|2n^2 - m^2| \geq 1$.* Continuing on,

$\large{\bigg |\sqrt{2} - \frac{m}{n} \bigg | = \bigg |\frac{2n^2 - m^2}{n(n\sqrt{2} + m)} \bigg | \geq \frac{1}{n(3n)} = \frac{1}{3n^2} }$

We simplified the denominator with the inequality we found at the start. But look at what we have here!

$\large{\bigg |\sqrt{2} - \frac{m}{n} \bigg | \geq \frac{1}{3n^2} }$

This means for any rational number with denominator $n$, $\sqrt{2}$ is at least a distance $\frac{1}{3n^2}$ away from that rational! Therefore, $|\sqrt{2} - \frac{m}{n}|$ can never equal 0, and we have thus proved our claim and that $\sqrt{2}$ is irrational.

Conclusion

A lot of this can seem like nitpicking definitions and being overly pedantic, and to a certain extent I agree. I mean, even in the proof above we sort of implicitly assume $\sqrt{2}$ is irrational at (*)! When we said $2n^2 \neq m^2$, you could rearrange that to get $\sqrt{2} \neq \frac{m}{n}$. In our definition above, this would only constitute as being "not rational" as opposed to "irrational", but we are so subtely tweaking our interpretation of this fact to make it useful. When differences between proofs are this minute in terms of what we are saying, it seems hard to take any of what we talked about today as practical. We even turned to the definition of rationality to come up with a "better" definition of irrationality, so what's even the point in trying to discern minutiae of our constructs.

Even so, it's a good exercise in thinking about how we arrive at "facts" or "knowledge". When can we prove something by contradiction, and even if we can, is there a more convincing argument? Are there possible cases we might be leaving out and forgetting? Are there some things that we cam only prove directly or indirectly (answer is yes)?

These questions are what underpin so much that we naturally gravitate towards, and there's a reason why they've been examined so harshly. These questions literally uprooted math and led to some of the most famous, mind-blowing results the field has ever seen. So always ask yourself, "What can I )assume, and what do I know?"

Perimeters and Parameterizations

Adi Mittal

How bees outsmart us all

What shape has the largest area to perimeter ratio?

Some might have a guess, or an intuitition for what the answer should be, but it's a surprisingly difficult question to pin down as a proof. In fact, a rigorous proof wasn't given until around 1840 (J. Steiner)!

Some Observations

There are some facts about our ideal shape we can observe quickly. First, it has to be convex_**. For if our shape was concave, we could always add more area with simple reflections.

We can get free area by reflecting over the blue line.

This can also be seen to work for not just polygons but curves as well, where instead of reflecting over vertices we can reflect over tangents.

Concavity is wasted. Credit: Wikipedia

Further, in a similar argument, there has to be a type of perimeter-symmetry to our shape. For any way of dividing our shape into two equal perimeters, the areas enclosed by this dividing line must be equal. If they were not, one half would be bigger than the other, and we could just reflect that bigger half over the line to get a shape with the same perimeter but greater area (as we would have two copies of the bigger half glued together, as opposed to the bigger half attached to the smaller half).

If area 1 was bigger than area 2, we can just replace area 2 with a reflection of area 1. We can then smooth out any cocavities the same way we did before.

This then also gives us a way to half our problem. Literally: instead of looking for the curve that encloses the most area for the same perimeter, we can instead look for the arc that encloses the most area with its endpoints connected by a fixed straight line. Then we can just reflect over it and we're good. So let's look at an arc.

We can divide the area of our arc into 3 sections with an inscribed triangle.

Is there any way we can increase the area of this arc? With the above diagram, we have divided the area into 3 sections. Since we want to keep the arc's perimeter fixed, we can't really affect areas 1 and 2. But 3 we can maximize easily. For a triangle with 2 fixed side-lengths and the angle in between them, the area of the triangle is $ab\sin\theta$, which is maximized when $\theta$ is a right angle. So, we can increase the arc's overall area by picking a point along the curve, and having it such that it always forms a right triangle with the base of the arc.

So, to maximize the area of the arc, we want this to be true for any point picked along the curve. What curve is defined by forming a right triangle with its base at every point? It's a semi-circle! By fixing the base-length as well as the right angle restriction, it happens to also fix all radial lines from the mid point of the base. So finally reflecting back over the line as our second observation suggests, the shape that encloses the most area for a given perimeter is a circle.

…that is if such a shape exists. What we have really shown is that IF there is a shape that encloses the most area for a given arc length, THEN it is a circle. We have not necessarily shown that there is a shape. Intuitively, it seems a bit frivolous to worry about it since it just makes sense that there is only so much you can do to spread a curve out, but it is worth taking a second to do so. For, if the problem was, "What shape encloses the least area for a given perimeter?", there is no such shape we can give. But this is a nit-picking detail that can be shown with not much difficulty using some calculus and limiting processes, but is worth keeping in mind.

Some might know the answer off the top of their head, and some might be able to guess, but the answer is unsurprisingly a circle. And we see that this is realized all throughout nature. Bubbles are circular since for a fixed amount of gas (i.e. volume), the sphere requires the least surface area to accomodate it (and therefore put the least amount of stress on the bubble). Rain drops, too, would be perfectly spherical if not for gravity.

The Analytic Approach

With more robust tools, we can prove this more directly—albeit in a more overkill way. Our problem of finding the greatest area per perimeter is a byproduct of the isoperimetric inequality: for a given perimeter/arc length of a curve $L$ that encloses an area $A$, then $4 \pi A \leq L^2$. The proof isn't long, but requires some less accessible calculus machinery. The below is adapted from Erhard Schmidt's 1938 proof.

PROOF: Let's parameterize a general simple closed curve $C$ (i.e. does not cross itself and ends where it starts), say $c(t) = (x(t),y(t))$. Since we care about arc length, we will parameterize $c(t)$ with arc length i.e. $\forall t \ |c'(t)| = 1$ so that $c(L)$ is our end point of the curve. Now as it is a simple closed curve, $y(t)$ must be bounded as continuous functions on a bounded interval are bounded. So we can enclose our curve by two parallel lines, say, $l$ and $l'$ that meet the curve $c(t)$ at say $t=0, t_{l}$.

Then, the area bounded by our curve $C$ is $A = \int_C x \ dy = \int_0^L x(t)y'(t) \ dt$ by Green's theorem.

Now we can construct an "approximate" circle to our curve, that is parameterized by $r(t) = (x(t), z(t))$ and is also tangent to $l$ and $l'$ (note that they are parameterized by the same x-vector). Call the radius of the circle $R$ and let its center be the origin of our coordinate system. Using our previous parameters, we will further fix this circle by

$$ z(t) = \begin{cases} +\sqrt{R^2 - x(t)^2}, \ t \in [0,t_l] \\ -\sqrt{R^2 - x(t)^2}, \ t \in (t_l, L] \end{cases} $$

to ensure it lies in the $l$ and $l'$ and is, well, a circle. Then, the area bounded by our circle is $\pi R^2 = - \int_\textrm{circle} y \ dx = - \int_0^L z(t)x'(t) \ dt$.

Adding our areas together, we get,

$2R\sqrt{A\pi} \leq A + \pi r^2 = \int_0^L xy' - zx' \ dt \leq \int_0^L \sqrt{(x^2 + z^2)(y'^2 + x'^2)} \ dt = LR$

as $x^2 + z^2 = R^2$ and $y'^2 + x'^2 = 1$. The first inequality is by the arithmetic-geometric mean inequality (AM-GM), and the second one is by the Cauchy-Shwarz inequality applied to $(x, z)$ and $(y',x')$. So putting it all together, $2R\sqrt{A\pi} \leq LR$, or $4\pi A \leq L^2$.

$\blacksquare$

To get equality, we need to satisfy the first and last inequalities. To get equality in AM-GM, we need $A = \pi R^2$ and thus $L = 2\pi R$. To get equality in Cauchy-Scharwz, we want our vectors to be parallel i.e. $\frac{z}{x} = \frac{-x'}{y'} \rightarrow zy = -xx'$. Substituting this into

$(xy' - zx')^2 = (x^2 + z^2)(y'^2 + x'^2)$

from above nets that $x = \pm Ry'$ and $y = \pm Rx'$, or in other words by squaring and summing both equations, $x^2 + y^2 = R^2$, which is precisely the definition of a circle.

So?

This problem generalizes to higher dimensions with higher dimension balls being the solution. But what I find most ineteresting is how quickly this problem changes with a slight tweak. We now know for sure circles are the best shape for a single container; an orange grows in the way for me to have the most amount of fruit in that single orange. But what if I had multiple oranges? Circles are awful to pack together as unlike, say, squares that nestle nicely together, circles leave so much extra room.

Even when optimally placed here, still about 10% of area is left uncovered.

So, if we wanted to create the optimal shape for stacking together using the least materials, what shape would that be? Clearly the answer is not a circle since it doesn't stack efficiently, and we have at the very least a square as our starting point. It turns out the answer, as nature has found its answer through bees, are hexagons. Perhaps another day we will go through the proof, but Thomas Hale's original proof of this Honeycomb conjecture is dense enough to leave alone for now.

What is even more interesting, is that unlike the isoperimetric problem from above, the 3D Honeycomb conjecture is still unsolved: what is the shape with the best area/surface area ratio to tesselate space? The Weaire-Phelan structure gives a non-obvious polyhedron that's the current record holder for known shapes, but it has not been proven to be the best.

The study of these minimal surfaces all tend to follow similar problems like the ones above, and just how deceptively difficult they are are what make them so interesting. Inspiration from physics and observed phenomena in nature are so helpful to guide us towards intuitions, the rigorous proof math demands is so often buried by our own self-defeating intuitions. If you're interested in more like this, I'd suggest skimming over one of our previous discussions on the calculus of variations.

The Language of Logic and Metatheoretic Magic

Adi Mittal

Why even mathematicians should care about history

Pythagorean Theorem: the sum of the squares of the length of the legs of a right triangle is equal to the square of the hypotenuse.

The Pythagorean theorem is perhaps the most famous statement in all of mathematics. $a^2 + b^2 = c^2$ is practically drilled into the minds of everyone that went through any form of secondary school education.

But let me ask you: how do you know it is true? It's not like you've looked at every right triangle and checked it follows the theorem. There are many, many proofs of the Pythagorean theorem, but what makes you know that those fancy symbol manipulation is really legit? This might seem like an inane question, for math follows strict, rigorous, logical argument. But what makes this logical argument valid? Why do we trust logic so readily? There are many ideas that are intuitive that are false, and unintutive that are true, yet logic seems to get a pass from everyone? Why?

Over the past year, I took my university's introductory sequence for logic, which were probably the most enlightening classes I've ever taken. It forced me to reflect on not only what mathematics meant to me, but also what knowledge and deduction as a whole does too. This post is a summary collecting the highlights of the courses, so to act as a checkpoint to refer back to from future posts when I explore some of the most powerful ideas that have eluded me in what feels like a forgotten field of study.

As this will be a (quite a) bit of a longer post, the table of contents below will guide each section in the sequence they should be read, and all context needed for one section will be given in anything before it. I'll be following the conventions and terminology as per V. Halbach's The Logic Manual and A. Eagle's Elements of Deductive Logic. Brief examples will be given, but for more examples and exercises, see the texts above. For the best viewing experience, read this on a computer or wider-screen device for the equations to format as they were intended to be.

The Language of Logic
Introduction to Metatheory
Soundness and Completeness
- The Soundness Theorem
- The Completeness Theorem
What's Next?
Conclusion

Part 1: The Language of Logic

What Even is Logic?

English is bad. Notoriously so. Try interpreting the following sentence: "The old man the boat." This might not even read as a correct sentence to some of you. The lexical ambiguity in that "man" is being used as a verb, even if it is more commonly a noun, is what makes this sentence lead one astray. Or, what about, "The professor said there would be an exam on Monday." Does the professor mean there will be an exam to be taken on the upcoming Monday, or was it that this past Monday he claimed there would be an exam in the indeterminate future? The structural ambiguity inherent to what "on Monday" qualifies makes the sentence on its own unclear.

English—and natural language as a whole—relies on expectations being met by both the speaker and listener for effective communication. We'll touch on this more later, but the inherent vagueness of our language, while gives it the nuance that we praise in art and literature, makes it really hard to analyze the truth of what people claim when they talk, especially in writing. Inflection and tone may give cues to what is precisely being said, but sometimes we want the content of our statement to be independent of any necessary human interpretation or judgement.

Take the following claim: "If Socrates is a man, then he is mortal. Socrates is a man. Therefore, Socrates is mortal." Few would dispute this argument, but there is very little content we needed to be okay with this argument. We don't care who Socrates is, or what it means to be a mortal or a man; there is something embedded into the nature of the argument that makes it good. "The Earth is flat if 2+2=5. 2+2=5. Thus the Earth is flat," is a perfectly good argument assuming that 2+2=5, so while it might not be true, it certainly is still a good inference given those facts. "If $P$, then $Q$. $P$. Therefore, $Q$," seems to always be a sound claim.

Logic, in this way, is the study of valid arguments. Given a set of premises, what conclusions can we deduce?

$\mathcal{L}_1$: A First Attempt

Above we already were able to begin generalizing a little bit. We came up with the general form, "If $P$, then $Q$. $P$. Therefore, $Q$." $P$ and $Q$ could be almost any sentence, and that argument would always hold. Almost any sentence, in that our sentences have to be truth evaluable; parsing questions or exclamations does not elicit any new information for our claims, so it does not really make sense to consider them. Even if we do not know if $P$ and $Q$ are actually true or false, we can still consider the different cases to check our argument structure.

Here we have the beginning of our first formal language: the logic of $\mathcal{L}_1$. We have already been able to characterize sentences in the form of sentence letters i.e. $P$ and $Q$, but if we need more we can add $R$, $P_2$, $Q_{3984}$, and others if we need more basic sentences (by convention, only $P,Q,R$ are the allowable sentence letters, with additional ones being subscripted). But this is a bit primitive.

Connectives

What makes arguments in English interesting is not the basic sentences, but how we are able to link them together via "If… then…", "…and…", "…or…", and other conjunctions. Can we formalize these in any way?

Let's think about a simple one: "…and…". Just like before, we can only care about sentences that are true or false, so a natural question will be when is " $P$ and $Q$ " true or false? "…and…" unifies two disparate sentences, so if the whole sentence was true, we would expect what ever is on either side of the "and" to also be true. To say that, "Grass is green and the Earth is flat," is a true sentence seems to ignore what the sentence is claiming. "…and…" states both subsentences "Grass is green" and "The Earth is flat" together at once, so if we think it is true, then both necessarily seem to have to be true, too. We can represent this as a truth table.

$ \begin{array}{c|c|c} P & Q & (P \wedge Q) \\ \hline T & T & T \\ T & F & F \\ F & T & F \\ F & F & F \\ \end{array} $

where $T$ means the sentence is true and $F$ false. Whenever $P$ and $Q$ are not both true, there conjunction is not either. As seen in the table above, we don't write "…and…" in $\mathcal{L}_1$, but rather use the symbol $\wedge$.

We call $\wedge$ a connective, for it connects two sentence letters together. Below are some other commonly used English connectives formalized in $\mathcal{L}_1$ with their corresponding truth tables.

$ \begin{array}{c|c|c} \textbf{English} & \mathcal{L}_1 & \textbf{Truth Table} \\ \hline \textrm{…and…} & \wedge &
\begin{array}{c|c|c} P & Q & (P \wedge Q) \\ \hline T & T & T \\ T & F & F \\ F & T & F \\ F & F & F \\ \end{array} \\ \hline \textrm{…or…} & \ \vee \ \ &
\begin{array}{c|c|c} P & Q & (P \vee Q) \\ \hline T & T & T \\ T & F & T \\ F & T & T \\ F & F & F \\ \end{array} \\ \hline \textrm{It is not the case that…} & \neg &
\begin{array}{c|c} P & \neg P \\ \hline T & F \\ F & T \\ \end{array} \\ \hline \textrm{If…then…} & \rightarrow &
\begin{array}{c|c|c} P & Q & (P \rightarrow Q) \\ \hline T & T & T \\ T & F & F \\ F & T & T \\ F & F & T \\ \end{array} \\ \hline \textrm{…if and only if…} & \leftrightarrow &
\begin{array}{c|c|c} P & Q & (P \leftrightarrow Q) \\ \hline T & T & F \\ T & F & T \\ F & T & T \\ F & F & F \\ \end{array} \\ \end{array} $

It's worth taking a second to make sure you can get behind that these formalizations are reasonable. I won't go through all of them as we did with "…and…", but there are some things to take note of that do differ a bit in English.

$\vee$ is specifically the truth table of inclusive or. In English, if one says "I'll either buy a pair of sneakers or sandals," we expect that person to usually just buy one option or the other, not both. $\vee$ allows the option for both, but does not require it to be true.
$\rightarrow$ is technically different from "If…then…", even though I put them in the same row in the table. When we say, "If a butterfly lands in North America, then there will be a earthquake in Australia tomorrow", we can definitely say that is false if the butterfly does land in North America and no earthquake occurs, since they are definitely not causally related then. But if the butterfly does NOT land in North America, then we are in no position to say whether or not there is a relationship then since we have nothing to observe. Similarly, if an earthquake happens, we can't say if that was because of the butterfly or not regardless of what happens since correlation and causation are different. So really, the truth table for "If…then…" is $ \begin{array}{c|c|c} P & Q & \textrm{If} \ P \ \textrm{then} \ Q \\ \hline T & T & ? \\ T & F & F \\ F & T & ? \\ F & F & ? \\ \end{array} $ We say it is not truth functional for those question marks there since we cannot evaluate whether that sentence is true or not, but for the sake of convenience, we use $\rightarrow$ with the completed table.

The Syntax and Semantics of $\mathcal{L}_1$

Now, we can upgrade from sentence letters to complex sentences in our logic. We will define a sentence in $\mathcal{L}_1$ as follows:

All sentence letters $P, Q, R, P_1,…$ are sentences.
If $\Phi$ and $\Psi$ are sentences, then $\neg\Phi$, $(\Phi \wedge \Psi)$, $(\Phi \vee \Psi)$, $(\Phi \rightarrow \Psi)$, and $(\Phi \leftrightarrow \Psi)$ are sentences.
Nothing else is a sentence.

Note the use of parantheses to bracket the sentences. This is our way to remove all structural ambiguity by clearly stating the scope of the connectives; there is no question that $(P \wedge (Q \vee R))$ is different from $((P \wedge Q) \vee R)$ as the parantheses explicitly denote the scope of the connectives to what subsentences they are linking together. There are conventions to remove these brackets for convenience, which I may default to later, but as a whole, are not important.

The last detail we are missing are interpretations. Before, we were talking about how logic does not care about human judgement, but we can consider the judgement of the universe. What is the truth value of the sentence $\neg(P \wedge Q \rightarrow P \vee R)$? We could make a truth table, but clearly the truth of the sentence depends on the truth values of $P$, $Q$, and $R$. So under an $\mathcal{L}_1$-interpretation or $\mathcal{L}_1$-structure that assigns a truth value to every sentence letter, we can assign truth values to sentences. For example, if we let a structure $A$ be such that $|P|_A = T$, $|Q|_A = T$, and $|R|_A = F$, then $|\neg(P \wedge Q \rightarrow P \vee R)|_A = F$ by its truth table. The idea of a structure is that it is a model of a universe, where we designate certain sentences/facts to be true or false, and see how the rest of the other complex facts (that are linked by connectives) change in accordance to that possible universe.

Notation: $|\cdot|_A$ will stand for the meaning or semantics under a structure $A$.

Now most logical sentences one encounters will be true sometimes, and false sometimes depending on the structure its evaluated in. But consider the sentence $(P \vee \neg P)$. Looking at its truth table,

$ \begin{array}{c|c} P & P \vee \neg P \\ \hline T & T \\ F & T \\ \end{array} $

Every structure assigns a truth value to every sentence letter, so $P$ must be true or false in every structure, but regardless, $(P \vee \neg P)$ is always true. Tautologies or logical truths are in a sense, facts of nature, for no matter what universe we are looking at, "Either the sky is blue or it is not" will be true. At least, usually.

On the other hand, contradictions like $(P \wedge \neg P)$ are false in every structure.

For future reference, the following notation will be helpful. For a given structure $A$ and sentences $\Phi$ and $\Psi$ :

If $|\Phi|_A = T$, we say $A$ satisfies $\Phi$, written $A \vDash \Phi$
If $|\Phi|_A = F$, we say $A$ does not satisfy $\Phi$, written $A \nvDash \Phi$
If $\Phi$ is a tautology, i.e. $A \vDash \Phi$ for all structures $A$, we just write $\vDash \Phi$
If $\Phi$ is a contradiction, i.e. $A \nvDash \Phi$ for all structures $A$, we just write $\Phi \vDash$
If $|\Phi|_A = |\Psi|_A$ for all strucutres $A$, we say $\Phi$ and $\Psi$ are logically equivalent, written $\Phi \equiv \Psi$

The last bullet point is just to say that there are lots of sentences that express the exact same truth table despite being written differently. For example, $P \rightarrow Q \equiv \neg P \vee Q$, so it's useful to say when to sentences are actually essentially the same.

Sets of Sentences

In addition to the semantics of individual sentences, we can discuss sets of sentences too. Consider $\Gamma = \{\Phi, \Psi,…\}$.

A structure $A$ satisfies $\Gamma$ if and only if $|\gamma|_A = T \ \ \ \forall \gamma \in \Gamma$, written $A \vDash \Gamma$
We say $\Gamma$ is consistent or satisfiable if and only if there is a structure $A$ such that $A \vDash \Gamma$

Validity of an Argument

We are finally in a position to start talking concretely and specifically what makes a good argument.

When someone has argued a claim (conclusion) from a series of facts and assumptions (premises), we want there to be a substantive connection between their premises and the conclusion. Just like with our truth table for "If…then…", we know there is no connection if there is a scenario in which our premises are true and our conclusion isn't. So if an argument is valid, then whenever the premises are true, the conclusion must also be true.

We can formalize this a little bit with the notion of semantic entailment:

Definition: Given a set of sentences (premises) $\Gamma = \{\Phi, \Psi,…\}$ and a single sentence $\varphi$ (conclusion), we say that $\Gamma$ semantically entails $\varphi$ if and only if for all structures $A$, if $A \vDash \Gamma$, then $A \vDash \varphi$.

In other words, whenever all of our premises are true, our conclusion is true.

When $\Gamma$ semantically entails a sentence $\varphi$, denoted $\Gamma \vDash \varphi$, then $\varphi$ is a valid conclusion from premises $\Gamma$. It's called semantic entailment since we are working with the semantics of the individual sentences, namely their truth values. In the language of consistency, we can also say that $\Gamma \vDash \varphi$ if and only if $\Gamma \cup \{\neg \varphi\}$ is inconsistent.

Claim: $\Gamma \vDash \varphi$ if and only if $\Gamma \cup \{\neg \varphi\}$ is inconsistent.

Proof: Suppose that $\Gamma \vDash \varphi$. Say that $A$ is a structure that satisfies $\Gamma$ i.e. $|\gamma|_A = T \ \ \ \forall \gamma \in \Gamma$. Then by our assumption, $|\varphi|_A = T$. By the rules for $\neg$, then $|\neg \varphi|_A = F$. For any structure $B$ that does not satisfy $\Gamma$, $\exists \gamma \in \Gamma \ \ \ |\gamma|_B = F$. So, for all structures, at least one sentence in $\Gamma \cup \{\neg \varphi\}$ is false, so it is inconsistent. $ \ \blacksquare$

It's worth pointing out that we have two uses for the symbol $\vDash$, one relating structures to sentences, and another relating sentences to other sentences. The context will specify which use we mean, but it is useful to keep in mind that we might double up on symbols when their related meanings are so similar.

Let's look at the example we've been holding this whole time: "If $P$, then $Q$. $P$. Therefore, $Q$." If this is a valid argument, then the formalization would say that $P \rightarrow Q, P \vDash Q$ is correct. We can verify this just by truth tables:

$ \begin{array}{c|c||c} (P \rightarrow Q) & P & \vDash & Q \\ \hline \color{blue}{T} & \color{blue}{T} & & \color{blue}{T} \\ F & T & & F \\ T & F & & T \\ T & F & & F \\ \end{array} $

Whenever the premises are true, so is the conclusion. While technically an okay way to check, it is inefficient. Imagine we had an argument with not 2, but even just 3, 4, or more sentence letters, since for every additional sentence letter in our premises, it doubles the number of rows in our truth table to check. Instead, we can show it just as easily with a proof by contradiction. Suppose our argument is invalid i.e. there is a structure where our premises are true and our conclusion is false.

$ \begin{array}{c|c||c} (P \rightarrow Q) & P & \vDash & Q \\ \hline T & T & & F \\ \end{array} $

But, if $P$ is true and $Q$ is false, then by the rules for $\rightarrow$, we have $P\rightarrow Q$ is false.

$ \begin{array}{c|c||c} (P \rightarrow Q) & P & \vDash & Q \\ \hline T & T & & F \\ F & & & \end{array} $

That contradicts our assumption that $P\rightarrow Q$ was true as a premise. So, it can't be the case that our argument is invalid.

Quicks of Validity

Keep in mind our definition of a valid argument: $\Gamma \vDash \varphi$ if and only if whenever $\Gamma$ is satisfied, so is $\varphi$. Now consider the argument $P \wedge \neg P \vDash Q$. Is this valid? Even though the premises have nothing to do with the conclusion, surprisingly, this is valid. $P \wedge \neg P$ technically satisfies the condition that whenever it is true (never), so is $Q$; it never has the chance to fail our definition. For this we can write $P \wedge \neg P \vDash$ since it will entail everything by our definition, and anything can go on the right hand side of the turnstile, and this is why we shorten all contradicitions accordingly.

Similarly, we can do the same for tautologies, $\vDash P \vee \neg P$ since they will always be true, so certainly whenever whatever on the left is true, so is the tautology.

So we can have weird technically valid arguments like, "The sky is blue and not blue, therefore the Earth is flat," or similarly that "Squares have 5 corners, so either it will rain today or it won't." These both read as absurd, but they are technnically valid with our definition. These might seem like a flaw for our definition, but for the convenience of all the other arguments, I think it is worth accepting as an oddity of the system. You can also think of them as just sort of fact-of-the-matter statements; tautologies are entailed by everything since they don't care what makes you think they are true, they just are. Similarly with contradictions, your starting set of premises are just inconsistent, and it is just a bad argument; if you're willing to accept bad premises, you can deduce bad conclusions.

Proof Systems and Natural Deduction

We have established quite a bit already in $\mathcal{L}_1$, but it is still not that elegant of a system. We now have a way to formalize English arguments, but our ways to validate them are still unwieldy. The only two ways we really have to verify statements is, in essence, brute force checking $\mathcal{L}_1$-structures to see if the argument fails anywhere. This is no better than checking every right-angled triangle to see if it satisfies the Pythagorean theorem. We want a systematic way to go about our arguments.

We want a proof system: given a set of premises $\Gamma$, we want to be able to manipulate them in some way to conclude $\varphi$.

For example, consider the argument $P \wedge Q, R \vDash P \wedge R$. Without checking any structures and evaluating truth tables, this argument should make sense. If we know $P \wedge Q$, i.e. $P$ and $Q$, we certainly know both $P$ and $Q$ are true individually, so we know $P$. Since we also know $R$ is true, then again it seems obvious we know $P$ and $R$ is also true together i.e. $P \wedge R$. We did not touch any semantics at all of our premises or conclusion (I used the word "true" as a means for accepting a premise), but rather just moved our sentence letters around in a sensible way to get the conclusion.

We just found two rules that make sense to have in our proof system: $\wedge$-Introduction and $\wedge$-Elimination.

$ \begin{array}{ccc} & \wedge\textrm{-Intro} & \\ \Phi & & \Psi \\ \hline & \Phi \wedge \Psi & \\ \end{array} \ \ \ \ \begin{array}{c} \wedge\textrm{-Elim1} \\ \Phi \wedge \Psi \\ \hline \Phi \\ \end{array} \ \ \ \ \begin{array}{c} \wedge\textrm{-Elim2} \\ \Phi \wedge \Psi \\ \hline \Psi \\ \end{array} $

So our above proof of $P \wedge Q, R \vdash P \wedge R$ would look like

$ \begin{array}{ccc} \underline{P \wedge Q} & & \\ P & & R \\ \hline & P \wedge R & \\ \end{array} $

Thinking through what the other connectives mean can give some intuitive rules for them too:

$ \begin{array}{c|c|c} \textbf{Connective} & \textbf{Introduction} & \textbf{Elimination} \\ \hline \wedge & \begin{array}{c} \begin{array}{cc} \Phi & \Psi \\ \end{array} \\ \hline \Phi \wedge \Psi \end{array}
& \begin{array}{c} \Phi \wedge \Psi \\ \hline \Phi \\ \end{array} \ \ \ \ \begin{array}{c} \Phi \wedge \Psi \\ \hline \Psi \\ \end{array} \\ \hline \vee & \begin{array}{c} \Phi \\ \hline \Phi \vee \Psi \\ \end{array} \ \ \ \ \begin{array}{c} \Psi \\ \hline \Phi \vee \Psi \\ \end{array} & \begin{array}{c} \begin{array}{ccc} & [\Phi] & [\Psi] \\ & \vdots & \vdots \\ \Phi \vee \Psi & \Delta & \Delta \\ \end{array} \\ \hline \Delta \end{array} \\ \hline \neg & \begin{array}{c} \begin{array}{cc} [\Phi] & [\Phi] \\ \vdots & \vdots \\ \Psi & \neg\Psi \\ \end{array} \\ \hline \neg \Phi \end{array}
& \begin{array}{c} \begin{array}{cc} [\neg\Phi] & [\neg\Phi] \\ \vdots & \vdots \\ \Psi & \neg\Psi \\ \end{array} \\ \hline \Phi \end{array} \\ \hline \rightarrow & \begin{array}{c} [\Phi] \\ \vdots \\ \Psi \\ \hline \Phi \rightarrow \Psi \\ \end{array}
& \begin{array}{c} \begin{array}{cc} \Phi & \Phi\rightarrow\Psi \\ \end{array} \\ \hline \Psi \end{array} \\ \hline \leftrightarrow & \begin{array}{c} \begin{array}{cc} [\Psi] & [\Phi] \\ \vdots & \vdots \\ \Phi & \Psi \\ \end{array} \\ \hline \Phi\leftrightarrow\Psi \end{array}
& \begin{array}{c} \begin{array}{cc} \Phi & \Phi\leftrightarrow\Psi \\ \end{array} \\ \hline \Psi \end{array}
\ \ \ \ \begin{array}{c} \begin{array}{cc} \Psi & \Phi\leftrightarrow\Psi \\ \end{array} \\ \hline \Phi \end{array} \\ \end{array} $

The square brackets are assumed premises. They are not explicitly given, but, for the sake of argument, you can take a "fake" premise to demonstrate a rule, and discard that assumption after you've used your rule. Some other comments on the rules:

$\vee$-Intro basically says if you know one sentence, appending any other sentence via $\vee$ is certainly fine since you already have a known fact. $\vee$-Elim says if you know one or the other (or both) of $\Phi$ and $\Psi$ is true, but do not know which, you can deduce a new sentence $\Delta$ if it follows from either of the sentences.
$\neg$-Rules are basically proofs by contradiction. If $\Phi$ results in both $\Psi$ and $\neg\Psi$, $\Phi$ can't be held true since it would prove contradictory results. Similarly with assuming $\neg\Phi$.
$\rightarrow$-Intro follows in line with for-the-sake-of-argument. If we assume $\Phi$, and can deduce $\Psi$, then regardless if we know $\Phi$ or not, we can say that it would logically imply $\Psi$ as well.

This specific proof system is known as natural deduction. These proof trees give us a direct way to interact with our premises in a way that we might normally do in English. However, some of these proofs do become huge quite quickly. Consider the proof of $P\rightarrow (Q\leftrightarrow R) \vdash P \wedge Q \leftrightarrow P \wedge R$:

Some of you might have noticed a small change in notation; I've been using $\vdash$ as opposed to $\vDash$. In proofs, since we are not considering structures and the truth values of our sentences, semantic entailment does not seem like the right way to describe the relationship between our premises and the conclusion. We now say $\Gamma \vdash \varphi$ if $\Gamma$ syntactically entails, or simply proves via our proof system, $\varphi$, since we are strictly only concerned with the symbols and the actual writing of the sentence as opposed to its truth.

But we should not forget our original goal of logic: to demonstrate arguments are valid, which involves $\Gamma \vDash \varphi$, which seems to have a very different meaning to $\Gamma \vdash \varphi$. It happens that our choice of rules above was not random. As we will see later, natural deduction is a sound system:

Soundness Theorem: If $\Gamma \vdash \varphi$, then $\Gamma \vDash \varphi$.

Soundness can be thought of as the quality of preserving truth; we can rely on our proofs to deduce meaningful, accurate conclusions in our formal language. Given our rules, we would hope that these are sound, since they are based on how people typically argue in real life, too. You might come across other proof systems, but I can guarantee you, they are designed to be sound and useful.

It's worth mentioning at this point we have similar notions for consistency in a proof based sense just as we have in a semantic sense. We say a set of sentences $\Gamma$ is ND-consistent (ND for natural deduction) if there exists a sentence $\Phi$ that $\Gamma$ does not prove; i.e. $\exists \Phi$ s.t. $\Gamma \nvdash_{ND} \Phi$. In general, we say $\Gamma$ is D-consistent if there is a sentence $\Gamma \nvdash_{D} \Phi$ in the deductive system $D$. A good way of thinking about this is if we have bad premises, we can deduce anything since our assumptions were contradictory in some way. For example, $\{P, \neg P\}$ is ND-inconsistent since by $\neg$-Elim, it can prove anything.

Fortunately enough for us, ND-consistency is equivalent to semantic consistency, but we'll see that later.

This concludes our discussion of $\mathcal{L}_1$. The system and rules we've outlined above is sometimes also referred to as propositional logic.

$\mathcal{L}_2$: The Logical Quantifiers

Let's go back to our very first argument we discussed: "If Socrates is a man, then he is mortal. Socrates is a man. Therefore, Socrates is mortal." We were able to show it is valid in $\mathcal{L}_1$ by formalizing it, and either checking its truth table or with our proof system. Let's look at a very similar argument. "All men are mortal. Socrates is a man. Therefore, Socrates is mortal." Perfectly reasonable. If we formalize this with our current logic, though, we end up with $P, Q \vDash R$, where $P$ is "All men are mortal," $Q$ is "Socrates is a man", and $R$ is "Socrates is mortal". Clearly this is not a logical argument in $\mathcal{L}_1$, since we can just specify a structure where $|P|_A = |Q|_A = T$ and $|R|_A = F$ since they are separate sentence letters.

Predicates and Constants

$\mathcal{L}_1$ is good for keeping sentences simple, but it lacks the nuance of the predicates English has; we will never be able to show the relation of men and mortality, as we can't express "…are mortal" nicely; relations between simpler sentences are covered by connectives, but relations between objects are not. We need something to convey properties somehow. Ideally, we want to be able to say Socrates is mortal, by first saying he is a man, and to further show that anything that has the property of being a man also has the property of being mortal.

To formalize "Socrates is a man", we can introduce predicates kind of like connectives. Let's have $P^1$ stand for "…is a man", and for convenience, we'll let the letter $|a|_A$ stand for the constant Socrates (remember, $|\cdot|_A$ denotes meaning). Then $P^1a$ is a nice compact way to write out "Socrates is a man".

Now how do we formalize that this is a true sentence? We don't want to have to explicitly say $|P^1a|_A = T$, since then we would have to specify that for everything that is a man, there is a constant associated with that human and that when put together with $P^1$ forms a true statement.

Let's take a step back for a moment. If you had to explain to an alien precisely what a human is, how would you do it? You might describe some key traits for them, but the easiest way, in a universal sense, is just to show them by example. Give them a list of humans and have them extrapolate on their own. So perhaps a good semantics for predicates is to exactly that: we will let $|P^1|_A = \{\textrm{Socrates}, \textrm{Feynman}, \textrm{Noether},…\}$.

Then a natural way of defining truth would be that $|P^1a|_A = T$ if and only if $|a|_A \in |P^1|_A$. That sentence makes sense, since $|a|_A$ is an object/noun, and $|P^1|_A$ is a set of objects, so we can evaluate that, and since $|P^1|_A$ explicitly states all the objects that have that property, seeing if something has that property (like being a man) is a matter of being in the list of objects said to have it. "Socrates is a man" is true iff "Socrates" has the property "is a man" iff Socrates is in the set of all men.

I've been putting a small superscript in our predicate $P^{\color{red}{1}}$. This is the arity index, and basically says how many objects the predicate takes as an input. So, naturally we can have indices greater than 1 as well.

The predicate "…like…" is a perfectly good predicate, but it now qualifies two objects and relates them to each other. So if Alice likes Bob, we can write this as $P^2ab$ with the relevant meanings for $P^2$ as "…likes…" and the constants referring to Alice and Bob respectively. But note here, order matters. Just because Alice likes Bob, that does not necessarily mean Bob likes Alice and so for some structures it might be the case that $|P^2ab|_A \neq |P^2ba|_A$.

Just as arity 1 predicates had a set of objects as its semantics, arity 2 has a set of ordered pairs i.e. $|P^2|_A = \{\langle \textrm{Alice}, \textrm{Bob}\rangle, \langle \textrm{Davis}, \textrm{Edward} \rangle, \langle \textrm{Jack}, \textrm{Jack} \rangle \}$ would be a reasonable way to denote the predicate. And similarly, we say $|P^2ab|_A = T$ if and only if $\langle |a|_A, |b|_A \rangle \in |P^2|_A$. So 2-ary predicates define binary relations.

$ \begin{array}{c|c|c} \textbf{Expression} & \textbf{Formalization} & \textbf{Semantic Value} \\ \hline \textrm{constants} & a,b,c,… & \textrm{objects} \\ \textrm{sentence letter} & P,Q,R,… & \textrm{truth value} \ (T,F) \\ \textrm{unary predicate} & P^1,Q^1,R^1,… & \textrm{set of objects} \\ \textrm{binary predicate} & P^2,Q^2,R^2,… & \textrm{set of ordered pairs} \\ \textrm{ternary predicate} & P^3,Q^3,R^3,… & \textrm{set of ordered triples} \\ \textrm{n-ary predicate} & P^n,Q^n,R^n,… & \textrm{set of n-tuples} \\ \end{array} $

In practice, the arity index usually doesn't need to be written, as it will be implied by the number of constants attached to the predicate i.e. $Pabc$ is clearly a ternary predicate, as if it wasn't, it would either be incomplete ("a hates b and c, but likes ?"), or it would accept too many arguments ("If a then b"; where does c go?). So in sentences like $Pa \wedge Pab$, each $P$ stands for a different predicate as implied by their arity.

This is the start of the more robust logic $\mathcal{L}_2$. Just for the formalities, here are some things we need to keep in mind over $\mathcal{L}_1$.

Sentence letters are the same as 0-ary predicates; they accept no arguments.
In an $\mathcal{L_2}$-structure, we assign semantics to every predicate (either truth values or sets to predicates), as well as an object to every constant
Connectives work the same as they have done before
Truth for predicates are equivalent to membership of a set

Quantifiers

So we now have a good way of storing properties of objects via the predicates they satisfy, so we have a good way of saying "Socrates is a man". But now we need a way of saying, "All men are mortal". It's that "all" that makes this difficult, since it does not refer to a specific man, but a generic one.

Let's generalize the sentence a bit more. What it really is saying, "If one is a man, then they are mortal." Now we have this nice "one" acting a placeholder name for our generic man. Actually, for a generic anything. If we are talking about a turtle, we cannot say if they are mortal by this sentence since they have nothing to do with that claim.

In an even more mathematical way of looking at the sentence, it could read as, "For all $x$: if $x$ is a man, then $x$ is mortal." That "for all" appears enough in arguments, we have formalize it as $\forall$, the universal quantifier. You might have seen this appear in math, and if not, I've already been abusing its convenience in some previous proofs on this post.

So our sentence would look like $\forall x(Px \rightarrow Qx)$ where $P$ and $Q$ are predicates for being a person and mortal respectively.

$x$ is clearly not a constant, since, well, it is not constant; it does not have a fixed meaning. We then aptly call it a variable. But clearly, variables can take on certain meanings, so we also have variable assignments $\alpha, \beta,…$ that allow us to arbitrarily assign variables meanings in a formal way. For example, if we let $x$ mean Socrates in our structures, i.e. $|x|_A^\alpha = \textrm{Socrates} = |a|_A$, then our argument works by the truth rules for $\rightarrow$.

The "opposite" of the universal quantifier is the existential quantifier $\exists$, which instead of saying our sentence applies to all possible objects, $\exists x \Phi$ only claims that there is at least one object that satisfies the sentence $\Phi$.

The duality of the quantifiers can be found even in natural language. "Everything is not a cow," as in $\forall x \neg Px$, is the same as saying, "There does not exist a cow", or $\neg \exists x Px$. Similarly, "There is something that is not a cow," or $\exists x \neg Px$, is equivalent to saying "Not everything is a cow" or $\forall x \neg Px$.

However, not every formula with variables and quantifiers make a sentence. Consider $\forall x \exists y(Px \vee \neg Qxy \rightarrow Rz)$. This doesn't make that much sense since $z$ is just hanging out as a free variable as there is no quantifier claiming anything about $z$. And since $z$ is a variable, it's not like it has any semantic meaning, either. So we wouldn't call this a sentence, but a formula. Variables that are not free are bound.

Definition: A formula is a sentence if and only if it has no free variables.

Similarly, $\forall y Qy \rightarrow Py$ wouldn't be called a sentence either, as depite how it looks, there is a free variable in $Py$ even though there is a $\forall y$ in the formula. You usually need parantheses to specify what the quantifiers bind. $\forall y(Qy \rightarrow Py)$ is a sentence, while the previous expression is not.

Domains

The final extra detail we need is to specify what meanings our variables are allowed to be assigned. So in an $\mathcal{L}_2$-structure, we also specify a domain that specifies all the possible objects in the "universe" that we are discussing. All constants are assigned meanings from the domain, and our quantifiers use variable assignments that range over the domain.

Satisfaction and Truth in $\mathcal{L}_2$

Suppose $\Phi$ and $\Psi$ are $\mathcal{L}_2$ formulas, $v$ is a variable, $A$ is a structure, and $\alpha$ is a variable assignment. We say $\alpha$ satisfies $\varphi$ under $A$, or $|\varphi|_A^\alpha = T$, in the following way:

$|\Phi v_1v_2…v_n|_A^\alpha = T$ iff $\langle |v_1|_A^\alpha, |v_2|_A^\alpha,…, |v_n|_A^\alpha \rangle \in |\Phi|_A^\alpha$ where $\Phi$ is an n-ary predicate and each $v_i$ is a variable or constant
$|\neg\Phi|_A^\alpha = T$ iff $|\Phi|_A^\alpha = F$
$|\Phi \wedge \Psi|_A^\alpha = T$ iff $|\Phi|_A^\alpha = T$ and $|\Psi|_A^\alpha = T$
$|\Phi \vee \Psi|_A^\alpha = T$ iff $|\Phi|_A^\alpha = T$ or $|\Psi|_A^\alpha = T$
$|\Phi \rightarrow \Psi|_A^\alpha = T$ iff $|\Phi|_A^\alpha = F$ or $|\Psi|_A^\alpha = T$
$|\Phi \leftrightarrow \Psi|_A^\alpha = T$ iff $|\Phi|_A^\alpha = |\Psi|_A^\alpha$
$|\forall v \Phi|_A^\alpha = T$ iff for every variable assignment $\beta$ that differs from $\alpha$ in at most $v$, $|\Phi|_A^\beta = T$; in other words, $|\forall v \Phi|_A^\alpha = T$ iff $\Phi$ is satisfied when every thing except $v$ is fixed
$|\exists v \Phi|_A^\alpha = T$ iff there is at least one variable assignment $\beta$ that differs from $\alpha$ in at most $v$ where $|\Phi|_A^\beta = T$; in other words, $|\exists v \Phi|_A^\alpha = T$ iff $\Phi$ is satisfied for some value of $v$

The normal connective satisfaction come from the standard truth tables we did before, and the quantifiers are exactly what you would imagine them to mean, just formalized in the new terminology we have to make it precise.

The truth of an $\mathcal{L}_2$ sentence is now easy given the above satisfaction rules:

Definition: An $\mathcal{L}_2$ sentence $\varphi$ is true $|\varphi|_A = T$ iff $|\varphi|_A^\alpha = T$ for every variable assignment $\alpha$ over $A$.

Validity remains the same in $\mathcal{L}_2$, just with our updated definitions for structures and truth:

Definition: Given a set of $\mathcal{L}_2$ sentences $\Gamma = \{\Phi, \Psi,…\}$ and a single sentence $\varphi$, we say that $\Gamma$ semantically entails $\varphi$ if and only if for all $\mathcal{L}_2$-structures $A$, if $A \vDash \Gamma$, then $A \vDash \varphi$.

Natural Deduction 2.0

Rules for our original connectives $\wedge, \vee, \neg, \rightarrow, \leftrightarrow$ remain the same. We only need to add the rules for $\forall$ and $\exists$. It will be easier to write them out then explain them afterwards.

$ \begin{array}{c|c|c} \textbf{} & \textbf{Introduction} & \textbf{Elimination} \\ \hline \forall & \begin{array}{c} \Phi[t/v] \\ \hline \forall v \Phi \\ \end{array} \ { \textrm{for generic constant} \ t} & \begin{array}{c} \forall v \Phi \\ \hline \Phi[t/v] \\ \end{array} \\ \hline \exists & \begin{array}{c} \Phi[t/v] \\ \hline \exists v \Phi \\ \end{array} & \begin{array}{c} \begin{array}{cc} & [\Phi[t/v]] \\ & \vdots \\ \exists v \Phi & \Psi \\ \end{array} \\ \hline \Psi \end{array} \ { \textrm{for generic constant} \ t} \\ \end{array} $

The notation of $\Phi[t/v]$ is to highlight substitution: $\Phi[t/v]$ is the formula $\Phi$ with all occurrences of $v$ in it replaced with $t$. For example, if $\Phi = Px \wedge \forall y (Qxy \rightarrow Rx)$, then $\Phi[t/x] = Pt \wedge \forall y (Qty \rightarrow Rt)$.

Here is the intuition behind the new rules:

$\forall$-Intro basically says if $t$ is a constant that does not appear anywhere else in our proof, it is in practice a "dummy variable", and allows us to introduce a proper variable bound by $\forall$.
$\forall$-Elim says if we know $\Phi$ is satisfied by possible $v$, it is certainly true if $v$ is replaced by a specific $t$; if all men are mortal, certainly Socrates is mortal too
$\exists$-Intro is the formal way of saying if we see an occurence of $\Phi$, then there is some $v$ that satisfies $\Phi$; if you see a cow, you know there is at least one animal
$\exists$-Elim is a for-the-sake-of-argument style proof. While we may not know exactly what satisfies $\exists v \Phi$, if we can deduce some claim independent of the specific thing that satisfies it, we know that claim must also be true as it only relies on the existence of something. If we know there is at least one plant-eating animal—call it Bob—we can conclude that there are plants, since Bob eats plants, and that's true for all "Bob"s (a.k.a. plant-eating animals).

With these added rules, we can now prove a lot of the things we already knew, such as:

The mortality argument of $\forall x(Px \rightarrow Qx), Pa \vdash Qa$
The quantifier interplay $\neg \forall x Px \dashv\vdash \exists x \neg Px$
Standard tautologies and contradictions; $\vdash \forall x Px \vee \neg \forall x Px$, and $\forall x Px \wedge \neg \forall x Px \vdash$

Here is an example worked proof from the additional exercises of The Logic Manual of $\forall y \exists x(Ryx \vee Qyx) \vdash \forall y (\exists x Ryx \vee \exists x Qyx)$:

Similarly, we will see that these rules are also sound.

$\mathcal{L}_=$: Identity and Definite Descriptions

$\mathcal{L}_2$ gets us most of the way for what we want but there's one slight inconvenience: we don't have a notion of identity. Say we have a constant $a$ with semantic value $|a|_A = \textrm{Socrates}$. Then, if we have another constant $b$ such that $|b|_A = \textrm{Socrates}$, we have no good way of showing these constants are really "the same"; they might look different, but they function in the same way as both standing in for Socrates.

And in a philosophical, sense, identity is an important concept in ontologies and metaphysics, so it's worthwhile having a formal notion that we can refer to when picking apart arguments.

We could just specify the predicate, "…is identical to…" as a binary predicate letter in $\mathcal{L}_2$, but that means that the predicate might receive arbitrary meanings in different structures, and identity seems like a pretty unshakeable concept. It also might mean that we use different predicate letters in different formalizations, and we want this to be consistent for it to be useful.

This is what $\mathcal{L}_=$ is for. $\mathcal{L}_=$ is exactly like $\mathcal{L}_2$ in terms of structures and truth, but it has the one difference of having an extra predicate of $=$ to refer to identity. Some quick notes:

As $=$ is a binary predicate, it relates two terms that are constants or variables (not formulas or sentences!)
By convention, we write $a=b$ as per the norm of math; we don't write $=ab$ as we would for a normal predicate/predicate letter
For variables/constants $s$ and $t$, $|s=t|_A^\alpha = T$ iff $|s|_A^\alpha=|t|_A^\alpha$ i.e. they share the same semantic value and represent the same object

Natural Deduction 2.5

We don't have much else to add alongside $=$, and the new rules we use are among the most intuitive of the lot.

$ \begin{array}{c|c|c} \textbf{} & \textbf{Introduction} & \textbf{Elimination} \\ \hline = & [a = a] & \begin{array}{c} \begin{array}{cc}\Phi[s/v] & s=t \end{array} \\ \hline \Phi[t/v] \\ \end{array} \\ \end{array} $

For all constants $a$, it is clear that $a$ is identical to itself, so the introduction rule allows us to say that whenever we want for any constant
If $s$ and $t$ are identical, they can be swapped freely in formulas without hurting the underlying semantics, so the elimination rule also makes sense

Using these rules along side our normal natural deduction, try to prove that $=$ is an equivalence relation, that is

It is reflexive: $\vDash \forall x (x=x)$
It is symmetric: $\vDash \forall x \forall y (x=y \rightarrow y=x)$
It is transitive: $\vDash \forall x \forall y \forall z (x = y \wedge y = z \rightarrow x = z)$

just as we would expect of normal identity.

Numerical Quantifiers and Definite Descriptions

With identity in place, we can now establish a surprising amount of additional, useful claims regarding the specifity of the objects we are talking about.

Numerical Claims

If we wanted to say, "There is at least 1 chicken", we can already do that with $\exists$ by letting $|P| = \textrm{"…is a chicken"}$ and the formalization $\exists x Px$. This is by definition of what it takes to satisfy $\exists$: it requires only one variable assignment, i.e. object in the domain, to be in the set of chickens that $P$ expresses.

If we want to say there are at least two chickens, then we can write there are two things that are chickens i.e. $\exists x \exists y(Px \wedge Py)$. But this alone does not guarantee there might be two chickens, since if there is one chicken, then the variable assignment where both $x$ and $y$ both refer to the same chicken would satisfy that sentence. So we add the clause that they are distinct: $\exists x \exists y(Px \wedge Py \wedge \neg x = y)$; the $\neg x = y$ prevents our variable assignments allowing $x$ and $y$ to be the same if we want

At least 3 chickens would be like $\exists x \exists y \exists z(Px \wedge Py \wedge Pz \wedge \neg x = y \wedge \neg x = z \wedge \neg y = z)$. At least $n$ chickens, would be $\exists x_1 \exists x_2 \cdots \exists x_n(\bigwedge_{i=1}^n Px_i \wedge\bigwedge_{1 \leq i < j \leq n} \neg x_i = x_j)$ where $x_i$ are variables and $\bigwedge$ is like $\sum$ where each term instead of being added is linked by the $\wedge$ connective.

Similarly, if we want at most 1 chicken, that is the same as saying the opposite of at least 2 chickens; the complement of $\geq 2$ is $$<2$$ or $\leq 1$. So we can just negate our sentence from before: $\neg \exists x \exists y(Px \wedge Py \wedge \neg x = y)$. A cleaner (and more intuitive way) to formalize this, in my opinion is the following: $\forall x \forall y(Px \wedge Py \rightarrow x = y)$; for every 2 supposedly different chickens, they are actually the same.

At most 2 chickens would be $\forall x \forall y \forall z(Px \wedge Py \wedge Pz \rightarrow x = y \vee x = z \vee y = z)$; if there are 3 supposedly different chickens, actually at least 1 is a duplicate of the other 2. For at most $n$ chickens: $\forall x_1 \forall x_2 \cdots \forall x_{n+1}(\bigwedge_{i=1}^{n+1} Px_i \rightarrow \bigvee_{1\leq i < j \leq n+1} x_i = x_j)$

Definite Descriptions

We can set lower bounds on how many chickens there are, and upper bounds on how many chickens there are, so we are now in place to formalize when there is exactly one chicken. For there to be only one supreme chicken, we just need to specify there is at least one and at most one chicken: $\color{blue}{\exists x Px} \wedge \color{red}{\forall x \forall y(Px \wedge Py \rightarrow x = y)}$. An even nicer expression is $\exists x (Px \wedge \forall y (Py \rightarrow y = x))$; there is something that is a chicken, and all other chickens are really the same as that original, defining chicken.

With this, we can now formalize definite descriptions. When we mention "The president of the United States", we are referring to a general position i.e. many people were president from George Washington to Joe Biden. However, there is only one current president of the U.S., and we are now able to differentiate the two with our formalizations above. "The president is a democrat" would be written as $\exists x (Px \wedge \forall y (Py \rightarrow y = x) \wedge Qx)$ where $P$ expresses "…is the president of the U.S." and $Q$ for "…is a democrat". In general to express, "The $\Phi$ has the property $\Psi$," we could use the following formula:

Russell's theory of definite descriptions: $\exists x (\Phi \wedge \forall y (\Phi[y/x] \rightarrow y = x) \wedge \Psi)$

With definite descriptions in hand, we can now formally prove arguments that naturally make sense in English. "Tim's car is red. Therefore there is a red car." Easy enough to see in English, but requires a bit more work to express in $\mathcal{L}_=$:

$\exists x(Px \wedge Qax \wedge \forall y (Py \wedge Qay \rightarrow y = x) \wedge Rx) \vdash \exists (Px \wedge Rx)$

$\begin{align} a: \ & \textrm{Tim} \ \newline P: \ & \textrm{…is a car} \ \newline Q: \ & \textrm{…owns…} \ \newline R: \ & \textrm{…is red} \ \end{align}$

Generalized Numerical Quantifiers

We were able to specify exactly 1 thing satisfies a property, but we can generalize to exactly $n$ with a simple recursive definition. To express there are exactly $n$ things satisfying a formula, we will denote it with $\exists!_n$, given by the following definition:

$\exists!_0 v \Phi = \neg \exists v \Phi$
$\exists!_{n+1} v \Phi = \exists u (\Phi[u/v] \wedge \exists!_n v(\Phi \wedge \neg u = v))$

Exactly 0 things satisfy the formula $\Phi$ when there does not exist anything that satisfies it. Exactly $n+1$ things satisfy $\Phi$ when there is something that satisfies $\Phi$ and exactly $n$ distinct, other things that satisfy $\Phi$ as well. As a gut check, we can test to see if this agrees with our definition from definite descriptions from before.

$\begin{align} \exists!_1 x Px & = \exists x(Px \wedge \exists!_0 y (Py \wedge \neg y = x)) \ \newline & \equiv \exists x(Px \wedge \neg \exists y (Py \wedge \neg y = x)) \ \newline & \equiv \exists x(Px \wedge \forall y \neg (Py \wedge \neg y = x)) \ \newline & \equiv \exists x(Px \wedge \forall y (\neg Py \vee y = x)) \ \newline & \equiv \exists x(Px \wedge \forall y (Py \rightarrow y = x)) \ \end{align}$

By a series of logical equivalences, we can see that we get precisely what we expect.

A Quick Aside on Identity

We've been using identity loosely so far, since what we mean by "identical" actually varies, whether you realize it or not.

If we say my friend and I have identical cars, we don't mean that my friend and I literally own the same car, but rather the make, the model, the color, etc. all look identical. This is an example of (approximate) qualitative identity, where two objects share the same properties. However, if you say the police tell you your car is the same one that was seen at a crime scene, then we mean it is literally the one and the same car. This is numerical identity.

We have to be careful about this since these two types of identities clearly seem disparate. While numerical identity seems to imply qualitative identity (one thing will always share the same properties as itself), it is debated if qualitative identity really implies numerical identity. If two things really share all the same properties, it becomes unclear whether or not that really implies we are talking of the same object.

The language of $\mathcal{L}_=$ focuses on the latter numerical identity, even though the former might be the more convenient one in everyday speech.

Summary of our Logics

We've covered a lot, so let's quickly recap some of the key ideas covered so far:

Logic studies valid arguments; what conclusions can we draw by nature of the structure of our argument?

For $\mathcal{L}_1$:

We had basic sentence letters, which we made more complex with the connectives $\{\wedge, \vee, \neg, \rightarrow, \leftrightarrow\}$, each with their own set of truth rules
We modelled the universe in the manner of structures, saying which sentences we said to be true and which to be false
Alongside it came its own proof system, natural deduction, that allowed us to validate arguments in a more systematic way beyond brute force checking

For $\mathcal{L}_2$:

We lacked the nuance of relating objects to each other in $\mathcal{L}_1$, so we introduced constants to formalize those objects with properties stored as predicates
We updated our model of the universe with specifying constants and predicates (unary, binary, ternary, and n-ary relations)
The universal $\forall$ and existential $\exists$ quantifiers were a nice shortcut to state claims of how many things in the domain satisfy a formula
We updated our proof system to account for these quantifiers

For $\mathcal{L}_=$:

The notion of numerical identity leads to obvious truths that would be convenient to introduce
We expanded $\mathcal{L}_2$: with the new binary predicate $=$ to show when two constants are essentially the same despite being written with different letters
Included the final set of rules for natural deduction to accommodate $=$
Could now make numerical claims

This completes everything you need to know about first-order logic (more on this later), sometimes referred to as predicate calculus.

Part 2: Introduction to Metatheory

What we've covered before gives us a pretty reliable way to analyze English arguments, which philosophers really want to make sure they aren't tripping over themselves when discussing and debating ideas. But there are a whole host of patterns and results within logic itself that arguably have a bigger impact on the rest of science and deductive reasoning.

For example, consider the following theorem:

Deduction Theorem: $\Gamma, \Psi \vDash \Phi$ if and only if $\Gamma \vDash \Psi \rightarrow \Phi$.

By thinking through the definitions and terms we are working with, we can treat this almost as we would a mathematical theorem, and prove it analytically.

Proof: If $\Gamma, \Psi \vDash \Phi$, then by consistency (and definition of entailment), $\Gamma, \Psi, \neg \Phi \vDash $. Note that ony whenever the set $\{\Psi, \neg \Phi\}$ is satisfied, so is $\Psi \wedge \neg \Phi$. So we can rewrite our sequent as an equivalent one as $\Gamma, \Psi \wedge \neg \Phi \vDash$. By logical equivalence, then $\Gamma, \neg(\Psi \rightarrow \Phi) \vDash $. Then by consistency again, we arrive at $\Gamma \vDash \Psi \rightarrow \Phi$. $\blacksquare$

We are able to determine a result about arguments in general. We don't know what our premises and conclusion are, yet we can still make substantive claims regarding logical arguments of that form.

Here is another simple example that is commonly used all throughout science and math:

Contraposition: $\Phi \vDash \Psi$ if and only if $\neg \Psi \vDash \neg \Phi$.

"If it is Monday, then I have math class. So if I don't have math class, it certainly cannot be Monday." That is the general idea of argument by contraposition: if we know what premise entails a conclusion, and we don't observe the conclusion, the premise must not be fulfilled either.

We've already proven two useful theorems! Now, we will develop and practice using more tools to argue with logic as its own field of study rather than just an auxiliary field to other sciences.

Principle of Mathematical Induction

The major tool we will need is induction. For a hypothesis $\Phi$, the Principle of Mathematical Induction (PMI) is usually stated as:

$ \begin{align} & \Phi(0) \newline & \forall n (\Phi(n) \rightarrow \Phi(n+1)) \newline \hline & \forall n \Phi(n) \end{align} $

The common analogy is with dominoes: if you knock over a starting domino (show hypothesis $\Phi$ works for integer like 0), and can prove that each domino will knock over the next domino (i.e. the hypothesis holding for $\Phi(n)$ implies the hypothesis holds for $\Phi(n+1)$), then you can infer that all the dominoes will fall over eventually (the hypothesis holds for all integers $\forall n \Phi(n)$).

Here is a classic example: we will show the formula for the sum of the first $n$ integers is $\sum_{k=1}^n k = \frac{n(n+1)}{2}$.

Base Case: The formula holds for $n=1$ since $\sum_{k=1}^1 k = 1 = \frac{1(1+1)}{2}$.

Inductive Hypothesis: Let's assume our formula works for the first $n$ integers. We want to show that it holds for case $n+1$: $\sum_{k=1}^{n+1} k = \sum_{k=1}^{n} k + (n+1)$. By our assumption, we can reduce this to $\frac{n(n+1)}{2} + n+1$. Simplifying further,

$ \begin{align} \frac{n(n+1)}{2} + n+1 & = \frac{n(n+1) + 2(n+1)}{2} \newline & = \frac{(n+1)(n+2)}{2} \end{align} $

Which is precisely the formula our hypothesis predicted. So by the PMI, for all $n$, $\sum_{k=1}^n k = \frac{n(n+1)}{2}$.

Those who are more familiar with math will have seen examples of induction everywhere, and it is quite good for proving theorems and claims that naturally divide into these cases-by-integers.

One drawback of induction is that it does not tell you how you get your hypothesis to begin with. That usually requires trial-and-error with a strong intuition, but once you have it, induction gives a relatively straightforward way of proving your claim (if it is right).

The formulation above is sometimes referred to as weak induction. The alternative is strong induction:

$ \begin{align} & \Phi(0) \newline & \forall n (\forall k \leq n \ \Phi(k) \rightarrow \Phi(n+1)) \newline \hline & \forall n \Phi(n) \end{align} $

Sometimes you need more cases besides just $\Phi(n)$ to prove $\Phi(n+1)$, so strong induction is a way to justify that. Despite the names, the weak and strong forms of induction are equivalent, so we can use either whenever.

With induction in hand, we can now start proving results within logic.

Example: Any $\mathcal{L}_1$ sentence $\Phi$ that only uses the connective $\leftrightarrow$ is not a contradiction i.e. there's a structure $A$ in which $|\Phi|_A = T$.

Proof: We will show this by giving a specific structure that $\Phi$ is true, and proving it holds via induction. I propose the structure $A$ where all sentence letters in $\Phi$ are assigned true. That is $\forall \alpha \in \textrm{SenLett}(\Phi) \ |\alpha|_A = T$.

Base Case: The simplest possible sentence using just $\leftrightarrow$ is the one not using it at all: a sentence letter. So if $\Phi$ is a sentence letter, by construction of our structure, $|\Phi|_A = T$.

Inductive Hypothesis: We'll induct on the complexity of $\Phi$. So let's build up $\Phi$ from "simpler" sentences; $\Phi = \Psi_1 \leftrightarrow \Psi_2$ where $\Psi_1, \Psi_2$ are both subsentences only containing $\leftrightarrow$. Remember, we have defined our structure $A$ such that $\forall \alpha \in \textrm{SenLett}(\Phi) \ |\alpha|_A = T$. Since $\Psi_1,\Psi_2$ build up $\Phi$, it should be clear that $\textrm{SenLett}(\Psi_{1,2}) \subseteq \textrm{SenLett}(\Phi)$. Then, by our structure, $\forall \alpha \in \textrm{SenLett}(\Psi_{1,2}) \ |\alpha|_A = T$. By our inductive hypothesis then, $|\Psi_1|_A = |\Psi_2|_A = T$, since they are $\mathcal{L}_1$ sentences that only use $\leftrightarrow$. By the truth table for $\leftrightarrow$, $|\Phi|_A = T$. $\blacksquare$

So for all sentences that only use $\leftrightarrow$, there is a structure in which it is true, and hence it is not a contradiction.

Induction naturally lends itself to proofs that are divided by integer cases, and we leveraged that by ascribing a natural number to a sentence via its complexity. Also note we used the strong form of induction here, since we don't actually know the complexity of $\Psi_1$ or $\Psi_2$, only that they are less complex than the inductive hypothesis necessitates. Here are some more formal definitions of what was used above, and others that will be helpful later:

$\textrm{NConn}(\Phi)$: The number of occurrences of connectives in $\Phi$. We also define this as the complexity of $\Phi$, as it characterizes how "deep" subsentences go.
$\textrm{Conn}(\Phi)$: The set of connectives used in $\Phi$; $\textrm{NConn}$ counts tokens, while $\textrm{Conn}$ counts types.
$\textrm{SenLett}(\Phi)$: The set of sentence letters in $\Phi$.
$\textrm{Atoms}(\Phi)$: The set of atomic objects in our language.

In general, $\textrm{Atoms}(\Phi) = \textrm{SenLett}(\Phi)$ for $\mathcal{L}_1$, but sometimes we will consider convenient extensions of $\mathcal{L}_1$. For one, it's sometimes useful to have a symbol for tautologies $\top$ and contradictions $\bot$. Just for clarity, for all structures $A$, $|\top|_A = T$ and $|\bot|_A = F$. These are considered useful atoms in the extended language $\mathcal{L}_1^+$.

Here's another example of how induction can be applied to logic:

De Morgan's Laws: For sentences $\varphi_1, \varphi_2, \cdots, \varphi_n$,

$\neg (\varphi_1 \wedge \varphi_2 \wedge \cdots \wedge \varphi_n) \equiv \neg \varphi_1 \vee \neg\varphi_2 \vee \cdots \vee \neg\varphi_n$ $\neg (\varphi_1 \vee \varphi_2 \vee \cdots \vee \varphi_n) \equiv \neg \varphi_1 \wedge \neg\varphi_2 \wedge \cdots \wedge \neg\varphi_n$

For those who are familiar with some set theory, these are identical to how set intersections (equivalent to $\wedge$) and unions (equivalent to $\vee$) relate to each other with complements ($\neg$).

Base Case: It can easily be seen by truth tables, or verified with natural deduction that $\neg(\varphi_1 \wedge \varphi_2) \equiv \neg\varphi_1 \vee \neg\varphi_2$ (even in English, it can be seen to be fairly reasonable).

Inductive Hypothesis: Assume this holds for $n$ sentences. Then:

$ \begin{align} \neg (\varphi_1 \wedge \cdots \wedge \varphi_n \wedge \varphi_{n+1}) = \neg ((\varphi_1 \wedge \cdots \wedge \varphi_n) \wedge \varphi_{n+1}) & \equiv \neg(\varphi_1 \wedge \cdots \wedge \varphi_n) \vee \neg \varphi_{n+1} \\ & \equiv \neg \varphi_1 \vee \cdots \vee \neg \varphi_n \vee \neg \varphi_{n+1} \end{align} $

The first equivalence is given by the case for 2 sentences, and the last is given by the case for $n$ sentences by the inductive hypothesis. The second law follows similarly. $\blacksquare$

The last immediate example we'll look at is a simple, yet important lemma.

Relevance Lemma: Suppose for two structures $A$ and $B$, $\forall \alpha \in \ \textrm{Atoms}(\Phi)$ we have it such that $|\alpha|_A = |\alpha|_B$. Then $|\Phi|_A = |\Phi|_B$.

The idea is that the only information pertinent to a logical sentence is what it looks like at the "lowest level"; only the sentence letters in $\Phi$ are what affect its truth value, and nothing else is necessary.

Proof: We'll follow similarly before by inducting on the complexity of $\Phi$.

Base Case: The simplest sentence $\Phi$ can be is a sentence letter. Then by the definition of structures, clearly $|\Phi|_A = |\Phi|_B$. In the case of $\mathcal{L}_1^+$, $\Phi$ can be $\top$ or $\bot$, but since their semantics are fixed for any structure, the claim still holds.

Inductive Hypothesis: Say this holds for sentences of complexity less than or equal to $n$. Now let $\Phi$ be a sentence of complexity $n+1$ i.e. $\textrm{NConn}(\Phi) = n+1$. Then $\Phi$ is of the form $\neg \Psi_1$, $\Psi_1 \wedge \Psi_2$, $\Psi_1 \vee \Psi_2$, $\Psi_1 \rightarrow \Psi_2$, or $\Psi_1 \leftrightarrow \Psi_2$ where $\textrm{NConn}(\Psi_{1,2}) \leq n$ and $\textrm{NConn}(\Psi_1) + \textrm{NConn}(\Psi_2) = n + 1$. If structures $A$ and $B$ agree on the values of the atoms of $\Phi$, they agree on the atoms of $\Psi_1$ and $\Psi_2$. By the inductive hypothesis, $|\Psi_1|_A = |\Psi_2|_B$. Since the truth value of $\Phi$ is fixed by the truth value of $\Psi_{1,2}$, then in all the cases, by their respective truth tables, $|\Phi|_A = |\Phi|_B$. $\blacksquare$

The Relevance Lemma essentially gives us a way of justifying finite truth tables; the reason why we know the sentence $P \wedge Q$ only depends on the truth values for $P$ and $Q$ is precisely because of the Relevance Lemma; so long as two structures agree on the value of $P$ and $Q$, they will agree on $P \wedge Q$ regardless of what the other sentence letters are assigned. It seems like an obvious fact, but is one that needs to be proven anyway.

Induction, especially on the complexity of sentences, will be a key tool in future results.

Truth Functions

For future convenience, we'll introduce the idea of a truth function. Just as you'd imagine, truth functions are like the typical function in math that takes some amount of inputs and spits out a single output. We thus define a truth function as a function $f: \{T,F\}^n \rightarrow \{T,F\}$. We have already seen some simple truth functions in the form of connectives. For example, $\wedge$ has the truth function such that $f_\wedge(T,F) = F$; if you connect a true sentence letter with a false sentence letter by $\wedge$, we get a more complex sentence that evaluates to false.

This leads to the natural idea of sentences expressing truth functions. We say a sentence $\Phi$ expresses a truth function $f$ iff the sentence letters $P_1,\cdots,P_n$ occur in $\Phi$, and $f(|P_1|_A, \cdots, |P_n|_A) = |\Phi|_A$.

Substitution

Earlier we discussed substitution briefly in the context of constants and variables with $\forall$ and $\exists$ and their proof rules. We can generalize substitution even further:

Definition: For a set of sentences $\Gamma$, $\Gamma[\varphi/S]$ is the uniform subsitution of the sentence $\varphi$ for the sentence letter $S$ in every sentence in $\Gamma$. If $S$ does not occur at all in $\Gamma$, then $\Gamma[\varphi/S] = \Gamma$.

For example, if we have the sentence $\Phi = P \rightarrow (Q \vee R) \wedge P$, then with substitution, $\Phi[(R \leftrightarrow \neg Q) / P] = (R \leftrightarrow \neg Q) \rightarrow (Q \vee R) \wedge (R \leftrightarrow \neg Q)$. This applies for sets of sentences as described above.

We then have the following theorem:

Substitution Theorem: If $\Gamma \vDash \Phi$, then $\Gamma[\varphi/S] \vDash \Phi[\varphi/S]$.

Which should sort of make sense; we've been focusing on the validity of argument as something intrinsic to the structure of the argument, as opposed to having anything to do with the specific sentences. So exchanging sentences with another in an argument should be no issue, even if the sentences are more complicated. We'll first prove a helpful lemma:

Substitution Lemma: For a sentence $\Psi$, sentence letter $X$, and a structure $A$, we define a new substitution structure $A_{\Psi/X}$ as follows:

$$ A_{\Psi/X}(\gamma) = \begin{cases} |\Psi|_A \ \textrm{iff} \ \gamma = X \\ |\gamma|_A \ \textrm{iff} \ \gamma \neq X \end{cases} $$

That is, the structure keeps all sentence letters fixed, UNLESS it is $X$, which we assign the value of $\Psi$. Then for all sentences $\Phi$, we have $|\Phi[\Psi/X]|_A = |\Phi|_{A_{\Psi/X}}$.

Proof: We'll do this by induction on the complexity of $\Phi$ as usual.

Base Case: Simplest sentence is a sentence letter. Then by construction of $A_{\Psi/X}$, $|\Phi|_{A_{\Psi/X}} = |\Phi[\Psi/X]|_A$.

Inductive Hypothesis: Suppose this holds for sentences of complexity $n$. Let $\Phi$ be of complexity $n+1$. Then $\Phi$ is of the form $\neg \delta_1$, $\delta_1 \wedge \delta_2$, $\delta_1 \vee \delta_2$, $\delta_1 \rightarrow \delta_2$, or $\delta_1 \leftrightarrow \delta_2$ where $\textrm{NConn}(\delta_{1,2}) \leq n$ and $\textrm{NConn}(\delta_1) + \textrm{NConn}(\delta_2) = n + 1$. We can then check this by each case.

$\underline{\Phi = \neg \delta}$: $|\Phi|_{A_{\Psi/X}} = |\neg \delta|_{A_{\Psi/X}} = \color{red}{f_{\neg}(|\delta|_{A_{\Psi/X}}) = f_{\neg}(|\delta[\Psi/X]|_A)} = |\neg \delta[\Psi/X]|_A = |\Phi[\Psi/X]|_A$

$\underline{\Phi = \delta_1 \wedge \delta_2}$: $ \begin{align} |\Phi|_{A_{\Psi/X}} = |\delta_1 \wedge \delta_2|_{A_{\Psi/X}} = \color{red}{f_{\wedge}(|\delta_1|_{A_{\Psi/X}}, |\delta_2|_{A_{\Psi/X}})} \ & \color{red}{= f_{\wedge}(|\delta_1[\Psi/X]|_{A}, |\delta_2[\Psi/X]|_{A})} \\ & = |\delta_1[\Psi/X] \wedge \delta_2[\Psi/X]|_A \\ & = |(\delta_1 \wedge \delta_2)[\Psi/X]|_A \\ & = |\Phi[\Psi/X]|_A \end{align} $

The equality in red is given by the inductive hypothesis, and the rest by definitions of the connectives' truth functions and substitution. The other cases are all identical. $\blacksquare$

We can now prove the Substitution Theorem.

Proof: Suppose for contradiction that $\Gamma \vDash \Phi$ while $\Gamma[\varphi/S] \nvDash \Phi[\varphi/S]$. That is, there is a structure where $A$ where $\forall \gamma \in \Gamma \ |\gamma[\varphi/S]|_A = T$ while $|\Phi[\varphi/S]|_A = F$.

Disjunctive Normal Form and Expressive Adequacy

While we have the connectives $\wedge,\vee,\neg,\rightarrow,\leftrightarrow$, there's a natural question as to whether or not we need all of them.

Disjunctive Normal Form

Consider the following sentence: $\neg((P \wedge Q \rightarrow R) \leftrightarrow \neg Q)$. Let's write out the truth table for this sentence.

$ \begin{array}{c|c|c|c} P & Q & R & \neg((P \wedge Q \rightarrow R) \leftrightarrow \neg Q) \\ \hline \color{red}{T} & \color{red}{T} & \color{red}{T} & \color{red}{T} \\ T & T & F & F \\ T & F & T & F \\ T & F & F & F \\ \color{blue}{F} & \color{blue}{T} & \color{blue}{T} & \color{blue}{T} \\ \color{green}{F} & \color{green}{T} & \color{green}{F} & \color{green}{T} \\ F & F & T & F \\ F & F & F & F \\ \end{array} $

This sentence is fairly compact, but also not very readable. We can instead rewrite this sentence into something slightly longer, but logically equivalent that is indicative of what this sentence is saying.

Notice that this sentence is true if and only if one of the three types of structures are in use:

$P,Q,R$ are true
$P$ is false (or $\neg P$ is true), and $Q,R$ are true
$P$ is false (or $\neg P$ is true), $Q$ is true, and $R$ is false (or $\neg R$ is true)

These correspond to the highlighted rows in the truth table. We can express these truth conditions of the sentence easily then by reading off each condition:

$\color{red}{(P \wedge Q \wedge R)} \vee \color{blue}{(\neg P \wedge Q \wedge R)} \vee \color{green}{(\neg P \wedge Q \wedge \neg R)}$

Whenever the red structure is in use, then our original sentence is true. Likewise, the first disjunct in the above sentence is also true, making the whole sentence true by the truth table for $\vee$. Similarly, when the blue or green structures are in use, our original sentence is true and so is our new sentence made of just $\neg$, $\wedge$, and $\vee$. And when none of those structures are used, then our original sentence is false, and since our new sentence is made from only the conditions that make our original sentence true, it too would be false. Since these two sentences are true in the same structures and false in all the same structures, these sentences are logically equivalent.

We call this second sentence the disjunctive normal form of the original sentence.

Definition: A sentence is in disjunctive normal form (DNF) if it only uses connectives from the set $\{\neg, \wedge, \vee \}$, and that $\vee$ never occurs in the scope of $\neg$ or $\wedge$, and $\wedge$ never occurs in the scope of $\neg$. Informally, a sentence is in DNF if it is a conjunction of disjunctions of sentence letters or their negations.

Our above method gives a constructive way on how to convert sentences into DNF using truth tables. A more formal version of the proof specifying structures is given in Eagle's Elements of Deductive Logic, but the idea remains the same even if obscured by notation. A sentence $\Phi$ can be written in DNF by writing

$(\textrm{Case 1} \ \Phi \ \textrm{is true}) \vee (\textrm{Case 2} \ \Phi \ \textrm{is true}) \vee \cdots \vee (\textrm{Case} \ n \ \ \Phi \ \textrm{is true})$

where each Case is the listing of the truth conditions needed given by the sentence letters.

The alternative to DNF is the conjunctive nomal form (CNF), where you can write a logically equivalent sentence using a conjunction of disjunctions of sentence letters and or their negations. The proof that any sentence can be written in CNF can be done similarly to the proof of DNF. The idea is a sentence $\Phi$ is true if and only if none of the structures that make it false are in use.

But we also can use De Morgan's Laws and the existence of DNF to prove the existence of CNF much quicker. First, write $\neg \Phi$ in DNF: $\neg \Phi \equiv \bigvee_{n=1}^m (\bigwedge_{i=1}^k \mathscr{P}_i)$ where $\mathscr{P}_i$ is a sentence letter or its negation. This would say that there are $k$ sentence letters in $\neg \Phi$, and that you only need $m$ disjuncts to express it. But, since $\neg \neg \Phi \equiv \Phi$, we can negate $\neg \Phi$ in DNF to get $\Phi$ in CNF:

$\Phi \equiv \neg (\neg \Phi) \equiv \neg \bigvee_{n=1}^m (\bigwedge_{i=1}^k \mathscr{P}_i) \equiv \bigwedge_{n=1}^m \neg (\bigwedge_{i=1}^k \mathscr{P}_i) \equiv \bigwedge_{n=1}^m (\bigvee_{i=1}^k \neg \mathscr{P}_i)$

which is precisely in CNF, with $m$ conjuncts and all the sentence letters and their negations negated.

Expressive Adequacy

We have seen above that we can convert any sentence into a logically equivalent one using only $\neg$, $\wedge$, and $\vee$. However, note the only thing we really needed was the truth table of a sentence, not the sentence itself. So if we outline any truth function via a truth table, we can find a sentence in DNF that expresses that truth function.

So we say that the set of connectives $\{\neg, \wedge, \vee \}$ is expressively adequate: any truth function can be expressed by a sentence using exclusively the connectives $\{\neg, \wedge, \vee \}$. In other words, the truth functions of $\{\neg, \wedge, \vee \}$ can make any other truth function under composition with each other. We proved that with the existence of DNF.

Then, in a way, we have some redundant connectives in $\mathcal{L_1}$: $\rightarrow$ and $\leftrightarrow$ can be expressed using $\{\neg, \wedge, \vee \}$.

$ \begin{align} \Phi \rightarrow \Psi & \equiv \neg \Phi \vee \Psi \\ \Phi \leftrightarrow \Psi & \equiv (\neg \Phi \vee \Psi) \wedge (\neg \Psi \vee \Phi) \end{align} $

Of course, it's much more convenient to use $\rightarrow$ and $\leftrightarrow$ than their equivalent forms, but in a way, they are just that—a convenience.

But we can do better. We know from De Morgan's Laws:

$ \begin{align} \Phi \wedge \Psi & \equiv \neg(\neg \Phi \vee \neg \Psi) \\ \Phi \vee \Psi & \equiv \neg(\neg \Phi \wedge \neg \Psi) \end{align} $

So even $\wedge$ and $\vee$ is redundant; you only need one of them with negation to express the other.

So we have two small expressively adequate sets already with $\{\neg, \vee\}$ and $\{\neg, \wedge\}$, but there are many others, such as $\{\neg, \rightarrow \}$ and $\{ \rightarrow, \bot \}$. There are even single connectives that are expressively adequate. Consider the connectives with the following truth tables:

$ \begin{array}{c|c|c} P & Q & (P \uparrow Q) \\ \hline T & T & F \\ T & F & T \\ F & T & T \\ F & F & T \\ \end{array} \ \ \ \ \begin{array}{c|c|c} P & Q & (P \downarrow Q) \\ \hline T & T & F \\ T & F & F \\ F & T & F \\ F & F & T \\ \end{array} $

$\uparrow$ is sometimes called nand (as in not/negated-and) while $\downarrow$ is nor (not-or). Since we know $\{\neg, \wedge \}$ is expressively adequate, and that

$ \begin{align} \neg \Phi & \equiv \Phi \uparrow \Phi \\ \Phi \wedge \Psi & \equiv (\Phi \uparrow \Psi) \uparrow (\Phi \uparrow \Psi) \end{align} $

so $\{\uparrow \}$ can express the truth functions of the expressively adequate connective set $\{\neg, \wedge \}$. Thus, $\{\uparrow \} $ is expressively adequate.

Here are some good exercises to try and fully wrap your head around expressive adequacy:

Show that $\{\downarrow \}$ is expressively adequate as well (hint: try to write another expressively adequate set in terms of just $\downarrow$).
Show $\uparrow$ and $\downarrow$ are the only 2-place connectives that are expressively adequate by themselves.
Consider the 3-place connective $\square(\Phi, \Psi, \Delta)$ which is true iff $\Phi,\Psi,\Delta$ are all false. Show $\{\square\}$ is expressively adequate.
Show $\{\neg\}$ is not expressively adequate (hint: how many inputs can a sentence only using $\neg$ have?)
Show $\{\vee, \wedge, \rightarrow, \leftrightarrow\}$ is not expressively adequate (hint: find a specific truth function it cannot express).
Consider the new connective $\leftrightarrow^*$ with the following truth table: $ \begin{array}{c|c|c} P & Q & (P \leftrightarrow^* Q) \\ \hline T & T & F \\ T & F & T \\ F & T & T \\ F & F & F \\ \end{array} $ Show $\{\leftrightarrow, \leftrightarrow^*\}$ is not expressively adequate.

Duality

We have seen many times that there's a natural interplay between between $\wedge$ and $\vee$ with patterns of negation with De Morgan's Laws, and as well as between $\forall$ and $\exists$. Even in English, we see this mixing of "internal" and "external" negations; "already outside" means the same thing as "not still inside", so "already" and "still" act in this "dual" manner, similar to $\wedge$ and $\vee$.

We can formalize logical duality, but we first will introduce a new, convenient extension to $\mathcal{L}_1^+$, appropriately called $\mathcal{L}_1^{++}$. This language is exactly like $\mathcal{L_1}^+$, but with the addition of generalized connectives; for each $n$-place truth function $f$, there is a unique $n$-place connective $c$ that is associated with $f$ i.e. that would express it with $n$ sentence letters.

Something else worth noting that I've avoided for consistency is that sometimes it's useful to represent $T$ with 1 and $F$ with 0, so we can do basic arithmetic to simplify our manipulation of truth values. For example, $|\neg P|_A = 1 - |P|_A$, or with $\wedge$, $|P \wedge Q|_A = |P|_A \cdot |Q|_A$.

Definition (Dual of a Connective): Let $c$ be an $n$-place connective with associated truth function $f_c$. The dual connective $c^*$ is the connective with the associated truth function $f_{c^*}$ defined as:

$f_{c^*}(t_1,t_2,\cdots,t_n) = 1 - f_c(1 - t_1,1-t_2,\cdots,1-t_n)$

for all truth values $t_1,t_2,\cdots,t_n$.

In essence, the dual connective is defined by this relation of internal and external negations. In our discussion of expressive adequacy, the final problem I wrote at the end used $\leftrightarrow^{*}$, the dual of $\leftrightarrow$. For a more visual definition, the dual of a connective $c^*$ is obtained by flipping every occurrence of $T$ to $F$ and $F$ to $T$ in the truth table for $c$. If we look at $\wedge$, then

$ \begin{array}{c|c|c} P & Q & (P \wedge Q) \\ \hline T & T & T \\ T & F & F \\ F & T & F \\ F & F & F \\ \end{array} \Longrightarrow \ \begin{array}{c|c|c} P & Q & (P \wedge^* Q) \\ \hline F & F & F \\ F & T & T \\ T & F & T \\ T & T & T \\ \end{array} $

We see then that $\wedge^*$ has the exact same truth table for $\vee$, and conclude they are duals. If we do the same for $\vee$, we would see that $\vee^*$ has the same truth table for $\wedge$, and that leads us to our first observation.

Claim: For all $\mathcal{L}_1^{++}$ connectives, $c^{**} = c$.

Proof:

$\begin{align} f_{c^{**}}(t_1,t_2,\cdots,t_n) & = 1 - f_{c^{*}}(1 - t_1,1-t_2,\cdots,1-t_n) \\ & = 1 - (1 - f_c( 1 - (1 - t_1),1-(1-t_2),\cdots,1-(1-t_n))) \\ & = f_{c}(t_1,t_2,\cdots,t_n) \end{align}$

Since these associated truth functions are unique to each connective, we've shown $c^{**} = c$. $\blacksquare$

Connective duals are the heart of this topic, so with that solidified, we are now in a position to define other forms of duality:

Definition (Generalized Duality):

The dual of a sentence is given recursively: if $\Phi$ is a sentence letter, then $\Phi^{*} = \Phi$, and if $\Phi = c(\varphi_1, \cdots, \varphi_n)$, then $\Phi^{*} = (c(\varphi_1, \cdots, \varphi_n))^{*} = c^{*}(\varphi_1^{*}, \cdots, \varphi_n^{*})$
Let $T^* = F$ and $F^* = T$. The dual of a structure $A^{*}$ is defined by flipping the truth values of all sentence letters in $A$, i.e. $|\alpha|_{A^*} = (|\alpha|_A)^* = 1 - |\alpha|_A$ for all sentence letters $\alpha$

We can now prove another adjacent lemma as before.

Claim: For all $\mathcal{L}_1^{++}$ sentences, $\Phi^{**} = \Phi$.

Proof: Induct on the complexity of $\Phi$.

Base Case: If $\Phi$ is a sentence letter, then $\Phi^{**} = (\Phi^*)^* = \Phi^* = \Phi$.

Inductive Hypothesis: Suppose this holds for sentences of complexity $n$. Let $c$ be the highest scope connective of $\Phi$ i.e. $\Phi = c(\varphi_1, \varphi_2, \cdots, \varphi_n)$ with $\textrm{NConn}(\varphi_i) \leq n$. Then

$ \begin{align} \Phi^{**} = c(\varphi_1, \varphi_2, \cdots, \varphi_n)^{**} = c^*(\varphi_1^*, \varphi_2^*, \cdots, \varphi_n^*)^* & = c^{**}(\varphi_1^{**}, \varphi_2^{**}, \cdots, \varphi_n^{**}) \\ & = c(\varphi_1, \varphi_2, \cdots, \varphi_n) \\ & = \Phi \end{align} $ $\blacksquare$

It's worth mentioning that outside of $\mathcal{L}_1^{++}$, it might not be the case that $\Phi^{**} = \Phi$ since we don't have this nice general connective $c$ to work with. But, it will still be logically equivalent; they may not look the same in other languages (as in use the exact same characters in the exact same order), but certainly $\Phi^{**} \equiv \Phi$.

With all of this in place, we now have our first major result:

Duality Lemma: For all sentences, $|\Phi^*|_A + |\Phi|_{A^*} = 1$. In other words, $|\Phi^*|_A = |\neg\Phi|_{A^*}$

Proof: Induct on the complexity of $\Phi$.

Base Case: If $\Phi$ is a sentence letter, then $|\Phi^{*}|_A = |\Phi|_A = 1 - |\Phi|_{A^*}$

$ \begin{align} |\Phi^*|_A = |c^*(\varphi_1^*,\varphi_2^*,\cdots,\varphi_n^*)|_A & = f_{c^*}(|\varphi_1^*|_A, |\varphi_2^*|_A, \cdots, |\varphi_n^*|_A) \\ & = f_{c^*}(1 - |\varphi_1|_{A^*}, 1 - |\varphi_2|_{A^*}, \cdots, 1 - |\varphi_n|_{A^*}) \\ & = 1 - f_{c}(|\varphi_1|_{A^*}, |\varphi_2|_{A^*}, \cdots, |\varphi_n|_{A^*}) \\ & = 1 - |c(\varphi_1,\varphi_2,\cdots,\varphi_n)|_{A^*} \\ & = 1 - |\Phi|_{A^*} \end{align} $

This lemma comes with a very nice consequence:

Duality Theorem: If $\Phi \vDash \Psi$, then $\Psi^* \vDash \Phi^*$.

Proof: Say $|\Psi^*|_A = 1$. Then by the Duality Lemma, $|\Psi|_{A^*} = 0$. Since $\Phi \vDash \Psi$, then $|\Phi|_{A^*} = 0$. By the Duality Lemma again, $|\Phi^*|_A = 1$. So in any structure where $|\Psi^*|_A = 1$, $|\Phi^*|_A = 1$. In other words, $\Psi^* \vDash \Phi^*$.

Some other interesting ideas to consider:

We proved earlier that $c^{**} = c$, but that does not rule out the possibility that there are connectives such that $c^* = c$. For example, $\neg$ is self-dual as $\neg^* = \neg$. Can you find any other self-dual connectives?
Having shown $\wedge$ and $\vee$ as duals, this gives us a nice informal way to explain why $\forall$ and $\exists$ are duals too (remember, $\forall x Px \equiv \neg \exists x \neg Px$). Informally, $\forall x Px$ sort of expresses $\bigwedge_{\tau \in C} P\tau$ which ranges over all constants $C$ (while accurate, it is not a strictly good way of defining $\forall x Px$ as sentences cannot be of infinite length). Similarly, $\exists x Px$ is sort of equivalent to $\bigvee_{\tau\in C} P\tau$. So given the duality of $\wedge$ and $\vee$, perhaps it is not that surprising that $\forall$ and $\exists$ are duals of each other too.
The Duality Theorem applies to only single sentences entailing other sentences, but what about sets of sentences? We can make a modified version for entailment between sets: we say $\Gamma \vDash \Sigma$ if when all sentences of $\Gamma$ are true, at least one sentence of $\Sigma$ is true. We can then also define $\Gamma^* = \{\gamma^* \ | \ \gamma \in \Gamma \}$ i.e. the dual of a set of sentences is the set of all dual sentences in the set. Then we can also prove in a similar manner if $\Gamma \vDash \Sigma$, then $\Sigma^* \vDash \Gamma^*$.

Part 3: Soundness and Completeness

The previous section was an exploration into how we are able to analyze logic in a very robust and formal way akin to the way we do so in math. And to be fair, it did start to feel like we were doing math after a certain point; we were manipulating sentences and structures in a fairly abstract way, while interesting, did not always have a direct appeal to why we were cared about those topics outside of curious questions. We now return back to some more concrete motivations, as we now have the methods to finally break down the two key theorems that make logic useful.

Soundness

As mentioned earlier, soundness is basically what makes proof systems and provability useful. As a reminder,

Soundness Theorem: If $\Gamma \vdash \Phi$, then $\Gamma \vDash \Phi$.

That is, for a given proof system, if it is sound, then it "preserves truth"; you wouldn't be able to prove something that is a non-sensical, bad argument. Soundness is arguably one of, if not, the most important quality we can have in a logical system. It's what allows us to know that for any right triangle the square of the shorter legs sum to the hypotenuse without having to check every right triangle. It's what allows us to know that $\sqrt{2}$ is irrational without having to check every possible fraction.

And the way you prove soundness is surprisingly easy.

Proof: We will prove this with induction, but instead of looking at the complexity of sentences, we will induct on the complexity of proofs; we will ascribe an integer to a proof to how many natural deduction rules it uses.

It's also worth noting here that our proof of soundness below only applies to natural deduction; we have to prove soundness for each proof system. We are working with Natural Deduction, so we will show those rules are sound, but if we use another proof system, say, Frege's propositional calculus, we would have to re-prove soundness (since we could just make up arbitrary proof rules that are not sound if we wanted).

Base Case: The simplest proof of "complexity 0" of $\Phi$ is from $\Phi$ itself $\Phi \vdash \Phi$; if you know $\Phi$, you certainly know $\Phi$ without needing to use any proof rules. Clearly $\Phi \vDash \Phi$ as any structure such that $|\Phi|_A = T$, that structure also assigns $|\Phi|_A = T$, establishing the base case.

Inductive Step: Now we will assume from premises $\Gamma$ that our proofs are sound, up until the last step. We will check for each of our Natural Deduction rules that they are sound.

$\underline{\wedge\textrm{-Intro:}}$ Suppose we have a proof of $\Psi_1$ from premises $\Pi_1$ (that is, $\Pi_1 \vdash \Psi_1$) and $\Psi_2$ from premises $\Pi_2$ (similarly $\Pi_2 \vdash \Psi_2$), we obtain a proof $\Psi_1 \wedge \Psi_2$ from an application of $\wedge\textrm{-Intro}$ (where $\Pi_1, \Pi_2 \subseteq \Gamma \ $). By our inductive hypothesis, $\Pi_1 \vDash \Psi_1$ and $\Pi_2 \vDash \Psi_2$. So, for any structure that satisfies $\Gamma$, that structure clearly satisfies $\Pi_1$ and $\Pi_2$ (as they come from $\Gamma \ $), and therefore also satisfies $\Psi_1$ and $\Psi_2$ (by the entailment derived from our inductive hypothesis). So $\Gamma \vDash \Psi_1$ and $\Gamma \vDash \Psi_2$. By the truth table for $\wedge$, we can see that $\Gamma \vDash \Psi_1 \wedge \Psi_2$.

$\underline{\wedge\textrm{-Elim:}}$ From a proof of $\Psi_1 \wedge \Psi_2$ on premises $\Gamma$ ($ \ \Gamma \vdash \Psi_1 \wedge \Psi_2$), we obtain a proof of $\Psi_1$ (or $\Psi_2$) using $\wedge\textrm{-Elim}$. By our inductive hypothesis, $\Gamma \vDash \Psi_1 \wedge \Psi_2$. By the truth table for $\wedge$, $\Psi_1 \wedge \Psi_2$ is true iff $\Psi_1$ and $\Psi_2$ are true, so we can conclude $\Gamma \vDash \Psi_1$ and $\Gamma \vDash \Psi_2$.

This is the basic idea of proving soundness: use our knowledge of semantics and definitions to show our final entailment. Sometimes it is just a matter of writing out what feels obvious, but some might require some more background knowledge or tricks.

$\underline{\textrm{→-Intro:}}$ From a proof of $\Phi$ on premises $\Gamma \cup \{\Psi\}$, get a proof $\Gamma \vdash \Psi \rightarrow \Phi$ with $\textrm{→-Intro}$ (discharging $\Psi$ at the end). From the inductive hypothesis $\Gamma, \Psi \vDash \Phi$. By the Deduction Theorem, $\Gamma \vDash \Psi \rightarrow \Phi$.

$\underline{\textrm{→-Elim:}}$ From a proof of $\Psi$ on premises $\Pi_1$ and another proof of $\Psi \rightarrow \Phi$ from premises $\Pi_2$, obtain a proof of $\Phi$ via $\textrm{→-Elim}$ (where $\Pi_1, \Pi_2 \subseteq \Gamma \ $). By the inductive hypothesis, $\Gamma \vDash \Psi$ and $\Gamma \vDash \Psi \rightarrow \Phi$. By the truth table for $\rightarrow$, whenever $\Gamma$ is satisfied, $\Phi$ must also be satisfied. Thus, $\Gamma \vDash \Phi$.

The other cases for $\neg$, $\vee$, and $\leftrightarrow$ are all very similar (but do know they are in fact sound). $ \ \blacksquare$

Soundness of $\mathcal{L}_2$

The above proof shows the natural deduction proof system is sound in $\mathcal{L}_1$, but we've moved on past to bigger and better things. To show natural deduction is sound in $\mathcal{L}_2$, we only need to show the 4 new rules are sound: $\forall\textrm{-Intro}$, $\forall\textrm{-Elim}$, $\exists\textrm{-Intro}$, and $\exists\textrm{-Elim}$.

I won't go through them in detail (because I this post is long enough and I don't want to type them out), but the proof idea is exactly the same as it was for $\mathcal{L}_1$. Here's a rough sketch of the results one could show to deduce soundness of $\mathcal{L}_2$ (starter proofs can be found in Elements of Deductive Logic):

Substitution of Co-Designating Terms: Let $\tau_1$ and $\tau_2$ be constants/variables, and $\varphi[\tau_2/\tau_1]$ is the formula of replacing free occurrences of $\tau_1$ by $\tau_2$ where none of these instances fall under $\forall \tau_2$ or $\exists \tau_2$ (constants occur freely vacuously). Then for any structure $A$ and variable assignment $\alpha$ where $|\tau_1|_A^\alpha = |\tau_2|_A^\alpha$, then $|\varphi|_A^\alpha = |\varphi[\tau_2/\tau_1]|_A^\alpha$ (show with induction on the length of formulae).
- This basically says that if two constants/variables stand for the same thing, they are interchangeable in that variable assignment.
With substitution of co-designating terms and the satisfaction rules for $\forall$ and $\exists$, show the following sequents are correct:
- $\Phi[t/v] \vDash \exists v \Phi$
- $\forall v \Phi \vDash \Phi[t/v]$
- If $\Gamma, \Phi[t/v] \vDash \Psi$, then $\Gamma, \exists v \Phi \vDash \Psi$ where the constant $t$ does not occur in $\Gamma, \Phi, \Psi$
- If $\Gamma \vDash \Phi[t/v]$, then $\Gamma \vDash \forall v \Phi$ where constant $t$ does not occur in $\Gamma, \Phi$
Use those sequents to complete the Soundess Theorem for $\forall$ and $\exists$ in the inductive steps of the proof.

Completeness

We have shown soundness, meaning anything that we do prove is a good argument. But we haven't really given a method of showing how to prove things; proofs in natural deduction—in math even—often rely on having a good intuition of the problem at hand that can be formalized into rigorous logic. For example, look at the following claim and proof:

Claim: There exists irrational numbers $a$ and $b$ such that $a^b$ is rational.

Proof: Consider the number $\sqrt{2}^\sqrt{2}$.

If this is rational, we're done.
If this is irrational, then consider the number $(\sqrt{2}^\sqrt{2})^\sqrt{2} = \sqrt{2}^2 = 2$, which is rational, and we are done.

Never once did we actually deduce the rationality of $\sqrt{2}^\sqrt{2}$, but by exhausting the possible cases, it doesn't matter since every possibility leads to a conclusion that proves our claim. $ \ \blacksquare$

To independently think of the above proof, one would have to just be comfortable and familiar with exponents, since it really just abused the fact how exponenets combine to create a possible number that satisfies the claim. Moreover, the above proof used the implicit, extra premise that all real numbers are either rational or irrational. This might be a tautology that can be proven under what one assumes about the real numbers, but that would still need to be shown (since not everyone necessarily takes this for granted). All together, this proof requires a lot of creativiity to not only think of the individual, relevant facts, but to string them together too.

What am I getting at, though? Since there is no algorithm or method to go about proofs, we have no good way to determine if we can prove something until we have proved it. When something has not been proven, though, we don't just give up; mathematicians and scientists keep searching for new techniques and evidence in the hopes of proving or disproving a theorem. But there's a key word there: "hope". We don't actually know if what we are trying to prove, actually is provable! We don't know if we are wasting our time away on a Sisyphean quest with no end.

Ideally, we want to be working in a complete logical system, that is:

Completeness Theorem: If $\Gamma \vDash \Phi$, then $\Gamma \vdash \Phi$.

This is the converse of soundness, and would be a very convenient property of our logic to have, since it would mean there are no "unjustified" truths; all arguments would have a chain of causal reasoning in our deductive system that allow us to show its validity.

Some of you may have heard of completeness over soundness before for a couple of reasons: 1) soundness is a fairly intuitive property many of us just trust, as much of logic can be readily thought out to seem to make sense; but more likely 2) Gödel's incompleteness theorems are among the most celebrated and frustrating results of modern math that have infiltrated the highest ranks of science legends. Ultimately, they culminate many decades of the development of logic and putting to rest, arguably the most fundamental question in math: can we know everything? The ultimate answer is no, leaving us with many theorems in math that might be logically true, but just unprovable from our axioms. As of now, over 10 trillion non-trivial zeroes of the Riemann hypothesis have been verified, yet a concrete, irrefutable proof of this 160 year old problem still eludes us, and incompleteness forces us to wonder if there might not be one at all (but if the Riemann hypothesis is false, then it is provably false according to Robin's theorem!).

But that's going a bit too far off the beaten path. Fortunately for us, $\mathcal{L}_1$, $\mathcal{L}_2$, and $\mathcal{L}_=$ are all complete logics, and the rest of this section will be to show that, starting with $\mathcal{L}_1$.

Having a complete system is desirable for a few reasons. Having all theorems be provable is of course nice, but completeness also allows us to connect a lot of dots between semantics and proofs as we would expect. Earlier we mentioned the idea of ND-consistency. In a sound and complete system (which ND is), it turns out syntactic consistency is equivalent to semantic consistency (or satisfiability):

Claim: $\Gamma$ is ND-consistent iff it is satisfiable.

Proof:

$(\Rightarrow)$ Suppose $\Gamma$ is ND-consistent. Then there is a sentence $\Phi$ such that $\Gamma \nvdash \Phi$. By completeness, $\Gamma \nvDash \Phi$. The only way an argument or sequent is invalid is if there is a structure $A$ such that $A \vDash \Gamma$ but $A \nvDash \Phi$. But then we have a structure such that $A \vDash \Gamma$, thus showing $\Gamma$ is satisfiable.
$(\Leftarrow)$ We'll prove the contrapositive i.e. we'll show that if $\Gamma$ is ND-inconsistent, then it is unsatisfiable. So say $\Gamma$ is ND-inconsistent, that is, $\Gamma \vdash \Phi$ for all sentences $\Phi$. By soundness, we have $\Gamma \vDash \Phi$ for all sentences $\Phi$. This includes contradictions, such as $\Phi = P \wedge \neg P$, and the only way $\Gamma$ could entail all sentences, including contradictions, would be if it was unsatisfiable.

From here on, I'll use consistent only to mean ND-consistent unless I specify otherwise, and use satisfiable for semantic consistency.

Deductive Completeness

Our ultimate goal is to prove completeness of a logical language. But first, we'll need to talk about the completeness of a set of sentences:

Definition (Semantic Completeness): $\Gamma$ is semantically complete when for all sentences $\Phi$, either $\Gamma \vDash \Phi$ or $\Gamma \vDash \neg \Phi$ (or both)
Definition (Deductive Completeness): $\Gamma$ is deductively complete when for all sentences $\Phi$, either $\Gamma \vdash \Phi$ or $\Gamma \vdash \neg \Phi$ (or both)

Semantically complete sets are like structures, ascribing truth values to all sentences when it is satisfied. Deductively complete sets act likewise but with proofs.

Combining ND-consistency and ND-completeness allows us to prove a very intuitive result about proofs:

Consistency and Completeness Lemma: Suppose $\Gamma \subseteq \textrm{Sen}(\mathcal{L}_1)$ is ND-consistent and -complete. Then for all sentences $\Phi,\Psi$:

$\Gamma \vdash \neg \Phi$ iff $\Gamma \nvdash \Phi$
$\Gamma \vdash \Phi \wedge \Psi$ iff $\Gamma \vdash \Phi$ and $\Gamma \vdash \Psi$
$\Gamma \vdash \Phi \vee \Psi$ iff $\Gamma \vdash \Phi$ or $\Gamma \vdash \Psi$ (or both)
$\Gamma \vdash \Phi \rightarrow \Psi$ iff $\Gamma \nvdash \Phi$ or $\Gamma \vdash \Psi$ (or both)
$\Gamma \vdash \Phi \leftrightarrow \Psi$ iff $\Gamma \vdash \Phi$ and $\Gamma \vdash \Psi$, or $\Gamma \nvdash \Phi$ and $\Gamma \nvdash \Psi$

This lemma esentially allows us to formalize results about proofs in how we would expect them to act; it gives us a bridge that consistency and completeness are enough to show that proofs and derivability act like truth in a structure.

Proof: We'll just verify each case individually.

Case (i):

$(\Rightarrow)$ Say $\Gamma \vdash \neg \Phi$. By its ND-consistency, $\Gamma \nvdash \Phi$.
$(\Leftarrow)$ Say $\Gamma \nvdash \Phi$. By its ND-completeness, $\Gamma \vdash \neg \Phi$.

Case (ii):

$(\Rightarrow)$ If $\Gamma \vdash \Phi \wedge \Psi$, then by $\wedge$-Elim, $\Gamma \vdash \Phi$ and $\Gamma \vdash \Psi$.
$(\Leftarrow)$ If $\Gamma \vdash \Phi$ and $\Gamma \vdash \Psi$, we can combine these two proofs into one bigger proof by applying $\wedge$-Intro, netting us a proof of $\Gamma \vdash \Phi \wedge \Psi$.

Cases (iii), (iv), and (v) are all similar: look at the given proof we assume, and extend it via natural deduction to get the conclusion we want. $ \ \blacksquare$

Maximally Consistent Sets

The central idea that we'll use to prove the completeness of $\mathcal{L}_1$ is maximally consistent sets.

Definition (Maximally Consistent Sets): A set $\Gamma$ is maximally consistent (or maximally D-consistent with respect to a proof system $D$) when $\Gamma$ is ND-consistent, and if $\Gamma \cup \{\Phi \}$ is ND-consistent, then $\Phi \in \Gamma$.

We can think of maximally consistent sets as model universes, completely filled with as many rules (sentences) that determine the laws and facts of this universe without contradicting itself. One more sentence and $\Gamma$ would no longer be consistent.

Conveniently enough, maximally consistent sets hold the two properties we just discussed.

Maximally Consistent Sets Are Complete: If $\Gamma \subseteq \textrm{Sen}(\mathcal{L}_1)$ is maximally consistent, then it is ND-consistent and ND-complete.

Proof: By definition of maximal consistency, $\Gamma$ is ND-consistent.

Now suppose for a contradiction that $\Gamma$ was ND-incomplete, i.e. there's a sentence $\Phi$ such that $\Gamma \nvdash \Phi$ and $\Gamma \nvdash \neg \Phi$. Hence, I claim both $\Phi \in \Gamma$ and $\neg \Phi \in \Gamma$, making $\Gamma$ inconsistent and showing our assumption of its incompleteness incorrect. To show a sentence $\Psi \in \Gamma$, we can leverage the definition of maximal consistency that $\Gamma \cup \{\Psi \}$ is consistent.

Consider for contradiction that $\Gamma \cup \{\neg \Phi \}$ is inconsistent. Then, it would prove everything, and in particular, it would also prove $\Gamma \cup \{\neg \Phi \} \vdash \Phi$. By $\neg$-Elim, we can see that also $\Gamma \vdash \Phi$ since $\neg \Phi$ can be discarded in that proof. But we already know that $\Gamma \nvdash \Phi$, so it must be that $\Gamma \cup \{\neg \Phi \}$ is consistent. By maximal consistency then, $\neg \Phi \in \Gamma$.
Same argument as before, but with assuming $\Gamma \cup \{\Phi \}$ is inconsistent. Similarly we can conclude $\Phi \in \Gamma$.

Since $\{\Phi, \neg \Phi\} \subseteq \Gamma$, we have $\Gamma \vdash \Phi$ and $\Gamma \vdash \neg \Phi$ trivially, and thus with $\neg$-Elim, $\Gamma$ proves everything. Hence it is inconsistent, which contradicts that it is maximally consistent. Hence $\Gamma$ must be ND-complete. $ \ \blacksquare$

Another nice property of maximally consistent sets is that they provability and membership are equivalent in $\Gamma$.

Membership Lemma: If $\Gamma$ is maximally consistent, then $\Gamma \vdash \Phi$ iff $\Phi \in \Gamma$.

Proof:

$(\Leftarrow)$ Clearly if $\Phi \in \Gamma$, then $\Gamma \vdash \Phi$ trivially.
$(\Rightarrow)$ Since $\Gamma \vdash \Phi$, by the consistency of $\Gamma$, we also have $\Gamma \nvdash \neg \Phi$ (since if it proved both, then we would be able to prove any sentence with $\neg$-Elim). We want to show $\Phi \in \Gamma$, which would be true if $\Gamma \cup \{\Phi\}$ was consistent (by definition of a maximally consistent set). Say for a contradiction that $\Gamma \cup \{\Phi\}$ was inconsistent, that is, it proves everything. If it proves everything, in particular, it also proves $\Gamma \cup \{\Phi\} \vdash \neg \Phi$. But then also, $\Gamma \vdash \neg \Phi$, since by $\neg$-Intro, we can discard the premise $\Phi$ as an assumed premise that we do not actually need concretely. This contradicts the fact that $\Gamma \nvdash \neg \Phi$, so our assumption must be wrong. Therefore, $\Gamma \cup \{\Phi\}$ is consistent, and hence $\Phi \in \Gamma$ by definition of a maximally consistent set.

$\blacksquare$

With these two lemmas in mind, we can now generalize membership for $\Gamma$ even broader:

Generalized Membership Lemma: Suppose $\Gamma \subseteq \textrm{Sen}(\mathcal{L}_1)$ is maximally consistent. Then for all sentences $\Phi,\Psi$:

$\neg\Phi \in \Gamma$ iff $\Phi \notin \Gamma$
$\Phi \wedge \Psi \in \Gamma$ iff $\Phi \in \Gamma$ and $\Psi \in \Gamma$
$\Phi \vee \Psi \in \Gamma$ iff $\Phi \in \Gamma$ or $\Psi \in \Gamma$ (or both)
$\Phi \rightarrow \Psi \in \Gamma$ iff $\Phi \notin \Gamma$ or $\Psi \in \Gamma$ (or both)
$\Phi \leftrightarrow \Psi \in \Gamma$ iff $\Phi \in \Gamma$ and $\Psi \in \Gamma$, or $\Phi \notin \Gamma$ and $\Psi \notin \Gamma$

Proof: We've already done all the work in the previous lemmas. $\Gamma$ is maximally consistent, so it's ND-consistent and ND-complete. By the Consistency and Completeness Lemma, we have all those cases about provability true in $\Gamma$. By the Membership Lemma, we can replace all those proofs with clauses of membership. $ \ \blacksquare$

Now compare each of these cases of the Generalized Membership Lemma to what satisfaction looks in a structure. The relations are identical! This is the key property of maximally consistent sets: membership in $\Gamma$ acts like truth in an $\mathcal{L}_1$-structure. Likewise, we are able to not only make a maximally consistent set $\Gamma$, but also create a structure specifically tailored to satisfy $\Gamma$.

Constructing a Maximally Consistent Set

Maximally consistent sets are filled with as many sentences as possible while remaining consistent. So, if we wanted to make a maximally consistent set, we could just look at every sentence and see if we can add it to our bucket while remaining consistent.

Completeness Lemma 1: Suppose we have an ND-consistent set $\Gamma$. Then there is a maximally consistent set $\Gamma^{+}$ with $\Gamma \subseteq \Gamma^{+}$.

Proof: The idea is that there are a countably infinite number of sentence letters in $\mathcal{L}_1$, and therefore a countably infinite number of sentences in $\mathcal{L}_1$ too (the Cartesian product of two countably infinite sets is countably infinite). So we just go through all the possible sentences, add them to $\Gamma$ if preserves consistency, and at the end we obtain not just a consistent, but maximally consistent set.

Let $\textrm{Sen}(\mathcal{L}_1) = \{\Phi_0, \Phi_1, \Phi_2, \cdots\}$, and let $\Gamma_0 = \Gamma$. We then define the recursion

$$ \Gamma_{n+1} = \begin{cases} \Gamma_n \cup \{\Phi_n\} \ \textrm{if it is ND-consistent} \\ \Gamma_n \ \textrm{otherwise} \end{cases} $$

Then let $\Gamma^{+} = \bigcup_{n} \Gamma_n$. By construction, $\Gamma_n$ is consistent and $\Gamma_n \subseteq \Gamma^+$ for all $n$. Importantly, including $n=0$, so $\Gamma = \Gamma_0 \subseteq \Gamma^+$.

First we show that $\Gamma^+$ is ND-consistent. Say it wasn't. Then $\Gamma^+$ proves every sentence, so $\Gamma^+ \vdash \varphi$ and $\Gamma^+ \vdash \neg \varphi$ for some sentence $\varphi$. Proofs must be finite (what would an infinite proof look like?), so we only need finite premises to prove these claims: $\{\gamma_1, \gamma_2, \cdots, \gamma_m \} \vdash \varphi$ where $\{\gamma_1, \gamma_2, \cdots, \gamma_n \} \subseteq \Gamma^+$
$\{\delta_1, \delta_2, \cdots, \delta_n \} \vdash \neg\varphi$ where $\{\delta_1, \delta_2, \cdots, \delta_n \} \subseteq \Gamma^+$ Where $m$ and $n$ are some positive integers. Then, the finitely many sentences of $\{\gamma_1, \gamma_2, \cdots, \gamma_m \} \cup \{\delta_1, \delta_2, \cdots, \delta_n \}$ must have appeared somewhere in our listing of $\textrm{Sen}(\mathcal{L}_1)$ and thus must all have been added to $\Gamma_k$ for some $k$. But then $\Gamma_k$ would be inconsistent as $\Gamma_k \vdash \varphi$ and $\Gamma_k \vdash \neg\varphi$, which cannot happen by construction. So $\Gamma^{+}$ is ND-consistent.
Now we need to show $\Gamma^{+}$ is maximally consistent. Say $\Gamma^{+} \cup \{\varphi\}$ is consistent. $\varphi$ must have appeared in our enumeration of $\textrm{Sen}(\mathcal{L}_1)$ at some point, i.e. $\varphi = \Phi_k$ for some $k$. $\Gamma^{+} \cup \{\Phi_k\}$ is consistent, and $\Gamma_k \subseteq \Gamma^{+}$, so it follows that $\Gamma_k \cup \{\Phi_k\}$ is consistent (subsets of consistent sets are consistent). So by our recursive construction of $\Gamma_{k+1}$, we have $\varphi = \Phi_k \in \Gamma_{k+1} \subseteq \Gamma^{+}$, so $\varphi \in \Gamma^{+}$. Hence $\Gamma^{+}$ is maximally consistent.

$\blacksquare$

So now we have a way of making a maximally consistent set.

Satisfying a Maximally Consistent Set

We showed earlier maximally consistent sets treat membership almost identically to how sentences are true in a structure. As such, it shouldn't be surprising that we can create a structure that exclusively revolves around a maximally consistent set.

Completeness Lemma 2: If $\Gamma$ is a maximally consistent set, then there's an $\mathcal{L}_1$ structure $A_{\Gamma}$ such that $A_{\Gamma} \vDash \Phi$ iff $\Phi \in \Gamma$.

Proof: We prove this by giving an explicit structure: for a sentence letter $\alpha$, let $A_{\Gamma} \vDash \alpha$ iff $\alpha \in \Gamma$. We now show with induction on the complexity of sentences, that for all sentences $\Phi$ we still have $A_{\Gamma} \vDash \Phi$ iff $\Phi \in \Gamma$.

Base Case: We defined $A_{\Gamma}$ by satisfying sentence letters our condition for sentence letters, so the base case holds.

Inductive Hypothesis: Say this holds up to complexity $n$, and $\Phi$ is of complexity $n+1$. As with so many proofs before, there are 5 cases to consider, one for each connective in $\mathcal{L}_1$. The key tool we'll use is the Generalized Membership Lemma for maximally consistent sets.

$\underline{\Phi = \neg \Psi}:$ $\Gamma$ is maximally consistent by assumption, so $\Phi = \neg \Psi \in \Gamma$ iff $\Psi \notin \Gamma$. By the inductive hypothesis, $A_{\Gamma} \vDash \Psi$ iff $\Psi \in \Gamma$, so we deduce $A_{\Gamma} \nvDash \Psi$. Thus, by the semantic rule for $\neg$, we get $A_{\Gamma} \vDash \neg\Psi = \Phi$.

$\underline{\Phi = \Psi_1 \wedge \Psi_2}:$ $\Phi = \Psi_1 \wedge \Psi_2 \in \Gamma$ iff $\Psi_1 \in \Gamma$ and $\Psi_2 \in \Gamma$. By the inductive hypothesis then, $A_{\Gamma} \vDash \Psi_1$ and $A_{\Gamma} \vDash \Psi_2$. By the truth rules for $\wedge$, $A_{\Gamma} \vDash \Psi_1 \wedge \Psi_2$, and so $A_{\Gamma} \vDash \Phi$.

The other connectives are more of the same of what we've done before. Hence we can create a structure that exactly and only satisfies members of a maximally consistent set.

$\blacksquare$

The Proof of Completeness

We've proven a lot about maximally consistent sets, and how they share a lot of properties to structures, but we've actually done pretty much all the work we need to fully prove the completeness of $\mathcal{L}_1$.

Completeness Theorem: If $\Gamma \vDash \Phi$, then $\Gamma \vdash \Phi$.

Proof: We'll prove the contrapositive: if $\Gamma \nvdash \Phi$, then $\Gamma \nvDash \Phi$. So suppose that $\Gamma \nvdash \Phi$. As we've seen many times before, if $\Gamma \nvdash \Phi$, then $\Gamma \cup \{\neg \Phi\}$ is consistent. By Completeness Lemma 1, we can make a maximally consistent set $\Gamma^{+}$ out of $\Gamma \cup \{\neg\Phi\}$ i.e. $\Gamma \cup \{\neg\Phi\} \subseteq \Gamma^{+}$. By Completeness Lemma 2, we can find a structure $A$ such that $A \vDash \varphi$ iff $\varphi \in \Gamma^{+}$. Since $\Gamma \cup \{\neg\Phi\} \subseteq \Gamma^{+}$, we have $A \vDash \Gamma$ and $A \vDash \neg\Phi$. By the semantic rule for $\neg$, we also have $A \nvDash \Phi$. Thus by the definition of entailment $\Gamma \nvDash \Phi$.

$\blacksquare$

It's a relatively short proof, but that's mostly because we shoved a lot of the proof into the Completeness Lemmas 1 and 2. Not to mention, I don't know about you, but it is definitely not obvious to me that maximally consistent sets would be the tool that paved the path forward to prove completeness, let alone even have the intuition that $\mathcal{L}_1$ is complete. To realize that satisfiability in a structure is like membership in a maximally consistent set is already impressive, but to then think to actually link the two for this proof is very slick.

As a final aside, we can now deem $\mathcal{L}_1$ as having special status:

Adequacy Theorem: $\Gamma \vdash \Phi$ if and only if $\Gamma \vDash \Phi$.

Completeness of Other Logics

As before, I won't go into full detail, but $\mathcal{L}_2$ and $\mathcal{L}_=$ are likewise both complete logics as well. To extend the proof of completeness, it is very similar to how we would extend it for soundness, needing to just account for the quanitifers $\forall$ and $\exists$.

First we would need to extend the Generalized Membership Lemma:
- $\exists v \Phi \in \Gamma$ iff $\Phi[t/v] \in \Gamma$ for some constant $t$
- $\forall v \Phi \in \Gamma$ iff $\Phi[t/v] \in \Gamma$ for all constants $t$
Next would be to show that Completeness Lemma 2 still holds, that is, every maximally consistent set of $\mathcal{L}_2$ sentences is satisfiable
Then the proof for completeness remains the same as before with $\mathcal{L}_1$

Part 4: What's Next?

This is where my logic sequence ended this year. But of course that does not mean that this is where logic as a field stops, too.

More To Be Studied!

Even before considering other logics or flaws within our current one, there is still many, extremely important results to be studied within $\mathcal{L}_1$, $\mathcal{L}_2$, and $\mathcal{L}_=$.

Compactness

One particularly strong result that deserves its own post is the Compactness Theorem:

Compactness Theorem: If every finite subset of a set of sentences $\Gamma$ is satisfiable, then the whole set $\Gamma$ is satisfiable.

We've considered finite sets of sentences a lot already (as that's what proofs required of us), and the Compactness Theorem trivially holds for them since if $\Gamma$ is finite, then $\Gamma \subseteq \Gamma$ is a finite subset of itself. But for infinite sets of sentences, it's not entirely clear when they are satisfiable or not, and Compactness gives us a more tractable way of evaluating satisfiability.

The contrapositive gives us the equivalent formulation:

Compactness Theorem: If $\Gamma$ is unsatisfiable, then there is a finite subset $\Gamma^{\textrm{fin}} \subseteq \Gamma$ that is unsatisfiable.

And this is guaranteed to hold for infinite sets, while it isn't for finite! Consider the unsatisfiable set $\{P, \neg P\} \vDash$ with clearly having its subsets (except itself) satisfiable. This has the immediate consequence that every argument $\Gamma \vDash \Phi$ can be captured in a finite argument.

Alternate Form of Compactness: If $\Gamma \vDash \Phi$, then $\Gamma^{\textrm{fin}} \vDash \Phi$.

Proof: If $\Gamma$ is finite already, then the claim holds as it is. So consider $\Gamma$ to be infinite. Recall that $\Gamma \vDash \Phi$ iff $\Gamma \cup \{\neg \Phi\}$ is inconsistent. Then by Compactness, there is a finite subset $\Gamma^{\textrm{fin}} \cup \{\neg\Phi \}$ that is inconsistent. $\neg\Phi$ must be in this set as $\Gamma$ is satisfiable by assumption of $\Gamma \vDash \Phi$, so $\Gamma^{\textrm{fin}}$ is always satisfiable (if $\Gamma$ is unsatisfiable, then it entails everything, and then by Compactness there's a finite susbet that's also unsatisfiable that entails everything). Since $\Gamma^{\textrm{fin}} \cup \{\neg\Phi \}$ is inconsistent, then $\Gamma^{\textrm{fin}} \vDash \Phi$. $ \ \blacksquare$

Finally, it's worth noting that Compactness quickly follows from any logic that is both sound and complete:

Proof of Compactness:

$\begin{array}{cc|cc} \Gamma \vDash \Phi & & & \textrm{Assumption} \ \newline \Gamma \vdash \Phi & & & \textrm{Soundness} \ \newline \Gamma^{\textrm{fin}} \vdash \Phi & & & \textrm{Proofs are finite} \ \newline \Gamma^{\textrm{fin}} \vDash \Phi & & & \textrm{Completeness} \ \end{array}$

There are more ways to prove and apply Compactness, but that's for another day.

Löwenheim-Skolem Theorem

Löwenheim-Skolem Theorem: If $\Gamma$ is a set of $\mathcal{L}_=$ sentences with an infinite structure with cardinality $\omega$, then

(Upward) $\Gamma$ has a structure of every cardinality $\omega' > \omega$
(Downward) $\Gamma$ has a countable structure

The proofs found can be found in Elements of Deductive Logic (which only hold in first-order logic!), but there's an interesting paradox that comes along with it:

Skolem's Paradox: The Löwenheim-Skolem Theorem says that no first-order theory can limit what the size of the structures that satisfy it. Set theory is a first-order theory, so the theorem would say that there is a countable structure that satisfies set theory. But set theory entails that there are uncountable sets as well. How can a structure with only countably many elements satisfy something that is uncountable?

I'll leave you to think about that, but the resolution to this dilemma (also in Eagle) leads to what is now known as the non-absoluteness of set theory; a set may be uncountable relative to one structure, but countable to another.

Lindström's Theorem

We've looked at different ways logic can be limited in some math-y ways, like with expressive adequacy. The previous two theorems above give some more implicit restrictions on what a first-order logic can do. Compactness says that $\mathcal{L}_=$ can't discern finite sets from "pseudo-finite" (infinite sets whose finite subsets are satisfiable), and Löwenheim-Skolem says that it can't differentiate cardinalities of a structure. However, Lindström's theorem tells us that given these two restrictions of satisfying Compactness and (specifically the downward) Löwenheim-Skolem theorem, first-order logic is actually the strongest logic with these restrictions. Different definitions can yield different results, but analyzing the relative strength of logics is an area to still explore. Lindström uses the idea of overlapping "good structures" between two logics to determine their relative strengths.

Natural Language and Grice's Maxims

Remember, we started building logic with the intention to formalize English and natural language arguments. But obviously our formalizations above can only go so far, and one of the first, early signs of the "weirdness" logic might bring with it is the case of expressing, "If… then…" If you might recall we had the following truth tables:

$ \begin{array}{c|c|c} P & Q & \textrm{If} \ P \ \textrm{then} \ Q \\ \hline T & T & ? \\ T & F & F \\ F & T & ? \\ F & F & ? \\ \end{array} \ \ \ \ \begin{array}{c|c|c} P & Q & P \rightarrow Q \\ \hline T & T & T \\ T & F & F \\ F & T & T \\ F & F & T \\ \end{array} $

To get rid of those question marks, we decided to use $\rightarrow$, since it was easier to assign the case of vacuous more than anything else. So the only real information $\rightarrow$ carries is when it evaluates to false, telling us when one fact definitely does not result in another fact. For this reason, we call $\rightarrow$ the material implication as it only really cares about the current state of the actual world: if something $\Phi$ is true in our world and another thing $\Psi$ is false, it cannot be the case that in our world $\Phi \rightarrow \Psi$.

But that's not really how we usually use "If… then…" When someone uses "if", it does not necessarily mean that they know that fact is true, but they're supposing that it's true; "if" can denote a hypothetical. Compare the following conditionals:

If Apple merges with Google, then Samsung does not have a monopoly on smartphones.
If Apple merged with Google, then Samsung would not have a monopoly on smartphones.

The use of "would" in the second case indicates something that hasn't actually happend, but describes a perfectly understandable situation that we can still parse information from. In the first sentence, $\rightarrow$ would deem that a true sentence since our antecedent is false i.e. Apple and Google have not merged in the actual world. But in the second case, that sentence is false, since it is conceivable that even if Apple and Google merged in an alternate universe, it is possible that Samsung could still be the number one seller of smartphones.

The counterfactual implication, denoted by the fancy symbol $\square \!\! \rightarrow$, captures these hypothetical situations we use all the time that our material implication just fails to render accordingly.

Grice's Maxims

Note what we introduced above with $\square \!\! \rightarrow$ had no concrete semantics attached to it; our idea of hypotheticals, and "would" only arose out of what we understand in natural language, instead of something problematic from within the logic (or even English) itself. As such, this doesn't really give us a new truth table or anything, and unfortunately for logic, there are quite a few of these tacit rules to English.

For example, if I say, "Either I will pass my job interview, or the other candidate will get in," there's an additional piece of information that's communicated beyond just the two possibilities: I don't know which case will occur. And according to our semantic rules for $\vee$ that formalize "…or…", I could entail a disjunction by knowing a disjunct. That is, if I know $\Phi$, I certainly also know $\Phi \vee \Psi$ according to the truth table for $\vee$; $\Phi \vDash \Phi \vee \Psi$. Strictly speaking then, saying "I will pass my job interview" just contains more information as not only does it entail that "or" sentence I said at the beginning, but also an additional sentence (itself).

So if someone was to use any statement involving "…or…", we would expect them to be telling us exactly as much information as we need to get their message across, that being that they just don't know either side of the "…or…" statement since if they did, they would just say it.

Paul Grice (1975) formalized these cooperative principles of communication into what is now know as Grice's maxims that describe these underlying, hidden aspects of language that we always use, but never explicitly write out or explain (since they are just expected). In no particular order, they are:

Quantity: contribute as much information as is required without excess or lack of details.
Quality: contribute only as much as one knows and has evidence to be truthful.
Relation: contribute to the subject matter at hand; only be relevant.
Manner: be direct, clear, and avoid obscurity and esoteric constructions and words.

Notably the maxim of quantity is what poses a difficulty for our logical language, as seen with "…or…" and $\vee$. Incorporating these maxims, though, is a possible direction one could look to enhance or strengthen their logic's utility and applicability.

Extensions of Logic

Our logic above has done pretty well for what it needs to do concerning truth, and has found its way into computer science and discrete math as well by its very nature. But there are some arguments that go beyond truth alone.

Higher-Order Logics

A natural—and actually well-practiced—addition to logic are higher-order types. What we've been working with is known as first-order logic, as we can only quantify over objects. That is, $\forall$ and $\exists$ only ranges over possible, single objects that satisfy a formula. But, there are many things we can say beyond just objects. For example, in math we often able to make claims about sets of numbers. The least-upper-bound property of the real numbers states that every nonempty subset of the real numbers has a least upper bound (i.e. the supremum exists).

Or even simpler, for any property $P$ and any object $x$, it must be the case that $Px \vee \neg Px$.

Second-order logic is the upgrade we need that allows us to start quantifying over sets of objects. In first-order logic, it was predicates that took on the semantic value of sets of objects, so naturally second-order quantification looks like quantifying over predicates. So for our above example, it would look like: $\forall P \forall x(Px \vee \neg Px)$ where $P$ is now a variable standing in place for a predicate.

Quantifying over predicates/properties/sets of objects allows us to say quite a bit more as you'd expect. Earlier we mentioned the difference between numerical identity and qualitative identity, but now we are able to formalize how they relate to each other:

The Indiscernibility of Identicals: $\forall x \forall y (x = y \rightarrow \forall P(Px \leftrightarrow Py))$

If two things are numerically identical, then they certainly have all the same properties

The Identity of Indiscernibles: $\forall x \forall y (\forall P(Px \leftrightarrow Py) \rightarrow x = y)$

If two things have all the same properties, then they should be the same object

These are known as Leibniz's laws, and as discussed, the first is usually uncontested, while the second is more controversial. Whatever the case may be, second-order logic has given us a way to more clearly pick apart these claims.

But why stop there? First-order logic allowed us to quantify over objects; second-order logic allowed us to quantify over sets of objects (predicates). third-order logic allows us to quantify over sets of sets of objects, and the pattern continues. Second-order logic in particular is the most widely "used" from mathematics to computer science, but it should not be held to be necessarily "better" than first-order logic: second-order logic is actually incomplete.

Higher-order logic has its pulls and drawbacks, but it is a natural step from what we've looked at today, especially if you're interested in the foundations of mathematics and set theory.

"Universal" Logics

We already saw with the counterfactual conditional $\square \!\! \rightarrow$, there are already some obvious cases our logic lacks. Not to mention, the idea of possible and necessarily pops up all the time in philosophy and arguments, that it's a little disconcerting that we can't already formalize them.

Modal logic extends our logic precisely in this way, with two new quantifiers: We express, "It is possible the case that P," as $\square P$, and, "It is necessarily the case that P," as $\lozenge P$. The semantics of $\square$ and $\lozenge$ take our ideas of a structure as a model of the universe a little too literally. We say $\square P$ is true if there is a possible world in which $P$ is true, and we say $\lozenge P$ is true if in all possible worlds $P$ is true. The notion of what a possible world is by nature just vague, but it is an interesting one to consider. Ignoring semantics, we can already deduce a lot with these operators:

$\neg \lozenge \Phi \equiv \square \neg \Phi$
- If it is not necessary that $\Phi$ occurs, there is a possible world in which $\Phi$ does not occur (i.e. $\neg \Phi$ occurs)
$\neg \square \Phi \equiv \lozenge \neg \Phi$
- If it is not possible that $\Phi$ can occur, then it is necessarily the case that $\Phi$ does not occur (i.e. $\neg \Phi$ occurs)

As might be expected by the way we discussed their semantics, $\square$ and $\lozenge$ act a lot like $\exists$ and $\forall$ respectively.

Other extensions include temporal logics, that allow us to formalize time-sensitivity as well. The statement, "It is raining right now," might have a constant meaning, but whether it is true or not changes based on time, thus motivating the formalization of predicates like "eventually", "always", "until", and others that delineate time.

Free logics allow us to abandon the need to have constants to actually mean anything, or even have a non-empty domain. Just because we can talk about an imaginary, non-real thing, like a unicorn (i.e. has the body of a horse, has one horn, etc.), does not necessarily mean that a unicorn exists. Yet, if we allow unicorns into our domain of a structure to logically discuss them, it would appear that we would be claiming the existence of a mythical creature, which becomes quite useful when discussing metaphysics and any ontology or description of existence (in a way, free logic is all about specifying the conditions of existence.).

Multivalue and Fuzzy Logics

We have only ever considered two possible ways to evaluate a sentence: it is either true or false. But is that really the case? Sounds kind of insane to even try and conceive of something else a sentence could be, but consider the following sentence:

This sentence is false.

If it is true, then it declares itself false. If it is false, then it declaring that it is false is wrong, so it is true. The cycle continues forever and ever, without ever resolving itself to one of our prescribed truth values.

So some propose a solution that include a third, deviant truth value that allows us to ascribe some sentences as neither true nor false in a 3-value logic. Or another unintuitive solution is to allow degress of truth, i.e. the above sentence might be considered .5 true. Giving truth values between 0 and 1 is the cornerstone of fuzzy logic, which is usually used to describe when there is inexact information. However, don't confuse this number of truth as that of probability; it's not that this sentence is true only half the time, but rather that it's considered only half true since it is vague. Probability usually is associated with ignorance (lack of information; i.e. "the coin will land heads with probability .5" is said since we don't have knowledge of the future) as opposed to a lack of clarity. This notion is similar to the idea of degrees of membership in fuzzy set theory.

Altering the Axioms

The tautology $P \vee \neg P$ is quite a natural one. "Either it is or it is not." What other option could there be? Even in math we use double negatives multiplying to a positive, so positive and negative numbers seems to be the only cases possible. The law of excluded middle just seems to be a fact of the universe, that falls perfectly in line with our proofs and semantics for the classical logic we've been looking at.

But I've alluded throughout the post that some people do not always accept this as a given. Intuitionistic, or constructive logic, rejects the law of excluded middle, and argues that $\neg \neg P \nvDash P$, which come as standard inference rules that we can derive. The name comes from the fact that this doctrine encourages literal demonstrations of proof: if you have a proof of $P \wedge Q$, you literally have a proof of both $P$ and $Q$; a proof of $P \vee Q$ means one has a proof of $P$, or has a proof of $Q$.

Even more (ironically) unintuitive, we lose the ability to use proof by contradiction, as that relies on the ability to claim that if an assumption is wrong, its negation is true. So for classic proofs like the irrationality of $\sqrt{2}$, intuitionists would claim that all that's been proven is that $\sqrt{2}$ is not rational, not necessarily that it is irrational.

Picking and choosing what axioms, entailments, or laws one accepts of course changes what meaning and truth will look like to them, and can lead to some interesting, tenable persepctives that hold real weight and become the focus of entire areas of philosophy.

Conclusion

This post does not replace the two textbooks outlined above. If anything, it is only a simple supplement that gives a high-level overview of all the major conceptst to newcomers of formal logic, and a quick review source for those who have already studied it. At the very least, it'll be a quick reference for me, and future posts, as logic really has forced me to reconsider what I thought to be "staple truths", not just in math, but life in general. We'll definitely revisist logic in the future whether it be from a computer science perspective with decidability and algorithms, certain shortcuts in math, or just more theory in logic itself. The way logic has developed in such an abstract way lends itself to be applied anywhere, it's just a matter of thinking about how, as it almost certainly can be.

The Hidden Universes of the Compactness Theorem

Adi Mittal

No introduction. Just cool.

Last post, I summarized the basics of formal logic, starting from a philosophical context before building up to more broad and involved results, coming closer to what just looks like math by the end of it. One of those results which was briefly thrown into the conclusion was the Compactness Theorem. You won't need to know a lot of logic to appreciate the power of this theorem, but it would help to either skim that last blog post or brush up on how to read mathematical logic and its symbols. If the following talk of abstract logic seems confusing, don't worry, skip ahead to the Applications of Compactness to see why this theorem is so much more than just abstract logic.

Here a quick review of some concepts from propositional logic (that is, predicate calculus without quantifiers and predicates) we'll be using and discussing:

Sentence: a string of logical characters, built from sentence letters $\{P, Q, R, \cdots \}$ that are either true or false, and connectives $\{\wedge, \vee, \neg, \rightarrow, \leftrightarrow\}$
Structure: a "model universe"; a total function that assigns every sentence letter a truth value. Complex sentences are assigned truth values according to the rules of the connectives they use.
Satisfiability: a set of sentences $\Gamma$ is satisfiable if and only if there is a structure that makes every sentence in $\Gamma$ true
Finite Satisfiability: a set $\Gamma$ is finitely satisfiable if and only if every finite subset of $\Gamma$ is satisfiable
Entailment: We say a set $\Gamma$ entails a sentence $\Phi$, written $\Gamma \vDash \Phi$, for whenever a structure satisfies $\Gamma$, then $\Phi$ is also true

Compactness Theorem: If $\Gamma$ is finitely satisfiable, then the whole set $\Gamma$ is satisfiable.

Roughly speaking, this says that if you have a set of logical statements, you can claim your set logical statements $\Gamma$ are not contradictory if every finite subset of logical statements are not contradictory themselves (we call the property of being non-contradictory satisfiability). If $\Gamma$ is already a finite collection of sentences, then Compactness trivially holds since $\Gamma \subseteq \Gamma$ is a finite subset of itself.

This becomes interesting when $\Gamma$ is infinite, since then this gives us a sufficient method for deducing a set of logical sentences is satisfiable. This should sort of make sense since if you can find a "problem set" within $\Gamma$ that are not satisfiable, then of course you can say $\Gamma$ is not satisfiable since if you try to make every sentence in $\Gamma$ not contradict each other, you'll have to find a way around the problem child set. This is essentially the contrapositive of the above formulation:

Compactness Theorem: If $\Gamma$ is unsatisfiable, then there exists a finite subset of $\Gamma$ that is unsatisfiable.

To see the strength of compactness, we can use this final formulation of compactness along with the definition of entailment to get one more that shows a compact logics are able to reduce arguments to only some number of "key" premises:

Compactness Theorem: If $\Gamma \vDash \Phi$, then $\Gamma^{\textrm{fin}} \vDash \Phi$.

This follows from an equivalent statement of entailment: $\Gamma \vDash \Phi$ if and only if $\Gamma \cup \{\neg\Phi\}$ is inconsistent (i.e. contradictory/unsatisfiable).

Even if an argument appears to need infinite premises $\Gamma$, there's an equivalent formulation that only requires a finite subset $\Gamma^{\textrm{fin}} \subseteq \Gamma$ that is what really matters to our argument. This relation between finite satisfiability and actual satisfiability, in a sense, states that there are certain "pseudo-finite" sets of sentences; they may look infinite, but for what you can deduce from them, they are no better than being a finite set of sentences.

We'll first cover some examples on why I think this theorem needs to be more known, where it gets its name from, and finally a proof of the Compactness Theorem for propositional logic.

Applications of Compactness

I've been going on a bit about this random theorem in a very abstract way—as logic tends to do. So, hopefully some examples will show just how useful this idea is; it gives us a way to extend results to infinity, which is not something that is all that common in math. Take proof by induction, for example: that gives us a way of showing a result holds for an arbitrary natural number, even as it approaches infinity. But to say the result holds at infinity is usually meaningless, even though we describe infinity or approaching infinity as that of the concept of an arbitrarily large number or similar.

Graph Theory

Due to the nature of graphs being readily characterized by a set of vertices and edges, they are surprisingly easy to formalize them and their properties in logic, making results fall out much quicker than you might expect.

Graph Coloring

Let's start with a concrete example: graph coloring. We say a graph $G$ of vertices $V$ and edges $E$ (we define a graph as the object $G=(V,E)$ to clearly define its vertices and edges) is $4$-colorable if every vertex can be assigned one of four colors (say, red, blue, green, and purple) where no two vertices of the same color are neighbors (that is, share an edge). We say a graph is $k$-colorable if we can assign every vertex one of $k$ colors with no two of the same color neighboring each other.

We assign every region a vertex, and connect two vertices if the two regions they represent share a border. Despite looking as complicated as it does, the above map is still 4-colorable. Credit: Wikipedia

Theorem: A graph is $k$-colorable if and only if all of its finite subgraphs are $k$-colorable.

Proof: $(\Rightarrow)$ Clearly if the whole graph is $k$-colorable, then all of its finite subgraphs are $k$-colorable.

$(\Leftarrow)$ For the other direction, we can write some logical sentences that describe the behavior of our graph. For a (undirected) graph $G=(V,E)$ with

$V = \{1,2,\cdots, m\}$ as a set of verteices (we'll just label with numbers for convenience),
$E \subseteq V \times V$ as set of edges (that is an irreflexive and symmetric relation),
- I.e. if the pair $(1,2) \in E$, we mean there is an edge connecting vertices 1 and 2
$C = \{1,2,\cdots, k\}$ as a set of colors,

we'll define a set of sentence letters $P_{v,c}$ to represent $\textrm{"vertex} \ v \ \textrm{has color} \ c \textrm{"}$. With these simple sentences, we can write more complex descriptions of our graphs.

Vertex $v$ has at least one color: $F_v := \bigvee_{c=1}^{k} P_{v,c}$
- Either vertex $v$ has color 1, or color 2, or…, or color $k$
Vertex $v$ has at most one color: $G_v := \bigwedge_{c_1 = 1}^{k} \bigwedge_{c_2 = c_1 + 1}^{k} \neg(P_{v,c_1} \wedge P_{v,c_2})$
- For any two different colors $c_1$ and $c_2$, it is not the case that vertex $v$ has both colors
Vertices $u$ and $v$ don't have the same color: $H_{u,v} := \bigwedge_{c = 1}^{k} \neg(P_{u,c} \wedge P_{v,c})$

Now let $\Gamma = \{F_v \ | \ v \in V\} \cup \{G_v \ | \ v \in V\} \cup \{H_{u,v} \ | \ (u,v) \in E\}$. Now it should be fairly clear that $\Gamma$ is satisfiable and not contradictory if and only if $G$ is $k$-colorable.

The first two sets of sentences $\{F_v \ | \ v \in V\} \cup \{G_v \ | \ v \in V\}$ says that every vertex has at at least one color and at most one color, that is, every vertex has exactly one color.
The last set $\{H_{u,v} \ | \ (u,v) \in E\}$ says that no neighboring vertices share the same color. If $G$ is $k$-colorable, then by definition, no two adjacent vertices have the same color.

So an easy structure $A$ to define that satisfies $\Gamma$ is $A(P_{v,c}) = T$; i.e. let $P_{v,c}$ be true if and only if vertex $v$ has color $c$ (exactly as it denotes).

By assumption, every finite subgraph of $G$ is $k$-colorable, so every finite subset of $\Gamma$ is satisfiable. By the Compactness Theorem, then $\Gamma$ must be satisfiable, and hence all of $G$ is $k$-colorable.

$\blacksquare$

We can then combine this with the famous Four Color Theorem:

Four Color Theorem: Every finite graph is 4-colorable.

And by the above, we can extend this to the

Infinite Four Color Theorem: Every infinite graph is 4-colorable.

Proof: Let $G$ be an infinte graph. Every finite subgraph of $G$ is 4-colorable, so by the above theorem, $G$ itself must also be 4-colorable. $\ \blacksquare$

The above method of formalizing and encoding behavior and relationships in math—and graph theory especially—alongside the Compactness Theorem gives us a nice, relatively straightforward strategy to extending finite results to the infinite.

Kőnig's Lemma

Here's another result from graph theory that has its own handful of applications.

Kőnig's Lemma: Every locally-finite infinite tree contains an infinite branch.

Just to be clear, here are some definitions:

A tree is a special type of (undirected) graph characterized by being connected (there's a path between any two vertices) and that it is acyclic (there is no path from a vertex to itself without passing through a vertex more than once)
A locally-finite graph is one where every vertex only has a finite number of edges/neighbors (known as the degree of the vertex)
An infinite graph is a graph with an infinite number of vertices

So a tree can look something like

Doesn't that sort of look like a tree growing downward? Here are some useful terms to describe trees:

The very top vertex is called the root
A vertex is at level $n$ if it takes $n$ edges to get to that vertex from the root. So the root can be defined by being the vertex at level 0.
A vertex $x$ is the parent of another vertex $y$ (the child) if they share an edge, and $x$ is a level lower than $y$
$x$ is the ancestor of $y$ if there is a path (i.e. a chain of parents) connecting $x$ and $y$, which we denote $x \preceq y$

So Kőnig's Lemma says every infinite tree either has an infinite-degree vertex or an infinite path. This should seem somewhat obvious, but as with anything involving infinity, it needs to be dealt with carefully. Since we can imagine building a tree where we carefully terminate every path to be finite. Sure, there may be arbitrarily long paths, but to show that there is an endless path as opposed to it being a very long finite path is a careful distinction. Moreover, Kőnig's Lemma, has ties to the axiom of choice, which should never be treated carelessly.

Proof of Kőnig's Lemma: As before, we'll do this with the Compactness Theorem. For a tree $T$, we'll define a series of sentence letters $P_{x}$ to denote "we select vertex $x$ in our path". The idea will be that—as the sentence suggests—to find a structure that enumerates specific $P_{x}$ at each level to tell us what our vertices will be in our path. The more complex sentences we will use and satisfy are:

We use at least one vertex at level $n$: $F_n := \bigvee_{i=1}^k P_{n_i}$ where $\{n_1, \cdots, n_k \}$ is the set of vertices at level $n$
- At level $n$, it must be the case we select either $n_1$, or $n_2$, or…, or $n_k$.
We pick at most one vertex at level $n$: $G_n := \bigwedge_{i = 1}^{k} \bigwedge_{j = i + 1}^{k} \neg(P_{n_i} \wedge P_{n_j})$
- For two different vertices $n_i$ and $n_j$ on level $n$, we do not pick both of them in our path
If we pick a vertex at some level, then we have to pick its ancestors: $H_{xy} := (P_{y} \rightarrow P_{x})$ where $x \preceq y$
- Since every vertex (except the root) has one parent, there's only one path from the root to that vertex, so naturally a line of ancestors forms our path

Now let $\Gamma = \{F_n \ | \ n \in \mathbb{N} \} \cup \{G_n \ | \ n \in \mathbb{N} \} \cup \{H_{xy} \ | \ x \preceq y \ \forall x,y \in T\}$. Now say there is a structure $A$ that satisfies $\Gamma$. Then it should be clear that the sequence of vertices $B = \{x \in T \ | \ A(P_x) = T\}$ would form an infinite path through our tree $T$:

The set $\{F_n \ | \ n \in \mathbb{N} \} \cup \{G_n \ | \ n \in \mathbb{N} \}$ specifies we use exactly one vertex at each level, so there's this natural descent we can follow in our tree
The last set $\{H_{xy} \ | \ x \preceq y \ \forall x,y \in T\}$ says that if we pick one vertex in our tree to form our path, we also pick its ancestors, so we know there is this path of edges that we can actually follow

To show that there is such a structure, we'll use Compactness. Take any finite subset $\Gamma^\textrm{fin} \subseteq \Gamma$. Let $\{x_1, x_2, \cdots, x_k\}$ be the set of vertices that have a corresponding sentence letter $P_{x_i}$ occurs in some element of $\Gamma^\textrm{fin}$. Since $\Gamma^\textrm{fin}$ only talks about a finite number of sentence letters and thus a finite number of vertices, it is also talking about a finite tree of height $n = \max\{\textrm{Lev}(x_1), \cdots, \textrm{Lev}(x_k)\}$ where $\textrm{Lev}(x_i)$ gives the level of vertex $x_i$. So let's pick some vertex at level $n$, and call it $\alpha$. Now the structure

$$ A(P_x) = \begin{cases} T \ \textrm{iff} \ x \preceq \alpha \\ F \ \textrm{otherwise} \end{cases} $$

should satisfy $\Gamma^\textrm{fin}$. Since every vertex has exactly one ancestor per level (i.e. its parent, the parent of the parent, etc.), this structure only satisfies one vertex per level as we'd want, and by construction, obviously ensures we select a chain of ancestors to have a path. Even if our subset of our tree is disconnected, this structure would still satisfy $\Gamma^\textrm{fin}$ as if a vertex $x$ has no ancestors, it would vacuously satisfy $H_{xy}$.

Since $\Gamma^\textrm{fin}$ was an arbitrary finite subset of $\Gamma$ and was satisfiable, by the Compactness Theorem, $\Gamma$ is satisfiable, and so for any locally-finite infinite tree, there is an infinite path through it.

$\blacksquare$

There are direct proofs, too, but Kőnig's Lemma itself can be seen as a weaker version of Compactness, so I think proving it both ways is informative.

Order-Extension Principle

As the previous two examples with graphs have shown, results with Compactness are all about extending results in some way. Here's another one that doesn't necessarily rely on infinity like our previous ones, but first we'll need some definitions.

A partial order on a set $X$ is a relation $\preceq$ satisfying:

Reflexivity: $\forall x \in X \ x \preceq x$
Antisymmetry: $\forall x,y \in X \ (x \preceq y \wedge y \preceq x \Rightarrow x = y)$
Transitivity: $\forall x,y \in X \ (x \preceq y \wedge y \preceq z \Rightarrow x \preceq z)$

A total order is a partial order with the additional restriction of

Connectedness: $\forall x,y \in X \ (x \preceq y \vee y \preceq x)$

So a total order is something in which all elements are able to be compared with one another, while in a partial order some may not be. So all total orders are partial orders, but not all partial orders are total orders. We denote an a partially ordered set as a pair of a set and a partial order defined on the set: $(X, \preceq)$.

An easy example of a total order is $\leq$ on the set of integers $\mathbb{Z}$: every pair of integers contains a greater element. An example of a partial order would be the relation "divides" on $\mathbb{Z}$: 2 does not divide 3, and 3 does not divide 2, so under the relation of divides, 2 and 3 would be incomparable.

It's worth mentioning that there are also strict partial/total orders that remove the reflexivity requirement (think the difference between $$<$$ and $\leq$ on the integers).

Now there is a natural question of whether or not we can remedy partial orders; is there a way we can maintain the structure of a partial order, but find a way to "fix" the incomparable elements. It turns out, we can do just that.

Order-Extension Principle: Any partial order may be extended to a total order.

For any partial order $\preceq$ on a set $X$, we can find a total order $\leq$ on $X$ such that $\forall x,y \in X \ (x \preceq y \Rightarrow x \leq y)$; we can preserve all the original relationships from the partial order, while adding all the missing relations to make a total order without contradicting the criteria above. This isn't completely obvious to me, specifically for the transitivity rule, since it is totally possible to imagine accidentally creating a chain that circles on itself, coming to the conclusion that $x \preceq y \preceq z \preceq \cdots \preceq x$, getting us that distinct elements should be equal to each other by antisymmetry.

Until now, we've been using the Compactness Theorem for propositional logic. For the following proof, we will use the Compactness Theorem for first-order logic, as this will allow us to talk about binary relations—which partial and total orders are. A relation $R$ on a set $X$ is a subset of $X \times X$, so we say that $xRy$ (or $x \preceq y$ in our case) is true iff the ordered pair $\langle x, y \rangle \in R$. The theorem is the exact same statement, just applies to a broader logic.

Proof: Let $X$ be a set, and $\preceq$ be a partial order on $X$. We then add the binary relation $\leq$ to represent our partial order, and $\forall x \in X$, we add a constant $c_x$ to denote it. The following set of sentences will make up our $\Gamma$ that we'll show satisfiable:

$\leq$ is a partial order
- This just specifies the content of $\leq$ so that it behaves how we want it to
Whenever $p \preceq q$, we add the sentence $c_p \leq c_q$
- Similar to sentences of form 1), this keeps the structure between our original partially ordered set and our formalization
For every pair $p, q \in X$, we add the sentence $c_p \leq c_q \vee c_q \leq c_p$
- This is how we'll extend our partial order to a total order by ensuring every pair of elements is compared

Now we'll show that $\Gamma$ is satisfiable. Take some finite subset $\Gamma^\textrm{fin} \subseteq \Gamma$. If $\Gamma^\textrm{fin}$ only contains sentences from rules (1) and (2), then we know $\Gamma^{\textrm{fin}}$ is satisfiable, since $\preceq$ on $X$—which $\leq$ is based on—is a non-contradictory object (I mean, it exists), so that gives us a valid guide to make our structure. So $\Gamma^\textrm{fin}$ is satisfiable in this case.

Otherwise, $\Gamma^\textrm{fin}$ has a finite number of sentences from rule (3). To satisfy these sentences, we will extend our partial order $\preceq$ to another one $\preceq^{*}$. We can construct $\preceq^{*}$ by induction:

If there is only 1 sentence from rule (3) which isn't already satisfied by $\preceq$, then, say, $p$ and $q$ are incomparable. So let's just state that $p \leq q$, and add this to our relation. So we let $\preceq^{*} \, = \,\, \preceq \cup \, \{\langle p, q' \rangle \ | \ q \preceq q' \ \forall q' \in X \} \cup \{\langle p', q \rangle \ | \ p \preceq p' \ \forall p' \in X \}$ We can't just add $\langle p, q \rangle$ alone to our relation $\leq$, since we need to ensure transitivity. That's why we add these sets of ordered pairs as opposed to just the incomparable pair.
Now we do this exact same extension as before for the finite amount of sentences from rule (3), and we're done (we have no issue with the Axiom of Choice since we only need to consider finite numbers of sentences).

So $\preceq^{*}$ gives us the precise model to show that $\Gamma^{\textrm{fin}}$ is satisfiable in this case. So in all cases, $\Gamma^{\textrm{fin}}$ is satisfiable, and by Compactness, $\Gamma$ is satisfiable too. (Note this isn't our total order, since it only acts on finite subsets of $\Gamma$, where we want our total order to act on all of $\Gamma$; $\leq$ is our total order)

So there is a structure in which $\Gamma$ is satisfied, which specifies that there is such a binary relation $\leq$ that acts as like a total order on $X$. We can define this total order $\leq^{*}$ specifically by: $p \leq^{*} q \Leftrightarrow (X, \leq^{*}) \vDash c_p \leq c_q$ (we have to distinguish $\leq$ and $\leq^{*}$ because one acts on our set, and the other acts as a binary relation in our language).

$\blacksquare$

Logical Results

I hope the above gives some more concrete examples of how the Compactness Theorem applies to useful mathematics, despite it being a purely logical result. That being said, it being from logic should make it no surprise there are a number of purely theoretic results that are just interesting to consider.

Weakness of Infinity

First-order logic can readily express the size of a model/structure fairly easily with quantification. The simple sentence $\varphi_{\geq 1} = \exists x (x=x)$ is satisfied if and only if the structure it is implemented in contains at least 1 element. $\varphi_{\geq 2} = \exists x \exists y (\neg x = y)$ is satisfied iff the structure contains at least 2 distinct elements. In general, $\varphi_{\geq n} = \exists x_1 \exists x_2 \cdots \exists x_n (\bigwedge_{1 \leq i < j \leq n} \neg x_i = x_j)$ expresses the size of a structure to be at least of size $n$ elements.

Similarly, we can come up with upper bounds on the size of a structure with $\forall$: $\varphi_{\leq n} = \forall x_1 \forall x_2 \cdots \forall x_{n+1}(\bigvee_{1\leq i < j \leq n+1} x_i = x_j)$.

Combining these can then get us exact sizes of structures: $\varphi_{= n} = \varphi_{\leq n} \wedge \varphi_{\geq n}$.

However, note that these only allow us to express specific finite sizes of structures. As it turns out, it is impossible to express in general the finitude of a model.

Claim: There is no sentence $\Phi$ that can be satisfied in and only in finite models.

Proof: Suppose for a contradiction there was such a $\Phi$. Now consider the following set of sentences:

$\Delta_{\infty} = \{\varphi_{\geq 1}, \varphi_{\geq 2}, \varphi_{\geq 3}, \cdots\} = \{\varphi_{\geq n} \ | \ n \in \mathbb{N} \}$

It should be clear that $\Delta_{\infty}$ is only satisfied in an infinite model, since if the model was finite, then there would be some minimum element $\varphi_i$ that would not be satisfied by expressing the finitude of a structure greater than the finite one in question.

Now take $\{\Phi \} \cup \Delta_{\infty}$. This set of sentences should now be unsatisfiable, since $\Phi$ is only satisfied in finite models by assumption, while $\Delta_{\infty}$ is only satisfied in infinite ones. By the Compactness Theorem, there is a finite subset $\Gamma^{\textrm{fin}} \subseteq \{\Phi \} \cup \Delta_{\infty}$ that is unsatisfiable. If $\Gamma^{\textrm{fin}} \subseteq \Delta_{\infty}$, then let $n$ denote the greatest $\varphi_n \in \Gamma^{\textrm{fin}}$. Any model of size equal to or greater than $n$ clearly then satisfies $\Gamma^{\textrm{fin}}$. But since $\Gamma^{\textrm{fin}}$ is unsatisfiable, it must be the case that $\Phi \in \Gamma^{\textrm{fin}}$ as otherwise it would be satisfiable for the reason just listed.

But then in any such structure of size greater than $n$ that satisfies the finite component of $\Delta_{\infty}$, $\Phi$ must be unsatisfied to keep $\Gamma^{\textrm{fin}}$ unsatisfied. But then we have found a finite structure in which $\Phi$ is not satisfied, contradicting our assumption. $\ \blacksquare$

As it turns out, it is also impossible to express the infinitude of a structure.

Claim: There is no sentence $\Phi$ that cannot be satisfied in and only in infinite models.

Proof: Suppose for contradiction there was such a $\Phi$. Then for all finite structures $A$, it is the case that $A \nvDash \Phi$. Thus also in all such structures, $A \vDash \neg \Phi$. Now take the set of sentences $\Gamma = \{\neg \Phi\} \cup \Delta_{\infty}$ (same $\Delta_{\infty}$ as before). Any finite subset of $\Gamma$ is satisfiable, since any finite subset of $\Delta_{\infty}$ is satisfiable for some finite structure, and $\neg \Phi$ is satisfiable in all finite structures. By the Compactness Theorem, $\Gamma$ is satisfiable, so there is some structure $A^+$ such that $A^+ \vDash \Gamma$ i.e. $A^+ \vDash \neg \Phi$ and $A^+ \vDash \Delta_{\infty}$. Since $A^+ \vDash \Delta_{\infty}$, we must have $A^+$ be an infinite structure. But by assumption, $A^+ \vDash \neg \Phi$ if and only if it is finite. Thus there cannot be any such sentence $\Phi$ that is satisfied in and only in infinite structures. $\ \blacksquare$

Just as propositional logic had the weakness of being able to not use quantifiers, we remedy it in strengthened first-order logic. In this way, this inability to express certain statements like the above might be seen as a flaw in first-order logic that needs some fix as well.

Extending Arithmetic and the Hyperreals

The above proofs on the in-expressability of infinity are not all that interesting on their own, but they highlight a proof strategy using that specific set of $\Delta_{\infty}$ to force certain properties to arise out of satisfying $\Gamma$; by its simple construction, $\Delta_{\infty}$ does not usually affect finite satisfiability and hence satisfiability by Compactness, but forces our set $\Gamma$ to behave in a certain way. Here are some weirder consequence these types of proofs can result in.

We all know the standard model of arithmetic: that's just the natural numbers $\mathbb{N}$ with our normal understanding of addition, multiplication, etc. No more than the basic math we learned in elementary school. More formally, this is the standard model of the Peano axioms, as $\mathbb{N}$ satisfies the axioms in the most "obvious" way that we are most familiar with.

But there are also non-standard models containing other numbers that are less commonly seen.

Claim: There is a non-standard model of arithmetic.

Proof: Let $P$ denote the Peano axioms. Now consider the set of sentences

$\Gamma = P \cup \{x > 0, x > 1, \cdots\} = P \cup \{x > n \ | \ n \in \mathbb{N}\}$

for some new symbol $x$. If we can show $\Gamma$ is satisfiable, then we'll have shown that there is a model that not only satisfies the Peano axioms, but also includes a "number" $x$ that is greater than all other natural numbers.

If we take any finite subset $\Gamma^{\textrm{fin}} \subseteq \Gamma$, then it is satisfiable by the standard model of arithmetic (as those satisfy the Peano axioms), with the addition that $x$ is a number greater than any number mentioned in $\Gamma^{\textrm{fin}}$ (since it'll only have finitely many sentences of the form $x > n$). By Compactness, since all finite subsets $\Gamma^{\textrm{fin}}$ are satisfiable, $\Gamma$ is satisfiable and has a model.

Since a model of $\Gamma$ is a model of $P$ (as it is just a susbet of $\Gamma$), it will be some model of arithmetic. But also, any model of $\Gamma$ that corresponds to $x$ cannot be any typical natural number, since $\Gamma$ states that it is greater than any natural number. So there is a non-standard model of arithmetic. $ \ \blacksquare$

In other words, there is a way to interpret our standard rules and axioms for finite numbers, and somehow apply them to infinite quantities. I don't know about you, but I was taught never to treat infinity like a number, but always as a concept or a process. Yet clearly it's not always contradictory or even bad logic to use them just like normal numbers.

Other non-standard models can have more complicated properties, like certain theorems failing that would be true in the standard model, but nonetheless it is interesting that Compactness implies that even such a model can exist and justifies our use of otherwise strange concepts.

For example, we can also show there is a non-standard model of analysis.

Claim: There is a non-standard model of (real) analysis.

Proof: Real analysis concerns itself with the ordered field of real numbers, so let's have $T$ be the set of sentences that all hold in this field. Now let's have

$\Gamma = T \cup \{0 < \epsilon < 1, 0 < \epsilon < \frac{1}{2}, \cdots\} = T \cup \{0 < \epsilon < \frac{1}{n} \ | \ n \in \mathbb{N_{>0}}\}$

Like before, if we can show that $\Gamma$ is satisfiable, we'll find a model that satisfies all everything we would expect of the real numbers (from $T$) while also showing that we can introduce a new positive number $\epsilon$ that is smaller than any other number. Again like before, any finite subset $\Gamma^{\textrm{fin}}$ is satisfiable, since any finite subset of $T$ holds in the original model of the real numbers, and for any finite subset of $\{0 < \epsilon < \frac{1}{n} \ | \ n \in \mathbb{N_{>0}}\}$, we just let $\epsilon$ be a number smaller than $\frac{1}{n}$ for the largest $n$ that appears in that set. Since any finite subset $\Gamma^{\textrm{fin}}$ is satisfiable and has a model, by Compactness, $\Gamma$ also is satisfiable and has a model. But clearly this model is non-standard for $\epsilon$ is a positive number that is not identical to any other positive real number, as it is smaller than all other positive reals. $\ \blacksquare \ $ (This also shows the existence of a non-Archimedean ordered field)

We are often reminded in math that we cannot treat the infinitely large and the infinitely small however we want, but there is a rigorous sense in which we can treat them familiarly as with all the other numbers. There is a sense in which you can do calculus and functional analysis without the need for limits, and just use these hyperreal or non-standard real numbers. If $h$ is one of these infintessimal hyperreal numbers (like $\epsilon$ from above), then we can define the derivative of a function $f$ at a point $x$ as

$\large{f'(x) = \textrm{st}(\frac{f(x + h) - f(x)}{h})}$

This looks just like the standard definition of the derivative, but instead of having a limit attached to it, we have this new function $\textrm{st}(\cdot)$ that acts as a "rounding function", of sorts, that turns our hyperreal fraction into a real number we can work with. If this interests you, here are some nice notes on the topic.

Topology and the Axiom of Choice

If it isn't clear by all the proofs we've covered, the Compactness Theorem can lend its hand in many places of mathematics. To see just how powerful this is, we will need to explore its connections to other ideas we already briefly touched on. But, it's worth already pointing out, the name is a bit misleading, isn't it?

Topological Compactness

Compactness, as it stands as a property, is more often found in topology than anything, as a way of generalzing the notion of "having no holes". For example, the set $(0,1)$ is not compact as it is missing its endpoints of 0 and 1. But, the closed interval $[0,1]$ is compact. The set of rational numbers $\mathbb{Q}$ is not compact as there are infinite holes in the irrational numbers: you can get as close as you want to any irrational number $r$ with a finite approximation in $\mathbb{Q}$, but clearly $r \notin \mathbb{Q}$, creating a "hole" of sorts.

Definition 1: A subset $S$ of Euclidean space $\mathbb{R}^n$ is compact if $S$ is closed and bounded.

Closed, meaning that $S$ contains all of its limit points (like the irrationals in the case of $\mathbb{Q}$: you can always get as close as you want to any irrational in $\mathbb{Q}$, but since they are not in $\mathbb{Q}$, it is not closed)
Bounded, meaning that every point in $S$ is a finite distance away from each other (so even $\mathbb{R} = (-\infty, \infty)$ is not compact as it is unbounded)

In a way, compactness is the next best generalization of being finite; lots of the properties we'd expect of finite sets generalize to compact ones, even though compact sets can be infinite like $[0,1]$. In fact, a finite set is just a discrete and compact set. For example, if a set $S$ is finite, then any continuous function $f: S \rightarrow \mathbb{R}$ is bounded. This trivially holds since if $S$ is finite, then $f(S)$ is also finite and we can just find its maximum and minimum to bound it. However, if $S$ is compact, this still remains true (Boundedness Theorem).

This should ring a bell, as this sort of sounds like the talk of pseudo-finite sets of sentences from the start:

Compactness Theorem: If $\Gamma \vDash \Phi$, then $\Gamma^{\textrm{fin}} \vDash \Phi$.

If a logic is compact, all infinite sets of sentences mimic the behavior of finite subset of themselves. If a space is compact, then a set from that space acts a lot like a finite set. As it turns out, there is a definition that is much definition of topological compactness that even more closely resembles this idea:

Definition 2: A set $S$ is compact if $S$ every open cover of $S$ has a finite subcover.

An open cover is a collection of open sets $C$ such that $S = \bigcup_{X \in C} X$; an open cover is a collection of open sets in which every element of $S$ is in one of the open sets. For example, the collection of open intervals $\{(-n,n) \ | \ n \in \mathbb{N}\}$ is an open cover of $\mathbb{R}$.
According to the Heine-Borel theorem, Definitions 1 and 2 are equivalent
We usually say topologies/metric spaces are compact, as opposed to generic sets
This definition applies to general topologies, not just Euclidean space $\mathbb{R}^n$

But this isn't the "real" idea of the Compactness Theorem; in all of our proofs, the version of Compactness we really cared about was with finite satisfiability:

Compactness Theorem: If $\Gamma$ is finitely satisfiable, then the whole set $\Gamma$ is satisfiable.

What we really liked was the ability to turn a problem of an infinite set into a more tractable, arbitrary finite one; we could take a local property and make it a global property. Topological compactness has a similar property:

Finite Intersection Property: Given a collection of sets $X$ in a compact space, if every finite subset of $X$ has a non-empty intersection, then the whole set has a non-empty intersection.

If we wanted to show a collection of sets shared a common element, the finite intersection property would give us a way to demonstrate that like the Compactness Theorem

Referencing the Boundedness Theorem from above, we can think of it as if a function is locally bounded on a compact set, then it is also globally bounded.

There is, however, an even tighter connection between topology and propositional logic.

Claim: The Compactness Theorem (of propositional logic) is equivalent to the claim that its associated valuation space is compact.

A valuation is an assignment of True or False to every sentence letter (equivalent to a structure for propostional logic)

Just as a reminder, here are the characteristics that determine a topology over a set $X$.

Definition: A topological space $(X, \mathcal{T})$ consists of a non-empty set $X$ and a family $\mathcal{T}$ of subsets of $X$ with the following properties:

$X, \emptyset \in \mathcal{T}$
$U,V \in \mathcal{T} \Rightarrow U \cap V \in \mathcal{T}$
If $U_i \in \mathcal{T}$ for $i = 1,2,\cdots$, then $\bigcup U_i \in T$

Proof: Let $V$ be the set of all valuations of our propositional logic. For every sentence $\phi$, we will assign it a set of valuations in which it's true: $U(\phi) = \{v \in V \ | \ v(\phi) = 1\}$. Now notice that for sentences $\phi_1$ and $\phi_2$,

$V = \bigcup U(\phi)$ ranging over all sentences $\phi$
- This is easier seen by $V = U(P) \cup U(\neg P)$ for a sentence letter $P$
- So the set $\{U(\phi) \ | \ \phi \textrm{ is a sentence}\}$ is a cover of $V$
$U(\phi_1) \cap U(\phi_2) = \{v \in V \ | \ v(\phi_1) = 1\} \cap \{v \in V \ | \ v(\phi_2) = 1\} = U(\phi_1 \wedge \phi_2)$
- This is by definition of the truth table for $\wedge$
- So $U(\cdot)$ is stable under intersection (we got another $U(\cdot)$ after intersection)

These two properties implies that $\{U(\phi) \ | \ \phi \textrm{ is a sentence}\}$ forms a basis of a topology over $V$.

A basis of a topology $\mathcal{B}$ is a family of open sets such that for a topology $\mathcal{T}$, every set in $\mathcal{T}$ can be expressed as a union of sets from $\mathcal{B}$
The set $\{(a,b) \ | \ a,b \in \mathbb{R}\}$ is a basis for a (the standard) topology on $\mathbb{R}$

So we are interested in the topological space $(V, \mathcal{T})$ where $\mathcal{T}$ is the topology generated by the basis $\{U(\phi) \ | \ \phi \textrm{ is a sentence}\}$.

Now lets consider some set of sentences $\Gamma$. $\Gamma$ is unsatisfiable if and only if for all valuations, at least one sentence in $\Gamma$ is false. Or equivalently, $\Gamma$ is unsatisfiable if and only if for all valuations, at least one sentence's negation $\neg \varphi$ is true. In terms of our $U(\cdot)$ function, we can also say $\Gamma$ is unsatisfiable if and only if each valuation is in one of the open sets (i.e. members of the topology) of the form $U(\neg \varphi)$ for $\varphi \in \Gamma$ (since $U(\neg\varphi)$ is precisely the set of valuations that make $\neg\varphi$ true). Equivalently, this also means that $\Gamma$ is unsatisfiable if $\{U(\neg\varphi) \ | \ \varphi \in \Gamma\}$ is an open cover of our topological space $V$ (recall what an open cover is from above if this isn't clear). So this shows the equivalence that

$\Gamma \textrm{ is unsatisfiable} \Leftrightarrow \{U(\neg\varphi) \ | \ \varphi \in \Gamma\} \textrm{ is an open cover of } V$

Now recall the Compactness Theorem: if $\Gamma$ is unsatisfiable, then there exists a finite subset $\Gamma^{\textrm{fin}}$ of $\Gamma$ that is unsatisfiable. So this gives us the equivalence:

$\Gamma \textrm{ is unsatisfiable} \Leftrightarrow \exists \Gamma^{\textrm{fin}} \subseteq \Gamma \textrm{ that is unsatisfiable}$

We can make this an if and only if since the Compactness Theorem gives us the $(\Rightarrow)$ direction, and the $(\Leftarrow)$ direction comes trivially by definition of unsatisfiability: if there is an unsatisfiable subset of $\Gamma$, then of course all of $\Gamma$ is unsatisfiable.

So, given the Compactness of propositional logic, we can combine these biconditionals to deduce that the Compactness Theorem is equivalent to the claim:

If $\{U(\neg\varphi) \ | \ \varphi \in \Gamma\}$ is an open cover of the valuation space $V$, then there is a finite subset $\Gamma^{\textrm{fin}}$ of $\Gamma$ such that $\{U(\neg\varphi) \ | \ \varphi \in \Gamma^{\textrm{fin}}\}$ is an open cover of $V$.

And that is precisely the definition of a compact space (see Definition 2 above).

$\blacksquare$

We were working with normal propositional logic above, but note that we didn't really assume anything about our logic in the proof. If we are working with an arbitrary logic with its own rules, symbols, connectives, etc., we could show that the Compactness Theorem holds if $V$ is compact. If you're interested in further reading the connection between logic and topology, Tychonoff's theorem is actually the broader theorem that has the consequence of implying the Compactness Theorem (of propositional logic).

The Axiom of Choice

The above shows, in my opinion, a fairly strong and fundamental use connection between logical compactness, to some more concrete math; it not only gives reason to its name, but also puts it into context of its broader origins.

There is an arguably EVEN MORE FUNDAMENTAL connection to math that the Compactness Theorem appeals to: the Axiom of Choice. We unwittingly saw this earlier when we proved König's Lemma, a weaker form of the Axiom of Choice.

To see this, it's worth considering a non-Compactenss proof of König's lemma. Here's a quick sketch:

Alternative Sketch of König's Lemma: Let $v_0$ be the root of our finitely-branching tree $T$ with children $\{c_1, \cdots, c_k \}$. Now let's define $S_i = \{v \in T \ | \ c_i \textrm{ is the ancestor of } v \}$. These sets encapsulate possible vertices we can reach from a given child of $v_0$. Since each vertex has precisely one parent, it follows that $T \ \backslash \{v_0\} = S_1 \cup \cdots \cup S_k$ (we trivially take each vertex to be one of its own ancestors; you are related to yourself, no?). Since $T$ is an infinite tree, at least one of the $S_i$ has to be infinite, too (if they were all finite, then there would only be a finite number of vertices in our tree; think Pigeonhole Principle). So we pick a child $c_i$ with an infinite associated $S_i$, and let $v_1 = c_i$. Now we list the children of $v_1$ and simlarly construct there associated descendant sets $S_i$. At least one of the children of $v_1$ must have an infinite set of descendants, so we select that child and let it be $v_2$. We repeat this process over and over again, at each step, looking at the children of our previously selected vertex, and see which one has an infinite set of descendants, and let that be our next step in our infinite path $v_0, v_1, v_2, \cdots$ and so on. $\ \blacksquare$

There's a certain level of choice necessary here. At each step, we're picking a vertex with a certain property (infinite descendants), which is fine since we were able to show that there is at least one vertex at each step with that property. However, if there are more than one vertex, then which vertex do you choose? The need for the axiom of dependent choice, while weaker than the full axiom of choice, is nonetheless a choice principle that König's lemma depends on. In turn, you can also use König's lemma to prove Compactness, giving the connection we want.

Spelled out a bit more, the idea is that the Compactness Theorem can be thought of as finding an infinite path in a tree of sentence letters and their negations: say $\Gamma = \{\Phi_1, \Phi_2, \cdots \}$ using sentence letters $\{P_1, P_2, \cdots\}$, and we let our tree be a binary tree, with each vertex containing exactly two children. We will let then find a path through the tree, where each step at level $n$, we can either take $P_n$ or $\neg P_n$ to be true (this can be seen as going on either the left or right branch of the tree). Any finite path of this tree defines a valuation, and in combination with the finite satisfiability of $\Gamma$ and König's lemma, you can find an infinite path that determines a structure that satisfies all of $\Gamma$.

But again, there's a level of choice necessary in picking a sentence letter or its negation. The Compactness Theorem can be seen as that specific choice principle.

Some other choice principles that Compactness is equivalent to the ultrafilter lemma, and the Completeness Theorem, and some others, all of which are strictly weaker than the full axiom of choice. But again, this should not be that surprising given what we've proven with Compactness, and the general nature of infinity itself. The Order-Extension Principle (which strongly resembles the ultrafilter lemma in form), while weaker than the axiom of choice, cannot be proven without it. I sort of even led you astray earlier, as Tychonoff's theorem from topology—that implies the Compactness Theorem—is actually just equivalent to the full axiom of choice.

Something so controversial and so important (literally makes up an entire letter in ZFC—Zermelo-Frankel set theory with the axiom of choice) is inextricably tied to the Compactness Theorem.

Proof of the Compactness Theorem

This last section is to offer a proof of the Compactness Theorem for propositional logic to close. Since this will be a proof of a logical result, this section would definitely benefit from knowing some basic logic, just at the very least to more easily read notation.

Remember, propositional logic does not include the first-order quantifiers $\forall$ and $\exists$. To show Compactness holds in first-order logic (like we used in the Order-Extension Principle), we'll have to amend our proof.

We'll prove the following form of the Compactness Theorem:

Compactness Theorem: If $\Gamma$ is finitely satisfiable, then the whole set $\Gamma$ is satisfiable.

Proof of Compactess: The idea of the proof is we'll extend $\Gamma$ by squeezing in sentence letters in such a way that maintains finite satisfiability. By doing so, that will allow us to determine a structure that will satisfy $\Gamma$ by essentially finding what sentence letters we have to make true from our extended $\Gamma$, allowing us to carry out all the finite "comparisons" that force a structure to converge to one that will satisfy $\Gamma$. We'll break it down step-by-step.

1) Definition of $\Gamma_i$: By our hypothesis, say that $\Gamma$ is finitely satisfiable. The set of sentence letters occurring in $\Gamma$ are enumerable, so let's do that with $P_1, P_2, \cdots$. Now let's define the following series of supersets. Let $\Gamma_0 = \Gamma$ and

$$ \Gamma_{n+1} = \begin{cases} \Gamma_n \cup \{P_{n+1}\} \ \textrm{if it is finitely satisfiable} \\ \Gamma_n \cup \{\neg P_{n+1}\} \ \textrm{otherwise} \end{cases} $$

2) Each $\Gamma_i$ is finitely satisfiable: We'll prove this by induction. $\Gamma_0 = \Gamma$ is finitely satisfiable by assumption. Now suppose $\Gamma_i$ is satisfiable for $i \leq n$. For the sake of contradiction, assume $\Gamma_{n+1}$ is not i.e. neither $\Gamma_n \cup \{P_{n+1}\}$ or $\Gamma_n \cup \{\neg P_{n+1}\}$ are finitely satisfiable. So there are finite subsets $\Delta$ and $\Sigma$ of $\Gamma_{n}$ such that $\Delta \cup \{P_{n+1}\}$ and $\Sigma \cup \{\neg P_{n+1}\}$ are unsatisfiable. Each of these sets must include $P_{n+1}$ and $\neg P_{n+1}$ since both $\Delta, \Sigma \subseteq \Gamma_{n+1}$ which is finitely satisfiable, so every finite subset of $\Gamma_{n}$ is satisfiable.

Further, clearly if $\Delta \cup \{P_{n+1}\}$ is unsatisfiable, clearly $\Delta \cup \Sigma \cup \{P_{n+1}\}$ is unsatisfiable (adding elements to a contradictory set won't make it less contradictory). Similarly, if $\Sigma \cup \{\neg P_{n+1}\}$ is unsatisfiable, then so is $\Delta \cup \Sigma \cup \{\neg P_{n+1}\}$. Since $\Delta \cup \Sigma \subseteq \Gamma_{n}$ is a finite subset from a finite satisfiable set, $\Delta \cup \Sigma$ is satisfiable. So in any structure/valuation in which $\Delta \cup \Sigma$ is true, it must be the case that either the sentence letter $P_{n+1}$ or its negation $\neg P_{n+1}$ is true. But we just said that both the sets $\Delta \cup \Sigma \cup \{P_{n+1}\}$ and $\Delta \cup \Sigma \cup \{\neg P_{n+1}\}$ are unsatisfiable.

This is a contradiction, so our initial assumption must be wrong, and $\Gamma_{n+1}$ must be finitely satisfiable.

3) Defining a structure from the $\Gamma_i \textrm{s}:$ For every sentence letter $P_i$, we will define a structure $A$ as follows:

$$ A(P_i) = \begin{cases} T \ \textrm{if} \ P_i \in \Gamma_i \\ F \ \textrm{otherwise; i.e.} \ \neg P_i \in \Gamma_i \end{cases} $$

In step 2, we showed that for all $n$, $\Gamma_n$ is finitely satisfiable. Since

$\Phi_n = \{P_i \in \Gamma_n \ | \ i \leq n\} \cup \{\neg P_i \in \Gamma_n \ | \ i \leq n\} \subseteq \Gamma_n$

is a finite subset of $\Gamma_n$, the set $\Phi_n$ (just the set of sentence letters or their negations in $\Gamma_n$) is also satisfiable. By construction of our structure $A$, for every $n$, every sentence letter or its negation $\alpha \in \Phi_n$ is true in $A$. In particular, though, $\Phi_n$ is a finite subset, so we only need a finite amount of sentence letters to be assigned specific truth values to satisfy $\Phi_n$. So any structure which agrees (i.e. assigns the same truth value) to the members of $\Phi_n$ as $A$ does will also make satisfy $\Phi_n$ (this might seem obvious—if we give the same sentence letter values that satisfy $\Phi_n$ to those sentence letters, of course $\Phi_n$ will be satisfied—but this will be important in defining a full structure later).

4) Showing $A$ satisfies $\Gamma$: Suppose for contradiction $A$ does not satisfy $\Gamma$. That is, for some sentence $\psi \in \Gamma$, $|\psi|_A = F$. By definition of a logical sentence, $\psi$ must be of finite length, and hence only have finitely many sentence letters occuring in it. Let $P_k$ be the highest numbered sentence letter (from our previous ordering in step 2) that occurs in $\psi$. As we discued in step 3, every $\alpha \in \Phi_k$ is true in a structure that agrees with $A$. Hence, $\psi$ must be false in every such structure too (since such a structure would fully determine the truth value of $\psi$ in the exact same way as $A$ did).

So we can conclude that $\Phi_k \cup \{\psi \}$ must be unsatisfiable. But $\Phi_k \cup \{\psi \} \subseteq \Gamma_k$ by construction of $\Gamma_k$. So by the finite satisfiability of $\Gamma_k$, it must be that $\Phi_k \cup \{\psi \}$ is also satisfiable. Contradiction. Therefore, our structure $A$ must satisfy $\Gamma$.

Since we found a structure that satisfies $\Gamma$, obviously it is satisfiable. So we've shown that if $\Gamma$ is finitely satisfiable, then itself is satisfiable.

$\blacksquare$

Conclusion

In a way, Compactness should sort of seem obvious. This idea of finite satisfiability basically allows us to make a bunch of finite comparisons of a property within a set, slowly but surely forcing the entire set to also conform to that property. In this case, it was satisfiability, but, say, if it was the property that every finite collection of numbers is prime, you'd expect your whole set to also be prime. Yet, obviously this is still a theorem, and in fact is not always true for all logics. Second-order logic does is not a compact logic. But when it is true, it is an immensely powerful tool, allowing us to extend seemingly local properties to global ones. Compactness has made a name for itself in analysis and topology, but logic has given us a method to take that beyond simple set theory to graphs, computability, and all the way to the foundational axioms that make up modern math.

Your Induction to Induction

Adi Mittal

Everything you need for discrete and continuous induction.

I've been on a kick for mathematical foundations recently. Maybe it's all that logic I've been looking at. One of those key ideas that finds its way into math is the Principle of Mathematical Induction. It's likely even if you've only taken high school math classes, you've encountered induction in one way or another. The importance of induction stems all the way back to the all-important, characterizing Peano axioms of arithmetic. In fact, any formal system that can perform induction is a critical indicator of not just the strength, but flaws of the system, as induction is the "simplest" cause of a logical system being incomplete (as in Gödel's incompleteness theorems).

Principle of Mathematical Induction

The usual idea of induction is proof by "climbing a ladder", or "knocking over mathematical dominoes". Let's have $\Phi(n)$ stand for "the theorem $\Phi$ for the case of natural number $$n$$". Then if you can show that

(PMI 1) $\Phi(0)$ is true
(PMI 2) $\forall k \geq 0$, if $\Phi(k)$ is true, then $\Phi(k+1)$ is true

Then you can conclude that $\Phi(n)$ is true for all natural numbers $n \geq 0$. The idea is that in property (PMI 2), the theorem $\Phi$ is self-fulfilling; there is an inherent quality to our theorem $\Phi$ that has it proves many cases for itself. If we know $\Phi(0)$ is true, then by (PMI 2) we know $\Phi(1)$ is true. Similarly, from $\Phi(1)$ we then know $\Phi(2)$ is true, and then from $\Phi(2)$ we know $\Phi(3)$ and so on. We climb the mathematical ladder starting with a simple case $\Phi(0)$ we know for a fact is true.

This is best seen with an example.

We will show the formula for the sum of the first $n$ integers is $\sum_{k=1}^n k = \frac{n(n+1)}{2}$.

Base Case (PMI 1): The formula holds for $n=1$ since $\sum_{k=1}^1 k = 1 = \frac{1(1+1)}{2}$.

Inductive Hypothesis (PMI 2): Let's assume our formula works for the first $n$ integers. We want to show that it holds for case $n+1$: $\sum_{k=1}^{n+1} k = \frac{(n+1)(n+2)}{2}$. We will show this by first rewriting our sum: $\sum_{k=1}^{n+1} k = \sum_{k=1}^{n} k + (n+1)$. By our assumption, we can reduce this to $\frac{n(n+1)}{2} + n+1$. Simplifying further,

$ \begin{align} \frac{n(n+1)}{2} + n+1 & = \frac{n(n+1) + 2(n+1)}{2} \newline & = \frac{(n+1)(n+2)}{2} \end{align} $

Which is precisely the formula our hypothesis predicted. So by the PMI, for all $n$, $\sum_{k=1}^n k = \frac{n(n+1)}{2}$.

$\blacksquare$

Here, we took a statement that can naturally be indexed by the natural numbers—in this case, a formula—and using the method of induction, proved our claim that the statement held. Note here that we didn't show $\Phi(0)$ is true, but rather $\Phi(1)$. This minor change isn't a big deal, since what we really want to show is a base case to build our inductive claim off of; we don't care where our ladder starts, so long as we can show there is some ground for it to stand on.

Induction is a very strong tool, as it does and can usually be applied to any claim that can be indexed by the integers; if there is a natural way to "count" the cases of your hypothesis, odds are induction could be useful.

However, note that induction didn't give us the arithmetic sum formula to begin with, but rather only a method to verify it. Some amount of intuition or case work needs to be done beforehand to get a hypothesis to follow through with induction.

Let's do a few more examples. Here's a simple one from set theory.

De Morgan's Laws: For sets $A_1, A_2, \cdots, A_n$, we have

$(A_1 \cap A_2 \cap \cdots \cap A_n)^c = A_1^c \cup A_2^c \cup \cdots \cup A_n^c$ $(A_1 \cup A_2 \cup \cdots \cup A_n)^c = A_1^c \cap A_2^c \cap \cdots \cap A_n^c$

Just as a reminder, for sets $A$ and $B$,

$A \cup B = \{x \ | \ x \in A \textrm{ or } x \in B\}$
$A \cap B = \{x \ | \ x \in A \textrm{ and } x \in B\}$
$A^c = \{x \ | \ x \notin A\}$

Notice how we're not working with integers directly, but rather integer amounts of sets. So, we'll then induct on the number of sets as our natural division of cases.

Base Case: We'll show this for $n = 2$. Let $x \in (A \cap B)^c$ be an arbitrary element. Then $x \notin A \cap B$, so we have either $x \notin A$ or $x \notin B$ by the rules for $\cap$. Equivalently, $x \in A^c$ or $x \in B^c$, so regardless of which set $x$ is in, it must be the case that $x \in A^c \cup B^c$. Hence, since every element of $(A \cap B)^c$ is an element of $A^c \cup B^c$, so $(A \cap B)^c \subseteq A^c \cup B^c$. To summarize,

$\begin{array}{ccc|cc} & x \in (A \cap B)^c & & & \textrm{Assumption} \ \newline & x \notin A \cap B & & & \textrm{Definition of set complement} \ \newline & x \notin A \textrm{ or } x \notin B & & & \textrm{Rules for } \cap \ \newline & x \in A^c \textrm{ or } x \in B^c & & & \textrm{Definition of set complement} \ \newline & x \in A^c \cup B^c & & & \textrm{Rules for } \cup \ \newline \hline \therefore & (A \cap B)^c \subseteq A^c \cup B^c & & & \textrm{Definition of subset} \end{array}$

Going the other way, suppose we have an arbitrary element $x \in A^c \cup B^c$. For the sake of contradiction, assume $x \notin (A \cap B)^c$. The rest of the proof is very similar to above.

$\begin{array}{ccc|cc} & x \in A^c \cup B^c & & & \textrm{Assumption} \ \newline & x \notin (A \cap B)^c & & & \textrm{Assumption} \ \newline & x \in A \cap B & & & \textrm{Definition of set complement} \ \newline & x \in A \textrm{ and } x \in B & & & \textrm{Rules for } \cap \ \newline & x \notin A^c \textrm{ and } x \notin B^c & & & \textrm{Definition of set complement} \ \newline & x \notin A^c \cup B^c & & & \textrm{Rules for } \cup \ \newline & x \in (A \cap B)^c & & & \textrm{Proof by contradiction} \ \newline \hline \therefore & A^c \cup B^c \subseteq (A \cap B)^c & & & \textrm{Definition of subset} \end{array}$

The only way to sets can both be subsets of each other is if there equal to each other. So by double containment, $(A \cap B)^c = A^c \cup B^c$.

Inductive Hypothesis: Let's assume our formula works for some number of $n$ sets. We want to show this works for $n+1$ sets. Fortunately most of our work was already done in the base case.

$ \begin{align} (A_1 \cap \cdots \cap A_n \cap A_{n+1})^c = ((A_1 \cap \cdots \cap A_n) \cap A_{n+1})^c & = (A_1 \cap \cdots \cap A_n)^c \cup A_{n+1}^c \\ & = A_1^c \cup \cdots \cup A_n^c \cup A_{n+1}^c \end{align} $

Since $(A_1 \cap \cdots \cap A_n)$ is just another set, we can first use the case for 2 sets in the 2nd equality, and the case for $n$ sets in the last equality.

The proof for the second of De Morgan's Law is proved almost identically.

$\blacksquare$

Let's pull another example from linear algebra.

Claim: Let $A$ be a square matrix. The eigenvectors of distinct eigenvalues of $A$ are linearly independent.

As a reminder:

An eigenvector is a non-zero vector $\textbf{v}$ with corresponding eigenvalue $\lambda$ of matrix $A$ such that $A \textbf{v} = \lambda v$
We call set of vectors $\{\textbf{v}_1, \cdots, \textbf{v}_k \}$ linearly independent when $a_1 \textbf{v}_1 + \cdots + a_k \textbf{v}_k = 0$ if and only if $a_1 = a_2 = \cdots = a_k = 0$ for scalars $a_1, \cdots, a_k$

We'll induct on the number of eigenvectors.

Base Case: For one eigenvector, it's clearly true, as $\textbf{v}_1 \neq 0$ by definition. So $a_1 \textbf{v}_1 = 0$ if and only if $a_1 = 0$.

Inductive Hypothesis: Now suppose that the claim is true for fewer than $n$ eigenvectors of distinct eigenvalues. So suppose that

$a_1 \textbf{v}_1 + a_2 \textbf{v}_2 + \cdots + a_n \textbf{v}_n = 0 \ \ \ \ (*)$

We want to show $a_1 = a_2 = \cdots = a_n = 0$. Let's apply $A$ to both sides of $(*)$:

$\begin{align} A(a_1 \textbf{v}_1 + a_2 \textbf{v}_2 + \cdots + a_n \textbf{v}_n) & = A(0) \\ a_1 \lambda_1 \textbf{v}_1 + a_2 \lambda_2 \textbf{v}_2 + \cdots + a_n \lambda_n \textbf{v}_n & = 0 \ \ \ \ (1) \end{align}$

where each $\lambda_i$ are the associated eigenvalues. Now let's also multiply $(*)$ by $\lambda_1$:

$a_1 \lambda_1 \textbf{v}_1 + a_2 \lambda_1 \textbf{v}_2 + \cdots + a_n \lambda_1 \textbf{v}_n = 0 \ \ \ \ (2)$

Now let's subtract equation (2) from (1).

$ \begin{array}{ccccccccc} & a_1 \lambda_1 \textbf{v}_1 & + & a_2 \lambda_2 \textbf{v}_2 & + & \cdots & + & a_n \lambda_n \textbf{v}_n & = 0 \\ - & a_1 \lambda_1 \textbf{v}_1 & + & a_2 \lambda_1 \textbf{v}_2 & + & \cdots & + & a_n \lambda_1 \textbf{v}_n & = 0 \\ \hline & & & a_2(\lambda_2 - \lambda_1)\textbf{v}_2 & + & \cdots & + & a_n(\lambda_n - \lambda_1)\textbf{v}_n & = 0 \end{array} $

By the inductive hypothesis, we know the $n-1$ eigenvectors $\textbf{v}_2,\textbf{v}_3,\cdots,\textbf{v}_n$ are linearly independent, so all the coefficients $a_i(\lambda_i - \lambda_1) = 0$. But also remember, each $\lambda_i$ are distinct eigenvalues, so $\lambda_i - \lambda_1 \neq 0$ for $i=2,\cdots,n$. So the only way $a_i(\lambda_i - \lambda_1) = 0$ is if $a_i = 0$ for $i = 2,\cdots, n$. Plugging this back into $(*)$, we get that $a_1 \textbf{v}_1 = 0$, and since $\textbf{v}_1 \neq 0$, it must be that $a_1 = 0$.

Thus, all coefficients $a_1 = a_2 = \cdots = a_n = 0$, so the eigenvectors are linearly independent, having proven the inductive hypothesis.

$\blacksquare$

Now induction doesn't always have to prove a positive claim. In fact, it can be combined with a proof by contradiction to prove a negative claim.

Claim: $\cos(1^{\circ})$ is irrational.

Proof: First, let's note the angle sum formula for cosine.

$\cos(\alpha \pm \beta) = \cos(\alpha)\cos(\beta) \mp \sin(\alpha)\sin(\beta)$

We can combine these to get the identity

$\cos(\alpha + \beta) + \cos(\alpha - \beta) = 2\cos(\alpha)\cos(\beta)$

Now let's assume for contradiction that $\cos(1^\circ)$ is rational. Now from our identity we just derived, let's note that

$\cos(N + 1^\circ) + \cos(N - 1^\circ) = 2\cos(N)\cos(1^\circ)$

Then, if we let $N = 1$, we can get an expression for $\cos(2^\circ)$.

$\cos(2^\circ) = \cos(1^\circ + 1^\circ) = 2\cos(1^\circ)\cos(1^\circ) - \cos(1^\circ - 1^\circ) = 2\cos^2 (1^\circ) - 1$

By our assumption that $\cos(1^\circ)$ is rational, $2\cos^2 (1^\circ) - 1$ is also rational, thus $\cos(2^\circ)$ is rational. Then using our identity

$\cos(N + 1^\circ) = 2\cos(N)\cos(1^\circ) - \cos(N - 1^\circ)$

We get $\cos(3^\circ)$ is rational by letting $N=2$. Then from that we get $\cos(4^\circ)$ is rational by letting $N=3$, and so on. Since we've proven our base cases for $\cos(1^\circ)$ and $\cos(2^\circ)$ are rational, we get inductively that $\cos(n^\circ)$ is rational for all natural numbers $n$.

In particular, then according to our induction, $\cos(30^\circ)$ is rational. But $\cos(30^\circ) = \frac{\sqrt{3}}{2}$, which is irrational. Contradiction! Thus, our initial assumption that $\cos(1^\circ)$ is rational must have been wrong, or in other words, it must be that $\cos(1^\circ)$ is irrational.

$\blacksquare$

Choosing a Base Case

In our formulation of induction, we said in particular:

(PMI 1) $\Phi(0)$ is true

We specified our base case to be $n=0$. But if you look at our example proofs, we used $n=1$ and $n=2$ as base cases sometimes, such as in the proof of De Morgan's Laws. Usually the base case doesn't matter, as induction would just say that for a base case $k$, a claim holds for all $n \geq k$. But we have to be careful of our choice of base case. For example, here's a false inductive proof.

Claim: All horses are the same color.

Base Case: For $n=1$, obviously a horse is the same color as itself.

Inductive Hypothesis: Suppose all sets of $n$ horses are the same color. Suppose we have $n+1$ horses, enumerated in some way (we assign each horse a number $1$ through $n+1$ to identify them). By the inductive hypothesis, the horses $1$ through $n$ are the same color. By the same inductive hypothesis again, horses $2$ through $n+1$ are the same color too. So all the horses are the same color as horses $2$ through $n$, thus showing that all $n+1$ horses are the same color. $\ \blacksquare$

Clearly this proof is wrong. Clearly some horses are brown, some are black, some are white, and some are patterned. Yet our "proof" would suggest otherwise. The issue lies in our base case, since it cannot be used to prove the case for $n=2$ horses. The key idea in the "proof" was to use the set of "intermediary" horses $2$ through $n$ to give some way of comparing horse $1$ with horse $n+1$ indirectly. But if there are only 2 horses, there is no middle horse to compare them to. If you want to prove that all horses are the same color in this way, you'd have to show it initially for 2 horses to have any chance of a real argument (which of course, isn't true).

But as we saw in our proof that $\cos(1^\circ)$ is irrational, choosing a faulty base case can be useful. There, we purposely picked a bad base case (i.e. by assuming $\cos(1^\circ)$ is rational) to get to an observable contradiction to then prove a negative claim. The point still stands, though, you must be careful with how you choose and apply your base cases.

1954237295746 ≠ ∞

Similar to the importance of choosing a base case, we must also be aware of what exactly induction is saying. The conclusion we get from an inductive argument is that $\Phi(n)$ is true for all natural numbers $n$. Note this doesn't mean it's true at a limit to $\infty$, it is only true for any arbitrarily large $n$. Consider the following faulty proof:

Claim: $\pi$ is rational.

Base Case: We'll induct on the number of decimal points in the expansion for $\pi$. For $n=0$, clearly $3$ is rational.

Inductive Hypothesis: Say the truncation of $\pi$ to the $n$th decimal place, call it $x_n$ is rational. Then $\pi$ to the $(n+1)$st decimal can be written as $x_{n+1} = x_n + \frac{m}{10^{n+1}}$ for $m \in {0,1,\cdots,9}$. So $x_{n+1}$ is rational. So all decimal expansions of $\pi$ are rational, so $\pi$ is rational. $\ \blacksquare$

Clearly this is also wrong, and it's just by the misinterpretation we spelled out now: there is a difference between being true in the limit and true at each individual step to the limit.

Weak Induction vs. Strong Induction

There's something else weird in our inductive proof of De Morgan's Laws. Again, let's look at our specific outline of induction:

(PMI 2) $\forall k \geq 0$, if $\Phi(k)$ is true, then $\Phi(k+1)$ is true

The inductive step says to show $\Phi(k)$ implies $\Phi(k+1)$. Yet, if you look at our proof of De Morgan's Laws, we had to use not only the case that $\Phi(k)$ was true, but also $\Phi(2)$ was true; we needed to assume that the law holds for not only $n$ sets, but for 2 sets as well. I mean, clearly if we assume that $\Phi(k)$ is true and $k \geq 2$, then $\Phi(2)$ should be true, but that's not what we explicitly allowed in our Principle of Mathematical Induction.

This is an instance of strong induction. The formulation of mathematical induction we gave before was weak induction, but the difference between them is minimal. If you can show that a theorem $\Phi$ holds for:

(SPMI 1) $\Phi(0)$ is true
(SPMI 2) $\forall k \geq 0$, if $\Phi(0)\wedge\Phi(1)\wedge\cdots\wedge\Phi(k)$ is true, then $\Phi(k+1)$ is true

then you can conclude that $\Phi(n)$ is true for all $n$. The difference between strong induction and weak induction is just that strong induction allows you to use multiple cases in your proof, while weak induction only specifies the previous case.

Here's a nice example of strong induction:

Claim: A country has $n$ cities. Suppose that any two cities are connected by a one-way road. Then there is a route that passes through every city.

Base Case: Clearly it is true for $n=2$: just take the one-way road that leads out from one city and into the other.

Inductive Hypothesis: Now suppose this holds up to and for $n$ cities. We want to show this holds for $n+1$ cities. Let's take the $(n+1)^{\textrm{th}}$ city, and divide the remaining cities into two groups: cities that have a road into city $n+1$ (call this group $A$) and cities that have roads coming out of city $n+1$ (group $B$).

Now obviously group $A$ has $n$ or fewer cities, so by the inductive hypothesis, there is a route that passes through all of the cities in group $A$. Similarly, for the same reason, there is a route that passes through all the cities in group $B$.

Now, take a route that passes through all the cities in $A$, then stop at the $(n+1)^{\textrm{th}}$ city, then finally go to a city in $B$ and complete a route that passes through all those cities. We can do that since every city in $A$ has a road to $n+1$, and every city in $B$ has a road from $n+1$.

Hence, the inductive hypothesis holds, and proves our claim. $\ \blacksquare$

We needed strong induction above since we have no idea how many cities are in groups $A$ and $B$, so we need the claim to hold for all values up to $n$, not just $n$ itself, in the inductive hypothesis. The renewed inductive hypothesis given in strong induction is—obviously so—much more applicable than its weaker counterpart, just by sheer introduction of additional inductive instances.

Despite the naming, though, they are actually equal in strength; anything you can prove with weak induction you can prove with strong induction, and anything you can prove with strong induction you can prove with weak induction.

Strong Induction Implies Weak Induction: It shouldn't be much of a surprise that if strong induction holds, then weak induction also holds, since weak induction is just like strong induction but looser requirements. Let's prove it anyway. Suppose that strong induction is a valid proof technique. We want to show that weak induction is also valid. That is, from assumptions

(WPMI 1) for some base case $k$ we know $\Phi(k)$ is true, and
(WPMI 2) $\forall m \geq k$, $\Phi(m)$ is true $\Rightarrow$ $\Phi(m+1)$ is true,

we want to show that $\Phi(n)$ is true for all $n \geq k$. Obviously, whenever (WPMI 1) is true, then so is (SPMI 1) since they are both the same clause. However, let's also note that whenever (WPMI 2) is true, so is (SPMI 2). To see this, let's suppose (WPMI 1) is true. For any $m \geq k$, if $\Phi(k)\wedge\Phi(k+1)\wedge\cdots\wedge\Phi(m)$ is true, then certainly $\Phi(m)$ alone is true. By (WPMI 2), then we also know $\Phi(m+1)$, thus showing (SPMI 2) holds.

Having this fact in hand, we can complete the proof. Let's assume strong induction holds. Then, if the assumptions (WPMI 1) and (WPMI 2) for weak induction are true, we also know the assumptions (SPMI 1) and (SPMI 2) for strong induction are also true. Then by the conclusion for strong induction, we know $\Phi(n)$ holds for all $n$. But this is the exact same conclusion that we would get from weak induction. Hence weak induction holds whenever strong induction does.

$\blacksquare$

Again, this shouldn't be that surprising, as every instance of weak induction is just an instance of strong induction with weaker hypotheses.

Weak Induction Implies Strong Induction: This direction will require a bit more work. Let's assume that weak induction is a valid proof technique. We want to show that strong induction is valid. That is, from the assumptions

(SPMI 1) for some base case $k$ we know $\Phi(k)$ is true, and
(SPMI 2) $\forall m \geq k$, $\Phi(k)\wedge\Phi(k+1)\wedge\cdots\wedge\Phi(m)$ is true $\Rightarrow$ $\Phi(m+1)$ is true,

we want to show that $\Phi(n)$ is true for all $n \geq k$. What we will do is induct on a meta-statement of $\Phi$. Let $S(m)$ be the statement "$$\Phi(n)$$ is true for all $$k \leq n \leq m$$". If we can show $S(n)$ is true for all $n$ (with regular induction), then we would also show $\Phi(n)$ is true for all $n$ as well.

From (SPMI 1), we get $\Phi(k)$ is true, and thus $S(k)$ is also true (this will be our base case). From (SPMI 2), we get that for all $m \geq k$, if $S(m)$ is true, then $\Phi(m+1)$ is true (since $S(m)$ is equivalent to the hypothesis of the conditional). But, if $\Phi(m+1)$ is true and $S(m)$ is true, then we also get that $S(m+1)$ is true by definition. Combining these two conditionals, we get that if $S(m)$ is true, then $S(m+1)$ is true (this is our inductive step).

Combining our base case and inductive hypothesis for $S(n)$, we can deduce via weak induction, that $S(n)$ is true for all $n \geq k$. And as described before, it must be the case that $\Phi(n)$ is also true for all $n \geq k$, which is precisely the conclusion that strong induction wants.

Here's a more intuitive version of this proof, but relies on knowing some basic logic to justify a direct argument. The idea is the same, but the above is more "contained" in my opinion, while the link is more readily understandable.

$\blacksquare$

So whenever, one proof looks like it requires strong induction, it could also be done with weak induction. For example, with De Morgan's Laws, you could imagine having shown the case for $n=2$, you could use weak induction to show $n=3$. Similarly,

So, perhaps a more appropriate name would just to say "mathematical induction", for weak and strong induction are both cut from the same cloth.

"Proving" Induction

We showed that weak and strong induction are equivalent, meaning if one holds so does the other. But, we never showed that either form of induction actually holds. Intuitively, I think the analogy of dominos or climbing a ladder makes induction only seem natural, but with all things in math, we do need to justify using induction rigorously somehow. Fortunately, a lot of the work in the proof is given by the axioms that define the natural numbers. We'll first prove weak induction.

Weak Induction: Let $S$ be a set of natural numbers with the following properties:

$0 \in S$
If $k \in S$, then $k+1 \in S$

Then $S$ is the set of all natural numbers.

Although this looks different to our above applications of induction, all we need to do is define our set such that $S = \{ n \in \mathbb{N} \ | \ \Phi(n) \textrm{ is true} \}$. Hopefully this is clear to be equivalent to our previous uses.

Proof: We will prove this by contradiction. So suppose that there is some set of natural numbers $T \neq \varnothing$ that are not in $S$. By the well-ordering principle, $T$ has a smallest element—any two distinct natural numbers has a lesser one among them two. Let's call this least element $\alpha$.

Since $0 \in S$ by $(1)$, we have $0 < \alpha \notin S$
By (the contrapositive) of $(2)$, we also have $\alpha - 1 \notin S$
Since $\alpha$ is the least element of $T$, we have $\alpha - 1 \notin T$, or in other words, $\alpha - 1 \in S$
This is a contradiction!

So, our initial assumption that such a non-empty set $T$ exists must be false, and thus it must be that $S$ contains all natural numbers. $ \ \blacksquare$

Even though we've proven the equivalence of weak and strong induction already, we can also prove the validity of strong strong induction independently in an almost identical fashion.

Strong Induction: Let $S$ be a set of natural numbers with the following properties:

$0 \in S$
If $0,1,\cdots,k \in S$, then $k+1 \in S$

Then $S$ is the set of all natural numbers.

Proof: We'll also prove this by contradiction. Suppose $T$ is the set of integers not in $S$, so $T$—by well-ordering—must have a least element $\alpha$.

Since $0 \in S$ by $(1)$, we have $0 < \alpha \notin S$.
By $(2)$, we have at least one element of $\{0,1,\cdots,\alpha -1\}$ that's not in $S$
But since $\alpha$ is the least element, $0,1,\cdots,\alpha -1 \notin T$ and so $0,1,\cdots,\alpha -1 \in S$
But then there is at least one element from $\{0,1,\cdots,\alpha -1\}$ that is both in and not in $S$
Contradiction!

Since the only assumption we made was $T$ was non-empty, it must be wrong, and thus $S$ is the set of all natural numbers. $ \ \blacksquare$

Well-Ordering Principle

In both proofs, we relied on the well-ordering of the natural numbers. A well-order is a relation $\leq$ on a set $X$ with the following properties:

Reflexivity: $\forall x \in X \ x \leq x$
Antisymmetry: $\forall x,y \in X \ (x \leq y \wedge y \leq x \Rightarrow x = y)$
Transitivity: $\forall x,y \in X \ (x \leq y \wedge y \leq z \Rightarrow x \leq z)$
Connectedness: $\forall x,y \in X \ (x \leq y \vee y \leq x)$
Minimums: Every non-empty subset $\varnothing \neq S \subseteq X$ contains a least element, i.e., $\exists \alpha \in S$ such that $\forall x \in S \ \alpha \leq x$

The set of natural numbers $\mathbb{N}$ is well-ordered under the relation "less than" $\leq$. This is something that is just an intuitive fact that we take for granted when working with the natural numbers, but it is a very special property given by the axioms of the natural numbers, and it will become important to us later on. This simple fact, though, is actually equivalent to the principle of mathematical induction itself!

Induction Implies Well-Ordering: Suppose for contradiction that $\mathbb{N}$ is not well-ordered. That is, there is a subset $\varnothing \neq S \subseteq \mathbb{N}$ with no least element. Then $0 \notin S$ as it would be then be minimal. Then $1 \notin S$ as $1$ would be minimal as $0 \notin S$. Suppose $0,1,\cdots,n \notin S$. Then $n+1 \notin S$ as it would be minimal. By induction, $\forall n \in \mathbb{N} \ n \notin S$. That is, $S = \varnothing$, which is a contradiction. Thus $\mathbb{N}$ is well-ordered. $ \ \blacksquare$

Well-Ordering Implies Induction: Assume $\mathbb{N}$ is well-ordered. We want to show that if a subset $S\subseteq \mathbb{N}$ with the properties that

$0 \in S$
If $k \in S$, then $k+1 \in S$

then $S = \mathbb{N}$.

So for a contradiction, suppose $S \neq \mathbb{N}$. That is, there is a set of natural numbers $\varnothing \neq T \subseteq \mathbb{N}$ not in $S$. By well-ordering, $T$ has a least element $\alpha$. From assumption $(1)$, we're saying $0 \in S$, so $0 \notin T$. Therefore, $\alpha = k + 1$ for some $k \in \mathbb{N}$. But if $\alpha = k + 1 \in T$, $k + 1 \notin S$. And by assumption $(2)$, if $k + 1 \notin S$, then $k \notin S$ and further $k \in T$. But $k < k+1 = \alpha$, contradicting the minimality of $\alpha$.

So if $T \neq \varnothing$, it contains no minimal element, contradicting the well-ordering of $\mathbb{N}$. Therefore, it must be that $T = \varnothing$, or in other words, $S = \mathbb{N}$, validating induction. $ \ \blacksquare$

That's enough metatheory. Let's get back to some applications.

Variations on Induction

The simple idea of induction is far more flexible than our statement of it may seem. With some clever combinations of inductive arguments, we can prove many things in much simpler ways than typical induction may suggest.

Forward-Backward Induction

Induction hopefully should clearly work in an "forward" direction at this point; we can climb the integer ladder up and up so long as we have one rung to start on. Similarly we can have a "backward" inductive (BI) argument: if you can show that for some claim $\Phi$ that

(BI 1) For some base case $k$ we know $\Phi(k)$ is true
(BI 2) $\forall m \geq k$, $\Phi(m)$ is true $\Rightarrow$ $\Phi(m-1)$ is true

then, naturally, we should be able to conclude that $\Phi(n)$ is true for all $n \leq k$. But we can use this idea in tandem with our normal form of induction to get a nice combined approach with forward-backward induction. If you can show that

(F-B 1) $\Phi(n)$ is true for infinitely many $n$
(F-B 2) For all $k$, if $\Phi(k)$ is true, then $\Phi(k-1)$ is true

then you can conclude that $\Phi(n)$ is true for all positive $n$. The idea is that you only need to show that $\Phi$ is true for an infinite amount of $n$, not necessarily all $n$. Then using (F-B 2), we can fill in the gaps between our infinite $\Phi(n)$ by working backwards from the proven cases in (F-B 1). Here's a more specific, but common example of forward-backward induction.

For some base case $k$ we know $\Phi(k)$ is true
$\forall m \geq k$, $\Phi(m)$ is true $\Rightarrow$ $\Phi(2m)$ is true
$\forall m \geq k$, $\Phi(m)$ is true $\Rightarrow$ $\Phi(m-1)$ is true

Here we show that all the powers of 2 (scaled by our base case) satisfy $\Phi(n)$, and then fill in all the gaps with our third condition that works backwards.

So really, forward-backward induction is just two inductive arguments being put together, which is more common than it may seem. For example, if you can show that

$\Phi(0)$ and $\Phi(1)$ are true
For all $k \geq 0$, $\Phi(k)$ is true $\Rightarrow$ $\Phi(k+2)$ is true

then $\Phi(n)$ is true for all $n$ as we just showed that $\Phi(n)$ is true for all even numbers and odd numbers separately; $\Phi(0)$ with the inductive step gets you the evens, while $\Phi(1)$ gets the odds. Forward-backward induction is the same idea, only combining two other inductive proofs to "fill in the gaps".

Let's put it to practice.

Arithmetic Mean-Geometric Mean (AM-GM) Inequality: For non-negative real numbers $a_1,a_2,\cdots,a_n$,

$\sqrt[n]{a_1 a_2 \cdots a_n} \leq \large{\frac{a_1 + a_2 + \cdots + a_n}{n}}$

with equality given if and only if $a_1 = a_2 = \cdots = a_n$.

Base Case: Clearly $n=1$ holds, so we'll also show it for $n=2$. Note that $\left(\sqrt{a_1} - \sqrt{a_2} \right)^2 \geq 0$. So by simple algebra,

$ \begin{align} \left(\sqrt{a_1} - \sqrt{a_2} \right)^2 & \geq 0 \newline a_1 - 2\sqrt{a_1 a_2} + a_2 & \geq 0 \newline a_1 + a_2 & \geq 2\sqrt{a_1 a_2} \newline \frac{1}{2}(a_1 + a_2) & \geq \sqrt{a_1 a_2} \end{align} $

Forward-Inductive Hypothesis: Let's assume the AM-GM inequality holds for some $n$. We'll show it also holds for $2n$.

$ \begin{align} \frac{a_1 + a_2 + \cdots + a_{2n}}{2n} & = \frac{\frac{a_1 + a_2 + \cdots + a_{n}}{n} + \frac{a_{n+1} + a_{n+2} + \cdots + a_{2n}}{n}}{2} \newline & \geq \frac{\sqrt[k]{a_1 a_2 \cdots a_n} + \sqrt[k]{a_{n+1} a_{n+2} \cdots a_{2n}}}{2} \newline & \geq \sqrt{\sqrt[k]{a_1 a_2 \cdots a_n} \sqrt[k]{a_{n+1} a_{n+2} \cdots a_{2n}}} \newline & = \sqrt[2k]{a_1 a_2 \cdots a_n a_{n+1} a_{n+2} \cdots a_{2n}} \end{align} $

Hence proving the forward inductive claim using cases for $2$ and $n$ non-negative numbers.

Backward-Inductive Hypothesis: Again we suppose AM-GM inequality holds for $n$ numbers. We will show it holds for $n-1$ numbers. First, though, since AM-GM holds for $n$ numbers, it will also hold when we let $a_n = \frac{a_1 + \cdots + a_{n-1}}{n-1}$ the average of the previous $n-1$ numbers.

$ \begin{align} \frac{a_1 + a_2 + \cdots + a_{n}}{n} & \geq \sqrt[n]{a_1 a_2 \cdots a_n} \newline \frac{a_1 + a_2 + \cdots + \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1}}{n} & \geq \sqrt[n]{a_1 a_2 \cdots a_{n-1} \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1}} \newline \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1} & \geq \sqrt[n]{a_1 a_2 \cdots a_{n-1} \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1}} \newline \left(\frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1} \right)^n & \geq a_1 a_2 \cdots a_{n-1} \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1} \newline \left(\frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1} \right)^{n-1} & \geq a_1 a_2 \cdots a_{n-1} \newline \frac{a_1 + a_2 + \cdots + a_{n-1}}{n-1} & \geq \sqrt[n-1]{a_1 a_2 \cdots a_{n-1}} \end{align} $

Thus proving the backward-inductive hypothesis.

By forward-backward induction, we've shown that AM-GM holds any number of non-negative real numbers.

$\blacksquare$

Proof by Infinite Descent

Building on this idea of using backwards induction, Fermat used backward induction in a very neat way. Because, well, there's something weird going on in our forward-backward induction, notably the backwards step:

(F-B 2) For all $k$, if $\Phi(k)$ is true, then $\Phi(k-1)$ is true

This doesn't seem all that problematic, but while the natural numbers are unbounded upwards, they are bounded downwards by 0. From $\Phi(k)$, we get $\Phi(k-1)$, then further $\Phi(k-2)$, and so on until we get $\Phi(0)$. But from $\Phi(0)$ can we conclude $\Phi(-1)$? The backwards inductive step would certainly suggest so, but that's not how natural numbers work. In the AM-GM inequality above, what would it mean for it to hold for "-1 numbers"? It just doesn't make sense.

The difference here is that we do have an understanding that backwards induction terminates eventually at 0. In particular, what our backwards inductive statement really should say is that

(F-B 2) For all $k\in \mathbb{N}$, if $\Phi(k+1)$ is true, then $\Phi(k)$ is true

We first assume that $k \in \mathbb{N}$ before we apply it to our claim we're proving. But what if we didn't assume that?

Proof by infinite descent exploits this assumption to make a very nice proof by contradiction. If we remove this condition assuming that one of our numbers is already a natural number, we could prove things by contradiction by showing that is a consequence. That is,

(Suppose for contradiction that) $\Phi$ is true
If $m \in \mathbb{N}$ and $\Psi(m)$ is true, then $m-1 \in \mathbb{N}$ and $\Psi(m-1)$ is true

Then you could conclude that $\Phi(n)$ is false for all $n \in \mathbb{N}$ since there is no infinitely descending sequence of natural numbers. $\Psi(n)$ here could be any consequence that would imply an infinitely decreasing sequence of natural numbers.

This is best done with an example.

Claim: If $\sqrt{k}$ is not an integer, then it is irrational.

Proof: Suppose for a contradiction that there is a positive integer $k$ such that $\sqrt{k}$ is not an integer and not irrational, i.e. $\sqrt{k}$ is rational. So there are natural numbers $m,n \in \mathbb{N}$ such that $\sqrt{k} = \frac{m}{n}$. Now we can do some algebra on this expression:

$ \begin{align} \sqrt{k} & = \frac{m}{n} \newline & = \frac{m(\sqrt{k} - \lfloor \sqrt{k} \rfloor)}{n(\sqrt{k} - \lfloor \sqrt{k} \rfloor)} \newline & = \frac{m\sqrt{k} - m\lfloor \sqrt{k} \rfloor}{n\sqrt{k} - n\lfloor \sqrt{k} \rfloor} \newline & = \frac{(n\sqrt{k})\sqrt{k} - m\lfloor \sqrt{k} \rfloor}{n(\frac{m}{n}) - n\lfloor \sqrt{k} \rfloor} \newline \sqrt{k} & = \frac{nk - m\lfloor \sqrt{k} \rfloor}{m - n\lfloor \sqrt{k} \rfloor} \end{align} $

By multiplying by $\frac{\sqrt{k} - \lfloor \sqrt{k} \rfloor}{\sqrt{k} - \lfloor \sqrt{k} \rfloor} = 1$, we ended up with a new rational expression for $\sqrt{k}$. But notice, by definition of the floor function $\lfloor \cdot \rfloor$, we multiplied the numerator and denominator by $\sqrt{k} - \lfloor \sqrt{k} \rfloor < 1$. So $m > m(\sqrt{k} - \lfloor \sqrt{k} \rfloor) = nk - m\lfloor \sqrt{k} \rfloor$ and similarly for the denominator. But the numerator and denominator are both still natural numbers, since $m$, $n$, $k$, and $\lfloor k \rfloor$ are all natural numbers, so their sums and products are too!

So from a rational expression for $\sqrt{k} = \frac{m}{n}$, we found another equivalent rational expression with a strictly smaller numerator and denominator. So we could repeat this process again multiplying through by $\frac{\sqrt{k} - \lfloor \sqrt{k} \rfloor}{\sqrt{k} - \lfloor \sqrt{k} \rfloor}$ to find another smaller rational expression ad infinitum. But there are no infinitely decreasing sequences of natural numbers for the numerator and denominator to be a part of, giving us a contradiction.

So it must be that our initial assumption was wrong that there is a $\sqrt{k}$ that is neither and integer nor irrational. In other words, it for all natural numbers $k$, either $\sqrt{k}$ is an integer or it is irrational.

$\blacksquare$

Here's a geometric application of infinite descent applied not to natural numbers directly, but rather integral multiples of a length of a line.

Multi-Induction

Let's revisit the very first inductive proof we gave. We showed that the formula for the sum of the first $n$ natural numbers is

$\sum_{i=1}^n i = \large{\frac{n(n+1)}{2}}$

There's a similar formula for a related double sum:

$\sum_{j=1}^m \sum_{i=1}^n (i + j) = \large{\frac{mn(m+n+2)}{2}}$

You might try to prove this with induction, but there's a sizeable difference between our original problems and this one: it's a formula of multiple variables.

As such, there is also a natural extension to what we've been doing in multi-induction. A way to think of it is like the "product" of inductive arguments; if we were "inducting linearly" before, we'll now "induct in a grid".

For our case of double induction, we'll represent our formula is true with $\Phi(m,n)$—just like what we did before with one variable. We'll then use the following formulation for double induction (DI): we'll show that

(DI 1) $\Phi(1,1)$ is true
(DI 2) For all $m \geq 1$, show $\Phi(m,1)$ is true
(DI 3) For each fixed $m_0$ and for all $n \geq 1$, show that $\Phi(m_0, n)$ is true

to prove that $\Phi(m,n)$ is true for all integers $m,n \geq 1$. If you think of $(m,n)$ as coordinates in space, (DI 2) is the "horizontal" induction, proving the claim along the $x$-axis and giving us the base cases for (DI 3), which fills in the columns along the $y$-axis.

And as you can probably guess, we'll show (DI 2) and (DI 3) with induction.

Claim: $\sum_{j=1}^m \sum_{i=1}^n (i + j) = \large{\frac{mn(m+n+2)}{2}}$

Base Case: We'll check it for the case $m = n = 1$. $\sum_{j=1}^1 \sum_{i=1}^1 (i + j) = 2 = \frac{1 \cdot 1(1+1+2)}{2}$.

Horizontal Inductive Hypothesis: We'll induct on $m$ and prove that the formula holds for all $m \geq 1$ while fixing $n=1$. So suppose the formula holds for some $m$, that is, $\sum_{j=1}^{m} \sum_{i=1}^1 (i + j) = \frac{m(m+3)}{2}$. We'll show that it also holds for $m+1$:

$ \begin{align} \sum_{j=1}^{m+1} \sum_{i=1}^1 (i + j) = \sum_{j=1}^{m+1} (1 + j) & = \sum_{j=1}^{m} (1 + j) + (1 + m + 1) \newline & = \frac{m(m+3)}{2} + (m + 2) \newline & = \frac{m^2 + 5m + 4}{2} \newline & = \frac{(m+1)(m+4)}{2} \newline & = \frac{(m+1)((m+1) + 3)}{2} \end{align} $

Since $\Phi(m,1) \Rightarrow \Phi(m+1,1)$, by induction, $\Phi(m,1)$ is true for all $m \geq 1$.

Vertical Inductive Hypothesis: Horizontal induction now gives us the base cases we need to induct on $n$. We now know that for a fixed $m_0$, $\Phi(m_0, 1)$ is true, so that will act as our base case. Now assume that $\Phi(m_0,n)$ is true for some $n$. That is,

$\sum_{j=1}^{m_0} \sum_{i=1}^n (i + j) = \large{\frac{m_{0} n(m_0+n+2)}{2}}$

We will show that $\Phi(m_0, n+1)$ is also true.

$ \begin{align} \sum_{j=1}^{m_0} \sum_{i=1}^{n+1} (i + j) & = \sum_{j=1}^{m_0} \left( \sum_{i=1}^{n} (i + j) + (n+1 + j) \right) \newline & = \sum_{j=1}^{m_0} \sum_{i=1}^{n} (i + j) + \sum_{j=1}^{m_0} (n+1 + j) \newline & = \frac{m_{0} n(m_0+n+2)}{2} + \sum_{j=1}^{m_0} (n+1 + j) \newline & = \frac{m_{0} n(m_0+n+2)}{2} + \frac{m_0(m_0 + 1)}{2} + m_0(n+1) \newline & = \frac{m_{0}(m_0 n + n^2 + 2n)}{2} + \frac{m_0(m_0 + 1 + 2(n + 1))}{2} \newline & = \frac{m_{0}(m_0 n + n^2 + 2n + m_0 + 1 + 2(n + 1))}{2} \newline & = \frac{m_0 (n + 1)(m_0 + (n+1) + 2)}{2} \end{align} $

We get the 3rd line equality by the inductive hypothesis, the 4th line is just an arithmetic series, and the rest follow from algebra. Thus, $\Phi(m_0, n) \Rightarrow \Phi(m_0, n+1)$, and by induction, $\Phi(m_0,n)$ is true for a fixed $m_0$ and all $n \geq 1$.

But $m_0$ was arbitrary. So $\Phi(m,n)$ is true for all $m,n \geq 1$. In other words, for all $m,n \geq 1$,

$\sum_{j=1}^m \sum_{i=1}^n (i + j) = \large{\frac{mn(m+n+2)}{2}}$

$\blacksquare$

Transfinite Induction

In our original proofs of weak and strong induction, we relied primarily on the well-ordering of the natural numbers. Actually, that's really about all we cared about. Let's look again at our statement of (strong) induction from that proof:

Strong Induction: Let $S$ be a set of natural numbers with the following properties:

$0 \in S$
If $0,1,\cdots,k \in S$, then $k+1 \in S$

Then $S$ is the set of all natural numbers.

Let's rewrite this with a bit more notation.

Strong Induction Again: Let $S \subseteq \mathbb{N}$ be a subset with the following properties:

$0 \in S$
If $\forall m \leq k \ \ m \in S$, then $k+1 \in S$

Then $S = \mathbb{N}$.

If all that really mattered was the idea of $\mathbb{N}$ being well-ordered—this idea that there's a lesser element, or a "next" element in the set—what's stopping us from changing $\mathbb{N}$ with any set $A$ that is well-ordered under a relation $\leq$ (that is not necessarily "less than or equal")? There is one issue though, that comes in the second clause for strong induction:

If $\forall m \leq k \ \ m \in S$, then $k+1 \in S$

Sometimes we can have well-ordered sets–that is have this concept of a "next" element—but have it such that there is no "good" definition of having a next, or $+1$ element. I know this sounds contradictory, but this will help if we talk it out from the start. So far we've been working with the natural numbers, so that seems like a good place to work out of. But first, we need to understand what the natural numbers are.

Finite Ordinals

In general, we use the the natural numbers to count things, like the sizes of sets. So oftentimes, we do just that! Let's look at a construction of the natural numbers via sets that do exactly that (due to von Neumann):

von Neumann's Construction of $\mathbb{N}$: We will define them recursively.

$0 = \varnothing$
$n+1 = n \cup \{n\}$ (this is the successor of a number)

So the first few natural numbers with this definition would be

$ \begin{align} 0 & = \varnothing \newline 1 & = 0 \cup \{0\} = \varnothing \cup \{\varnothing\} = \{\varnothing\} \newline 2 & = 1 \cup \{1\} = \{\varnothing\} \cup \{\{\varnothing\}\} = \{\varnothing, \{\varnothing\}\} \newline 3 & = 2 \cup \{2\} = \{\varnothing, \{\varnothing\}\} \cup \{\{\varnothing, \{\varnothing\}\}\} = \{\varnothing, \{\varnothing\}, \{\varnothing, \{\varnothing\}\} \} \end{align} $

If it's easier, you can also think of each number being defined as the set of all previous numbers, meaning that $n+1 = \{0, 1, \cdots, n\}$. So the number, say $4$, could be thought of as how many numbers are in the set $\{0,1,2,3\}$. If you're interested, you can look into how to do all the standard arithmetic with these set operations, but the only point that we care about is how we can compare them.

Since we're working with sets, the idea of "less than or equal" doesn't really make sense, but fortunately, we built into our definition a nice counterpart to represent "less than or equal". Our well-ordering relation is "contains" $\in$: for two numbers, if $a \in b$, then we have $a < b$. This can be seen easiest with the second definition of a number being the set of all previous numbers. We call these numbers that relate to each other in this set-theoretic way as ordinals. In particular,

Ordinals are well-ordered under the relation $\in$, and
Ordinals also have the property that whenever $x \in S$, it's also the case that $x \subseteq S$ as well.

Ordinals generalize the way we would normally count things to infinite sizes. The natural numbers are also known as the finite ordinals.

So every natural number can be identified with the set of all natural numbers before it. And the set of all these finite ordinals are well-ordered, and we can do induction on them like we have been doing before. Great!

Infinite Ordinals

But look at our definition of an ordinal. We're defining them in this recursive manner, but there's nothing saying that they have to be of finite size. So what happens if we just make an infinite ordinal? So far all of our ordinals have been of finite natural numbers, what's stopping us from extending one of these sets to infinity?

Well, suppose we did take the limit of the set of natural numbers. That is the set, let's call $\omega$, defined as

$\omega = \{0,1,2,3,4,\cdots\}$

This set is certainly greater than any other natural number under the relation $\in$—in fact, it is the supremum of the natural numbers, being the smallest ordinal greater than all of the natural numbers. Also, certainly by the way we defined the natural numbers, every element of $\omega$ is also a subset of $\omega$, so $\omega$ fits the definition of an ordinal too.

But think about that for a second. If we take the set of all natural numbers alongside this infinite ordinal in the set $\mathbb{N} \cup \{\omega\}$, we definitely have a well-ordered set, but how could we ever do induction on this set? Because $\omega$ is greater than all natural numbers, it definitely does not have the form of $k+1$ for some natural number $k$. Yet, according to our definition of induction

(SPMI 2) If $\forall m \leq k \ \ m \in S$, then $k+1 \in S$

those are the only statements we could build off of. No matter how long you let our induction go for, it will never be able to prove a statement is true for the ordinal $\omega$, while still being in a well-ordered set.

This is the purpose of transfinite induction. When working to prove things for any well-ordered set, we can run into these limiting cases ($\omega$ is aptly called a limiting ordinal) that don't work completely well with the notion of having a "next element" at infinity; you can imagine labelling elements in a well-ordered set as with 0, 1, 2, etc. as we have been doing before, and then eventually running into an element $\omega$ to throw your induction off. It's always infinity that makes things a little more difficult.

The Principle of Transfinite Induction

The principle of transfinite induction will look extremely familiar to what we've already been doing.

Transfinite Induction: Let $A$ be a well-ordered set, and $\Phi(x)$ be a proposition that has domain $A$ with the following properties:

(TI 1) $\Phi(0)$ is true
(TI 2) $\Phi(b)$ is true for all $b < a$, then $\Phi(a)$ is true

Then you can conclude that $\Phi(x)$ is true $\forall x \in A$.

$0$ just means the least element of $A$. The only real difference again is in the inductive step. In (TI 2), we replace the idea of having $k+1$ with just a notion of "greater than" as to deal with possible limiting cases. In fact, transfinite induction is presented as normal induction with an additional condition:

Transfinite Induction: Let $A$ be a well-ordered set, and $\Phi(x)$ be a proposition that has domain $A$ with the following properties:

(TI 1: Base Case) $\Phi(0)$ is true
(TI 2a: Successor Case) If $\Phi(\beta)$ is true for all $\beta \leq \alpha$, then $\Phi(\alpha+1)$ is true
(TI 2b: Limiting Case) $\Phi(\beta)$ is true for all $\beta < \alpha$, then $\Phi(\alpha)$ is true

Then you can conclude that $\Phi(x)$ is true $\forall x \in A$.

(TI 2a) is essentially just the normal form of (strong) induction we've come to know and love, with the only difference being that we can have successor ordinals that are not natural numbers, like $\omega + 1$. So it really is just the limiting cases that can throw a wrench into our proofs.

Let's now put it to work.

Examples

Claim: There is a subset $A \subset \mathbb{R}^2$ that intersects every line in exactly two points.

This is not clear to me at all. I mean, just by literally the sheer number of lines there are, it seems hard to make it so that you can nudge points around so that no three are ever collinear, while still somehow covering every line. We can uniquely identify every line with an equation $y = ax + b$, so we can identify every line with an ordered pair $(a,b)$. Thinking of this as coordinates, it should be clear that the number of lines is equivalent to the size of $\mathbb{R}^2$, which is more than the number of natural numbers. So if we want to prove this via induction, we'll need to do so with transfinite induction.

Proof: Let $\{ L_\alpha \}$ be a labelling of every line (i.e. assign every line an ordinal $\alpha$). We'll inductively construct a sequence $\{A_\alpha\}$ of subsets of $\mathbb{R}^2$ with the following properties for each $\alpha$:

(Instantiation) $A_\alpha$ has at most two points
(Preservation) $\bigcup_{\beta \leq \alpha} A_\beta$ contains no three points collinear
(Diagonal) $\bigcup_{\beta \leq \alpha} A_\beta$ contains exactly two points of $L_\alpha$

Then $A = \bigcup A_\alpha$ over all ordinals $\alpha$ will have the desired property. This is because the preservation property will ensure every line has at most two points in $A$, while the diagonal property will ensure that every line has at most two points in $A$.

Base Case: $\alpha = 0$. $L_0$ would be our first line, and all we need to do to satisfy those 3 properties above is let $A_0$ be any two points in $L_0$.

Successor Case: Suppose for any ordinal $\alpha$, the sequence $\{A_\beta\}_{\beta \leq \alpha}$ satisfies the 3 properties above. We'll show that we can construct a satisfactory $A_{\alpha + 1}$ to extend the sequence.

Let $B = \bigcup_{\beta \leq \alpha} A_{\beta}$, and $C$ be the set of all lines that contain two points from $B$. By the inductive hypothesis, $B$ has the preservation property, so line $L_{\alpha + 1}$ has at most 2 points in $B$, i.e. $L_{\alpha + 1} \cap B$ has at most 2 points. Now we have two cases to consider.

Case 1: If $L_{\alpha + 1} \cap B$ has exactly 2 points, we just let $A_{\alpha+1} = \varnothing$ to immediately satisfy the 3 properties.

Case 2: If $L_{\alpha + 1} \cap B$ has less than 2 points, then line $L_{\alpha+1}$ intersects every line in $C$ in at most 1 point (if it intersected a line in more than 1 point, it would be that line itself), i.e. for all lines $L \in C \ \ |L_{\alpha+1} \cap L| \leq 1$. To satisfy the preservation and diagonal property, we obviously want to pick a point in $L_{\alpha + 1} \backslash \bigcup C$ (i.e. a point on our new line but not on any of the previous lines). We can do this for sure since we can show $L_{\alpha + 1} \backslash \bigcup C \neq \varnothing$.

The set of all points in a line has equivalent cardinality to the reals (it's just a tilted number line, after all). So $|L_{\alpha + 1}| = |\mathbb{R}| = 2^{\aleph_0}$.
Every line can be uniquely written as $y = ax+b$, and thus also be identified by the ordered pair $(a,b)$. So the number of lines is equivalent to $|\mathbb{R}^2| = 2^{\aleph_0}$
$C$ is the set of all lines with two points from $B$, which by definition, does not include all lines yet. So $|C| < 2^{\aleph_0}$
Using the fact that for all lines $L \in C \ \ |L_{\alpha+1} \cap L| \leq 1$: $|L_{\alpha + 1} \cap \bigcup C| = |\bigcup_{L \in C} L_{\alpha + 1} \cap L| \leq |C| < 2^{\aleph_0}$
So $L_{\alpha + 1}$ and $\bigcup C$ share fewer than $2^{\aleph_0}$ elements, together, meaning that $L_{\alpha + 1} \backslash \bigcup C \neq \varnothing$.

Now we're free to pick a subset from $A_{\alpha + 1} \subset L_{\alpha + 1} \backslash \bigcup C$. If $L_{\alpha + 1} \cap B$ has one point, just pick 1 point for $A_{\alpha + 1}$, and if $L_{\alpha + 1} \cap B$ has no points, then let $A_{\alpha + 1}$ 2 points. This will satisfy our preservation and diagonal properties.

Limit Case: Suppose for any ordinal $\alpha$, and for all $\gamma < \alpha$, the sequence $\{A_\gamma\}_{\beta \leq \gamma}$ satisfies the 3 properties above. We want to show that we can construct a $A_{\alpha}$ to extend the sequence. Notice in our successor argument, we never really needed the fact that we were proving it for $A_{\alpha + 1}$, but just the fact that we had a series of already constructed previous cases of $A_\beta$ with $\beta < \alpha + 1$. This makes our limit case almost identical to the successor case. Here's the outline:

Let $B = \bigcup_{\gamma < \alpha} A_\gamma$, and $C$ be the set of all lines that contain two points from $B$
Now consider the set $L_\alpha \cap B$
$L_\alpha \cap B$ contains at most 2 points because $\{A_\gamma\}_{\beta \leq \gamma}$ has the preservation property
Case 1: If $L_\alpha \cap B$ contains 2 points, let $A_{\alpha} = \varnothing$
Case 2: If $L_\alpha \cap B$ contains less than 2 points, then $L_{\alpha}$ intersects with every line in $C$ at most once
- By the same reasoning as before we can pick a subset $A_{\alpha} \subset L_{\alpha} \backslash \bigcup C \neq \varnothing$
- If $L_\alpha \cap B$ has 1 point, let $A_{\alpha}$ be 1 point
- If $L_\alpha \cap B$ has 0 points, let $A_{\alpha}$ be 2 points

Now again let $A = \bigcup A_\alpha$ over all ordinals $\alpha$. By transfinite induction, we've shown there is a subset $A \subset \mathbb{R}^2$ that intersects every line exactly twice.

$\blacksquare$

Outside of set theory, because there are more reals than natural numbers, transfinite induction lends itself to these types of weird geometry problems.

Though, I have to admit I did lie a little bit here. We technically used transfinite recursion as opposed to transfinite induction. We didn't necessarily prove a claim for all ordinals $\alpha$, but rather constructed something inductively/recursively over the ordinals that satisfied some properties. The difference is subtle, but they are technically different methods from one another. Maybe you could argue that we were proving the claim that "there is something constructible" for all ordinals, so I find the difference nitpicky.

Examples of transfinite induction can be tedious (especially if uncomfortable with ordinals and working with infinities), but you can find some in this book for more. Some interesting examples include:

There is a countable partition of $\mathbb{R^2}$ so that the distance between any two different points in the same set is irrational
$\mathbb{R^2}$ is not a union of disjoint circles
$\mathbb{R^3}$ is a union of disjoint circles

Structural Induction

The idea of well-ordering—a key idea in "climbing a ladder"—naturally led us to generalize induction to transfinite induction. But there's something else in our idea of induction that, to me at least, seems more characteristic to the proof technique: recursion.

We've already seen recursion a little bit, but this idea of defining things within themselves is what induction really screams to me. In that second clause of our principle,

(PMI 2) $\forall k \geq 0$, if $\Phi(k)$ is true, then $\Phi(k+1)$ is true

we rely on proving the validity of $\Phi(k+1)$ by already assuming the "simpler" idea $\Phi(k)$. How do we know $\Phi(k)$ is valid? We show it from the "simpler" $\Phi(k-1)$. And we keep going until we hit the simplest version of our claim with our base case $\Phi(0)$.

Building up examples from a simplest case is what I think of when I use induction. But, there are many other times we build something from a base or simple case, too. Trees, in computer science and graph theory, follow this exact pattern of construction.

We have a root at the top that brances out into different leaves and paths. While we may not have a definite "next" element going down a tree, we certainly have this recursive nature where at each step going down a tree, we have a smaller, "simpler", subtree. So perhaps, if we wanted to prove somthing about trees, there's a type of inductive argument we could use that modifies our original idea. For a propsition (like a property) $\Phi$ about a recursively defined structure like a tree, we could have somthing like

Show $\Phi$ is true for the base case of the structure
If $\Phi$ holds for all substructures, then $\Phi$ also holds for the recursively defined structure

then we should be able to conclude that $\Phi$ holds for the entire structure. This type of structural induction is something that should make sense based on our previous, regular versions of induction, but it's something so general we'll have to adapt it to our structures.

Here's a simple example building on trees. First, we need some definitions.

A vertex is a dot in our tree.
An edge is a line that connects two vertices

We define a tree recursively:

(Base Case) A tree has a root vertex with no edges, or
(Recursive Definition) A tree has a root vertex with edges connecting to some number of other trees

Claim: For our tree, let $V$ be the set of vertices and $E$ the set of edges. Then $|V| = |E| + 1$.

Base Case: Our base case is a tree is a root vertex with no edges. So $|V| = 1$, and $|E| = 0$, so the claim holds.

Inductive Hypothesis: Suppose that we have a tree with $n$ subtrees, and the claim holds for all subtrees. We will use $V_i$ and $E_i$ to denote the set of vertices and edges respectively for tree $i=1,\cdots,n$.

Then by definition, the number of vertices of the whole tree

$|V| = 1 + |V_1| + |V_2| + \cdots + |V_n|$

as the tree is built from a root vertex connected to the subtrees. Similarly,

$|E| = n + |E_1| + |E_2| + \cdots + |E_n|$

as the number of edges would be the sum of the edges of the subtrees plus the additional $n$ edges that connect the subtrees to the root. Now it's just a matter of algebra with the inductive hypothesis.

$ \begin{align} |V| & = 1 + |V_1| + |V_2| + \cdots + |V_n| \newline & = 1 + (|E_1| + 1) + (|E_2| + 1) + \cdots (|E_n| + 1) \newline & = 1 + (n + |E_1| + |E_2| + \cdots + |E_n|) \newline & = 1 + |E| \end{align} $ $\blacksquare$

Here's another one with recursively defined sets.

Claim: Define the set $S$ as follows:

(Base) $6,21 \in S$
(Recursive) If $x,y \in S$, then $x + y \in S$

Show that every element in $S$ is a multiple of 3.

Base Case: Clearly $6 = 3 \cdot 2$ and $21 = 3 \cdot 7$ are both multiples of 3

Inductive Hypothesis: Assume that $x,y \in S$ and both are multiples of 3 i.e. $x = 3n$ and $y = 3m$ for integers $n,m \in \mathbb{Z}$. Then $x + y = 3(n + m)$ which is also a multiple of 3. So clearly all elements are multiples of 3. $\ \blacksquare$

Finally, one more from propositional logic. In propositional logic, we define sentences recursively with connectives. These and only these are sentences:

(Base) Sentence letters $P, Q, R, \cdots$ are sentences
(Recursive) If $\phi$ and $\psi$ are sentences then the following are also sentences:
- $\neg \phi$ (negation)
- $(\phi \wedge \psi)$ (and)
- $(\phi \vee \psi)$ (or)
- $(\phi \rightarrow \psi)$ (material implication)
- $(\phi \leftrightarrow \psi)$ (biconditional)

Claim: All sentences have an equal number of left and right parantheses.

Base Case: Sentence letters have no parantheses, and $0=0$, so they have an equal number of left and right parantheses.

Inductive Step: Suppose $\phi$ and $\psi$ are sentences with an equal number of left and right parantheses. We check the claim for each connective:

$\neg \phi$: Since we add no new parantheses and $\phi$ has an equal amount of left and right, the claim holds.
$(\phi \wedge \psi)$: If $\phi$ has $n$ left and right parantheses, and $\psi$ has $m$, then $\phi \wedge \psi$ has $n+m$ left and right parantheses, still being equal. Adding parantheses around it keeps the equality, just adding 1 to each side.
The remaining connectives are identical to the $\wedge$ case

$\blacksquare$

Structural induction is a fairly straightforward adaptation of induction, and, as a proof technique, is helpful to formalize what might otherwise just be obvious facts (like the above).

Induction was something we only limited to well-ordered sets. Structural induction allows us to use it on well-founded partial orderings too. Well-founded is similar to well-ordered, meaning that given a relation $R$ on a set $X$, every non-empty subset has a minimal element. A partial order is a relation $R$ on a set $X$ that relates elements in a set $X$ to one another, but without necessarily comparing all elements to each other. Going back to the example of trees, we can relate elements by being in the same chain of edges, so different branches would be uncomparable to one another.

Continuous Induction

Up until now, we've treated induction as an inherently discrete method. I mean, how can it not be? We started by first describing a method that is iterating over the natural numbers, and then further generalized them to any well-ordered set. But nonetheless, we only applied induction to inherently discrete casees. What would it even mean to climb a "continuous" ladder? What would a "next step" look like? For any real number $x$ and some incremental value $\delta$, there's always a number $y$ in between $x < y < x + \delta$, so the real numbers don't give us this nice "next step" to prove (with this, it shouldn't be hard to see that the real numbers aren't even well-ordered).

Let's think about what we're doing when we use induction. It sort of feels like an almost constructive method, right? We first show there is at least one element in our set with our base case. Then from our base case, we add another element through the inductive step. From that new element, we add another element with another inductive step. And so on and so on. We treat induction like a method in which we add more and more elements to our set.

This is inherently rooted in thinking in terms of an indexed claim $\Phi(n)$. If we want to continue this analogy, we really need a way of thinking that gives us a new "next step". Let's revisit our definition of induction when we were proving it from well-ordering:

Principle of Mathematical Induction: Let $S \subseteq \mathbb{N}$ be a subset with the following properties:

$0 \in S$
If $k \in S$, then $k+1 \in S$

Then $S = \mathbb{N}$.

Again, we could define $S = \{ n \in \mathbb{N} \ | \ \Phi(n) \ \textrm{is true} \}$, but the clear issue is still our inductive step. What will be our "next step", or $k+1$? Well, if there's no good indexing of the reals and we can't include any particular one inductive step, why don't we just include a bunch? As in, why don't we just include a new subset of numbers at each inductive step? What could be wrong with that? Our induction then would be that if we can show that some number $x \in S$, then we can show that some further interval is also in $S$, as opposed to just a single other number. That way we can just keep adding more and more interval until we reach as many reals as we want!

Real Induction Attempt 1: Let $S \subseteq \mathbb{R}_{\geq 0}$ be a subset with the following properties:

$0 \in S$
For any $x \geq 0$, if $x \in S$ then $[x, z) \subseteq S$ for some $z > x$

Then $S = \mathbb{R}_{\geq 0}$.

Instead of dominoes, our analogy would be like improvising a rope: so long as you have smaller threads that have some length, you can eventually tie a rope as long as you want.

There is one problem with our formulation, though: it doesn't ensure a certain size for our intervals. So if our intervals are shrinking in length, they might converge onto only some smaller part of the real line. Say we start at 0, but at each step $n$ only add an interval of length $(\frac{1}{2})^n$. Then we'll only cover the reals from $[0,1)$. So we need at least one more requirement for this to work.

And actually, we have run into this issue before. Recall for transfinite induction, we had limit ordinals that blocked our path, and had to prove the limiting cases in addition to the inductive step. We will do the same here.

Real Induction: Let $S \subseteq \mathbb{R}_{\geq 0}$ be a subset with the following properties:

(RI 1) $0 \in S$
(RI 2) For any $x \geq 0$, if $[0,x] \subseteq S$ then $[x, z] \subseteq S$ for some $z > x$
(RI 3) For any $x \geq 0$, if $[0,x) \subseteq S$ then $x \in S$

Then $S = \mathbb{R}_{\geq 0}$.

So now if we run into a limit case like before, we say it must be in our set, and that gives us a new case to continue performing real induction on. One other thing to note is that I tweaked our second condition (RI 2) to be closer to the strong form of induction from before to highlight there is an equivalent analogy in real induction. I also changed (RI 2) to be a closed interval, as our new (RI 3) allows us to do that. If a set $S$ satisfies (RI 1–3), we say $S$ is inductive.

So real induction isn't all that different from what we're used to already. Instead of natural numbers being our cases, intervals are. We start with a base case, and slowly show our claim holds on new intervals until we end up showing that from combining all those intervals, the claim actually holds for all real numbers.

You might be wondering how this is at all useful. Particularly because of (RI 2): we no longer have a concrete $k+1$ we can work with in our proofs. Instead, we're stuck with this weird existence proof we have to find that doesn't seem tangible at all in practice. That's what I, at least, thought at first. Fortunately, much of real analysis has built into their definitions this same ambiguous existence-of-a-number that we can leverage (as you'll see).

Earlier I pointed out how we might run into the "issue" of only proving a claim for the interval $[0,1)$, but sometimes that's exactly what we want as opposed to showing something for all reals. We can slightly modify our method of induction just by limiting our choices of possible elements:

Interval Induction: Let $S \subseteq [a,b]$ be a subset with the following properties:

(RI 1) $a \in S$
(RI 2) For any $a \leq x < b$, if $[a,x] \subseteq S$ then $[x, z] \subseteq S$ for some $z > x$
(RI 3) For any $a < x \leq b$, if $[a,x) \subseteq S$ then $x \in S$

Then $S = [a,b]$.

All we did is change our base case, and add a cap to our inductive step.

While we're here, we might as well check beyond our intuition that this should work.

Proof of Real Induction: Suppose a set $S \subseteq [a,b]$ is inductive, and (RI 1–3) hold. We want to show that $S = [a,b]$. So—just like in our proof of regular induction—suppose for contradiction, $S \neq [a,b]$, so there is a non-empty set $T = [a,b] \backslash S \neq \varnothing$. Since $T$ is bounded, and by the completeness of the reals, $T$ has an infimum (greatest lower bound).

Case 1: $\mathbf{\inf (T) = a}$. By (RI 1), $a \in S$. By (RI 2), there is a $z > a$ such that $[a,z] \subseteq S$. So $z$ is a lower bound for $T$ (the elements not in $S$). But $z > a = \inf (T)$, contradicting the definition of an infimum.
Case 2: $\mathbf{a < \inf(T) \in S}$. $\inf(T) \neq b$, since we declared $T$ non-empty. Then by (RI 2), there is a $z > \inf(T)$ such that $[\inf(T), z] \subseteq S$. Like before, this would suggest $z > \inf(T)$ is a lower bound for $T$, contradicting the definition of an infimum.
Case 3: $\mathbf{a < \inf(T) \in T}$. Then by definition of $T$, we would have $[a,\inf(T)) \in S$. By (RI 3), then $\inf(T) \in S$, which is again a contradiction by definition of $T$.

Since these 3 cases are exhaustive for the infimum of $T$, and all lead to a contradiction, it must be the case that our initial assumption is wrong. That is, that $S \neq [a,b]$. In other words, if (RI 1–3) hold on our desired interval, it must be that $S = [a,b]$.

$\blacksquare$

So now we have this method of showing a claim holds for more and more subintervals that build on each other, until it completes the whole interval, allowing us to prove theorems in a now more familar way than before.

Here's a nice intuitive example.

Intermediate Value Theorem: Suppose $f:[a,b] \rightarrow \mathbb{R} \backslash \{0\}$ is continuous, and $f(a) > 0$. Then $f(b) > 0$.

If $f(x)$ is continuous and only maps numbers to non-zero real numbers, then it should make sense that if we have a point $a$ where $f(a) > 0$, then we should not have any number where $f(x) \leq 0$ without "lifting up our pen"; since $f(x) \neq 0$, we have this barrier where either our function is continuous and positive, or it is not continuous and negative too.

The idea of this proof is to use continuity to inductively string together intervals that are positive on $f$ until we cover the whole length of $[a,b]$. Just for consistency, the set we're secretely working with (as from our statement of real induction) is $S = \{x \in [a,b] \ | \ f(x) > 0\}$.

Base Case: From our assumption, we're already given that $f(a) > 0$.

Inductive Hypothesis: Say that for some $a \leq x < b$ that $f(t) > 0$ for all $t \in [a,x]$. We want to show that there is a further interval $z > x$ where $f(t) > 0$ for all $t \in [x,z]$.

Since $f$ is continuous at $x$, there is a small neighborhood centered at $x$ with also positive outputs i.e. $\exists \delta > 0$ such that for all $t \in (x-\delta, x+\delta)$ we have $f(t)>0$. You can just check this with the definition of continuity. A function $f:E \rightarrow \mathbb{R}$ is continuous at $x$ if:

$\forall \epsilon > 0 \ \exists \delta > 0 \ \forall p \in E \ \ \left( |x-p| < \delta \Rightarrow |f(x) - f(p)| < \epsilon \right)$

So if we let $\epsilon < f(x)$, we get some corresponding $\delta$ so that $f(p)>0$ for $p$ in that neighborhood. Thus, if we let $z = x + \delta$, we've proven our inductive step that there is a number $z > x$ such that $f(t) > 0$ for all $t \in [x,z]$. It might be a small interval, but good enough nonetheless.

Limit Case: Now say for for some $a < x \leq b$, we have that $f(t) > 0$ for all $t \in [a,x)$. We need to show that $f(x) > 0$ as well. Suppose it wasn't, i.e. $f(x) \leq 0$. By definition of our function, $f(x) \neq 0$, so the only remaining case is $f(x) < 0$. But by continuity—like in the inductive step—there would be a neighborhood such that $f(t) < 0$ for all $t \in (x - \delta, x + \delta)$. But that contradicts our assumption that $f(t) > 0$ for all $t \in [a,x)$ as $x - \delta < x$. So it must be the case that $f(x) > 0$.

By real induction, we have proven that if $f:[a,b] \rightarrow \mathbb{R} \backslash \{0\}$ is continuous and $f(a) > 0$, then $f(x) > 0$ for all $x \in [a,b]$. And in particular, $f(b) > 0$.

$\blacksquare$

For more examples, I'd read the instructional paper I linked above that goes quite in-depth in proving many famous theorems with real induction. Most of the proofs are relatively straightforward once you figure out how to state your set in a good way to induct on it.

Real induction is one of those tools that just doesn't seem like it should work by the way we treat normal induction. It's not a method one would probably think of, and given how little it's used, it's not surprising it's not even that well-known.

Conclusion

Induction is one of the most fundamental tools in math. It's inextricably linked to our modern conception of the natural numbers, and in many ways, is what sets apart our contemporary understanding of math from other logical systems. So, I'm not surprised it's as powerful a proof tool as it is, but there is so much more to it than the typical weak and strong induction that is primarily taught. How we apply induction gives us so many new and creative ways to prove claims, but more importantly how thinking inductively naturally guides our intuition for alternate approaches. We saw it first with transfinite induction, then real induction, but who knows what other powerful variations on induction there may be.

Privacy Protection and Zero-Knowledge Proofs

Adi Mittal

Your password is not protected perfectly, but it's good enough.

Proofs, to me, are the cornerstones of all science. What separates math, physics, chemistry, and all other fields of science is the level of rigor we expect for professionals to conduct themselves. We don't take anything for fact until we are convinced irrefutably to call it so. We can conjecture all we want, but it only holds so much sway until we can use it as a definitively true claim. Mathematics is even more strict in how it's a purely deductive field, where the notion of a proof accepts nothing but a direct chain of deductive, causal reasoning. Sure, even in math sometimes we use empirical data to guide our judgements, but "proof by lack of a counterexample" isn't all that helpful.

But in the day-to-day, we rarely care for such rigor. If the weather man says it's going to rain today, I'm inclined to believe it since he's mostly been right before, and that's good enough for me. "Proof", to most people, usually just refers to any kind of persuasive evidence, as opposed to air-tight arguments. You want to be convinced with proof—nothing more, nothing less. If someone tells me they're a Michelin star chef, I won't need to see the certificate, but rather I'll probably be convinced if they even make one meal for me up to that standard.

Indirect proof to validate something, like with the case of the Michelin star chef, is what we usually rely on. We can't always provide hard evidence all the time, so convincing others with a demonstration of sorts is a nice alternative. But sometimes, it is the only option. If you had to prove citizenship to a country, sending over all your legal documents in one package, to one person, seems like a risky idea. If that package is lost or intercepted, your identity is gone. Or perhaps you are voting in an election, and want to prove that you did in fact vote. But ideally you could do that without having to disclose exactly who you voted for.

In these higher-stake situations with more sensitive information, any type of "direct" proof is a bit of a terrifying idea, for if something goes wrong between you and who you are interacting with, the consequences can be devastating. We want some way to prove something, without necessarily having to reveal any information about the proof claim.

Zero-Knowledge Proofs

Think about what we are asking for a second. We want to prove something to someone, without giving any particular facts; we want to endow someone with new information, necesssarily without giving any new information.

These zero-knowledge proofs (ZKPs) that validate a claim without imparting new knowledge sounds like an impossible feat. If someone has access to all the information already, and can't come to the conclusion themself, then how will they ever come to agree with the claim? Remarkably, there are strategies to create such a proof. An example does best here.

The Colorblind Man

Say your friend is extremely red-green colorblind. You give them two balls, one red and one green. Besides their color, they are completely identical (size, weight, mass distribution, etc.) to each other. To your friend, the balls do in fact look identical (due to the colors), but you want to prove to them they are truly not. Here's how you do it:

Hand them a ball in each hand.
Tell them to put it behind their back and choose to either swap the balls or leave them in the same hands.
Have them reveal to the balls to you and ask if he swapped them or not.

You, who's not colorblind, can answer him no problem. Getting it right once may be a fluke, but the more times you get it right, the more confident your friend should become that there is something differentiating the balls, even if he can't see it. If you're friend thinks you're only guessing randomly, even just getting it right 10 times in a row has probability $(\frac{1}{2})^{10} = 0.00098$, or a less than .1% chance of happening. You can keep repeating the test as long as your friend wants, but soon enough, he'll have to accept that you aren't just guessing, but rather seeing something he cannot.

In this, clearly we didn't change our friend's perception in anyway—he's still colorblind. Moreover, we didn't do any additional tests, like using an RBG monitor to digitally record values for your friend to read. Yet, we were able to convince them with a probabilistic guarantee that he is lacking information, even if we can't give him that information directly. In essence, nothing was changed of the scenario—no new knowledge was given—yet we were still able to prove to the colorblind friend that he in fact holding two differently colored balls, even if he can't verify that directly.

Inequitable Income

You recently joined a small company—no more than 5 people including yourself—and have recently begun to feel unvalued. You think you might be paid less than average, and you bring this up to the others. They try to explain to you that you are being paid fairly, despite the amount of work you put in. They seem confident enough, but they won't tell you their salary. How can you prove to them that you are paid less than average, without getting the data from your colleagues?

Let's say you, as person $A$, have a monthly salary of $ \$ a$, and the other four have salaries of $ \$ b, \$ c, \$ d, \$ e $. Now this is what we do:

You give a random, placeholder number $x$.
Then person $B$ computes the sum $x + b$, and tells person $C$ this value.
Person $C$ then adds their own salary to the sum to get $x + b + c$, and tells $D$ this number
People $D$ and $E$ do the same thing, until $E$ tells you the final number $x + b + c + d + e$.
Finally, since you know the value $x$, you can subtract it and recompute the value of $a + b + c + d + e$
You can then announce the average to the group $\large{\frac{a + b + c + d + e}{5}}$

Then, without any other individual having to know any other person's salary, can confidently agree that this is the group average, and enjoy the ensuing chaos of wealth mismanagement. This all works because of the value $x$: no one except you knows the value, so at each step, no other person will have enough information to deduce others' salaries from just their own. So without the need for any specific piece of new information, we are still able to deduce averages from a set of secret numbers.

These last two examples, particularly the former with the colorblind man, encapsulates the idea of zero-knowledge proofs. You, and possibly other parties, have secret information you want (or have) to keep secret, and somehow manipulate whatever knowledge you already have to deduce something meaningful. Even after I saw these two examples, it still was a surprising fact to me that not only such an concept exists, but also has super practical applications.

But we haven't actually specified exactly what makes a zero-knowledge proof, well, zero-knowledge. If we (the prover) are trying to prove some statement to a verifier with a ZKP, there are 3 properties we'd want to satisfy:

Completeness: if the statement is true, then an honest verifier will be convinced of it by an honest prover
- This ensures we can prove the statement
Soundness: if the statement is false, then there is no proof that can convince the verifier (up to some small probability)
- This ensures our proof method is trustworthy
Zero-Knowledge: if the statement is true, then no verifier learns anything except that the statement is true
- This protects our secret

The first two qualities are that of a general interactive proof system, which is a type of proof that relies on a conversation-esque exchange between the prover and verifier (like in the colorblindness example). These interactive proofs follow a protocol that dictates the flow of conversation, and in the context of our qualities, someone is honest when they legitimately follow this protocol in good faith.

Let's do a more practical example.

The Discrete Logarithm Problem

You and your friends set-up your own little secret network to talk to each other privately. To make sure no one else can spy on you guys, you implement a password system: everyone picks a username that consists of 3 numbers, two integers $A$ and $B$ as well as a (usually large) prime $p$. Your password $x$ will be a number such that $B = A^{x} \bmod p$. Obviously, if you wanted to log-in, you could just give your password everytime, but that puts you at risk for eavesdroppers or anyone listening in. Ideally, we want to be able to prove to our network that we do actually know our password, without having to give the password. This way, if anyone wants to break into our account, they would have to brute force it, which isn't the best use of one's time.

One issue, for example, is that you could have multiple solutions to a modular arithmetic problem. The equation $13 = 3^x \bmod 17$. You can check that $x = 4$ is a solution. But, due to Fermat's little theorem, we also know that $1 = 3^{16} \bmod 17$. Then for any integer $k$, we know that

$3^{4 + 16k} \bmod 17 = 3^4 \cdot (3^{16})^k \bmod 17 = 13 \cdot 1^k \bmod 17 = 13 \bmod 17$

so even if you have a solution, it doesn't necessarily mean you found the particular solution that is our password.

Finding $x$ given $A,B,p$ is known as the discrete logarithm problem, and due to the difficulty of brute forcing, can actually be used as a method of identification (like a password). An additional thing to note, to maximize the search space for people brute forcing your $x$ value, we choose $A$ to be a generator, meaning that for every value $m = 1,2,\cdots,p-1$, there is a value $n$ such that $m = A^{n} \bmod p$. You tell everyone your $A,B,p$, and no one but you ever has to know your $x$. But you can still confidently prove to people that you do know it, and that you are in fact actually you.

After we (the prover) have distributed our values of $A,B,p$ (to any potential verifier) such that $B = A^x \bmod p$, this will be our protocol:

The prover picks a random number $r$ and calculates the value $C = A^r \bmod p$
The verifier now flips a coin. If the coin is heads they give the prover $c = 0$, and if tails $c=1$.
The prover sends the verifier the value of $cx + r$
The verifier now checks they get the expected value: $A^{(cx + r)} \bmod p = (A^{x})^c \cdot A^{r} \bmod p = B^c \cdot C$
Verifier repeats steps 1–3 until they are satisfied.

We see parts of our previous ZKPs appear here: 1) we let the verifier do a random action to test the prover's knowledge just like in the colorblind friend example, and 2) we do a pseudo-encryption of our password $x$ by adding a random number $r$ to obscure it in a sum.

Now we want to verify this acutally works, both as a password-method and a zero-knowledge proof.

So say someone was trying to hack into our account. If they knew what our verifier was going to give for $c$, then theoretically they could cheat their way into our account. There are two possibilities they could guess, being the value of $c$:

$c=0$. Then the attacker just pick a random value $r$ and the verification process works as normal.
$c=1$. Then the attacker pick a random value $r'$ and instead of giving the value $C$ that the verifier requests, they instead give the value $C{'} = (A^{r{'}} \bmod p) \cdot B^{-1}$. Then when the verifier wants the value $x+r$, the attacker instead gives $r'$, so that $A^{r'} \bmod p = (A^{r'} \bmod p) \cdot B^{-1} \cdot B = C' \cdot B$ which would be match the swapped values the verifier would expect.

But, if the attacker is wrong, because of how difficult it is to compute the discrete logarithm, they would not be able to reverse any of the new values they need with the value of $c$ they did not expect. So someone who doesn't know the password $x$ would have a 50% chance of passing the test, so the verifier just does this enough times until they are convinced. So at the very least, this does work as an effective identity/password check. So, also, this shows our proof is both sound and complete: if we don't, then it's near impossible for someone to cheat, and if we do know our password, a verifier should be convinced eventually.

This proof is also zero-knowledge since, well, we only worked with the public information. You can imagine building a simulator—like a computer—that can mimic exchanges between a prover and verifier (as if following the procedure of an attacker in our proofs of soundness and completeness) that operates entirely on its own. And I don't know about you, but a computer program is only working with what it's given, so if this interactive proof works within the confines of a computer, no new information is ever given.

$\blacksquare$

While this is a good proof of concept, it does require a lot of work in practice. If you want to try and wrap your head around these types of proofs before moving on, here's another simple one about proving knowledge of a vector. It's an interactive proof with a focus on succinctness, and I found it to be quite helpful to think about some of the related concepts in cryptography and zero-knowledge proofs. The important concept to take away from this article and our above example with the discrete logarithm, randomness is an extremely useful way to instantly bound the error of our proofs; we don't need the proof to be 100% confident, we just need it to be confident enough such that the prover can't cheat if they tried.

Non-Interactive Zero-Knowledge Proofs

Before, in the example with the discrete logarithm problem and the colorblind man, we had this verifier act as an arbiter: they make a random decision that should prevent hacks (i.e. swapping the balls, sending $c=0$ or $c=1$), and after some arbitrary number of successes, they accept that the prover is genuine. Now imagine that you are the verifier, or are implementing this. Do you really want to sit through doing the same test over and over again, until you feel convinced? Do you really want to have to rely on some random mechanism every time that you carry out? Even if you make a computer to do this for you, imagine how annoying that must be to work with every time you want to log in with a password. Seems like a pain.

What we would ideally use is a non-interactive protocol: something that removes our need to have to always make a random choice and carry it out until some subjective level of confidence. But again, we don't want to compromise on zero-knowledge; we like our privacy. Fortunately, others' have come to the same issues and have come up with zero-knowledge succinct non-interactive arguments for knowledge (zk-SNARKs) to solve many of our qualms. To keep it clear, let's go over exactly what each part of this obscenely long name actually means:

zero-knowledge: this is the same as before; we leak no new information and only prove that we know a claim is true without divulging anything specific about it.
succinct: our proof takes a constant amount of space, i.e. the proof takes the same amount of time every time you go through it; we don't need to do the protocol until we are (subjectively) confident as we did with the previous proofs.
non-interactive: our verifier doesn't give any intermediate input, i.e. no need to decide swapping balls or picking $c=0$ or $c=1$.

So here's a super simple zk-SNARK:

Claim: I want to prove to my friend that two entrances to a building are connected.

Proof: All I need to do is have my friend watch from outside that I can in fact enter the one door and exit the other.

This is zero-knowledge since all my friend gets is knowledge that my claim was true and nothing more; they don't actually get to see how they are connected or anything more about the buliding.
It's succinct since I only need to do it once, since there is no room for doubt once I've shown it once (how do you prove something like this by a fluke?).
Finally, it's non-interactive since my friend doesn't need to actually do anything but observe.

That's the idea of zk-SNARK, at least. We'll build up to a pretty surprising zk-SNARK by the end of this, but to get there, we'll have to cover some more ground.

Now the remainder of this post will be pretty mathematically heavy in the worst way possible. The ideas are important and useful, and the overarching concept of a zero-knowledge proof is extremely interesting, but the mechanics behind them are fairly tedious—and that's just how it goes sometimes with cryptography. To ensure security, often times the best (and first thought about) ideas are to obscure the data with as many impossibly hard operations to undo. The notation might get dense, but hopefully we build up to it in a comprehensible manner.

Honesty Checks

Our discrete logarithm example, while not only gave us a nice demonstration of zero-knowledge proofs, they also now have given us an additional encryption method. The entire problem hinged on the fact that calculating discrete logarithms are hard to begin with. So we can use exponents, actually, to hide and encrypt lots of other pieces of data. And moreoever, can act as a safeguard to make sure no one is cheating us.

For example, say we're a verifier, and want to make sure that our prover follows the protocol of multiplying a number we give them. They can multiply whatever they want, but they have to multiply the number we give them. We can force this with exponents because the discrete logarithm is hard.

The verifier picks a generator $g$ (like from the discrete logarithm), our secret number $s$, and our encrypted shift $\alpha$.
the verifier gives the prover the values $g^s$ and $g^{\alpha \cdot s}$.
The prover with some number $c$, then gives back the values $(g^s)^c = g^{cs}$ and $(g^{\alpha \cdot s})^c = g^{c\alpha s}$.
The verifier checks that $(g^{cs})^\alpha = g^{c\alpha s}$.

Because of how hard it is to cheat this system by finding another exponent to get equal values that the verifier would expect, the prover is forced to use the encrypted value of $g^s$, and the shift of $g^{\alpha \cdot s}$ gives us insurance to check the prover's honesty at the end. And in the same way the prover doesn't learn the values of $s$ or $\alpha$, we don't learn the value of $c$ they want to multiply by. We only checked that they were honest.

Polynomial Knowledge

Now the reason why that's important, is that exponents give us all the methods we need to check more complicated values, like that of polynomials. The question we'll aim to answer is:

Can we prove to a verifier we know a polynomial $p(x)$ of degree $d$ has roots at $r_1, r_2, \cdots, r_n$ without revealing our polynomial?

In other words, can we prove to a verifier that for some polynomial $h(x)$, we know that $p(x) = (x-r_1)(x-r_2)\cdots (x-r_n)\cdot h(x)$? For brevity and convenience, we'll call the target polynomial $t(x) = (x-r_1)(x-r_2)\cdots (x-r_n)$.

The reason why polynomials are interesting is because they are incredibly hard to cheat. Say the prover has a polynomial $f(x)$, and claims to know the exact polynomial that the verifier has $g(x)$, that is the prover claims $f(x) \equiv g(x)$. If both polynomials are of degree $d$, then either $f(x)=g(x)$ has $d$ solutions by the Fundamental Theorem of Algebra, or has infinite solutions if and only if $f(x) \equiv g(x)$. So if the verifier picks a random value $s$ for the prover to evaluate, it is extremely unlikely that $f(s) = g(s)$ if they are not genuinely the same. If the prover does not know $g(x)$ and just guessed a random polynomial $f(x)$, if the verifier pick a random integer from the range $s \in [1,10000]$, there is at most a $\frac{d}{10000}$ chance that $f(s) = g(s)$.

First Attempt

Here's a naive way to prove to a verifier they know such a polynomial:

Protocol Attempt 1:

The verifier picks a random number $s$ and gives this to the prover along with $t(s)$.
The prover calculates $h(s) = \frac{p(s)}{t(s)}$ and passes that along to the verifier with $p(s)$.
The verifier checks that $h(s) \cdot t(s) = p(s)$.

This, though, has the issues of:

The prover can just pick an arbitrary number $h$ and calculate $p = h \cdot t(s)$ and that will satisfy the verifier.
Since we give the prover access to the number $s$, the prover can just make a new polynomial that happens to have the value of $p(s) = t(s) \cdot h$.
There's no verification of the degree of the polynomial.

The first two issues are purely because we don't encrypt anything; we just give the values of $s$ and $t(s)$. We need a way to obscure them so that computations with them are still feasible, but the prover can't use them to cheat the protocol.

Homomorphic Encryption

The way we can do this is in line with our honesty checks and the discrete logarithm, and exploiting the properties of exponents. Say our polynomial was a quadratic $f(x) = x^2 - 3x + 2$. We can encrypt $f(x)$ the same way we have been doing before by using a generator $g$ taken modulo $n$ (ideally a prime):

$g^{f(x)} \bmod n = g^{x^2 - 3x + 2} \bmod n = (g^{x^2})^1 \cdot (g^x)^{-3} \cdot (g^0)^2 \bmod n$

This type of encryption is known as homomorphic encryption, as our encryption method has a nice structure to it that allows us to do our arithmetic operations on them without having to decrypt it. In particular, if we let $E(s) = g^s \bmod n$ be our encryption as we have been, the structure we get is that $E(n + m) = E(n) \cdot E(m)$ by exponent rules.

One issue, though, is that we cannot multiply two encrypted values together with this homomorphic encryption scheme. If you have two numbers $a,b$ that have been encrypted as $E(a) = g^a$ and $E(b) = g^b$, we can easily find $E(a + b) = E(a)E(b)$ as we stated above. But only given $E(a), E(b)$, you cannot find an expression for $E(ab)$. Similarly, we also can't find an expression for $E(a^b)$. We'll address this later.

As we've discussed, reverse engineering an exponent modulo a number is especially hard because of how the modulo operation cycles, leaving room for many options to satisfy the equation and thus making finding a specific solution hard. So we can update our protocol:

Protocol Attempt 2:

The verifier picks a random number $s$, and sends encrypted powers of $s$ to the prover $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$.
The prover calculates $h(x) = \frac{p(x)}{t(x)}$. With the encrypted values, they calculate and send to the verifier $g^{p(s)}$ and $g^{h(s)}$.
- i.e. $g^{p(s)} = g^{c_d s^d + \cdots c_1 s^1 + c_0 s^0} = (g^{s^d})^{c_d} \cdots (g^{s^1})^{c_1} \cdot (g^{s^0})^{c_0}$ for the coefficients $c_d, \cdots, c_0$
The verifier checks that $g^{p(s)} = (g^{h(s)})^{t(s)} = g^{h(s) \cdot t(s)}$.

Although I've left it out for convenience, remember that all these encrypted values $g^x$ are taken to mean $g^x \bmod n$. It's the discrete logarithm that is difficult, not the normal logarithm.

So by encrypting $s$ and just not giving the value of $t(s)$, we've fixed our issues, right?

The first two are in fact fixed, but we still need a way to enforce the degree requirement. In a way, we already do have a type of restriction since the verifier only gives the prover powers of $s$ encrypted with $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$. But this restriction is only in place if the prover uses these and only these values. There's nothing stopping the prover from just not using those encrypted values of $s$, and using their own constructed numbers to cheat. Here's a more concrete way of seeing this:

What the verifier wants at the end of the day is for the equation $g^{p(s)} = (g^{h(s)})^{t(s)}$ to hold, and is trusting that $g^{p(s)}$ and $g^{h(s)}$ are provided truthfully by the prover. But if we have a prover that does not know a polynomial, all they need to do to fool the verifier is to give values $z_p = g^{p(s)}$ and $z_h = g^{h(s)}$. In other words, the prover just needs to find a solution to $z_p = (z_h)^{t(s)}$. Which, unfortunately for our protocol, is quite easy.

Let $z_h = g^r$ for some chosen random $r$
Then we just need $z_p = (g^r)^{t(s)} = (g^{t(s)})^r$
Since the target polynomial $t(x)$ is public, the prover can solve $g^{t(s)}$ with the given values $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$
So finding $z_p$ is done

So finding a solution is no harder than essentially picking a random number. We want some way to ensure that the prover uses and only uses the values the verifier gives them from $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$ in finding their $z_p,z_h$. If we can do that, the only way they'll be able to calculate $z_p,z_h$ is probably with a polynomial they have.

Readily enough, we have done this with our honesty checks above: we provide some arbitrary shift $\alpha$ only the verifier knows, to make sure the prover didn't leave anything out or cheat.

Protocol Attempt 3:

The verifier picks random number $s$ and $\alpha$. They send the encrypted powers of $s$ and the shifts to the prover: $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$ and $\{g^{\alpha s^0}, g^{\alpha s^1}, g^{\alpha s^2}, \cdots, g^{\alpha s^d}\}$.
The prover calculates $h(x) = \frac{p(x)}{t(x)}$. With the encrypted values, they calculate and send to the verifier $g^{p(s)}$, $g^{\alpha p(s)}$, and $g^{h(s)}$.
The verifier checks that $g^{p(s)} = (g^{h(s)})^{t(s)} = g^{h(s) \cdot t(s)}$, and that $(g^{p(s)})^\alpha = g^{\alpha p(s)}$.

The first check the verifier does is to see the values match like before, and the second check is to make sure there was no cheating and only the encrypted values of $s$ were used; the prover had to give a polynomial of degree $d$ and had to evaluate it at the $s$ the verifier gives them, as that is the only way to preserve the $\alpha$ shift.

For example, if the prover claimed they knew a quadratic with $d=2$, they claim they have a polynomial looking like $p(x)=c_2x^2+c_1x^1 + c_0$. But if they were really trying to sneak in that they had a cubic or quartic or any polynomial of degree higher than 2, they would not be able to calculate the terms needing $x^3$ or $x^4$ since the encrypted and shifted values could not be calculated and thus preserved. But that is only true if the prover used values from $\{g^{s^0},g^{s^1},g^{s^2}\}$. Now by using the $\alpha$ shift, the prover has no choice but to use these values since any deviation from them will appear comparing to the $\alpha$ values.

Further, we can do more than just ensure the degree of the polynomial we're checking, but also which specific powers are used in the polynomial. If, say, the polynomial was claimed to be a cubic, but only used powers $x^3$ and $x^1$, the verifier can choose to only send in the encrypted and shifted values of $s^3$ and $s^1$. If the prover needed the other powers to evaluate their polynomial, they would be stuck since they can only evaluate the terms with power 3 and 1.

Making It Zero-Knowledge

So far, our protocol has gone under a few iterations—and to be fair it's actually pretty robust. But we kind of forgot about making it zero-knowledge. I mean, yes we've encrypted the data to prevent any cheating from the prover, but the verifier can theoretically use the values the prover gives to brute force their way to finding the polynomial since they are the ones that generate the secret values $s$ and $\alpha$ from the beginning. For example, ideally our protocol should be secure for even a 1-degree polynomial, and even brute forcing that is just a matter of iterating through a series of numbers.

This is easily enough done in the same way that we've been doing before: we have the prover introduce a random parameter $\delta$ that obscures the data.

Zero-Knowledge Protocol:

The verifier picks random number $s$ and $\alpha$. They send the encrypted powers of $s$ and the shifts to the prover: $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$ and $\{g^{\alpha s^0}, g^{\alpha s^1}, g^{\alpha s^2}, \cdots, g^{\alpha s^d}\}$.
The prover calculates $h(x) = \frac{p(x)}{t(x)}$. With the encrypted values, they calculate $g^{p(s)}$, $g^{\alpha p(s)}$, and $g^{h(s)}$. With a random number $\delta$, they send the verifier $(g^{p(s)})^\delta$, $(g^{\alpha p(s)})^\delta$, and $(g^{h(s)})^\delta$
The verifier checks that $(g^{p(s)})^\delta = ((g^{h(s)})^\delta)^{t(s)} = g^{\delta \cdot h(s) \cdot t(s)}$, and that $((g^{p(s)})^\delta)^\alpha = (g^{\alpha p(s)})^\delta$.

Making it Non-Interactive

Our protocol is very similar to our discrete logarithm problem. I mean, we based it off of what we did there, only generalizing what we did with polynomials instead of specific numbers. So this doesn't really show us anything we haven't already seen before. We want to try and make it non-interactive so our verifier doesn't have to constantly monitor our verification. And more importantly, so our protocol is trustworthy: due to the nature of the interactive parts, verifiers could collude with provers, making each protocol use a one-time check. Even better would be if we could make it also succinct, and have each call of the protocol takes a consistent amount of time.

One way we can remove the need for interactivity is by, well, replacing the interactive parts with some constant, reliable parameters to always use as oppose to the ones suggested by the verifier. In this case, the verifier has to pick a value $t(s_0)$ (since the target polynomial $t(x)$ is known, really we just need to fix an $s_0$) as well as a fixed shift $\alpha_0$. But we need these to be trustworthy, and unable to be leaked.

We could just try encrypting these values like before by exponenitating modulo $n$ like before with $g^{t(s_0)}$ and $g^{\alpha_0}$. But the problem is, we've also encrypted the other values like $p(s_0)$ and $h(s_0)$, and as we said, we can't multiply two encrypted values together, which is exactly what the checks the verifier needs at the end of the protocol; with our encryption, if we have $E(\alpha_0)$ and $E(p(s_0))$, we have no way of getting $E(\alpha_0 p(s_0))$.

There's one extra piece of machinery we'll need: elliptic curves. Elliptic curves are a class of implicit functions of the form $y^2 = x^3 + ax + b$ and their relation to the whole of cryptography is a bit too wide to encapsulate in this post, so perhaps we'll come back to them another day. The important thing about them though is that we can establish cryptographic pairings with these curves that can get around our multiplication issue.

A cryptographic pairing is a function $e(\cdot, \cdot)$ that take two encrypted numbers, and outputs the product of those two numbers in a different representation (i.e. outputs the product of the numbers encrypted in a new way). Because the output space is a different "encryption scheme" from before, it makes it irreversible, and a "one-time operation"; you can't use the output of $e(\cdot, \cdot)$ in another cryptographic pairing in a meaningful way. The key properties of this pairing are:

$e(g^a, g^b) = e(g^b, g^a) = e(g^{ab}, g^1) = e(g^a, g^1)^b = e(g^1, g^b)^a = e(g^1,g^1)^{ab}$

I know this is a very rough sketch as to how we are overcoming the problem of multiplying encrypted values, but for the sake of brevity, we'll come back to it sometime later (if you can't wait, check this out). The important idea to keep in mind is that we have a way of multiplying encrypted values of $t(s)$ and $\alpha$ in a useable way.

So now, to make our proof non-interactive, we fix our values $\alpha, t(s)$, and then encrypt them $g^\alpha, g^{t(s)}$. These values will be the ones used by every prover and verifier moving forward, and to carry out our operations, we will use our cryptographic pairing function, which we'll see below.

Also, as an aside, we no longer need multiplicative group generators like we have used before. $g^n$ can now mean adding the generator of the elliptic curve $g$ to itself $n$ times. It's essentially the same (and acts the same for our purposes), but removes an extra component from our process.

The Final Protocol

We are now ready to put together our final protocol for knowledge of a polynomial of degree $d$ with the same roots as $t(x)$. The keys below are the necessary elements each party needs on their side of the proof beforehand.

Set-Up:

Fix random values $s$ and $\alpha$
Establish a cryptographic pairing $e(\cdot,\cdot)$ and a generator $g$
Find encryptions $g^\alpha$, $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$, $\{g^{\alpha s^0}, g^{\alpha s^1}, g^{\alpha s^2}, \cdots, g^{\alpha s^d}\}$

Now we distribute to the prover and verifier their respective information they are allowed to work with:

Proof key: $\left(\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}, \{g^{\alpha s^0}, g^{\alpha s^1}, g^{\alpha s^2}, \cdots, g^{\alpha s^d}\}\right)$
Verification key: $(g^\alpha, g^{t(s)})$

Proof:

Wants to prove knowledge of $p(x) = t(x) \cdot h(x) = c_d x^d + \cdots + c_0 x^0$
Let $h(x) = \frac{p(x)}{t(x)}$
Calculate $g^{p(s)}, g^{h(s)}$ using $\{g^{s^0}, g^{s^1}, g^{s^2}, \cdots, g^{s^d}\}$
Calculate $g^{\alpha p(s)}$ using shifted values $\{g^{\alpha s^0}, g^{\alpha s^1}, g^{\alpha s^2}, \cdots, g^{\alpha s^d}\}$
Pick a random shift $\delta$
Send to the verifier our proof $\pi = \left(g^{\delta p(s)}, g^{\delta h(s)}, g^{\delta \alpha p(s)} \right)$

Verification:

With $\pi = \left(g^{\delta p(s)}, g^{\delta h(s)}, g^{\delta \alpha p(s)} \right)$, we do our two checks for satisfiability and degree of the polynomial
Does the polynomial work? Check that $e(g^{\delta p(s)}, g) = e(g^{t(s)}, g^{\delta h(s)})$
Did the prover cheat? Check that $e(g^{\delta \alpha p(s)}, g) = e(g^{\alpha}, g^{\delta p(s)})$

And that's our protocol from start to finish. Without ever needing to reveal what $p(x)$, we can confirm with high probability that our prover's polynomial has the desired roots matching $t(x)$. We can add other requirements to the polynomial, like only including certain powers as we discussed, or others such as it must be a square polynomial.

General zk-SNARKs

So far, we've spent roughly 6000 words talking about polynomials and homomorphic encryption, and that's with skipping explanations of elliptic curve cryptography thrown in there too. But in practice, when was the last time you saw someone work with a polynomial directly? If I had to think of when someone would want to prove knowledge of something in practice, it would probably be something less direct, like knowing the output of a program, or a secret input (like a password). Those don't seem related to polynomials at all. For example, if I had a computer program and a given output, I'd like to show I have the corresponding input without revealing that input.

But of course, in the weirdest ways, polynomials are the current backbone of zk-SNARKs. And unfortunately, like with elliptic curves from before, require an already huge amount of literature to even crack the surface of how they work. Ultimately, the idea is that given any program, we can make the problem of proving knowledge-of-input into a question of knowledge-of-polynomial. Here are the rough steps (as outlined in this more in-depth post):

Computation: The actual code for the program itself.
Flattening: We turn the code into a combination involving expression only involving $=+,-,\times,\div$. These arithmetic operations essentially correspond to different types of gates in a circuit. This is probably the hardest, least obvious step, and even now it is not clear to me what are the restrictions for programs that can be converted to these arithmetic circuits. I'd recommend reading this.
R1CS: With the arithmetic circuit, we now convert it to a rank-1 constraint system. An R1CS is a set of groups of 3 vectors $(a,b,c)$, that has a solution $x$ such that $(a \cdot x) \times (b\cdot x) = c\cdot x$ where $\cdot$ is the standard dot product. Each group of these 3 vectors $(a,b,c)$ represents some kind of constraint on our solution. In our case, we will have a triple of vectors $(a,b,c)$ for each gate/operation we have in our arithmetic circuit. So if our arithmetic circuit has 4 steps in it, we'll have 4 triplets of vectors constraining our solution $x$ that it must all satisfy. The length of the vectors will be equivalent to the number of variables needed in the circuit. Here's an example conversion of an arithmetic gate to R1CS constraint vectors.
QAP: We do yet another conversion from R1CS to a quadratic arithmetic program. The idea is to encode our vector constraints in polynomials. If, for example, we had 5 constraints with vectors of length 7, we would have 5 pairs of $(a,b,c)$ constraints where $a,b,c$ are all length 7 vectors. Then, we could encode these in 3 groups of 7 polynomials (in this case, they would be of degree $5-1=4$). Each polynomial represents a coordinate in the vector, and each group represents whether that vector is the $a$, $b$, or $c$ vector in the constraint. Since we have 5 constraints, we then retrieve our each constraint vector by plugging in $x=1,2,3,4,5$ to each polynomial and read off the values by group and coordinate. We can create these polynomials via Lagrange interpolation, or your favorite way to fit polynomials to specific values. This is why the degree of the polynomials are 1 less than the number of constraints: you need exactly $d+1$ points to determine a polynomial of degree $d$.

The point of putting our program into a QAP is that we can now do our R1CS verification a lot more compactly. Instead of checking all the dot products individually between our solution vector $x$ and the constraint vectors, we can dot product the solution once with our polynomials. Dot products are just combinations of addition and multiplication, so the result of $(x \cdot A(t)) \times (x \cdot B(t)) - x\cdot C(t) = F(t)$ will just be another polynomial in $t$. Our solution vector is genuine if $F(t) = 0$ for $t=1,2,3,4,5$ in our above example, since plugging in those values corresponds to checking a different R1CS constraint (that represents one of our arithmetic logic gates).

But now, we're mostly at the point in which our zk-SNARK protocol for polynomials is starting to look like a more useable tool for general programs. There are a few additional steps outlined in some of the links above, but the core idea is here, of encoding the computation of the program in a polynomial that we later can check the knowledge of via a zk-SNARK.

Conclusion

There are a lot of uses that zero-knowledge proofs can find themselves in. Basically, whenever you want any level of privacy. From passwords, to graphs, to polynomials, or to nuclear disarmament. And its most popular use, blockchains and cryptocurrencies (good way of checking valid transactions without needing to reveal people's currency balances). The underlying theory of zero-knowledge proofs, though, is simultaneously easy to understand, and difficult to implement. The theory is rich, but dense, so pehaps one of these days we'll fill in the gaps of elliptic curves (which definitely deserves its own post) and fully fleshing out how we can convert computer programs to polynomials.

For more reading and types of proofs, here's a nice simple example for proving knowledge of coloring a graph that can be found here or here (and a nice little demo to go with it). Interestingly, as an aside, we can since reduce any NP problem to the 3-coloring problem, this actually gives us a way of generating zero-knowledge proofs for any NP hard problem.

For even more details on zk-SNARKs and zero-knowledge ideas on the whole, see this article that informed much of this post.

e, π, and Irrational Numbers

Adi Mittal

A classic fact and its 10000 word Wikipedia spiral.

Irrational numbers are a bit strange to think about. In the sense, they were the first "new" type of number to really challenge early and young mathematicians. We have the natural numbers $\mathbb{N}$ like $\{0,1,2,3,\cdots\}$. From there, we then include the negative numbers $\{ \cdots,-2,-1,0,1,2,\cdots \}$ to have the integers $\mathbb{Z}$, that can be used to represent absences of quantities and additive inverses. From there, we can consider the in-between quantities of fractions, also known as the rationals $\mathbb{Q}$. For a while, this is what we thought all there was to numbers. Pythagoras and the Ancient Greeks famously thought there was no number that wasn't rational. They had this notion of increasing inclusion of numbers $\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q}$, and that was the upper limit.

As we now know, this clearly isn't the case. We've talked about irrational numbers a little bit before, specifically on how $\sqrt{2}$ is irrational. It might be worthwhile going over the proof again:

Claim: $\sqrt{2}$ is irrational.

This proof is pretty well-known, but to me, it's not necessarily obvious. Sure, in retrospect after knowing the proof, it might seem like a good direction to go in and get a contradiction within that rational assumption. But even then, it's a pretty clever proof.

I mention this, since I was recently thinking about an arguably more famous irrational number: $\pi$. A number studied forever that manages to appear in random sums, integrals, and expressions it has no right being in, it's no surprise it is one of the few symbols of math that transcends pop culture. And despite that, before writing this, I couldn't explain or prove why it is irrational. $\pi$ is geometric at heart, and translating to number-theoretic contexts just didn't seem to make sense. But along the way of finding out its proof (that stumped many before me), there is much to unpack about not just about $\pi$, but irrational numbers on the whole.

If you care for only the proof that $e$ and $\pi$ are irrational, skip [here]. Otherwise, follow through the table of contents for whatever you're looking for.

Decimal Expansions

The existence of irrational numbers actually implies a lot. For one, there are infinitely many more irrational numbers than rationals. With the classic diagonal argument, we can see that there are not any "more" rationals than integers or naturals. The irrationals are what make the real numbers uncountable. A way we can see that is through decimal expansions: all irrational numbers have non-repeating decimal expansions. So now if you imagine constructing a random number $0.142346\dots$ by picking a random number 0–9 at each digit, the chance of landing a repeating decimal is extremely unlikely.

Claim: A number is rational if and only if it has a terminating or repeating decimal expansion.

Proof: $(\Rightarrow)$ The easiest way to see this is through long division. If we have a rational number $\frac{p}{q}$ with $q \neq 0$, then at every step in the long division, we have $q$ possible remainders (being $0,1,2,\cdots,q-1$). If the remainder is 0 at any point, then we are done with the divison algorithm and the decimal expansion terminates (this it the case of 0 being the repeating portion of the decimal). If the remainder is never 0, after $q+1$ steps in the division algorithm, we will get a remainder we have already seen before, and so the division algorithm will keep producing the same digits in order we have seen before from that remainder. Hence the decimal expansion repeats.

$(\Leftarrow)$ Say we have a number $x = a.d_{1}d_{2}\cdots d_{n} \overline{d_{n+1} \cdots d_{n+m}}$ that has integer part $a$, $n$ pre-repeating decimals (each $d_i$ is a digit of the number), and $m$ repeating decimals (indicated by the overline). Then we can left-shift the non-repeating number by multiplying by a power of 10:

$10^n x = 10^n a + d_{1}d_{2}\cdots d_{n}.\overline{d_{n+1} \cdots d_{n+m}} \ \ \ \ (1)$

Since the decimals repeat after an additional $m$ digits, we can shift it further to get another similar looking number:

$10^{n + m} x = 10^{n + m} a + d_{1}d_{2}\cdots d_{n}d_{n+1} \cdots d_{n+m}.\overline{d_{n+1} \cdots d_{n+m}} \ \ \ \ (2)$

Now we can subtract equation $(1)$ from $(2)$ to eliminate the repeating decimal portion:

$10^{n + m} x - 10^n x = (10^{n + m} a + d_{1}d_{2}\cdots d_{n}d_{n+1} \cdots d_{n+m}) - (10^n a + d_{1}d_{2}\cdots d_{n})$

Now, see that since we have removed the decimal/fractional part of the numbers in that subtraction, the righthand side of the equation is an integer, call it $N$. Then we can solve for $x$ and see that

$x = \large{\frac{N}{10^{n+m} - 10^n}}$ $=$ $\large{\frac{N}{(10^m - 1)10^n}}$

is a quotient of two integers, and so rational.

$\blacksquare$

So if we want to use real numbers with decimals as we normally do, we need our non-repeating decimals to correspond to something. These are our of course our irrationals. But let's be clear exactly what non-repeating means. It's frequently unjustifiably said that since $\pi$ is irrational, it contains every single string of numbers ever conceived, and hence contains exactly the time and date you were born, you will die, and a copy of every book ever written. This is unknown. Just consider the non-repeating decimal that only contains 0s and 1s

$0.10110111011110\cdots \overset{\textrm{n ones}}{111\cdots 1} 0 \overset{\textrm{n+1 ones}}{111\cdots 1}0\cdots$

Clearly this does not repeat, but certainly also does not contain all possible strings of numbers; it does not even contain all strings of 0s and 1s.

In a sense, too, the irrationals are also necessary for us. Without the irrational numbers, obviously we do not have all the real numbers, but the irrationals are necessary to fill in the gaps and holes left by the rationals to make the real numbers $\mathbb{R}$ as useful as they are. For example, a key property of the reals is that every subset of the reals that has an upper bound, in fact has a least upper bound. That is, for a subset $A \subset \mathbb{R}$ such that $\exists x \in \mathbb{R} \ \forall a \in A \ a \leq x$ (i.e. $x$ is an upper bound), there exists $\ell \in \mathbb{R}$ that is less than or equal to all possible upper bounds $x$. The rationals don't have this property: $\{ x \in \mathbb{Q} \ | \ x^2 < 2 \}$ has upper bounds in $\mathbb{Q}$, but not a least upper bound; we need irrationals and $\sqrt{2} \in \mathbb{R}$ to do this.

Completeness and the Variety of Irrationals

The irrationals, in this way, are integral to the real numbers. Not just in the sense that we have to include them, but in that they complete the real numbers. That least upper bound property we discussed above is what is referred to as the completeness of the real numbers (which can be an axiom or property derivative of other axioms, but either way is what characterizes the reals).

And yet, despite their utility, irrational numbers are still pretty mysterious. We know all rational numbers take on a certain form of $\frac{p}{q}$. Even complex numbers all reduce to some $a + bi$. But irrational numbers can take on any number of forms.

Taking the square root of any non-square number is irrational
The golden ratio $\varphi$ is irrational
$e = \lim\limits_{n\to\infty} \left(1+\frac{1}{n} \right)^n = \sum_{n = 0}^{\infty} \frac{1}{n!} \approx 2.71828\cdots \ $ is irrational
$\pi$ as the ratio of a circle's circumference to its diameter is irrational

Irrational numbers can pop up almost anywhere. Though, we might be getting a bit ahead of ourselves. We have a sort of "standard form" of irrational numbers given by our decimal expansions and our above claim: a number is irrational if and only if it has a non-repeating decimal part. But it's not like we can a priori check a decimal expansion is infinite to prove a number is irrational; it's usually shown that infinite decimal expansions are a consequence of irrationality.

Continued, Infinite, and Reasonable Fractions

So, while we showed that we in theory need irrational numbers, in practice representing them and using them seem far more tedious and annoying to work with.

And further, if you're just an engineer, astronomer, or even a mathematician that needs some concrete number to work with, rational numbers get us most of the way there for what we need. Why not just take some arbitrary decimal cut off? Most people only know $\pi$ up to $3.14$ anyway.

$3.14$ is pretty good, but an arguably better approximation is the famous $\frac{22}{7} \approx 3.14286$. Sure, it might be further off from $\pi$ than $3.14$, but it is a rather simple fraction that can make calculations easier and more convenient. You could go even further and get $\frac{355}{113} \approx 3.1415929$ which is good up to 6 decimals. Can we get any better?

So before we get any further in directly discussing irrationals, let's take a minute to talk more about rational numbers and their relationship with irrationals. From this, we'll see another nice property of irrational numbers that will help us along our way in a similar, but stronger way compared to decimal expansions.

Repeating Decimals $\Leftrightarrow$ Rationals

We alread proved that repeating and terminating decimals correspond to some rational number. So as a first step, it might be worth thinking about how we might go to and from fractions and these nice decimals before considering how we might do something with the weird and infinite decimals of irrationals.

Finite decimals are easy: just take the least power of 10 in it's decimal expansion and put it in the denominator of the digits. $.5 = \frac{5}{10^1} = \frac{1}{2}$, and $1.562782 = \frac{1562782}{10^6}$. Not particularly helpful.

If we have a repeating decimal, it's slightly trickier, but our proof above showing they are rational essentially tells us how to do the conversion.

Consider the number $x = 2.642857 \overline{142857}$
Using the above notation, $a=2$, $n=6$, $m=6$
So $N = 10^{6+6}\cdot 2 + 642857142857 - (10^6\cdot 2 + 642857)$
All together, we get that $x = \frac{N}{(10^6 - 1)10^6} = \frac{37}{14}$

You can also use geometric series to get the same answer, using the fact that the repetitions in the decimal are all separated by a ratio of $10^{-m}$ (for example, $0.\overline{142857} = 10^{-6} \cdot \sum_{n=0}^{\infty}142857\cdot 10^{-6n}$, and since $10^{-6}<1$, we get that it's equal to $10^{-6} \cdot 142857\cdot \frac{1}{1 - 10^{-6n}} = \frac{142857}{999999} = \frac{1}{7}$).

Though, we should expect that we can get these exact rationals since we already proved that finite and repeating decimals all are rational.

Rational Approximations

Now irrational numbers, on the other hand, by definition, will not have a rational form. So now we will actually need rational approximations. But finding those approximations, is not particularly easily. In particular, for irrational $x$, we want to find a rational $\frac{p}{q}$ such that $\left| x - \frac{p}{q} \right| < \epsilon$. At the very least, we know such rationals exist for any $\epsilon > 0$, since the rationals are dense in $\mathbb{R}$; there is always a(n infinite amount of) rational(s) in the interval $\left( x - \epsilon, x + \epsilon \right)$.

But we don't want to waste our time with "bad" approximations. Notice that for any denominator $q$, we can find a $p$ such that $\left| x - \frac{p}{q} \right| \leq \frac{1}{2q}$. This is just a result that if we divide the real numbers into intervals of size $\frac{1}{q}$, our number $x$ will fall in one of them and be closer to one side of the interval than the other.

It turns out though, we can find approximations that are far more "efficient" and get us much closer to our irrational number than just $\frac{1}{2q}$ for their denominator $q$.

Dirichlet's Approximation Theorem: $\forall x \in \mathbb{R}$ and $\forall N \in \mathbb{N}$, there are integers $p$ and $q$ such that $0 < q \leq N$ and $\left|qx - p \right| < \frac{1}{N}$.

As a consequence, we then have rational approximations $\frac{p}{q}$ that get us within $\frac{1}{q^2}$ of our irrational $x$ since $q\leq N$ implies

$$\left|qx - p \right| < \frac{1}{N} \Rightarrow \left|x - \frac{p}{q} \right| < \frac{1}{qN} \leq \frac{1}{q^2}$$

Proof: Pick $x\in \mathbb{R}$ and $N \in \mathbb{N}$. We will prove this using the Pigeonhole Principle. Note that any integer $n$ can be written as a fraction with any denominator $k$ as $\frac{nk}{k}$. So all we care about is approximating the fractional part of $x$, call it $\{x\} \in [0,1)$. Now divide the interval $[0,1)$ into $N$ equal subintervals of length $\frac{1}{N}$ i.e.

$[0,1) = \left[0, \frac{1}{N} \right) \cup \left[\frac{1}{N}, \frac{2}{N} \right) \cup \cdots \cup \left[\frac{N-1}{N}, 1 \right)$

Now consider the set of $N+1$ numbers $kx$ for $0\leq k \leq N$, and in particular their fractional parts $\{kx\}$. By the Pigeonhole Principle, at least two of these fractional parts $\{k_1 x\}, \{k_2 x\}$ lie in the same $\frac{1}{N}$-subinterval of $[0,1)$ (without loss of generality, we can pick $k_1 > k_2$).

Note, we can write $\{kx\} = kx - \lfloor kx \rfloor = kx - j$ for some interger $j$. So if two fractional parts lie in the same subinterval, then what we have is

$$\left|\{k_1 x\} - \{k_2 x\}\right| = \left|(k_1 x - j_1) - (k_2 x - j_2)\right| = \left|(k_1 - k_2)x - (j_1 - j_2)\right| < \frac{1}{N}$$

So if we let $q = k_1 - k_2$, and $p = j_1 - j_2$, then we have $0 < q \leq N$ and

$$\left|\{k_1 x\} - \{k_2 x\}\right| = \left|qx - p \right| < \frac{1}{N}$$

which is exactly what we wanted to show.

$\blacksquare$

Also, it's worth noting that there are infinitely many of these approximations, despite this just being an existence proof of at least one such pair $p,q$ that bound $\left|x - \frac{p}{q} \right| < \frac{1}{q^2}$. This can be quickly noticed since for a given $N_0$, Dirichlet's theorem says that there is at least one pair of $p_0,q_0$ such that $\left| q_0x - p_0 \right| < \frac{1}{N_0}$. But then, if we increase $N_0$ to $N_1$ large enough such that $\left| q_0x - p_0 \right| \geq \frac{1}{N_1}$, then Dirichlet's theorem ensures that there is some new pair of numbers such that $\left| q_1x - p_1 \right| < \frac{1}{N_1}$. And we can continue this forever.

As an interesting aside, this can also be used to show that rationals are in some ways worse than irrationals at being approximated by other rationals. Of course, if we "approximate" a rational with itself, it trivially holds that it is the "best" approximation. But concretely, if we have a rational $x = \frac{a}{b}$ and another rational $\frac{p}{q} \neq x$, then we get that

$\frac{p}{q} \neq x = \frac{a}{b} \Rightarrow pb - qa \neq 0$

In particular, since $p,q,a,b$ are all integers, so would that difference above be, and so we can conclude that $|pb - qa| \geq 1$. So all together,

$\left|qx - p \right|$ $ = \left| \large{\frac{qa - pb}{b}} \right|$ $\geq \large{\frac{1}{b}}$

which is bounded away from 0, regardless of $q$. Which shouldn't be that surprising thinking through that $p,q$ are integers so that difference in distance should only have a non-integer part that is exactly proportional to the denominator $\frac{1}{b}$.

Best Rational Approximations

Dirichlet's Approximation Theorem tells us that we have stronger candidates than others for rationally approximating irrationals. Which is great and all, but can we give a more specific criterion to single out these "better" approximations? Or at the very least, approximations that we would prefer using?

We won't use the approximation that Dirichlet's Theorem would suggest above, as in using something "efficient" that gets us closer to the number we are approximating more than we'd expect it to. And besides, our proof above doesn't actually tell us how to find these good approximations since we have no control for our denominator $q$, but rather only an upper bound for $q$ with our choice of $N$. In particular, it also doesn't tell us how good our approximations are to other Dirichlet approximations; it could be the case that one approximation from Dirichlet's theorem is better than another despite having a smaller denominator (and so is in some way more preferable).

Probably the easiest metric of a best rational approximation of a number in this case would be that it is the best approximation for the smallest denominator found yet. Ideally, we want to work with "simple" fractions (i.e. with small denominators). So we say a rational approximation $\frac{n}{d}$ for a real number $x$ is best if $\frac{n}{d}$ is closer to $x$ than any other fraction with denominator less than $d$. If you want something more concrete, $\frac{n}{d}$ is a best approximation for $x$ if

$\large{\frac{n}{d}}$ $= \min\left\{ \left| x - \frac{p}{q} \right| \ : \ q \leq d \right\}$

To really get the point across, $\frac{p}{q}$ is a best rational approximation if it is not possible to get a better approximation using a smaller denominator. This is part of the reason why $\frac{22}{7}$ is so convenient to approximate $\pi$ as it is still a rather simple and easy-to-work with fraction, but is in a sense better than any possibly simpler fraction. Again, this is great to have, but we still don't have a way of finding these approximations any better than just brute force checking every other simpler fraction.

To find these approximations, we'll have to take a slight detour that perhaps is the closest link between representing any real number with rationals that you might have seen before.

Continued Fractions

Consider one of our favorite irrational numbers that we might want to approximate: $\sqrt{2}$. Before trying to find approximations or anything, notice that $\sqrt{2} - 1 = \frac{1}{1 + \sqrt{2}}$.

Now hold on. We just found that $\sqrt{2} = 1 + \frac{1}{1 + \sqrt{2}}$. If $\sqrt{2}$ equals that fraction on the righthand side, then we can just replace an instance of $\sqrt{2}$ with that same fraction:

$$ \sqrt{2} = 1 + \large{\frac{1}{1 + (1 + \frac{1}{1 + \sqrt{2}})}} $$

And again:

$$ \sqrt{2} = 1 + \large{\frac{1}{2 + \frac{1}{1 + (1 + \frac{1}{1 + \sqrt{2}})}}} $$

And can keep doing this forever:

$$ \sqrt{2} = 1 + \large{\frac{1}{2 + \frac{1}{2 + \frac{1}{2 + \frac{1}{\ddots}}}}} $$

This is a continued fraction for $\sqrt{2}$. This particular one is the canonical or simple continued fraction for $\sqrt{2}$ since all the numerators are 1. The generalized continued fraction allows for any numerator. Perhaps an even more famous continued fraction is that for the golden ratio $\varphi$:

$$ \varphi = \frac{1 + \sqrt{5}}{2} = 1 + \large{\frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}}}} $$

As shorthand, simple continued fractions are sometimes written as $\sqrt{2} = [1;2,2,2,2,\ldots]$ to indicate the coefficients. So likewise, $\varphi = [1;1,1,1,1,\ldots]$.

Properties of Continued Fractions

Working with things that look like rational numbers should key us into that this is closer to where we want to be looking. And like decimals, continued fractions gives us a nice dichotomy between the rational and irrational.

Claim: A number is rational if and only if its simple continued fraction representation is finite.

Proof: $(\Leftarrow)$ If the continued fraction is finite i.e.

$x = a_1 + $ $\large{ \frac{1}{a_2 + \frac{1}{a_3 + \frac{1}{\frac{\ddots}{a_{n-1}+\frac{1}{a_n}}}}}}$

Then we can just collapse the continued fraction by creating common denominators and simplifying with normal fraction rules, showing $x$ is rational.

$(\Rightarrow)$ The idea is we'll continuously separate our fraction into its integer and fractional parts, and continue to reduce the fractional part. As an example,

$\frac{19}{7} = 2 + \frac{5}{7} = 2 + \frac{1}{\frac{7}{5}} = 2 + \frac{1}{1 + \frac{2}{5}} = 2 + \frac{1}{1 + \frac{1}{\frac{5}{2}}} = 2 + \frac{1}{1 + \frac{1}{2 + \frac{1}{2}}}$

If $x = \frac{p}{q}$ is rational, then we can repeatedly apply the division algorithm to create our continued fraction. So write $p = a_1 q + r_1$ for $0\leq r_1 < q$. Hence, we can write $\frac{p}{q} = a_1 + \frac{r_1}{q} = a_1 + \frac{1}{\frac{q}{r_1}}$. Again, we can repeat the division algorithm and write $q = a_2 r_1 + r_2$ for $0 \leq r_2 < r_1$.

To put it more neatly, write the following division algorithm steps:

$ \begin{array}{cc} p = a_1 q + r_1 & 0 \leq r_1 < q \\ q = a_2 r_1 + r_2 & 0 \leq r_2 < r_1 \\ r_1 = a_3 r_2 + r_3 & 0 \leq r_3 < r_2 \\ \vdots & \vdots \\ r_{n-3} = a_{n-1} r_{n-2} + r_{n-1} & 0 \leq r_{n-1} < r_{n-2} \\ r_{n-2} = a_n r_{n-1} \end{array} $

Note since $r_1 > r_2 > \cdots > r_{n-1}$ form a sequence strictly decreasing and non-negative integers, this process must eventually terminate with all $r_m = 0$ after a certain point in a finite number of steps (this is what justifies using the division algorithm to find the GCD of two numbers). Once our algorithm terminates in a finite number of steps (this is what keeps our continued fraction finite), it's just a matter of writing out the continued fraction:

$x = \frac{p}{q} = a_1 + \frac{r_1}{q} = a_1 + $ $\large{ \frac{1}{a_2 + \frac{1}{a_3 + \frac{1}{\frac{\ddots}{a_{n-1}+\frac{1}{a_n}}}}}}$

and we're done.

$\blacksquare$

So this actually gives us a direct proof that $\sqrt{2}$ is irrational with its continued fraction: its simple continued fraction is infinite, and so it must be irrational.

More on Irrational Continued Fractions

Let's go back to our proof that all rational numbers have finite continued fraction expansions. Another way we can see this fact is by constructing continued fractinos for irrational numbers, and we do so in almost the exact same way we did for rationals. The idea in using the division algorithm is that we separate our number $x$ into its integer and fractional component, and then invert the fractional component (which is less than 1, so inverting it gives us a number greater than one to keep separating into integer and fractional parts). I.e. first write

$x = \lfloor x \rfloor + x - \lfloor x \rfloor = \lfloor x \rfloor +$ $\large{\frac{1}{\frac{1}{x - \lfloor x \rfloor}}}$

and then repeat this exact writing process on $x - \lfloor x \rfloor$. But since $x - \lfloor x \rfloor$ is also irrational, and so will all future denominators at each step. The only way this process terminates is if our step $x - \lfloor x \rfloor = 0$ which is rational, and hence can't happen.

Claim: Let $x$ be irrational. Let $x_0 = x, \ a_k = \lfloor x_k \rfloor, \ x_{k+1} = \frac{1}{x_k - a_k}$. Then for all $k$

$x = [a_0; a_1, a_2, \ldots, a_{k-1}, x_k] = a_0 + $ $\large{ \frac{1}{a_1 + \frac{1}{a_2 + \frac{1}{\frac{\ddots}{a_{k-1}+\frac{1}{x_k}}}}}}$

Proof: This follows immediately from our construction above: separate the integer part and invert the fractional. It clearly holds for $k=0$, since $x = x_0$. Say this holds for $k$. Then $x_{k+1} = \frac{1}{x_k - a_k} \Rightarrow x_k = a_k + \frac{1}{x_{k+1}}$ and so

$x = [a_0; a_1, a_2, \ldots, x_k] = [a_0; a_1, a_2, \ldots, a_k + \frac{1}{x_{k+1}}] = [a_0; a_1, a_2, \ldots, a_k, x_{k+1}]$. $\blacksquare$

Then it is not too hard to show that we can extend this to the infinite continued fraction in the way we'd want to.

Claim: $[a_0; a_1, a_2, \ldots] = \lim\limits_{n\to\infty} [a_0; a_1, a_2, \ldots, a_n] = x$

Before we continue into the infinite, we'll need some tools about the finite first that we will extend with limits. The truncated continued fractions of $x$ called the convergents of the continued fraction. The nth convergent of a continued fraction is

$c_n = \frac{p_n}{q_n} = [a_0; a_1, a_2, \ldots, a_n] = a_0 + $ $\large{ \frac{1}{a_1 + \frac{1}{a_2 + \frac{1}{\frac{\ddots}{a_{n-1}+\frac{1}{a_n}}}}}}$

A convergent is just the initial segment of the continued fraction up to the $nth$ denominator. One useful fact we will need is the recursive formula for the numerator $p_n$ and denominator $q_n$ of these convergents (since the direct formula can get very complicated in simplifying it).

Lemma: For an infinite continued fraction $[a_0; a_1, a_2, \ldots]$ with $a_i > 0 \ \forall i > 1$, let $c_n = \frac{p_n}{q_n}$ be the nth convergent. We then have the following recurrence:

$ \begin{array}{c|c} p_0 = a_0 & q_0 = 1 \\ p_1 = a_1 a_0 + 1 & q_1 = a_1 \\ p_n = a_n p_{n-1} + p_{n-2} & q_n = a_n q_{n-1} + q_{n-2} \\ \end{array} $

Proof: We'll prove this inductively. Note when I equate two fractions here, I mean to literally equate the numerators and denominators just as shorthand to show the two recursions at once (and simultaneously link it back to the convergent directly).

Base Case: We just quickly check this $c_0 = \frac{a_0}{1} = \frac{p_0}{q_0}$. $c_1 = a_0 + \frac{1}{a_1} = \frac{a_1 a_0 + 1}{a_1} = \frac{p_1}{q_1}$.

Inductive Step: Say this is true for $c_n = \frac{p_n}{q_n} = \frac{a_n p_{n-1} + p_{n-2}}{a_n q_{n-1} + q_{n-2}}$. Now consider $c_{n+1} = [a_0, a_1, \ldots, a_n, a_{n+1}]$. Note we can obtain $c_{n+1}$ from $c_n$ by replacing $a_n$ with $a_n + \frac{1}{a_{n+1}}$. That is, we can do a substitution trick like we did before, and write $c_{n+1} = [a_0, a_1, \ldots, a_n + \frac{1}{a_{n+1}}]$. Fortunately, by the inductive hypothesis, we have our numerator and denominator of $c_n$ in a form that is dependent on $a_n$ and makes this substitution possible (note that $p_{n-1}$, $p_{n-2}$, $q_{n-1}$, $q_{n-2}$ don't depend on $a_n$, so they remain unaffected by this substitution).

$\begin{align} c_{n+1} & = \frac{\left(a_n + \frac{1}{a_{n+1}}\right) p_{n-1} + p_{n-2}}{\left(a_n + \frac{1}{a_{n+1}}\right) q_{n-1} + q_{n-2}} \\ & = \frac{\left(a_n p_{n-1} + p_{n-2} \right) + \frac{1}{a_{n+1}} p_{n-1}}{\left(a_n q_{n-1} + q_{n-2} \right) + \frac{1}{a_{n+1}} q_{n-1}} \\ & = \frac{p_n + \frac{1}{a_{n+1}} p_{n-1}}{q_n + \frac{1}{a_{n+1}} q_{n-1}} \\ \frac{p_n}{q_n} & = \frac{a_{n+1} p_n + p_{n-1}}{a_{n+1} q_n + q_{n-1}} \end{align}$

Giving us exactly what we wanted.

$\blacksquare$

These recursions give us some very useful ways of characterizing convergents, since without them, it isn't hard to imagine how complicated the numerators and denominators might become when trying to collapse the continued fraction.

Claim: $p_{n-1} q_n - q_{n-1} p_n = (-1)^n$

Proof: As expected, we'll use induction:

Base Case: For $n=1$, we get $p_0 q_1 - q_0 p_1 = a_0 a_1 - 1 \cdot (a_1 a_0 + 1) = -1$. Inductive Step: Suppose this is true for $n$. Then using our recursions, we can show this holds for $n+1$:

$\begin{align} p_{n} q_{n+1} - q_{n} p_{n+1} & = p_n (a_{n+1} q_{n} + q_{n-1}) - q_n (a_{n+1} p_{n} + p_{n-1}) \\ & = p_n q_{n-1} - q_n p_{n-1} \\ & = (-1) \cdot (p_{n-1} q_n - q_{n-1} p_n) \\ & = (-1) \cdot (-1)^n \\ & = (-1)^{n+1} \end{align}$
$\blacksquare$

This fact also then immediately tells us that $\gcd (p_n, q_n) = 1 $ and that they are coprime, so $c_n = \frac{p_n}{q_n}$ is always in lowest terms. It also gives us a nice relation between adjacent convergents:

$\begin{align} c_n - c_{n-1} = \frac{p_n}{q_n} - \frac{p_{n-1}}{q_{n-1}} = \frac{p_n q_{n-1} - q_n p_{n-1}}{q_n q_{n-1}} = \frac{(-1)^{n+1}}{q_n q_{n-1}} \end{align}$

This fact that $c_n - c_{n-1} = \frac{(-1)^{n+1}}{q_n q_{n-1}}$ will get arbitrarily small also shows they form a Cauchy sequence (the formal proof is just using this bound with the triangle inequality) and hence converge $\lim\limits_{n\to\infty} c_n$ exists. Below we'll show that they converge to the limit we expect.

Taking a minute to give some terminology and the above recurrence will help us in the future.

Claim: Using the $a_i$, $x_i$ defined above, $[a_0; a_1, a_2, \ldots] = \lim\limits_{n\to\infty} [a_0; a_1, a_2, \ldots, a_n] = x$

Proof: As with any question about infinity, we look at the partial, in-between steps and consider the formal limit. That means our convergents $c_n$ i.e. we want to show $\left| x - c_n \right| = \left| x - \frac{p_n}{q_n} \right| \rightarrow 0$.

First, note clearly the $x_n$ is irrational for all $n$. Second, we'll show $a_n > 0$ for all $n$ so we can use our recursive formula of convergents above. This follows because our algorithm essentially takes the fractional part of our number (which is less than 1), and inverts it (making it greater than 1) at the next step: since $a_k = \lfloor x_k \rfloor$, we have $a_k < x_k < a_k + 1$ (we get a strict lower inequality by the irrationality of $x_k$). So $0 < x_k - a_k < 1$. Hence, $x_{k+1} = \frac{1}{x_k - a_k} > 1$, and therefore $a_{k + 1} = \lfloor x_{k+1} \rfloor \geq 1$.

So we can use our recursive formulas for convergents we proved above. Also using our "finite" continued fraction from before, we can write

$x = [a_0; a_1, a_2, \ldots, a_n, x_{n+1}] =$ $\large{\frac{x_{n+1} p_n + p_{n-1}}{x_{n+1} q_n + q_{n-1}}}$

Therefore,

$\begin{align} x - \frac{p_n}{q_n} & = \frac{x_{n+1} p_n + p_{n-1}}{x_{n+1} q_n + q_{n-1}} - \frac{p_n}{q_n} \\ & = \frac{x_{n+1} p_n q_n + p_{n-1} q_n - x_{n+1} q_n p_n - q_{n-1} p_n}{(x_{n+1} q_n + q_{n-1})q_n} \\ & = \frac{p_{n-1} q_n - q_{n-1} p_n}{(x_{n+1} q_n + q_{n-1})q_n} \\ & = \frac{(-1)^n}{(x_{n+1} q_n + q_{n-1})q_n} \\ \end{align}$

The last equality we showed above with our recursions. Taking absolute values,

$\left| x - \frac{p_n}{q_n} \right| =$ $\left| \large{\frac{1}{(x_{n+1} q_n + q_{n-1}) q_n}} \right|$

Also, $x_{n+1} > \lfloor x_{n+1} \rfloor = a_{n+1}$, so $x_{n+1} q_n + q_{n-1} > a_{n+1} q_n + q_{n-1} = q_{n+1}$:

$\left| x - \frac{p_n}{q_n} \right| =$ $\left| \large{\frac{1}{(x_{n+1} q_n + q_{n-1}) q_n}} \right|$ $ < $ $\large{\frac{1}{q_{n+1} q_n}}$

Next, note that $q_n \geq n$ for $n \geq 1$:

Lemma: $q_n \geq n$ for $n \geq 1$.

Proof: $q_1 = a_1 \geq 1$ as shown earlier, so we have a base case established. Say $q_k \geq k$ for $k\leq n$. Then $q_{n+1} = a_{n+1}q_n + q_{n-1} \geq 1 \cdot n + n-1 \geq n+1$. $\blacksquare$

Finally,

$\left| x - \frac{p_n}{q_n} \right| < $ $\large{\frac{1}{q_{n+1} q_n}}$ $\leq$ $\large{\frac{1}{(n+1)n}}$

Giving the limit

$\lim\limits_{n\to\infty} \left| x - \frac{p_n}{q_n} \right| = 0$

$\blacksquare$

So we're actually justified in using these infinite continued fractions in the way we want to, and generate them in a semi-algorithmic way:

$\begin{align} \pi & = 3 + .141592 \ldots \\ & = 3 + \frac{1}{\frac{1}{.141592\ldots}} = 3 + \frac{1}{7 + .062513\ldots} \\ & = 3 + \frac{1}{7 + \frac{1}{\frac{1}{.062513\ldots}}} = 3 + \frac{1}{7 + \frac{1}{15 + .996594\ldots}} \\ & = 3 + \frac{1}{7 + \frac{1}{15 + \frac{1}{1 + \frac{1}{\ddots}}}} \end{align}$

I say semi-algorithmic since the way I presented above already relies on having the exact value of $\pi$ in decimal, which can be arguably harder to find. It's good enough to get us as many coefficients as we'll practically need, but we'll address this problem later.

Best Rational Approximations

Now the reason we went through all this trouble is because continued fractions have the surprising ability to generate best rational approximations systematically. Remember, we say a rational approximation $\frac{n}{d}$ for a real number $x$ is best if $\frac{n}{d}$ is closer to $x$ than any other fraction with denominator less than $d$.

Definition: We say $\frac{p}{q}$ is a best rational approximation for $x$ if

$\large{\frac{n}{d}}$ $= \min\left\{ \left| x - \frac{p}{q} \right| \ : \ q \leq d \right\}$

Now here's the remarkable fact:

Theorem: Write an irrational number $x = [a_0; a_1, a_2, \ldots]$ as its continued fraction expansion. Then for any convergent $c_n = \frac{p_n}{q_n}$ and any rational $\frac{a}{b}$ with $b < q_{n+1}$, then $\left| q_n x - p_n \right| \leq \left| bx - a \right|$.

This is actually a stronger statement, since it's not only a best rational approximation, but it is also a better rational approximation than some denominators greater than $q_n$. It should not be surprising that $q_{n+1} > q_n$ for all $n$ (just think about the process involved in collapsing a continued fraction; or show by induction with the recursion formulas). But what our claim says is that for not only rational approximations of denominator less than $q_n$, but also for some denominators greater than $q_n$, $\frac{p_n}{q_n}$ is still a better approximation for $x$. Which is surprising, since that would mean even if we divide the real numbers into smaller boxes, sometimes our numbers will be closer to the edges of the bigger box. It's like saying that a steak knife can get you a more precise cut than some X-Acto knives.

Proof: Let $\frac{a}{b}$ be a rational number in lowest terms such that $b < q_{n+1}$. We'll also assume $a \neq p_n$ and $b \neq q_n$ (since then we clearly have equality). We'll rewrite $a,b$ in terms of new numbers $r,s \in \mathbb{R}$ that we define by the following system of equations:

$\begin{align} a & = r p_n + s p_{n+1} \\ b & = r q_n + s q_{n+1} \\ \end{align}$

$r,s$ exist since that system of equations is equivalent to the matrix system

$ \begin{bmatrix} p_n & p_{n+1} \\ q_n & q_{n+1} \end{bmatrix} \begin{bmatrix} r \\ s \end{bmatrix} = \begin{bmatrix} a \\ b \end{bmatrix} $

and we showed that the determinant $p_n q_{n+1} - q_n p_{n+1} = (-1)^{n+1} \neq 0$ and so is invertible. Now if we solve for $r,s$ (either with matrices or directly), we see that

$ \begin{bmatrix} r \\ s \end{bmatrix} = \begin{bmatrix} p_n & p_{n+1} \\ q_n & q_{n+1} \end{bmatrix}^{-1} \begin{bmatrix} a \\ b \end{bmatrix} = \pm \begin{bmatrix} q_{n+1} & -p_{n+1} \\ -q_n & p_n \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = \pm \begin{bmatrix} q_{n+1}a - p_{n+1}b \\ -q_n a + p_n b \end{bmatrix} $

So $r,s$ are integers.

Claim: $r,s \neq 0$.

Proof: If $r = 0$, then $q_{n+1}a = p_{n+1}b$. But since $p_{n+1}$ and $q_{n+1}$ don't divide each other (we proved earlier they were coprime), we have $q_{n+1}$ divides $b$, contradicting the assumption that $b < q_{n+1}$. If $s=0$, then we have $\frac{a}{b} = \frac{p_n}{q_n}$ which we already excluded in our assumptions. $\blacksquare$

Claim: $r$ and $s$ have opposite signs.

Proof: We'll just check the two cases.

If $s > 0$, then $s \geq 1$. Therefore $sq_{n+1} \geq q_{n+1} > b$, and thus $b = rq_n + sq_{n+1}$ implies that $rq_n = b - sq_{n+1} < 0$. Since $q_n > 0$, we must have $r<0$.
If $s < 0$, then $-sq_{n+1} > 0$. Hence we have $rq_n = b - sq_{n+1} > 0$. Thus $r > 0$. $\blacksquare$

Claim: $q_n x - p_n$ and $q_{n+1} x - p_{n+1}$ have opposite signs.

Proof: Note we saw earlier that the difference between consecutive convergents alternate in sign:

$\begin{align} c_n - c_{n-1} = \frac{(-1)^{n+1}}{q_n q_{n-1}} \end{align}$

That is, the convergents take turns being greater or lesser than the previous convergent. What this means is that our convergents alternately overshoot and undershoot its limit $x$ (in particular, knowing that $c_0 = \lfloor x \rfloor < x$, the even convergents undershoot and the odd ones overshoot). So in particular, for all $n$, we have

$\begin{align} \frac{p_n}{q_n} < x < \frac{p_{n+1}}{q_{n+1}} \ \ \textrm{or} \ \ \frac{p_{n+1}}{q_{n+1}} < x < \frac{p_n}{q_n} \end{align}$

If we take the first case, then the first inequality gives $0 < q_n x - p_n$ while the second one gives $q_{n+1} x - p_{n+1} < 0$, so they're of opposing signs. Similarly the second case gives the same. $\blacksquare$

Claim: $r(q_n x - p_n)$ and $s(q_{n+1} x - p_{n+1})$ have the same signs.

Proof: This is just something you can check by cases. The opposing parity of $r,s$ and of $(q_n x - p_n), (q_{n+1} x - p_{n+1})$ cancel each other out. $\blacksquare$

Finally, putting this last fact to work, we get

$\begin{align} |bx - a| & = |(r q_n + s q_{n+1})x - (r p_n + s p_{n+1})| \\ & = |r(q_n x - p_n) + s(q_{n+1} x - p_{n+1})| \\ & = |r||q_n x - p_n| + |s||q_{n+1} x - p_{n+1}| \\ & > |r||q_n x - p_n| \\ & \geq |q_n x - p_n| \end{align}$

We can separate the absolute values in the 3rd equality since they have the same sign, and the last inequality comes from the fact that we showed the integer $r \geq 1$.

$\blacksquare$

For completeness sake,

Corollary: Each convergent $c_n = \frac{p_n}{q_n}$ is a best rational approximation for $x$.

Proof: Suppose otherwise, so that there $\exists \frac{a}{b}$ such that $b \leq q_n$ with

$\begin{align} \left| x - \frac{a}{b} \right| < \left| x - \frac{p_n}{q_n} \right| \end{align}$

Multiplying these inequalities together gives

$\begin{align} |bx - a| < |q_n x - p_n| \end{align}$

But this contradicts our theorem as $b \leq q_n < q_{n+1}$. $\ \blacksquare$

The Most Irrational Number

If you want to approximate $\pi$, we now have a good way of doing so: take some convergent of the continued fraction. Looking at the first couple,

$\begin{align} \pi \approx c_0 & = 3 \\ \pi \approx c_1 & = 3 + \frac{1}{7} = \frac{22}{7} \approx 3.1429 \\ \pi \approx c_2 & = 3 + \frac{1}{7 + \frac{1}{15}} = \frac{333}{106} \approx 3.141591 \end{align}$

which are some of the more famous approximations of $\pi$. Note, if you do want to get really good approximations, making the last term in a convergent large is ideal. Notice how much closer we got with $c_2$ compared to $c_1$, and that is precisely because 15 is (relatively) much bigger than 7. These large coefficients get multiplicatively bigger as we simplify the convergent, giving us a rather large denominator $q_n$ that more finely dices the real numbers for us to approximate with. $\pi$ has a lot of these large numbers in its continued fraction expansion, giving us good convergents to create these surprisingly accurate best rational approximations.

$\pi = [3; \color{red}{7}, \color{red}{15}, 1, \color{red}{292}, 1, 1, 1, 2, 1, 3, 1, \color{red}{14}, 2, 1, 1, 2, 2, 2, 2, ...]$

This also means, if you care about small denominators, it's worth stopping one before a large number (i.e. at 7 instead of 15, or 1 instead of 292); if you need such a large number to generate a new best rational approximation (and get a really big denominator), then it must have been that the approximation before it must have also been really good if you had to zoom in that much on the next step.

NOTE: we proved that all convergents are rational approximations, but the converse is not true! $\frac{13}{4}$ is also a best rational approximation for $\pi$, but is not generated through our convergents.

So, what if we minimize all the coefficients then? If we give no large numbers to stop at for our convergents, we'll be left with a number that is not easy to approximate at all. We know all the numbers are positive integers, so let's just set them all to 1 to minimize them:

$1 +$ $\large{\frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}}}}$

We know this converges as we showed that the difference between consecutive convergents gets small (and so forms a Cauchy sequence), and as we saw before this is the Golden Ratio $\varphi$. We can also solve this directly if you wanted: if we set the value of the fraction to $x$

$1 + \large{\frac{1}{ \fbox{$1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{\ddots}}} $} }} = x$

we can see that the boxed portion of the fraction is equivalent to the original, whole fraction, so we can substitute $x$ for the box.

$1 + \frac{1}{x} = x$
$x^2-x-1=0$

Is precisely the quadratic that has roots at $\frac{1 \pm \sqrt{5}}{2}$ (we only admit the positive solution by since all our convergents are positive, and limits preserve weak inequalities, so the limit is non-negative).

So if we look at some of the convergets of $\varphi$, they're quite bad at approximating it (for reference, $\varphi \approx 1.618034$):

$\begin{align} \varphi \approx c_0 & = 1 \\ \varphi \approx c_1 & = 1 + \frac{1}{1} = 2 \\ \varphi \approx c_2 & = 1 + \frac{1}{1 + \frac{1}{1}} = \frac{3}{2} = 1.5 \\ \varphi \approx c_3 & = 1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{1}}} = \frac{5}{3} = 1.\overline{6} \\ \varphi \approx c_4 & = 1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{1}}}} = \frac{8}{5} = 1.6 \\ \end{align}$

In fact, takes until $c_9$ to get 3 decimals correct. Remember, the first convergent for $\pi$ was already within 3 decimals.

$\varphi \approx c_9 =$ $\large{\frac{89}{55}}$ $= 1.618\overline{18}$

Since $\varphi$ is so bad—in fact as we've demonstrated, the worst—to approximate rationally, it is often referred to as the most irrational number.

Relation to Dirichlet's Approximation Theorem

Recall, in our first little discussion about approximating irrational numbers, we spoke about Dirichlet's Approximation Theorem.

Dirichlet's Approximation Theorem: $\forall x \in \mathbb{R}$ and $\forall N \in \mathbb{N}$, there are integers $p$ and $q$ such that $0 < q \leq N$ and $\left|qx - p \right| < \frac{1}{N}$.

The more concrete result we got out of it was that we could have rational approximations $\frac{p}{q}$ that get us within $\frac{1}{q^2}$ of our irrational $x$, which is much better than our initial bound of being within $\frac{1}{2q}$ of $x$.

In our proof on the convergence of continued fractions, we actually say that our convergents actually satisfy this bound too. Look above in the proof, and we deduced that

$\begin{align}\left| x - \frac{p_n}{q_n} \right| < \frac{1}{q_{n+1} q_n} < \frac{1}{q_n^2} \end{align}$

and the final inequality comes from the fact that $q_{n+1} > q_n$. So, as we'd hope, our best rational approximations from continued fractions are also just outright good and efficient approximations.

In fact, we can usually do a little bit better.

Corollary: For irrational $x$ and for all $n \in \mathbb{N}$, at least one of the convergents $\frac{p_n}{q_n}$ or $\frac{p_{n+1}}{q_{n+1}}$ satisfy the inequality

$\begin{align}\left| x - \frac{p}{q} \right| < \frac{1}{2q^2} \end{align}$

Proof: Suppose this isn't true. Then

$\begin{align} \frac{1}{2q_n^2} + \frac{1}{2q_{n+1}^2} \leq \left|x - \frac{p_n}{q_n} \right| + \left|x - \frac{p_{n+1}}{q_{n+1}}\right| = \left|\frac{p_n}{q_n} - \frac{p_{n+1}}{q_{n+1}}\right| = \frac{1}{q_n q_{n+1}} \end{align}$

The second equality follows since $\frac{p_n}{q_n} < x < \frac{p_{n+1}}{q_{n+1}}$ or $\frac{p_{n+1}}{q_{n+1}} < x < \frac{p_n}{q_n}$. Say we're in the first case, then

$\begin{align} \left|x - \frac{p_n}{q_n} \right| + \left|x - \frac{p_{n+1}}{q_{n+1}}\right| = \left(x - \frac{p_n}{q_n} \right) + \left(\frac{p_{n+1}}{q_{n+1}} - x \right) \end{align}$

and the $x$ terms cancel (similarly for the second case). The last equality is precisely the difference between two adjacent convergents we found earlier.

$\begin{align} \frac{1}{2q_n^2} + \frac{1}{2q_{n+1}^2} \leq \frac{1}{q_n q_{n+1}} \Rightarrow \frac{q_{n+1}^2 + q_n^2}{2q_n^2 q_{n+1}^2} \leq \frac{1}{q_n q_{n+1}} \Rightarrow (q_{n+1} - q_{n})^2 \leq 0 \end{align}$

But we already know that $q_{n+1} > q_n$, so this can't hold, giving us our contradiction. $ \ \blacksquare$

It turns out the converse to this is in fact true:

Legendre's Theorem: If $x \in \mathbb{R}$, and $\frac{p}{q} \in \mathbb{Q}$ such that $\left| x - \frac{p}{q} \right| < \frac{1}{2q^2}$, then $\frac{p}{q} = c_n$ is a convergent of the continued fraction for $x$.

We won't prove this, but the proof can be found on the Wikipedia page.

Finding Continued Fractions

One problem I had with continued fractions originally was that there didn't seem to be a particularly nice way of actually finding them. It's nice that we have this very straightforward way of doing it by separating integer parts and inverting the fractional, but that already requires us to have calculated the decimal expansion of whatever number we're expanding. This makes it seem not as elegant and a bit clunky, so when I found this paper outlining another method, I had to look through it.

The accuracy of this method will still rely on having some level of floating point precision, but otherwise, I think it is a much nicer of going about finding continued fractions in a way that weaves in another natural approximation method. The idea lies within finding roots of functions in a way similar to Newton's method.

Say we want to find the continued fraction of the number $\alpha$. Further, assume that there is a twice differentiable function $f(x)$ such that $f(\alpha) = 0$. To get a rough idea of the algorithm, there are two facts we first must take note of.

1) The Mean Value Theorem: Given a function $f(x)$ that is differentiable on the interval $(a,b)$, the Mean Value Theorem states that $\exists c \in (a,b)$ such that

$\begin{align} f'(c) = \frac{f(b) - f(a)}{b-a} \end{align}$

That is, there is a point that has a tangent with the same slope as the secant line between endpoints of the graph. So if we consider the interval between $(\alpha, t)$ where $f(\alpha) = 0$, we get that

$\begin{align} f'(c) = \frac{f(t) - f(\alpha)}{t - \alpha} = \frac{f(t)}{t - \alpha}\end{align}$

Rearranging a little bit and letting $t= \frac{p_n}{q_n}$ be a convergent of $\alpha$, we can see that

$\begin{align} \left| \frac{p_n}{q_n} - \alpha \right| = \left| \frac{f(\frac{p_n}{q_n})}{f'(c)} \right| \end{align}$

2) Relating convergents and remainder coefficients: Recall in our proof that simple infinite continued fractions converge, we found that

$\begin{align} \left| \frac{p_n}{q_n} - a \right| = \frac{1}{ (\alpha_{n+1} q_n + q_{n-1})q_n } \end{align}$

where $\alpha_{n+1}$ is the "remainder" coefficient we get when calculating continued fractions i.e.

$\alpha = [a_0; a_1, a_2, \cdots, a_n, \alpha_{n+1}] = a_0 + \frac{1}{a_1 + \frac{1}{a_2 + \frac{1}{\frac{\ddots}{ a_n + \frac{1}{\alpha_{n+1}}}}}}$

If we combine these two equalities, we get that

$\begin{align} \frac{1}{ (\alpha_{n+1} q_n + q_{n-1})q_n } = \left| \frac{p_n}{q_n} - a \right| = \left| \frac{f(\frac{p_n}{q_n})}{f'(c)} \right| \end{align}$

If we solve for $\alpha_{n+1}$:

$\begin{align} \alpha_{n+1} = \frac{\left| f'(c) \right|}{q_n^2 \left| f(\frac{p_n}{q_n} )\right|} - \frac{q_{n-1}}{q_n} \end{align}$

This is great, since now we have a direct way of computing the coefficients $a_{n+1} = \lfloor \alpha_{n+1} \rfloor$. Further, the computation only requires knowing the evaluating the function $f(x)$ and the previous two convergents! This is the basis for the algorithm: we repeatedly compute values of $a_{n+1}$ by computing the remainder $\alpha_{n+1}$ with our convergents.

However, this only maintains equality for the value fo $c \in (a, \frac{p_n}{q_n})$. So what we do instead is approximate $\alpha_{n+1}$ with a reasonably close value. One value that works just fine is $\frac{p_n}{q_n}$ itself. So what originally was could be a hard problem of finding decimal expansions of $\alpha$ to do our separate-and-invert algorithm, now becomes a much more tractable problem in computing decimals in evaluating a function. Also this process is memoryless: it does not need anything more than the previous state two convergents to compute the next. Whereas if we wanted to refine our continued fraction with the old method, we would have to start from the very beginning and recompute everything.

Again, this still requires calculating something else to an arbitrary precision to get the correct continued fraction (and we still need to get the first two convergents manually as seed values). So in the case of $\pi$, we are calculating instances of $\sin(x)$, which in some ways easier. This gives a wide range of numbers we can now compute continued fractions for:

For $\sqrt{2}$, we use $f(x) = x^2 - 2$
For $\ln(n)$, we use $f(x) = e^x - n$.
For $\pi^2$, we use $f(x) = \sin(\sqrt{x})$
For $e^\pi$, we use $f(x) = \sin(\log x)$

Many other function compositions can lend to many other numbers that would seem otherwise hard to get continued fractions of.

For more details and a coded demo, check out this Python notebook that is set up to calculate the first 50 coefficients (as well as the convergents and their numerators and denominators) of the simple continued fraction for $\pi$.

Applications of Best Rational Approximations

Music Tuning

Approximating irrationals are convenient for nothing more than just that in most cases: convenience. But here's something that requires this irrational approximation.

We here sound and music intervals on an exponential scale. Given a note with frequency $f$, it's octave is the note with frequency $2f$. So obviously, they have a frequency ratio of 2:1. Since this, in a sense, is the simplest ratio we can have between notes' frequencies, the octave forms the basic ratio for end points in our musical scale. The more interesting notes are the ones between an octave with more complex ratios to the base frequency. Common intervals include a major third with a frequence ratio of 6:5 to the base note, or perfect fourth with 4:3. But perhaps the most famous and most recognizable is the perfect fifth with a ratio of 3:2. The theme to Star Wars famously opens with a perfect fifth.

As nice as these rational intervals are, they pose a slightly annoying problem: they are not (geometrically) evenly spaced, so they don't divide an octave into even steps. If you wanted to build a piano, this would make tuning each note very tedious, and make our notes feel a bit random without this consistent spacing.

If we wanted to divide an octave evenly into $n$ notes, that means we want to be able to find the corresponding frequency ratio of $2^\frac{1}{n}$ (remember, we hear sound exponentially, so hearing the next note up is equivalent to multiplying our base frequency by our desired ratio). But this number is irrational (proof is identical to our why $\sqrt{2} = 2^\frac{1}{2}$ is irrational)! So this runs into a different problem: we'll never be able to exactly get our rational frequency ratios for our intervals like the fourth or the fifth.

So we're at a crossroads: either we keep our rational frequency ratios for nice intervals and the octave isn't evenly divided, or we divide our octave evenly and never have a rational interval. So we compromise: we find a value of $2^\frac{m}{n}$ that closely approximates our ratios, and that value of $n$ tells us how many notes we divide our octave into.

Here's where we can use continued fractions and our best rational approximations. Since the perfect fifth is probably the most univerally harmonious interval, we'll try and approximate that first. Then, we note that

$\begin{align} 2^{\log_2 (3/2)} = \frac{3}{2} \end{align}$

So if we can approximate $\log_2 (3/2)$ the best we can, we'll get a reasonable rational exponent that closely solves $2^\frac{m}{n} = \frac{3}{2}$. We can write out the first few numbers of the continued fraction by hand using our algorithm from before:

$\begin{align} \log_2(\frac{3}{2}) = 0+\frac{1}{1+\frac{1}{1+\frac{1}{2+\frac{1}{2+\frac{1}{3+\frac{1}{\ddots}}}}}} \end{align}$

Looking at the first few convergents, we see that

$\begin{align} \log_2(\frac{3}{2}) & = 0.584962500721 \ldots \\ & \approx c_0 = 0 \\ & \approx c_1 = 0 + \frac{1}{1} = 1 \\ & \approx c_2 = 0 + \frac{1}{1 + \frac{1}{1}} = \frac{1}{2} = 0.5 \\ & \approx c_3 = 0 + \frac{1}{1 + \frac{1}{1 + \frac{1}{2}}} = \frac{3}{5} = 0.6 \\ & \approx c_4 = 0 + \frac{1}{1 + \frac{1}{1 + \frac{1}{2 + \frac{1}{2}}}} = \frac{7}{12} = 0.58\overline{3} \\ & \approx c_5 = 0 + \frac{1}{1 + \frac{1}{1 + \frac{1}{2 + \frac{1}{2 + \frac{1}{3}}}}} = \frac{24}{41} = 0.\overline{58536} \\ \end{align}$

Looking at our denominators, we get options of $n = 1,2,5,12,41,\cdots$. Using $n \leq 5$ gives too few notes in a scale, and $n = 41$ might be too many. So $n=12$ notes feels like a good in-between, and in fact is what we actually use on most pianos today: 7 white keys and 5 black per octave. (One interesting best approximation not found here is also $n=19$, which some might find a reasonable choice too)

Cryptography and Wiener's Attack

RSA is one of the most widely used cryptographic protocols used, protecting most of the internet's traffic today.

Here's a brief rundown how two people, Alice and Bob, would share secret messages (i.e. can only be read by the recipient) using RSA:

Bob chooses two (ideally large) prime numbers $p,q$ and calculates $N = pq$, and an integer $e$ (called the encryption exponent). He then releases to everyone the public key $(N, e)$. This is how people will hide and encode their messages to Bob.
To decrypt messages, we need we need the decryption exponent $d$ that satisfies $ed \equiv 1 \bmod \varphi(N)$. $\varphi(n)$ is Euler's totient function and returns the total number of integers less than $n$ that are coprime to $n$. We also need $\gcd(e, \varphi(N)) = 1$.
The factorization of $N$ and $d$ are kept secret, and form the private key. This is what allows Bob and only Bob to decrypt messages.
To encrypt a message $M$, Alice sends the ciphertext $C = M^e \bmod N$
To decrypt it, Bob computes $C^d \equiv (M^e)^d \equiv M^{ed} \equiv M \bmod N$

The decryption equivalence follows by Fermat's little theorem, and is not too hard to verify (you can also check the Wikipedia page). Also, note that sometimes instead of $\varphi(n)$ people will also use the Carmichael function $\lambda(n)$. In our case, $\lambda(pq) = \mathrm{lcm}(p-1,q-1) \leq (p-1)(q-1)$ which maintains all the properties needed.

The security of RSA relies on the surprising asymmetry in factorization: given $p$ and $q$, calculating $N = pq$ isn't hard, but given an $N$ finding two numbers such that $N = pq$ is difficult. Without being able to factor $N$, the private key remains private and hence messages can't be stolen. At least, that's our idea—it's secure because we don't know how to factor numbers efficiently (at least, not without quantum computers). No one does. So we use RSA everywhere for that reason.

So if we could break it, that would be quite bad.

Wiener's attack exploits one thing that we didn't put much effort specifying: the decryption exponent $d$. If $d$ is chosen in such a way that it's small enough (specifically when $d < \frac{1}{3} N^{\frac{1}{4}}$) Wiener's attack can break through RSA. This, though, is a really small range of $d$. If we let $N$ be a number with 20 digits, $d$ can at most have 5 digits.

Some Preliminaries

First, note for (co)prime $p,q$, we have

$\varphi(pq) = (p-1)(q-1) = pq - p - q + 1 = N - (p+q) + 1$.

This follows from a simple counting argument: the numbers that are not coprime to $pq$ are the multiples of $p$ and the multiples of $q$ less than $pq$, i.e. $1p, 2p, 3p, \cdots, (q-1)p$ and $1q,2q,3q,\cdots(p-1)q$. So there are $(p-1) + (q-1) = p + q - 2$ numbers not coprime to $pq$. There are $pq-1$ numbers less than $pq$. Hence there are

$(pq - 1) - (p + q - 2) = pq - (p+q) + 1 = (p-1)(q-1)$

numbers coprime to $pq$. (also follows because $\varphi$ is multiplicative, so $\varphi(pq) = \varphi(p)\varphi(q)$ if $p,q$ coprime; think Chinese Remainder Theorem).

So if we know $p+q$ then we know $\varphi(N)$ and vice versa. If you're familiar with Vieta's fomrulas, these look a lot like expressions that appear in quadratics:

$(x-p)(x-q) = x^2 - (p+q)x + pq = x^2 - (N - \varphi(N) + 1)x + N$

By the quadratic formula,

$\begin{align} p,q = \frac{(N - \varphi(N) + 1) \pm \sqrt{(N - \varphi(N) + 1)^2 - 4(1)(N) }}{2(1)} \end{align}$

So if we know $N$ and $\varphi(N)$, we can recover $p,q$ without ever needing to go through the problem of factoring. $N$ is already public information in RSA. To find $\varphi(N)$, all we know about $\varphi(N)$ is $ed \equiv 1 \bmod \varphi(N)$. In other words, $ed = 1 + k \cdot \varphi(N) = 1 + k \cdot (p-1)(q-1)$ for some integer $k$. Therefore,

$\begin{align} \varphi(N) = \frac{ed - 1}{k} \end{align}$

If we want to find $\varphi(N)$, what we really want to find is $k$ and $d$.

Second, we make the following observation:

Let $N=pq$, $e$, and $d$ be given.
Note $\varphi(N) = (p-1)(q-1)$.
Also, $ed = 1 + k \cdot \varphi(N) = 1 + k \cdot (p-1)(q-1)$ for some integer $k$
Dividing by $dpq$, we get $\frac{e}{pq} = \frac{k}{d} \cdot \frac{pq - p - q + 1 + \frac{1}{k}}{pq} = \frac{k}{d}(1-\delta)$ where $\delta = \frac{p + q - 1 - \frac{1}{k}}{pq}$
Note $\delta < 1$ since $pq > p+q$ for all $p,q > 1$. So we get $\frac{e}{pq} < \frac{k}{d}$ by a small amount.

So we have $\frac{e}{N} \approx \frac{k}{d}$, so if we can guess approximations (should start ringing a bell) of $\frac{e}{N}$, we might be able to guess $\frac{k}{d}$.

Wiener's idea then is to use the convergents of the continued fraction for $\frac{e}{N}$ to get potential values of $\frac{k}{d}$, as those would give us nice rational approximations.

Example: Let $(N,e) = (90581, 17993)$ be our public key.

$\begin{align} \frac{e}{N} = \frac{1}{5 + \frac{1}{29 + \frac{1}{\ddots + \frac{1}{3}}}} = [0; 5,2,29,4,1,3,2,4,3] \end{align}$

The first convergent $\frac{0}{1}$ does not give $k,d$ values to factor $N$ ($k=0$ is the real problem). But it can be checked that the next convergent $\frac{1}{5} = \frac{k}{d}$ does work, and gives the correct guess for $\varphi(N)$:

$\begin{align} \varphi(N) = \frac{17993 \cdot 5 - 1}{1} = 89964 \end{align}$

If we plug this into our quadratic from before, we get $p,q = 379, 239$ which correctly factors $N$.

Here are all the conditions necessary for this attack.

Wiener's theorem: Let $N=pq$ with $q < p < 2q$, and $d < \frac{1}{3} N^{\frac{1}{4}}$ such that $ed \equiv 1 \bmod \varphi(N)$. Given $(N,e)$, one can recover $d$.

Proof: First note that since $q

$(p+q-1)^2 < (p+q)^2 < (3q)^2 < 9pq$

Recall that $\varphi(N) = (p-1)(q-1) = N - p - q + 1$. Combining this with our above inequality gives $\left| N - \varphi(N) \right| < 3\sqrt{pq} = 3\sqrt{N}$. Now also remember that $ed - k\varphi(N) = 1$, so we get that

$\begin{align} \left| \frac{e}{N} - \frac{k}{d} \right| & = \left| \frac{ed - kN}{Nd} \right| \\ & = \left| \frac{ed - k\varphi(N) - kN + k \varphi(N) }{Nd} \right| \\ & = \left| \frac{1 - k(N - \varphi(N)) }{Nd} \right| \\ & \leq \left| \frac{3k\sqrt{N} }{Nd} \right| \\ \end{align}$

$k\varphi(N) = ed - 1 < ed$, so since $e < \varphi(N)$ (the public key can and usually is chosen as such because modular arithmetic ensures there is one), we then must have $k < d$ for that equality to have any hope in working out. Therefore,

$\begin{align} \left| \frac{e}{N} - \frac{k}{d} \right| \leq \left| \frac{3k\sqrt{N} }{Nd} \right| < \left| \frac{3d\sqrt{N} }{Nd} \right| = \left| \frac{3}{\sqrt{N}} \right| \\ \end{align}$

Since $d < \frac{1}{3} N^{\frac{1}{4}}$, we have that $9d^2 < N^{\frac{1}{2}}$. Hence,

$\begin{align} \left| \frac{e}{N} - \frac{k}{d} \right| < \left| \frac{3}{\sqrt{N}} \right| < \left| \frac{3}{9d^2} \right| < \frac{1}{2d^2} \\ \end{align}$

By Legendre's theorem that we mentioned above, $\frac{k}{d}$ is equivalent to a convergent of the continued fraction for $\frac{e}{N}$. Also, $ed - k\varphi(N) = 1$, so $\gcd(k,d) = 1$, so $\frac{k}{d}$ is already in lowest terms. So not only is $\frac{k}{d}$ equivalent to a convergent, it is in fact the convergent with the same numerator and denominator.

$\blacksquare$

Note that the bound on $d < \frac{1}{3}N^{\frac{1}{4}}$ was picked precisely to satisfy Legendre's theorem, to ensure it would be a convergent. Others have improved on this bound, and one we can clearly see that if we want $\frac{3}{\sqrt{N}} < \frac{1}{2d^2}$, we can let $d < \frac{1}{\sqrt{6}}N^{\frac{1}{4}}$.

Solving Pell's Equation

Pell's equation is the Diophantine equation (that is, we want integer solutions for it)

$x^2 - Dy^2 = 1$

where $D$ is a non-square positive integer (for negative or perfect square $D$ since there are only the finitely many solutions $(\pm 1, 0)$ by considering the sign of the equation or factoring a difference of squares). Continued fractions and solutions to Pell's equations are surprisingly intertwined.

The key lies in the fact that there is a 1-to-1 correspondence between quadratic irrationals and repeating continued fractions. That is, if the coefficients in a continued fraction ever cycle, then the continued fraction can be written as $\frac{p + \sqrt{q}}{r}$ where $q$ is a non-square.

We won't get into it here, but here are some notes and a paper that fill in the details for quadratic irrationals and Pell's equation.

Pell's equation seems innocuous at first—just a genereic equation that people historically studied for one reason or another, but it is because of its general form that lends itself to the necessity of studying Pell's equation.

Question: The triangular numbers are $1, 3, 6, 10, \cdots, \frac{n(n+1)}{2}, \cdots $. Are any of these a perfect square?

The first 6 triangular numbers. Credit: Wikipedia

Solution: We are essentially solving $\frac{1}{2}n(n+1) = m^2$. Rewriting a little bit, we want to solve $(2n+1)^2 - 8m^2 = 1$. Let $x = 2n+1$ and $y = m$, we just need to find solutions to $x^2 - 8y^2 =1$. From the above results (that we did not cover), we can do that systematically with the continued fraction for $\sqrt{8}$.

For an absurd historical example of Pell's equation, here's one attributed to Archimedes.

The Original Proof of the Irrationality $\pi$

We've gone through a few properties and interesting tidbits that characterize irrational numbers especially as it comes to computing and approximating them. These tools, especially of continued fractions, give a tool that can be used to deduce irrationality of numbers. In fact, the first proofs that $e$ and $\pi$ were irrational originally came from them. But I wanted to talk about them specifically because I want it to be clear jsut how little we actually know of irrational numbers and how difficult it can actually be to show that numbers are irrational (as we'll discuss below). Continued fractions, though, seem to be a more universal way out. We've already shown how $\sqrt{2}$ is irrational, but the actual proof is quite unique; we didn't use any of the characteristics of irrational numbers. But then we showed it was irrational again purely from its continued fraction, and that almost completely streamlined the proof to not only be direct, but essentially shorten the proof to one line: just show the continued fraction.

Before we go onto some of the more modern proofs $\pi$ is irrational, we're now somewhat familiar with continued fractions, so let's look at J. H. Lambert's (1761) original proof using them.

Theorem: $\pi$ is irrational.

Proof: I'm adapting the proof found on this Stack Exchange post, as it only uses what's necessary. For some more rigor, the Wikipedia page or this paper provide good details.

There are a few steps we'll take along the way.

Step 1: A continued fraction for $\tan(x)$.

This is by far the most tedious step, and I'll only sketch it out for actually how annoying it is. The way to derive it is to essentially consider the quotient of the power series for $\sin(x)$ and $\cos(x)$.

$\begin{align} \tan(x) = \frac{\sin(x)}{\cos(x)} = \frac{\sum_{n=0}^{\infty} (-1)^n \frac{x^{2n+1}}{(2n+1)!}}{\sum_{n=0}^{\infty} (-1)^n \frac{x^{2n}}{(2n)!}} = \frac{x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots}{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots} \end{align}$

We'll now do a series of manipulations on this last fraction:

$\begin{align} \frac{x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots}{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots} & = x \cdot \frac{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots} \\ & = x \cdot \frac{1}{\frac{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}} \end{align}$

Let's look at the new sub-fraction of power series we have in the denominator. We'll now add and subtract the denominator of the sub-fraction as a special form of 0 to its numerator:

$\begin{align} % \frac{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} & = \frac{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots + (1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots) - (1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots)}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} \frac{\color{red}{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots} + \overset{= \ 0}{\overline{(1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots) \color{red}{- (1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots)}}}}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} \end{align}$

Combining the terms in red:

$\begin{align} \frac{(1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots) \color{red}{- \frac{2x^2}{3!} + \frac{4x^4}{5!} - \frac{6x^6}{7!} + \cdots}}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} & = 1 + \frac{\color{red}{- \frac{2x^2}{3!} + \frac{4x^4}{5!} - \frac{6x^6}{7!} + \cdots}}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} \\ & = 1 - x^2 \cdot \frac{\color{red}{\frac{2}{3!} - \frac{4x^2}{5!} + \frac{6x^4}{7!} - \cdots}}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} \\ & = 1 - x^2 \cdot \frac{\color{red}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots}}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} \\ \end{align}$

So, currently, we have

$\begin{align} \tan(x) & = x \cdot \frac{1}{\frac{1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}} \\ & = \frac{x}{1 - x^2 \cdot \frac{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}} \\ & = \frac{x}{1 - \frac{x^2}{\frac{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots}}} \end{align}$

We can then continue to play the same game and adding and subtracting the denominator of our sub-fraction to its numerator:

$\begin{align} \frac{\color{red}{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots} + 3(\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots) \color{red}{- 3(\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots)}}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots} \end{align}$

We multiplied them by 3 to clear the units. Combining the terms in red again:

$\begin{align} \frac{3(\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots) \color{red}{- \frac{x^2}{3!} \frac{2}{5} + \frac{x^4}{5!} \frac{4}{7} - \cdots}}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots} & = 3 - x^2 \cdot \frac{\color{red}{\frac{1}{3} \frac{1}{5} - \frac{x^2}{3!} \frac{1}{5 \cdot 7} + \cdots}}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots} \end{align}$

So then our expansion is

$\begin{align} \tan(x) & = \frac{x}{1 - \frac{x^2}{\frac{1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots}}} \\ & = \frac{x}{1 - \frac{x^2}{3 - x^2 \cdot \frac{\frac{1}{3} \frac{1}{5} - \frac{x^2}{3!} \frac{1}{5 \cdot 7} + \cdots}{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots} }} \\ & = \frac{x}{1 - \frac{x^2}{3 - \frac{x^2}{\frac{\frac{1}{3} - \frac{x^2}{3!} \frac{1}{5} + \frac{x^4}{5!} \frac{1}{7} - \cdots}{\frac{1}{3} \frac{1}{5} - \frac{x^2}{3!} \frac{1}{5 \cdot 7} + \cdots} }}} \\ \end{align}$

If we continue doing this, we see the pattern

$\begin{align} \tan(x) & = \frac{x}{1 - \frac{x^2}{3 - \frac{x^2}{ 5 - \frac{x^2}{7 - \frac{x^2}{\ddots}} }}} \\ \end{align}$

Obviously, this isn't fully rigorous as we're playing with infinite series pretty carelessly, so see the above links for more details.

Step 2: A condition for irrationality.

Lemma: For a continued fraction

$\begin{align} \frac{a_1}{b_1 - \frac{a_2}{b_2 - \frac{a_3}{b_3 - \frac{a_4}{b_4 - \frac{a_5}{\ddots}} }}} \\ \end{align}$

assume that $1 + a_n \leq b_n$. If $1+a_n < b_n$ infinitely often, then the fraction is irrational.

Proof: For contradiction, say it is rational.

$\begin{align} \frac{\lambda_1}{\lambda_0} = \frac{a_1}{b_1 - \frac{a_2}{b_2 - \frac{a_3}{b_3 - \frac{a_4}{b_4 - \frac{a_5}{\ddots}} }}} \\ \end{align}$

The conditions $1 + a_n \leq b_n$ and $1+a_n < b_n$ infinitely often ensure that our continued fraction is between 0 and 1 (analyze the convergents), so we have $\lambda_1 < \lambda_0$. To make our life simpler, let's call a lower portion of the fraction $\rho_1$:

$\begin{align} \rho_1 = \frac{a_2}{b_2 - \frac{a_3}{b_3 - \frac{a_4}{b_4 - \frac{a_5}{\ddots}} }} \\ \end{align}$

Then we have

$\begin{align} \frac{\lambda_1}{\lambda_0} = \frac{a_1}{b_1 - \rho_1} \Rightarrow \rho_1 = \frac{b_1 \lambda_1 - a_1 \lambda_0}{\lambda_1} \end{align}$

So $\rho_1 = \frac{\lambda_2}{\lambda_1}$ is rational. But also, since $\rho_1$ is essentially the same "type" of continued fraction as the original, it too is between 0 and 1. Hence $\rho_1 < 1$ and therefore $\lambda_2 < \lambda_1$. We can keep repeating this process, producing a sequence of strictly descending positive integers $\cdots < \lambda_2 < \lambda_1 < \lambda_0$. But clearly this is impossible! So our continued fraction is irrational. $\ \blacksquare$

Step 3: Showing $\pi$ is irrational.

Say $\pi$ is rational. Then $\frac{\pi}{4} = \frac{a}{b}$ is also rational, and $\tan(\frac{\pi}{4}) = 1$. But we also have our continued fraction for $\tan(x)$ :

$\begin{align} \tan(\frac{\pi}{4}) & = \tan(\frac{a}{b}) = \frac{\frac{a}{b}}{1 - \frac{(\frac{a}{b})^2}{3 - \frac{(\frac{a}{b})^2}{ 5 - \frac{(\frac{a}{b})^2}{7 - \frac{(\frac{a}{b})^2}{\ddots}} }}} = \frac{a}{1b - \frac{a^2}{3b - \frac{a^2}{ 5b - \frac{a^2}{7b - \frac{a^2}{\ddots}} }}} \end{align}$

But clearly, $a^2$ is constant, so for a large enough $n$, we'll have $1 + a^2 < nb$. So at some point, we'll have a sub-fraction

$\begin{align} \frac{a^2}{(2n+1)b - \frac{a^2}{(2n+3)b - \frac{a^2}{ (2n+5)b - \frac{a^2}{(2n+7)b - \frac{a^2}{\ddots}} }}} \end{align}$

that satisfies our Step 2 claim, and hence is irrational. If that subfraction is irrational, then clearly the whole fraction is irrational (just by thinking about how that continued fraction would collapse and simplify up). But clearly $\tan(\frac{\pi}{4}) = 1$ is rational. Contradiction!

So it must be that $\pi$ is irrational.

$\blacksquare$

In case you were interested, since we spent all that time with approximation theorems from before, it's interesting to note that for all integers $p,q$ we have

$\large \left|\pi - \frac{p}{q} \right|$ $ > \large \frac{1}{q^{42}}$

so $\pi$ gets pretty close to some rational numbers (as we've seen with our convergents before), but it's interesting seeing the lower bound given. As always, paper attached.

The Simpler Irrationality Proofs

This proof is important for its history alone: it was the first sound proof that $\pi$ is irrational. In fact, $e$ was also first proven irrational in a similar way (Euler, 1737) by writing out the continued fraction for $e$ and showing it was infinite.

But Fourier had a much better proof that $e$ is irrational:

Claim: $e$ is irrational.

Proof: Suppose $e = \frac{a}{N}$ was rational. Now consider $e\cdot N!$ which is clearly an integer by assuming $e = \frac{a}{N}$. We can evaluate this using the series expansion $e = \sum_{k=0}^\infty \frac{1}{k!}$.

$\begin{align} e \cdot N! & = \frac{N!}{0!} + \frac{N!}{1!} + \cdots + \frac{N!}{(N-1)!} + \frac{N!}{N!} + \frac{N!}{(N+1)!} + \frac{N!}{(N+2)!} + \cdots \\ & = \color{blue}{N! + N! + \frac{N!}{2!} + \cdots + N + 1} + \color{red}{\frac{1}{N+1} + \frac{1}{(N+1)(N+2)} + \cdots} \end{align}$

Now the terms in blue clearly add to an integer. But the parts in red:

${\scriptsize \begin{align} \frac{1}{N+1} + \frac{1}{(N+1)(N+2)} + \frac{1}{(N+1)(N+2)(N+3)} + \cdots & < \frac{1}{N+1} + \frac{1}{(N+1)(N+1)} + \frac{1}{(N+1)(N+1)(N+1)} + \cdots \\ & = \frac{1}{N+1} \cdot \frac{1}{1 - \frac{1}{N+1}} \\ & = \frac{1}{N+1} \cdot \frac{N+1}{N} \\ & = \frac{1}{N} < 1 \end{align} }$

We bounded the red series by a geometric series, and were able to show that it must be less than 1. But that means the series in red is not an integer. Therein lies our contradiction: if we assume $e$ is rational, we have $e \cdot N!$ must be an integer, but the series expansion for $e$ would suggest that it's not.

$\begin{align} e \cdot N! & = \overset{\in \mathbb{Z}}{\overline{\color{blue}{N! + N! + \frac{N!}{2!} + \cdots + N + 1}}} + \overset{\notin \mathbb{Z}}{\overline{\color{red}{\frac{1}{N+1} + \frac{1}{(N+1)(N+2)} + \cdots}}} \notin \mathbb{Z} \end{align}$ $\blacksquare$

Continued fractions are still useful today, but for irrationality proofs, the above is so much more attractive compared to finding infinite continued fractions. Which makes me want to think that Lambert's proof that $\pi$ is irrational—albeit important—is an unwieldy relic to what else is there.

Claim: $\pi$ is irrational.

I. Niven's (1947) proof that $\pi$ is irrational is actually quite short, and only relies on some calculus and a helper function. Notably, the key fact that it relies on is that $\pi$ is the smallest positive 0 of $\sin(x)$.

Lemma: For all $n \geq 1$, consider the function

$\begin{align} f(x) = \frac{x^n(a-bx)^n}{n!} \end{align}$

Then the following are true:

$f(x)$ is a polynomial of the form $\frac{1}{n!} \sum_{k=n}^{2n} c_k x^k$ with integer coefficients $c_k$
For $0 < x < \frac{a}{b}$, we have $0 < f(x) < \frac{1}{n!}$
For all $k \geq 0$, the derivatives $f^{(k)}(0)$ and $f^{(k)}(\frac{a}{b})$ are integers.

Proof: Claims 1. and 2. are straightforward (consider binomial expansion, and the factors in the numerator). For 3., note by 1., we've shown $f(x)$ is a polynomial consisting of terms with degree between $n$ and $2n$. So $f^{(k)}(0) = 0$ unless $n\leq k\leq 2n$. When $n\leq k\leq 2n$, then $f^{(k)}(0) = \frac{k!}{n!} c_k$. Since $k \geq n$ and $c_k$ is an integer, then $\frac{k!}{n!} c_k$ is also an integer. By the symmetry in $f(x) = f(\frac{a}{b}-x)$, we get $f^{(k)}(x) = (-1)^k f^{(k)}(\frac{a}{b}-x)$. Therefore $f^{(k)}(\frac{a}{b}) = (-1)^k f^{(k)}(0)$ is also an integer. $ \ \blacksquare$

Now we can prove $\pi$ is irrational.

Proof: Assume $\pi = \frac{a}{b}$ is rational, and consider $f(x)$ from the above lemma. Now define the new function

$F(x) = f(x) - f^{(2)}(x) + f^{(4)}(x) - \cdots + (-1)^n f^{(2n)}(x)$

If we take the derivative of this, we find

$\begin{align} F'(x) & = f^{(1)}(x) - f^{(3)}(x) + f^{(5)}(x) - \cdots + (-1)^n f^{(2n+1)}(x) \\ F''(x) & = f^{(2)}(x) - f^{(4)}(x) + f^{(6)}(x) - \cdots + (-1)^n f^{(2n+2)}(x) \\ \end{align}$

$F''(x)$ looks a lot like $F(x)$. In particular, we have this relation:

$\begin{align} -F''(x) + f(x) & = F(x) \\ \end{align}$

Note this holds precisely because $f^{(2n+2)}(x) = 0$. With this in mind, we can see that the following derivative is

$\begin{align} \frac{\mathrm{d}}{\mathrm{d}x} \left( F'(x)\sin(x) - F(x)\cos(x) \right) = F''(x) \sin(x) + F(x) \sin(x) = f(x)\sin(x) \end{align}$

Thus, if we integrate the righthand-side,

$\begin{align} \int_{0}^{\pi} f(x)\sin(x) \ dx = \left[ F'(x)\sin(x) - F(x)\cos(x) \right]_{0}^{\pi} = F(\pi) + F(0) \end{align}$

By part 3. of our lemma $f^{(k)}(\pi)$ and $f^{(k)}(0)$ are integers for all $k$, so $F(\pi) + F(0)$ is an integer.

Note now for $0

$\begin{align} 0 < f(x) \sin(x) = \frac{x^n(a-bx)^n}{n!} \sin(x) = \frac{x^n a^n (1-\frac{b}{a}x)^n}{n!} \sin(x) < \frac{\pi^n a^n}{n!} \end{align}$

But the RHS can be made arbitrarily small, i.e. $\frac{\pi^n a^n}{n!} < \frac{1}{\pi}$ for a big enough $n$. So we can bound our integral by

$\begin{align} 0 < \int_{0}^{\pi} f(x)\sin(x) \ dx < \pi \cdot \frac{\pi^n a^n}{n!} < 1 \end{align}$

However, we claimed that $\int_{0}^{\pi} f(x)\sin(x) = F(\pi) + F(0)$ is an integer. But there is no integer between 0 and 1! So our assumption that $\pi$ was rational must be wrong, and so $\pi$ is irrational.

$\blacksquare$

This proof is "simpler" in that the mathematical machinery required is relatively low-level. It might be that the actual contradiction we were looking for is less obvious, but we didn't have to prove that many obscure, auxilliary lemmas like we did with continued fractions in Lambert's proof.

Proofs from THE BOOK generalizes this proof strategy to show a few more interesting irrationality results.

Lemma: For all $n \geq 1$, consider the function

$\begin{align} f(x) = \frac{x^n(1-x)^n}{n!} \end{align}$

Then the following are true:

$f(x)$ is a polynomial of the form $\frac{1}{n!} \sum_{k=n}^{2n} c_k x^k$ with integer coefficients $c_k$
For $0 < x < 1$, we have $0 < f(x) < \frac{1}{n!}$
For all $k \geq 0$, the derivatives $f^{(k)}(0)$ and $f^{(k)}(1)$ are integers.

Proof follows exactly as you'd expect from before.

Theorem: $e^r$ is irrational for all rational $r \neq 0$.

Proof: We can reduce this by considering $e^p$ is irratraional for all integers $p$, since if $e^{\frac{p}{q}}$ was rational, then $(e^{\frac{p}{q}})^q = e^p$ would also be rational. The key idea here is using that $\frac{\mathrm{d}}{\mathrm{d}x} e^x = e^x$.

Let $f(x) = \frac{x^n(1-x)^n}{n!}$ refer to the one from the lemma above. As usual, assume the contrary and that $e^p = \frac{a}{b}$. Now consider the new function

$F(x) = p^{2n} f(x) - p^{2n-1}f'(x) + p^{2n-2}f''(x) \mp \cdots + f^{(2n)}(x)$

This looks familiar to the one Niven used to show $\pi$ is irrational. Taking its derivative:

$F'(x) = p^{2n} f'(x) - p^{2n-1}f''(x) + p^{2n-2}f''(x) \mp \cdots + f^{(2n+1)}(x)$

Hence, we get the relation

$-\frac{1}{p}F'(x) + p^{2n} f(x) = F(x)$

Now we take the derivative of a particularly nice function:

$\begin{align} \frac{\mathrm{d}}{\mathrm{d}x}(e^{px} F(x)) = pe^{px}F(x) + e^{px}F'(x) = p^{2n+1}e^{px} f(x) \end{align}$

Integrating, we then get

$\begin{align} \int_{0}^{1} p^{2n+1}e^{px} f(x) \ dx = \left[ e^{px} F(x) \right]_{0}^{1} = e^p F(1) - F(0) \end{align}$

Recalling that $e^p = \frac{a}{b}$, we conclude that

$\begin{align} b \int_{0}^{1} p^{2n+1}e^{px} f(x) \ dx = aF(1) - bF(0) \end{align}$

is an integer. As before, we can bound this integral from above:

$\begin{align} b \int_{0}^{1} p^{2n+1}e^{px} f(x) \ dx < bp^{2n+1}e^p \frac{1}{n!} = \frac{ap^{2n+1}}{n!} < 1 \end{align}$

We get the last inequality by just taking a large enough $n$. There are no integers between 0 and 1, giving us our contradiction and completing the proof.

$\blacksquare$

Open Questions and More Strangeness

Irrational Powers

But not much about irrationals are that well understood. It's quite simple to show that combining an rational with another rational number via addition, multiplication, division, and subtraction, only results in more rational numbers. Further, it's not too bad to show that combinations in the same way with an irrational and rational only lead to more irrationals. But combining irrationals are a little strange: obviously $(1 - \sqrt{2}) + \sqrt{2} = 1$ is the sum of two irrational numbers that result in a rational. You can even find two irrational numbers $a,b$ such that $a^b$ is rational.

Claim: There exists irrational numbers $a$ and $b$ such that $a^b$ is rational.

Proof: We know that $\sqrt{2}$ is irrational, so consider the number $\sqrt{2}^\sqrt{2}$. If this is rational, we're done. If this is irrational, then consider the number $(\sqrt{2}^\sqrt{2})^\sqrt{2} = \sqrt{2}^2 = 2$, which is rational, and hence we are done.

In fact, it is possible to show for a lot of positive rational numbers $r$, there exists an irrational $a$ such that $a^a = r$.

Theorem: For every rational number $r \in \left( (1/e)^{1/e}, \infty \right)$, either $r = a^a$ for an irrational $a$, or $r \in \{1, 4, 27, 256, \cdots, n^n, \cdots \}$.

Proof: Consider the function $f(x) = x^x$ on the interval $I = (1/e, \infty)$. $f(x)$ is continuous on $I$, since it is differentiable, and injective as it is monotonic (check derivative is strictly positive), so $f(I) = \left( (1/e)^{1/e}, \infty \right)$. So given a rational $r \in f(I)$, let $a \in I$ be the corresponding value such that $a^a = r$.

We'll now go and prove the contrapositive: if $a$ is rational and not an integer, then $a^a$ is irrational. In other words, if $r = \frac{p}{q}$, and $a = \frac{n}{m}$ are in lowest terms i.e. $\gcd(p,q) = \gcd(n,m) = 1$, and

$\large \left( \frac{n}{m} \right)^\frac{n}{m}$ $= \large \frac{p}{q}$

then $m = 1$ and $a$ must be an integer. So, for a contradiction, assume that $m > 1$. Rearranging our equation, we get that

$n^n q^m = p^m m^n$

Since $\gcd(p,q) = 1$, a prime divides the factor $q^m$ on the left-hand side if and only if it divides the factor $m^n$ on the right-hand side. So $q^m = m^n$. Since $m > 1$, we can write $q = \alpha^i k$ and $m = \alpha^j l$ for some prime $\alpha$ and integers $k,l$. Since $q^m = m^n$, we must also have $im = jn$ for the exponent of $\alpha$ to match. Thus $i(\alpha^j l) = jn$, and $\alpha^j$ divides $jn$.

But $\gcd(m,n) = 1$, and $\alpha^j$ is a factor of $m$, hence $\alpha^j$ divides $j$. So, $\alpha^j \leq j$. Since $\alpha$ is a prime, $2 \leq \alpha$, and so $2^j \leq j$. But it can easily be shown that $2^j > j$ for all integers $j$, giving us the contradiction.

$\blacksquare$

Transcendental Numbers

$\pi$ being irrational is a well-known fact among everyone. A more surprising fact to more mathematically inclined is the fact that $\pi$ is transcendental.

Not all irrational numbers are equal. $\sqrt{2}$ is, in some ways, defined by the fact that it is the (positive) root to $x^2 - 2 = 0$. Despite being, well, just a number that seems to exist as its own object, $\sqrt{2}$ has a decidedly algebraic quality to it which allows us to precisely specify it in much simpler terms (i.e. integers and arithmetic). Lots of numbers can be characterized as solutions to polynomials. Even imaginary numbers, which many find difficult to grasp, can be simply defined in terms of simple polynomials: $\pm i$ are the solutions to $x^2 + 1 = 0$.

But not every number can be defined via a polynomial. In fact, it's not too hard to show that there are only countably many of these algebraic numbers, so certainly the unconutable $\mathbb{R}$ contains some non-algebraic or transcendental numbers. Two of which we are already familiar with: $\pi$ and $e$.

Proving numbers, especially $\pi$ and $e$, are not straightforward by any means, and simple statements of them were huge discussion points throughout math's history (see Hilbert's 7th Problem), resulting in major theorems like the Lindemann–Weierstrass theorem. The first number to be discovered to be transcendental was one by Liouville, aptly called Liouville's constant:

$\begin{align} \sum_{n=1}^\infty \frac{1}{10^{n!}} = 0.11000100000000000000000100\ldots \end{align}$

and you can find a proof of its transcendence here.

Maybe we'll come back to all this later, but for now, we've done enough and can save it for another day.