**Hello, World**! I’m Adi Mittal, a student at Terman Middle School. I enjoy food, math, running, martial arts, and music of memes.

My main intent in starting this blog is to share my thoughts, ideas, and outlooks on cool math. My primary goal is to create content that's interesting and to share my thoughts on the world around me. I hope my ideas and thoughts can appeal to you, just as it did to me and showed me the interests of math!

**UPDATE (AUGUST 1ST, 2021)**

As the years have gone by, I have moved on from my lowly middle school self and am now a rising senior at Henry M. Gunn High School who is still obsessing over math, but am also actively participating in track and field, the concert and jazz bands, Model United Nations, and acting as a key organizer in my school's TEDx organization. My first year of high school was primarily situating myself in the new social dynamics and with all of these great clubs, so blogging was put on an indefinite hiatus as it floated to the back of my mind. A couple years ago, I partook in a research class in my sophomore year (which you can read about here) asked to summarize our work in a blog post, and it seemed like a great time to restart the site and continue writing. Since then, I've been trying to write between school vacations, writing a couple times a month at my most efficient, but I try to get post at least once every couple of months.

]]>**Coins!** As some of you may relate to this, I love to take a coin, and just spin it on a flat surface or table. It's just satisfying, but being put to shame by the so called "Fidget Spinner". I was spinning a coin a few days ago, and was ultimately bored at the time, so I decided to ask myself a simple question: **What information from the coin can be taken away from it spinning?** This then was taken into is there a **correlation**, or ratio, between the **rate at which the coin rotates**, and the **rate at which it "wobbles"**. With a goal in mind, I picked up my pencil and started to work things out.

First things first, defining and finding all of our givens:

This diagram represents what the coin might look like at a given instance.

What we know about the coin:

The **radius** of our coin $= R$

The **circumference** of our coin $C = 2 \pi R$

Now for the circle our coin rotates and wobbles upon:

The **radius** of this circle is entirely dependant on the angle of the coin to the horizontal (table/flat surface. We will define that as $\theta$). Using the diagram, we can find that $r = R \cos (\theta)$

The **circumference** is $c = 2 \pi R \cos (\theta)$

With all the givens that we need out of the way, on to the application.

A thing to note is that as the coin completes a **full rotation** around the smaller circle, the **original placement of the coin moves by a certain amount**. You can easily demonstrate this by drawing an arrow on a quarter, and guiding it through a rotation on a circle smaller than the quarter.

This extra distance it covers can easily be thought out to be as $C - c$.

$2 \pi R (1 - \cos(\theta)) =$ the rate at which the distance our coin completes per revolution.

The rate for the distance per revolution our coin completes while "wobbling" is the same as the circumference our coin moves around upon, which we know is $2 \pi R \cos(\theta)$

Putting these two together as ratio of rate of rotation, to rate of "wobbling", we get:

$\large= \frac{1}{\cos (\theta)} - 1$

This expression represents that at any given moment, the ratio between how fast the coin is spinning, and the how fast the coin is "wobbling" (which can be seen as the amount of hertz produced by the coin), will be $\frac{1}{\cos (\theta)} - 1$. This also means that if you multiply the frequency of "wobbling" by this expression, it will output how fast the coin should be spinning at the given value of theta. For example, let's say the coin is wobbling at a frequency of $5\,hertz$ at an angle of $\frac{\pi}{4}\,radians$ (because radians are cool), the coin would have to be rotating about $4.52\,revolutions\,a\,second$ to maintain that angle to the horizontal at that "wobbling" frequency (because of how hertz measure $cycles\,per\,second$, the cycles translate into revolutions for the output).

Of course, this is all theoretical. In practice, the coin may slip. Wind may change the local air pressure, thus changing the air resistance. Everything needs to stay **constant**, with no disturbances or changes occuring during the coin's movement. But this is still a neat thing if you were to ask me!

Just to recap, we took the basics and givens of our coin and it's enviornment. Used those to get the generalized rates of the coin spinning and wobbling. We then used those calculate our ratio between the two at a given instant. Not bad!

If you have any questions or comments, send me an email or leave a comment!

]]>**How do you... How do you even...** This is when you know the problem you are about to be shown, will be annoying. When a friend of mine first introduced this problem, I thought this would be very, VERY simple, to solve. Use some angle properties, use the given similar triangles, and soon enough, a solution will be found. Of course this didn't work. I tried a few other things, same result. Showed it to my family, not much help was gained. They tried what I did. Then it hit me. The goal is to find a specific **measure** of the **triangle**s. **Trigonometry** $=$ The **Measurement of Triangles**. The solution quickly followed using a specific property, and I was like, "Meh. That was quite obvious." Enough ranting, it's time you got a look at the problem itself.

*In the diagram below,* $\angle ABC = \angle ACB = \angle DEC = \angle CDE$, $\,\overline {BC} = 8$, *and* $\,\overline {DB} = 2$. *Find* $\,\overline {AB}$

When drawing everyting out...

Before you continue reading, I highly encourage you attempt this geometry problem. It's an interesting problem, and once you found the concept you need to use to solve it, it's all an easy ride down from there. Following this warning will be the full solution and my thoughts on how I solved this myself. I know I already talked about my thoughts and how I solved this a little in the beginning of this post, but from here on will be

The property I thought of (after my 45 minutes of trial-and-error) that we can use to solve for $\,\overline {AB}$ is the

Where $A$ is the angle oposite of $a$, $\,B$ is the angle oposite of $b$, and $C$ is the angle oposite of $c$.

We can rewrite this using the diagram:

So now that we have this written out, we can start solving for $\,\overline {AB}$. For convenience, I'm going to refer to the angles equivalent to $\,\angle ABC$ as $\,\theta$.

Using the **Angle Sum Theorem**, $\,\angle BAC = 180 - 2 \theta$. Using this, we can find an expression equal to $\,\overline {AB}$.

Doing some substitution...

$\large \frac{8}{\sin 2 \theta} = \frac{\overline {AB}}{\sin \theta}$

For where the $\sin 2 \theta$ came from, $\sin 180 - 2 \theta$ when evaluated, is the same as $\sin 2 \theta$. Now for some expansion and evaluation...

$\overline {AB} \sin \theta \cos \theta = 4 \sin \theta$

$\overline {AB} = \large \frac{4}{\cos \theta}$

Now that we have an **expression** for $\overline {AB}$, we just need to find a value of $\cos \theta$, and that will give us the length of $\overline {AB}$! So now, what can we do? What I first thought (based on the information we were given), if we find two expressions representing the value of the **same side length**, we can set those two expressions to equal one another, to find a value that makes that equation true. That equation will mostly likely output a value of a function of an angle, as we know very few side lengths, and know no angles (we're hoping it would output a value of $\cos \theta$). Again, this is only what I was thinking when solving this problem at the time. The only reason I thought this, is that I noticed two triangles, that were similar to $\triangle ABC$, contained within $\triangle ABC$.

We have similar triangles $\triangle DEC$ and $\triangle BEC$. And we know that they are similar as the both triangles share the same angles ($\theta, \theta, and \,180 - \theta$) as the original triangle $\triangle ABC$. And rememeber that side length I mentioned earlier that we could find two expressions for, and use those to solve for its length (that's a mouthful)? That length is $\overline {CE}$! It shares a side length with $\triangle DEC$ and $\triangle BEC$, and we can find our two expressions, by solving for the length of $\overline {CE}$ once, in $\triangle DEC$, and again in $\triangle BEC$. Agian, we're hoping for a value of $\cos \theta$. Starting by solving for $\triangle BEC$...

We are given the length of $\overline {BC} = 8$, which simplifies our job quite a bit. We can do the same thing we did to find an expression for $\overline {AB}$: Use the **Law of Sines**!

$\large \frac{8}{\sin \theta} = \frac{\overline {CE}}{2 \sin \theta \cos \theta}$

$\overline {CE} \sin \theta = 16 \sin \theta \cos \theta$

$\overline {CE} = 16 \cos \theta$

We now have a value of $\overline {CE}$ from $\triangle BEC$, time to solve $\overline {CE}$ for $\triangle DEC$...

First off, all though it's not stated, we know the length of $\overline {DE}$. $\triangle BEC$ is an isosceles, where $\angle BEC = \angle ECB$, which also means $\overline {BC} = \overline {BE}$. As $\overline {BC} = 8$, therfore $\overline {BE} = 8$. Since we were told $\overline {DB} = 2$, we can solve $\overline {BE} - \overline {BD} = \overline {DE} = 6$. Now back to the all-mighty, **Law of Sines**...

Substitution and expansion...

$\large \frac{3}{\sin \theta \cos \theta} = \frac{\overline {CE}}{\sin \theta}$

$\overline {CE} \sin \theta \cos \theta = 3 \sin \theta$

$\overline {CE} = \large \frac{3}{\cos \theta}$

Great! We're lucky that it came out as a value of $\cos \theta$, but anyways, we have our two expressions, now just to set them equal to one another...

$16 \cos^2 \theta = 3$

$cos^2 \theta = \large \frac{3}{16}$

$\cos \theta = \large \frac{\sqrt{3}}{4}$

Now that we have our value of $\cos \theta$, we can just substitute this into our original expression for $\overline {AB}$...

$= \large \frac{4}{(\frac{\sqrt{3}}{4})}$

$ = \large \frac{16}{\sqrt{3}}$

And there it would be, our solution! Although it might of seemed quite lengthy to get to $\frac{16}{\sqrt{3}}$, it all just revolved around the one concept of the **Law of Sines**, so not to bad.

Although this is one way to obtain the solution, I'm sure there are other ways to tackle this problem, and I found another way which completely negates our first step, to find an expression for $\overline {AB}$, but adds an extra step to the end.

With our value of $\cos \theta = \frac{\sqrt{3}}{4}$, we can draw a right triangle with this as one of our angles with a bit moving around.

We can do this, because as we stated earlier $\theta = any\,angle\,equivalent\,to\, \angle ABC$ (and that's the exact angle we're working with). We also **bisected** $\overline {BC}$ at $F$ to form the 2 right triangles within our isosceles triangle, so the length of $\overline {BF} = 4$. We can then use some basic trigonometry and evaluation to solve for $\overline {AB}$.

$\cos \theta = \large \frac {4}{\overline {AB}}$

$\cos (\arccos \large \frac {\sqrt{3}}{4}) = \large \frac {4}{\overline {AB}}$

$\large \frac {\sqrt{3}}{4} = \large \frac {4}{\overline {AB}}$

${\large \frac {\sqrt{3}}{4}} \overline {AB} = 4$

$\overline {AB} = \large \frac{16}{\sqrt{3}}$

Just another simple way of getting to the exact same answer.

If you have any questions or comments, send me an email or leave a comment!

This specific solution, is one of my favorites that I have seen. One of my inital attempts was to use the dimensions of the similar triangels and find the common ratio between the side length and the base of the triangle. I knew it could be done, but never put my finger on it. However, when a friend of mine took a look at this problem, after a bit of thought, he managed to come up with this. It's really quite a spectacular of a solution, and this is credited entirely to him (no use of name for privacy reasons). Oh, and I'll be speaking in first person, just so I don't cause any confusion, or make it seem like I'm taking it as mine. Just to be clear.

So the first step is to take the three triangles we know to be similar to one another ($\triangle ABC, \triangle BEC, and \triangle CED$. We know that they are similar due to the fact they all share two common angles, which force them to have a common ratio between the base and a leg of the triangle. This will be important to remember later), and we will $0-index$ them from the original triangle, to the following divisions within one another. I will also now be referring to the triangles by their respective index numbers.

Now using the fact that every triangle is similar, and that each progressive triangle was formed by using the base length of the previous triangle to form the leg of the next triangle, we can find a ratio between a dimension (say, the base) of a triangle, and its previous/next triangle, and use that to find the length of $\overline{AB}$. I know that is kind of confusing right now, but trust me, it will makes more sense the more I go on.

So we know the base length of two bases of two triangles ($\triangle 0$, and $\triangle 2$). Since we know that they should share a common ratio, we can right them as a ratio between one another, and hence find said ratio.

$\large = \frac{4}{3}$

So we have a ratio, but the problem with this ratio it's for two divisions. It's for going between $\triangle 0$ and $\triangle 2$. We want one between $\triangle 0$ and $\triangle 1$, or $\triangle 1$ and $\triangle 2$. But this is easy! Since a division in this case is a factor of the previous triangle. This means if we take some dimenstion _a_ of a triangle, multiply it by our ratio **once**, we will obtain the dimension _a_ of the next division's triangle. For an example, if we have triangle-base $\overline{BC}$, and multiply it by our ratio, we should get the length of triangle-base $\overline{EC}$. Take a look at the diagram if that helps. Essentially, the base length of $\triangle 0$, multiplied by some ratio, we will get the base length of $\triangle 2$, and do that again, we will get the base length of $\triangle 3$. Now if you see, we had to multiply **twice** to get from $\triangle 0$ to $\triangle 2$. A.K.A., take the square of the ratio. To undo a square, you take the **squareroot**. So we can undo our two-division ratio, by taking the squareroot of that, to get our one-division ratio.

So that's our ratio between a one triangle division. So now we need to find the length of $\overline{AB}$. So we can do what we did originally with the base lengths, only with the legs of the triangle. Larger triangle, over the divided triangle. In this case, $\triangle 0$ over $\triangle 1$.

$\large \frac{\overline{AB}}{\overline{BC}} = \frac{2}{\sqrt{3}}$

$\large \frac{\overline{AB}}{8} = \frac{2}{\sqrt{3}}$

$\overline{AB} = \large \frac{2 \cdot 8}{\sqrt{3}}$

$\overline{AB} = \large \frac{16}{\sqrt{3}}$

The Earth has a diameter of approximately 12742000 meters. Most people of course wouldn't travel that far, but what if you did? How fast can you get across with nothing but yourself? That's essentially what people have asked in the form of the question: How long will it take to fall through the center of the Earth?

Following our standard procedure, let's list all the givens:

$The\,Force\,of\,Gravity\,is\,F = \large \frac{G m M}{r^2}$

Where...

$G = Gravity$

$m = Mass\,of\,Object_1\, (in\,this\,case,\,us)$

$M = Mass\,of\,Object_2\,(in\,this\,case,\,Earth)$

$r = the\,distance\,between\,m\,and\,M.$

So now we are just trying to find as many values or expressions to variables within that eqauation of force. We can leave $m$ as is, becuase that's the mass of our human/us. So what we really need is $r$ and $M$.

One thing we have to worry about though, is that as we fall $r$ will change. As we fall we will get clsoer to Earth's center of mass, eventually pass it, and then get farther from it. So we will call our current distance relative to Earth's center of mass as $x$. And what's great about this, if we are any distance into our fall, we can just ignore any mass above us. Using the diagram as an example, if we are $R-x$ deep into our fall, we can ignore any mass of Earth contained between $R$ and $x$. Some of you may think, "But wait! Wouldn't the mass above us have it's own force of gravity acting upon you, and therefore slowing you down as you fall?" The answer is technically yes, but that all balances out with the mass *below* you and to the *side of* you. All these forces cancel out, making it not affect you at all. So, all we really care about is the amount of mass below us, and the distance between us and the Earth's center of mass (which would be the radius $x$ as we have been discussing). So we have one variable filled.

Now we need $M$. The formula for mass is $M = volume \times density$. The volume of the Earth $= \frac{4 \pi x^3}{3}$ (we are using $x$ again as the mass affecting us changes over our fall). And we can represent density with $\rho$. So the $M$ equals:

Putting this all together, the force of gravity acting upon us during this fall equals:

$ = \large \frac{4 \pi G m \rho}{3} x$

If we let $\frac{4 \pi G m \rho}{3} =$ say, v, we get $F = -v x$. It's negative because we are falling first. This is actually an **oscillating system**. To represent this, I've made a mock graph to show how gravity affects us over time starting from the top of "Earth". The graph is just a representation.

If the x-axis is time, and we fell from the top of Earth (and there is NO air resistence), as you can see, we would just continuously bounce back and forth between the top and bottom of the Earth. Now we need to find the **period** of our oscillating system. The period is the time it takes for one cycle to be completed. To be more precise, we need *half* of the period. That is because one cycle (in this case) is falling all the way down, and coming all the way back. We only want the time it takes to fall down, so that's why the half.

The eqauation for the period of a simple oscillating system (also called a harmonic motion) is:

The variable representation is that $k$ is our oscillating system, and $m$ is our mass. But since we want half of that, so therefor time to fall through the Earth is...

Doing some substitution...

$Time = \pi \sqrt {\large \frac {m}{\large \frac{4 \pi G m \rho}{3}}}$

$ = \pi \sqrt {\large \frac {3 m }{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 m \pi^2}{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 \pi}{4 G \rho}}$

Now all we need to do is put in $G$ as the Gravitational Constant, and $\rho$ as the density ($\rho = \frac{mass}{volume}$) of Earth (I did some Googling...)!

$= \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot s^{-2} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = \sqrt {\large \frac {3 \pi \cdot s^{2}}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = s \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

So, I don't know about you, but when I have something like this, I just straight up put it into *Wolfram Alpha* , or a similar calculator, as I am just lazy and it's a pain to evaluate. So, letting it be computed by the calculator...

$ \large = 2530.5\,seconds$

This, funnily enough is also the answer to the universe and all of its questions. $2530.5\,seconds = 42\,minutes\,(+10.5\,seconds)$. Quite a coincidence if I say so!

Now what's great about our equation we used ($ = \sqrt {\frac {3 \pi}{4 G \rho}}$), it's quite easy to apply to other objects, as most of it is constant! 3, is well, a constant. So is 4. $\pi$ has been universally agreed upon for its value. And as far as we can tell in the universe, the Gravitational Constant is true. The only thing that determines the fall length is the density. So you could have two planets, one with $x$ as its radius, and the other as $100 x$. If the are just as dense as one another, you will fall through them (across the diameter) in the same time.

Now just as a random fact that I thought was amusing, was the top speed you would attain. We know that acceleration due to gravity on Earth is $\frac{9.807\,m}{s^2}$. The top speed would be when you reach the center of the Earth, which is 6,371,000 meters from the surface (aka, the radius of Earth). Using this, we can calculate the speed at which we would be at in meters per second at the center. Just to be sure, we can calculate acceleration due to gravity, using our original formula, where we ignore our mass: $g = \frac{G \cdot M_{Earth}}{R^2}$

$g = \frac{6.67408 \cdot 10^{-11} \cdot m^3 \cdot kg^{-1} \cdot s^{-2} \cdot 5.972 \cdot 10^{24}\, kg }{6371000^2\,m^2}$

$g = \frac{6.67408 \cdot 5.972 \cdot 10^{13} \cdot m}{6371000^2 \cdot s^2}$

Thanks to a calculator...

$\approx \large\frac{9.82m}{s^2}$

Of course this is not the same as what others have put on the internet, values will differ from here to there. I trust the value of $\frac{9.807m}{s^2}, as I think there values they used to calculate it would be more accurate. Back to the top speed now.

$ = \large \frac{9.807 \cdot 6371000}{s^2}$

$ = \large \frac{\sqrt{9.807 \cdot 6371000}}{s}$

$\approx \large \frac{7904.454251 m}{s}$

$\approx \large \frac{17681.760583\,miles}{hour}$

That's about 23.23 times the speed of sound! This literally means you can't yell during this fall, as you would be going literally faster than the time it takes to vibrate the air around you. It will be a silent fall. That is, if there was air, and the terminal velcoity of a human wasn't $\frac{53m}{s}$.

So that would be it for this post! We found out we can cross Earth in under 45 minutes, and break the sound barrier 23 times over!

I plan on following it up with another post showing how you can use integration to find the time to fall through Earth (and that equation to find the period of an oscillating system/simple harmonic motion that kind of came out of nowhere. The $2 \pi \sqrt{\frac{m}{k}}$), and to show some other cool properties and interesting things about falling, pendulums, and oscillating systems in general.

Now here's an extra challenge for you: How long will it take to fall through Earth, 500 kilometers above the surface?

If you have any questions or comments, send me an email or leave a comment!

]]>There is just no introduction needed here. The problem at hand is probably one of the hardest, most controversial topic in computer science:

In case this is not clear (or never have heard this problem before), it is to show that all NP-hard problems are P problems, or show that they are not equal. An NP-hard problem is a problem that cannot be solved in polynomial time ($NP$ represents for non-deterministic polynomial-time, and $P$ just represents for polynomial-time). Polynomial time is time that can be represented as a function of the input (input being whatever you need to achieve/solve for in the problem), and the function is a simple polynomial function. For example, the following function representing the time it takes to solve some problem,

..., this would classify the problem as a $P$ problem, as we can represent the time it takes to solve the problem as a simple polynomial function. An example for how long some $NP-hard$ problem might take to solve would be such as...

This is bad for computation time, since $x$ is our input, our values would explode the greater the amount of input we have. That's why this is an $NP-hard$ problem. We would essentially have to brute force our, and check every possible scenario (within allotted values for our problem) to solve for this.

So now the reason why this problem is so controversial, it's because that if we can show that $P = NP$ is true, we can then theoretically solve ANY problem within an algorithmic, and in polynomial time. It will cut so much time off of the time it takes to solve all the crazy hard, unsolved problems.

And I know, some of you may be thinking, "But, hey! Wouldn't most problems need a completely different approach to solve, than another problem?" Well, my response to this, would be yes, but, there are some $NP-hard$ problems that

Okay, now with that all out of the way, the reason why I started discussing $NP-problems$, the $P$ versus $NP$ problem. This problem bothers me so much, for a few reasons. **ONE**: This seems a lot easier to solve than it actually is, and this just intuitively bothers me more than other problems do. It seems like such a simple statement to show, but it's just not. **TWO**: The way people are approaching this problem, it seems all to *awkward* and incorrect to me. It seems that they are overcomplicating this quite a bit. But this is computer science, so I don't have much say. And the person who proved *Fermat's Last Theorem* did so in more or less 150 pages (I think), so this could very well be so as well.

My attempts haven't been as successful (well, if it was successful, I would be too excited to write this up), but I do have a few thoughts on the matter. My first attempt was rather bleak. Take a generalized form of the time it takes to solve an $NP-hard$ problem, and just try to work it down to some representation of polynomial time. This obviously, did not work. What ended up happening was that I was trying to represent the wrong variable into polynomial-time representation, and couldn't find a way to expand on onto the variable that I needed to express. So, that idea was gone. The second idea, would be a bit more practical. Take some $NP-complete$ problem, look at it how it's time is in its NP form, then try to find some algorithm that results in the same solution, but is in polynomial time. The reason why I would do this, is because you can link almost any $NP-problem$ to one of the $NP-complete$ problems. Using this, we can creat a map, linking every $NP-complete$ problem to another. That way, if we can solve for one, we have then technically shown for every $NP-problem$. We can do that, or generalize somehow our $NP-complete$ problem, and show from there. My last idea on the matter, is to think of the consequences of this statement ($P=NP$) of being **true**, or **false**. If this is **true**, I feel that this would create a paradox. Because finding the polynomial-time fuction of a $NP-problem$ is $NP-hard$ in itself. But that cannot happen, as we said that $P=NP$, so we have a contradiction in itself. So you would then have to show that finding a P function of a NP function is in P. But that is also $NP-hard$. Then you would have to show that is also in P. But that's also $NP-hard$, so we have to show that it's in P, etc., etc. So we end up with having to contiuously prove that something that is in NP is in P, to show that the smaller $NP-hard$ problem of P versus NP (showing the conversion of NP to P in a given $NP-problem$), is also in P (that was a bit long and a mouthful. Essentially you get a recursive $NP-problem$, and each iteration of this recursive problem is slightly different than the last iteration, but with the same goal of showing that that iteration of an $NP-problem$ takes P time to actually solve). If it was **false**, we would stay where we are computationally, and nothing would of changed. Personally, based on what I have done so far, I think $P \neq NP$. But don't think that is my final decision. People have shown $NP-hard$ problems to be computed in polynomial-time, so based on my second idea of mapping $NP-complete$ problems, there is still some possibility. Expect some updates, and future posts, as this is one of many other problems (I'll just say them: The Millenium Problems) that have gotten me thinking in almost no way I have done before (that's probably because they are not all math-based, and I'm a math-based guy, so math-based + not-math-based-problem = new type of thinking). Actually, don't expect future updates and posts, just know there will be future updates and posts.

If you have any questions or comments, send me an email or leave a comment!

Not much for an introduction this post. Found this problem when looking for interesting problems for myself. Shoutout to Harvard's Problem of the Week (from 2002 to 2004). The problem at hand is:

(a) What is your expected value you win when playing the game?

(b) Play the same game, except let your earnings be $2^{n-1}$, where $n$ is the amount of flips. What do you expect to win now? Does it make sense?

**(a)**: Expected value is the amount you win, multiplied by the probability of it occuring, and adding up all the possible outcomes.

You have a 50% chance to win 1 dollar. 25% chance to win 2 dollars. 12.5% chance to win 3 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$= \large \sum _{n=1}^{\infty }\: \frac{n}{2^n}$

$= \large 2$

**OR**

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$\large =(\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{6} + ...) + (\frac{1}{4} + \frac{1}{8} + \frac{1}{16} + ...) + (\frac{1}{8} + \frac{1}{16} + ...) + ...$

$= (1) + \large (\frac{1}{2}) + (\frac{1}{4}) + (\frac{1}{8}) + (\frac{1}{16}) +...$

$\large = 2$

So you can expect to win

This is where the fun is at.

**(b)**: We have 50% chance to win 1 dollar. We have a 25% chance to win 2 dollars. We have a 12.5% chance to win 4 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{4}{8} + \frac{8}{16} +...$

If you don't mind, since I like to write things in sigma notation, I would like to write the simplified verison of this sum in sigma notation.

$=\large \sum _{n=1}^{\infty}\: \frac{1}{2}$

$\large = \infty$

This is why I picked this problem. The first part is quite simple, but this part creates quite a dilemma. What can we do now? How should we interpret this for the expected value of our game? Now one would ever put up a game in which the player is expected to win an infinte amount of money, since no one has an infinite amount money!

The following explanation is a jumble between what I thought, and Harvard's. I recommend looking at what they said specifically.

The solution is that our game (would be known as the *experiment* in our scenario) doesn't agree with the exact definition of **expected value**. Expected value is defined as an average over an *infinite* amount of attempts/trials (this can be viewed at least as the limit towards an infinite number of attempts/trials). The thing is that, you'll never be able to play an infinite amount of games. Essentially, our experiment (game) doesn't agree with our calculated expected value, as the experiment has nothing to do whatsoever with the precise defintion of expected value. Just as an example, if you were to (somehow) play an infinite amount of games, your earnings would indeed average an infinite amount. This whole idea of this expecting to win an infinite amount, and it "not working/making sense/not being possible" arises when we try to make expected value, something it isn't.

Okay, I like math, but from this point onward I didn't have much. And what I did wasn't cohesive, as 25% was written down, the other 75% was in my head. The problem is, that 75% *was* in my head. I would try to go through and get my complete explanation, but I feel that Harvard's solution is already quite nice. So the rest is all Harvard's explanation. Only credit I get here is for the fact I formatted it for this page. Here you go.

"*This might not be a very satisfying explanation, so let us get a better feeling for the problem by looking at a situation where someone plays $N = 2^n$ games. How much money would a “reasonable” person be willing to put up front for the opportunity to play these N games? Well, in about $2^{n−1}$ games he will win one dollar; in about 2^{n−2} he will win two dollars; in about $2^{n−3}$ games he will win four dollars; etc., until in about one game he will win $2^{n−1}$ dollars. In addition, there are the “fractional” numbers of games where he wins much larger quantities of money (for example, inhalf a game he will win $2^n$ dollars, etc.), and this is indeed where the infinite expectation value comes from, in the calculation above. But let us forget about these for the moment, in order to just get a lower bound on what a reasonable person should put on the table. Adding up the above cases gives the total winnings as: $2^{n−1}(1) + 2^{n−2}(2) + 2^{n−3}(4) +· · ·+ 1(2^{n−1}) = 2^{n−1}n$. The average value of these winnings in the $N = 2^n$ games is therefore $\frac{2^{n−1}n}{2^n} = \frac{n}{2} = \frac{(\log_2 N)}{2}$. A reasonable person should therefore expect to win at least $\frac{(\log_2 N)}{2}$ dollars per game. (By “expect”, we mean that if the player plays a very large number of sets of $N$ games, and then takes an average over these sets, he will win at least $2^{n−1}n$ dollars per set.) This clearly increases with $N$, and goes to infinity as $N$ goes to infinity. It is nice to see that we can obtain this infinite limit without having to worry about what happens in the infinite number of “fractional” games. Remember, though, that this quantity, $\frac{(\log_2 N)}{2}$, has nothing to do with a true expectation value, which is only defined for $N → ∞$. Someone may still not be satisfied and want to ask, “But what if I play only $N$ games? I will never ever play another game. How much money do I expect to win?” The proper answer is that the question has no meaning. It is not possible to define how much one expects to win, if one is not willing to take an average over a arbitrarily large number of trials.*"

Neat little problem if I do say so myself. Some of my work, some of Harvard's, hope it was cohesive and clear who was writing what when. I wish I could of gotten my last piece of explanation, just would of taken a bit too long for something I need to redo. Moral of the story: Take complete notes.

If you have any questions or comments, send me an email or leave a comment!

**Over 7 million students** across the United States missed 15+ days of school in the 2015-16 school year (US Department of Education, 2012). These *chronically absent* who miss 10% of their academic year cause over 40 *billion* worth of instructional minutes to go to waste. Even more jarring, in the same report, it was cited that inconsistent student attendance is a better indicator than test scores to whether a student will drop out of school or not. In case it wasn't clear enough, student absenteeism is a major issue haunting the education system, and the need find a solution to it is greater than ever. Over the past 8 months, I have conducted a small series of experiments to try and alter the state of this issue in one public high school in Palo Alto, California.

**In the case of Palo Alto's Henry M. Gunn High School**, only 5% of the school's 2000 students are in this category of the chronically absent. That means on average, 1 student is missing in every class on campus. What makes this statistic concerning though is when one considers the demographic of Palo Alto. Here are two maps of the United States: one of the median household income (New York Public Radio), and another of chronic absentee rates by district (US Department of Education, 2012).

Although Palo Alto is in one of the most affluent neighborhoods in the nation, its chronic absentee rate is equal to some areas with with not even half its median household income. This led to the idea that absenteeism may be fueled by different motivators across the country. For example, in some less fortunate neighborhoods, kids may be an active contributor to their family income, leading to conflicts with their academic commitment. Similarly, because of Palo Alto's wealth and greater access to resources, academic competitiveness may fuel absenteeism. This may seem counterintuitive at first: if someone wants to do well in class, why would they skip it? These *strategic absentees* skip class not because they want to, but rather as a necessary evil: they feel the need to skip a class a means to prepare for another one. These are the students that can most affect the absentee rate, as not only are they motivated to go to class, but their absences are likely more sporadic than they standard chronically absent, meaning that they are more likely to be in class to be influenced by a teacher or administrator, which nicely leads into the next section.

**Students can't be forced to go to class**. Especially as Henry M. Gunn High School sports an open-campus policy, it is impossible to be able to force a substantial number of absentees to go to class beyond the already instated measures. So, instead, we tried to **persuade** the students, as if they agree for themselves attending class is the right thing to do, they are more likely to act on it. To do so, we used specially engineered social measures to try and convince the students as best as possible, known as **nudges**.

Nudges, at their core, are suggestions. They don't affect one's ability to choose, but they utilize the person's experience to guide them to pick an option. The most commonly used nudge utilizes comparisons: let's say a restaurant has a dish that doesn't sell super well, and is ultimately costing them money having it listed on the menu. To boost its sales, they can list a slightly cheaper dish that has no intention of actually being sold in large quantities. This "fake" dish provides a reference frame for the buyer, making what originally seemed as overprice as suddenly as a great deal. The first time this type of persuasion was first formalized in a book of the same name: *Nudge: Improving Decisions About Health, Wealth, and Happiness*, written by Cass Sunstein and Richard Thaler ( highly recommended read ). It should be reiterated that this doesn't change one's ability to pick what food item they want, and that's what makes a nudge so effective. It allows the person to convince themselves without needing to feel as if the choice was imposed on them (i.e. remove everything on the menu except one dish). This is a form of *indirect persuasion*: we're not explicitly saying what we want the affected person to do, but we are adding information to guide one choice over another. The counterpart to this would be *direct persuasion*: giving explicit desires for which option to choose.

**This project employed 2 types of persuasion tactics that were each disseminated via 2 types of mediums**. The two types of mediums you are already familiar with: direct and indirect persuasion. This was examined by having the teacher give information by describing a set of data (see below) for direct persuasion. Indirect persuasion was tested by having the environment give the information, meaning instead of having the teacher give information on data, the students take in the information for themselves by noticing the data as a poster in the classroom. Now, I have been purposely vague about what data was shown to the students as there were actually *two* different sets of data that were shown to different classes, but they communicated the same idea.

If these curves look the same, that's because they are; they are the same set of data, but one is presented with a positive connotation (attendance is good) and the other with a negative connotation (absenteeism is bad). These are the two sets of data presented, as to see whether *how* one presents data affects one's ability to influence.

So, in total, there are 5 classes: $\mathrm{i})$ teacher presents positive connotation data; $\mathrm{ii})$ teacher presents negative connotation data; $\mathrm{iii})$ classroom presents positive connotation data; $\mathrm{iv})$ classroom presents negative connotation data; and lastly, all compared to a $\mathrm{v})$ control with no intervention (this is formally known as framing)).

As you can see, there are 5 lines shown on the graph: 4 lines, 2 blue and 2 red, to represent data collected and a green line that represents the aforementioned Gunn's 5% chronic absentee rate. The blue lines represent tardy and absentee data from January, to collect pre-experimental data. Red lines represent data collected in February, the month in which the experiment was set in motion.

This graph may be a bit intimidating to read, but it helps to realize that the x-axis is not a representation of time like most line graphs, but rather are the individual class models. This allows for easy comparison between months, as if there was, say, a virus moving throughout the classes that caused a 3% increase to absences to all classes, it will look as if one line was shifted upwards, having the same structure more or less to the other.

These results are really surprising, as there was *no* improvement at all from any implementation of our stimulus. If anything, it got marginally *worse* with a slight increase to tardies in all classes ($\approx$3-4% increase). Which makes some sense as when you consider whenever someone commands you to do something or says you're doing something wrong, the first instinct is to disagree and defend your actions. This feeling is known as **reactance**, and it is likely what caused this mild increase in tardies.

What is extremely concerning however is the massive increase in absences seen in the "-/teacher" class (negative connotation data presented by teacher), which observed an astonishing **85% increase** in absences. That is the difference between another **90 additional students** absent in Gunn, and an **additional 560 across all of the Palo Alto Unified School District**. This incites an interesting thought: **subconscious effects, such as reactance, can be amplified by other effects as well**. In this case, it was the *negativity bias* -- the idea that negative connotations tend to be overestimated in their impact than positive ones -- that amplified the reactance. For instance, say you are an avid fan of the fruit, apples. If someone says oranges are better than apples, reactance will be incurred and you'll likely disagree. If someone says apples are worse than oranges, however, now there is this feeling of losing an opinion as well, amplifying the disagreement. Here, it is the difference between saying attending class is better than skipping it, and vice versa.

**These results are highly specific**. Before you try to go and apply these ideas beyond the realm of this project, you should consider who were subject to this experiment. In fact, I had witnessed this exact concept during my trials: the inspiration for an easy experimental model was inspired by Moore (2004), who conducted an almost identical scenario of my experiment and found it to be effective for *university* students, while mine showed to not be effective for high schoolers. Further testing is needed, but the greatest takeaway from this experiment is that communication and persuasion is something that can only be achieved when it is very specifically tailored for a specific audience.

If you want to read to a higher degree of depth on the matter, everything cited here and more can be found in my original paper.

**Originally, this project was never supposed to be about attendance**. Originally, I was intending on studying voting theory as I was (and still am) super interested in how individual behavior affects the collective, and how that can be leveraged. I was looking at networks and graph theory, synchronization, and other related topics, but I was especially looking at behavioral economics and prospect theory. Realizing I was about a year too early to be able to study any recent elections or voting processes, I honed in on the behavioral economics aspect of the research, and started to look for a new problem to address. I talked to Gunn High School's principal on school issues that could be examined, and attendance was a recurring theme. This was only corroborated by the *California Healthy Kids Survey*, a questionnaire that surveyed 9th and 11th graders each year, and it reported that 7% of freshmen and 11% of juniors cut class to prepare for another class alone in the month the survey was distributed (California Department of Education, 2017-18). So, I started directing my attention to different studies and papers that were already conducted to learn what ideas have been tried and tested, such as what Moore (2004) and Self (2012) did.

Very quickly on, however, I realized I was probably going to have the same issue that I was going to have with voting theory: gathering original experimental data. Not that there is anything wrong with using pre-existing data, I personally wanted to collect my own original data to analyze, especially as I wasn't sure if something like attendance would be internalized the same way in a high school population as the more frequently studied college population (and as stated previously, there was in fact a discrepancy between Moore's and my findings due to different populations studied). So, I began reaching out to different teachers to see who would be willing to help run the experiment in their classes. Doing so proved to be very difficult, as I needed a teacher who taught at least 5 classes of 20+ students that had some absentees in each class, who also had the class time to be able to explain the necessary information to 2 classes. I was able to properly contact 2 teachers, one of which I was able to collect data for, and the other I was not due to the sudden COVID-19 outbreak.

Another thing that made this project difficult was that I had an incredible workload set out for this year. Between the 5 required academic classes, 2 electives, an after-school elective, club commitments, and a sport for two-thirds of the year, I just didn't have the time in my schedule to devote another 2+ hours of class time, which didn't even include time needed outside of class to research and write. Not to mention any extracurriculars I had in place as well. I had to schedule almost all of my meetings at 7am or earlier, or worse, do them entirely across email threads, which made communication ambiguous and difficult at times.

Regardless, this whole experience taught me so much about academia beyond the scope of what any high school classroom could have, and taught me how no matter no simple a question one has, willing to ask it can lead to incredible results.

As stated at the top, this was done as part of the **A**dvanced **A**uthentic **R**esearch program that PAUSD provides to its two high schools as a means to introduce students to formal research and academic writing, well beyond what a standard English class essay or chemistry lab write-up teaches. Providing students with community mentors, experts, and connections, it fosters student growth via the students' own motivation to learn, creating an environment where projects, such as the former, can be created out of curiosity, and not by seeking a letter on a report card.

This post looks to describe an interesting property intrinsic to any and all quartic functions, and it has to do with the relationship between the functions' inflection points. Below is a Desmos graph, with labeled $f(x)=Ax^4+Bx^3+Cx^2+Dx+E$, the general quartic equation, as well as its 2 inflection points $P$ and $Q$. A third point $R$ is labeled, which is the point that intersects the line between $P$ and $Q$ and $f(x)$. What we are interested in today is the ratio $\frac{PR}{PQ}$. Play with the graph below to vary $f(x)$ and see what happens to that ratio.

Quickly you will notice that for (most) non-zero $A$ values, $\frac{PR}{PQ}$ always remains at the rather famous constant, $\varphi=1.61803...$, the golden ratio. This may seem coincidental, but there is a rather nice way of proving that this ratio is *exactly* equal to the golden ratio.

The classic definition of $\varphi$ comes from a specific geometric construction of a rectangle.

In this *golden rectangle*, there are two rectangles to focus on: the large one with aspect ratio $\frac{a+b}{a}$, and the smaller red one with aspect ratio $\frac{a}{b}$. The golden ratio is given by $\frac{a}{b}$ when the small red rectangle has the *same* aspect ratio as the larger rectangle (made up of the blue square and red rectangle). Letting $\varphi=\frac{a}{b}$, and setting the ratios equal to each other nets us:

$\large{\varphi = \large{\frac{1 \pm \sqrt{5}}{2}}}$

The positive solution to this quadratic is the more well known value $\varphi$. Taking some variations of the previous equations can net other interesting relationships that $\varphi$ pertains to. For example, taking the third line from the derivation of $\varphi$ nets a recursive, cyclic definition of the golden ratio. Expanding out the relation gives another famous definition of $\varphi$.

An infinite descending fraction solely containing 1s. Taking a variation of the fourth line also gives an interesting appearance of $\varphi$.

An infinitely nested radical solely containing 1s. Notice, however, that the solution to the golden ratio has a negative counter part as well: $1-\varphi=-.61803$. Although it may seem nonsensical to assign a negative value to many of the expressions we used in defining $\varphi$, this value holds many of the same properties that $\varphi$ holds on its own as well, and the reason we don't see it as often has to do with the volatility of the value in these iterated scenarios, but that's for another time.

First, let's look at how to find the inflection points of a quartic. Inflection points are given by the quality that it's the point along a function where its concavity changes. I.e. if you look at the tangent lines along a curve as you vary the input $x$, the tangent lines' slopes will change. The inflection points are found when the slopes' behavior alters. Take the function $x^3$, for example.

Notice how as we let our value $a$ increase, the slope of our tangent line — the first derivative of $f(x)$ — decreases from $-1.5$ to $0$. But from $0$ to $1.5$, the slope begins to increase. This is all visualized in our graph $f'(x)$ which plots every point $x$ and the value of its slope at $f(x)$. One can clearly see $f'(x)$ tends in a downward manner initially, before rising again. And for $f'(x)$ to have a slope that's first negative (decreasing) then positive (increasing), it must have a slope of zero in between. So, our point where our concavity changes is when the slope of $f'(x)$ equals 0. In other words, when the second derivative $f''(x)=0$. Here we can see it clearly visualized at the solution $x=0$, which confirms all of our previous observations. Doing so for any general quartic nets us:

As this is a degree 2 polynomial, the quadratic formula quickly gives our two solutions for $x$ in general, which we will call $p$ and $q$.

This also explains why only most values of our constants have inflection points, as if the $9B^2-24AC$ term is negative, it results in an imaginary solution, meaning no inflection point is found within the real plane. With valid constants giving us solutions for our inflection points $P$ and $Q$ respectively, the line through them can quickly be written as:

The intersection point $R$ can be found solving for when $f(x)=g(x)$, or in other words, when $f(x)-g(x)=0$

$Ax^4+Bx^3+Cx^2+Dx+E-\frac{f(q)-f(p)}{q-p}(x-p)-f(p) = 0$

One can try to factor and work this out, but there is a much nicer approach that avoids working with this messy equation.

If we limit our transformations to purely scaling and translating our graph, all of our ratios will remain equivalent. So if we can find a set of transformations to make our work easier, we will still be able to prove our initial proposition, but in a much easier way. To (re)start, we're going to define a new function $h(x)$ that takes $f(x)$, and scales and moves it around as follows:

This may seem arbitrary, but keeping in mind what $p$ and $q$ mean, this transformation alters the graph in a rather specific and useful way. First notice these two key components in our transformation:

This results in shifting the graph over to the left $p$ units and down $f(p)$ units. Or more clearly, it takes our first inflection point $(p,f(p)) \rightarrow (0,0)$, the origin. We'll refer to the origin as $P'$. Now, let's look at the remaining components of the transformation:

Multiplying $x$ by $q-p$ results in *compressing* the $x$-axis by a factor of $q-p$. So, the x coordinate distance between our inflection points is condensed from a length of $q-p$ to a length $\frac{q-p}{q-p} = 1$. Just to keep our scaling consistent throughout $f(x)$, we also scale the $y$-axis down by a factor $q-p$, so we add an extra factor of $\frac{1}{q-p}$. This factor is almost purely for aesthetic purposes, as you will see it will preserve the structure of our graphs and make it easier to see our scaled copy of $f(x)$ in $h(x)$. So, as the difference in $x$ coordinates between $P'$ and $Q'$ is 1, $Q'$ will be at $(1,h(1))$. $R'$ will retain its same definition as $R$, differing only in that it is on our newly transformed function.

Notice how the two inflection lines are parallel. That is due to that extra factor of $\frac{1}{q-p}$ in $h(x)$, but note that the math that follows is not dependent on it.

It's worth noting that we don't actually know any of the constants that shape our new quartic $h(x)=ax^4+bx^3+cx^2+dx+e$ as they don't change according to our scaling factors (notice the change in capitalization; these new constants for $h(x)$ is separate and different to those of $f(x)$). However, we do know the solutions to $h''(x)=0$. Instead of using our function to find its second derivative like we did in our original approach, we are working backwards from our second derivative to narrow in on our function. Since we know where our inflection points are at, we can rewrite our $h''(x)$ as a product of factors.

The factor of $12a$ comes from the leading term when taking the second derivative of any general quartic, as we saw in the original attempt to prove this. Expanding this expression and integrating twice gives us:

Notice I didn't add a new constant after the second integration, as that is equivalent to the $y$-intercept, which we know to be at $(0,0)$. Now that we have $h(x)$ in terms of itself, separated from $f(x)$, we can easily find the coordinates of $Q'$ and find $h(1)$.

Now we can create a new secant line $g(x)$ to pass through our two inflection points, $P':(0,0)$ and $Q':(1,b-a)$.

Now we can continue using our original method, which is to find all solutions to $h(x)-g(x)=0$. Only this time, our transformations should net a cleaner equation.

That $ax$ we factored out is our solution at $x=0$, or $P'$, which we used to construct the line in the first place. Similarly, because we used $Q'$ to construct the line as well at $x=1$, we can factor out an $x-1$ as well.

That last factor is the exact quadratic that we derived to define the golden ratio. Knowing that, we now have all of our solutions to the intersection points between our quartic and secant line.

The negative solution to the golden ratio here is the fourth point of intersection at $S:(s,h(s))$ with $s<0$. Now the last thing to note is that our 3 points of interest, $P'$, $Q'$, and $R'$, are all collinear. So, they can be thought of as a projection of the $x$-axis to a sloped line that scales how far they are spaced apart. However, since this is multiplicative, the ratios will be the same, so we only need to look at the ratios between their $x$ coordinates.

You can also quickly find other ratios of different lengths and find other interesting connections. Take $\frac{PQ}{QR}$, for example.

If you look at our defining quadratic $\varphi^2-\varphi-1=0$, it can be rewritten as $\varphi(\varphi-1)=1 \rightarrow \varphi=\frac{1}{\varphi-1}$. Completing our expression gives us:

Just as our golden rectangle previously foretold.

]]>They are surfaces not covered in flat mirrors, but rather are tessellated with the corners of cubes that are mirrored. Why is that? To find out, we first need to talk about Fermat's principle, and $90^\circ$ angles.

Fermat's principle, or the principle of least time, was an idea coined in 1662 by the mathematician of the same name, and it states that the path taken by any given ray of light is always the quickest one. Although this may seem obvious, it allows for many properties of light and optics to be derived from it. The one that it helps demonstrate for us is the common equality of the *Law of Reflection*: the angle a light approaches a surface is the same angle it reflects at.

Let's say we have a light source $S$, and we're reflecting it off a mirror (black) at point $R$, to have our ray reach an end point $E$. To show that the angle of incidence must equal the angle of reflection, we are going to create a mirrored copy of our end point, $E'$ (points $P$ and $Q$ are exclusively reference points). As $E'$ is a reflection of $E$ across the mirror, they are both equidistant to $R$, so we end up with two orange lines of equal length, $\overline{RE}$ and $\overline{RE'}$. However, because $\overline{RE} = \overline{RE'}$, our original path of reflection $SRE$ can be modeled with the new path $SRE'$. Note that the speed of the light isn't changing throughout our model, so we only need to find the shortest path $SE'$. To minimize $\overline{SE'}$, the shortest path is clearly just a straight line (blue). We already new that the angle $\angle{ERQ} = \angle{E'RQ}$ by definition of reflection of $E \rightarrow E'$, and now that we know $\overline{SE'}$ is a straight line, the angle that $\angle{SRP} = \angle{E'RQ}$. Combining these two inequalities nets us $\angle{SRP} = \angle{E'RQ} = \angle{ERQ}$, which was what we wanted to show.

Although this seems like an obvious fact, knowing why it this fact is true helps to understand how we will apply it to our bike reflector and corner cubes.

To understand why corner cubes are chosen as bike reflectors structure, looking at simpler cases always helps. Instead of looking at corners of cubes to see how light interacts with them, we can first work from the corner of a square and see what happens.

Notice how regardless of what angle the light is hitting the corner, the light reflected from the corner is always parallel to the ray entering it. We can prove this remains true for any angle $\alpha$ quite simply using some basic geometry.

We want to show that ray $\overrightarrow{M}$ is parallel to $\overrightarrow{N}$ given $\overrightarrow{M}$ intersects the corner at an angle $\alpha$ and that we have a true square corner that is a right angle. Filling in the givens, the rest follows nicely. The Law of Reflection gives the angle congruent to the initial $\alpha$, and the idea that all triangles' angles sum to $180^\circ$ gives the $90-\alpha$. The trick in proving this involves adding an auxiliary line as such and the rest follows.

We add another line parallel to one of the sides of our corner. This creates another right angle. Since we know that $90-\alpha$ makes part of the right angle, we know that $\alpha$ must make up the rest of the right angle, as $90-\alpha+\alpha=90$. By Law of Reflection we then know that there is a symmetrical angle of measure $\alpha$. Now since $\overrightarrow{M}$ and $\overrightarrow{N}$ both are attached to parallel lines at congruent angles, the only way that can happen is if $\overrightarrow{M}$ was parallel to $\overrightarrow{N}$ as well. Hence, a ray $\overrightarrow{M}$ has a reflected path $\overrightarrow{N}$ that exits parallel to its ray of incidence.

Moreover, we can show this only holds true for right angles using very similar logic. Setting our once right angle to $\theta$...

From our diagram, it's clear that for $\overrightarrow{M}$ to be parallel to $\overrightarrow{N}$, $\alpha=\alpha+\theta-90$, which when solving for $\theta$ gives $\theta=90$, our previous right angle.

All of the previous arguments can be applied to the 3-dimensional case by decomposing the ray of light into two other rays, and by showing that those two rays are parallel to the initial, that the composite ray is as well. With all of this together, it makes perfect sense why bike reflectors are corners of cubes: they send light back to its source. If you had a standard mirror, no light would return back to where it came from unless looking perfectly perpendicular to the mirror.

If no light goes back to its source, to say, a car's headlights, no light will hit the driver's eye to indicate that there is a bright, shining reflector to show that there is a bike up ahead (for this reason exactly, most reflectors actually have angles slightly large than 90$^\circ$ so that most light returns back to its source, and some can scatter to an observer slightly above/below/left/right of the source). These reflectors actually have a specific name to it, and they're known as *retroreflectors*, literally meaning to reflect backwards. This concept has been leveraged to aid satellites, and indirectly the military. There's a reason why no stealth-based aerial technology has no right angles: they want to avoid creating an accidental retroreflector that can return radio waves.

Hopefully this gave insight into a seemingly arbitrary design choice in one of the most common bike accessories used today.

]]>You should also watch it as these essays assume: a) that you know the movie so you don't walk into thousands of words of spoilers, and b) that you are familiar with some of the terminology introduced in the film as well.

To start, let's look at the end.

The ending could be a sort of allusion to the western genre as a whole (after all it is commonly referred to as a “western on wheels”). After having found his name and fulfilling his moral duties to the extent as he feels needed, Max leaves to be his lone ranger just as he started to move onto his next wandering adventure. This sort of functions with how Max blends into the crowd, illustrating how he is just as anyone in that crowd: seeking a purpose, and having now fulfilled that, he walks in the opposite direction from whichever everyone else walks, no longer looking at the Citadel nor Furiosa *for* hope, but walking away as he has already been *given* hope by it. This is furthered by the final quote presented by The First History Man at the film’s conclusion: “Where must we go, we who wander this wasteland, in search of our better selves?” This implies that the place to seek to better oneself doesn’t exist, or is unknown at the very least, and it has to do with because one of the central themes of the movie is that redemption is a self-realizing process, not one induced by a place or material object, but rather *conducted through* places, objects, and or people. They can be mediums, but not the incitation of being redeemed. Nux, having abandoned his technological faith in Immortan Joe, is a key example of the process, as he takes the phrase, “Witness me!,” to be one not of sacrificing himself, but rather as remembrance by asking the wives who he’s befriended to not forget him for the person he was rather than his death for them.

Returning back to Max’s extraneous leaving of the Citadel, having found redemption and reacquainting himself with his past, Max feels no obligation to stay as while the Citadel, the Wives, Furiosa, and Nux were his mediums of redemption, they weren’t a part of what he was redeeming: his quest was purely for himself and his past, ideas and memories unassociated with the movie’s setting, it’s the parallels that he inevitably sees in the other characters that draws him to help. Contrasted to the other characters’, such as aforementioned Nux, while they do have their own redemption arcs, they tie in directly to the immediate setting which gives them motive beyond personal reason to stay; they’re not only redeeming themselves, but the land itself (note that these should be treated as separate actions; intertwined as they are, Nux, Furiosa, and the Wives, redeeming the Citadel is more of a byproduct of their personal growth that was conducted through their reclaiming of the stronghold). While Max and the rest of the cast are all given hope from the present and their actions through the movie, what differentiates them is how they leverage this hope to recontextualize their life: Max feels he has atoned and repented for his guilted past, while the rest have their futures recontextualized, now knowing that they no longer lead a life of forced repression. Their repression wasn’t what likely induced their guilt, but knowing that they had little hope to have a better life at all is likely what made them feel powerless, and guilty they hadn’t attempted to advocate for a more moral society. But, because they never had that opportunity physically and only the thought, there isn’t anything in their previous life that they would be to perceive differently or acknowledge: they did what they could, and they only have the future to look forward to, and we see this time and time again in the intermittent sequences which the cast breaks from the intensity of the chase scenes: Furiosa is seeking to return to her dearly beloved home, the Green Place; the Wives are motivated, almost haunted by the prospect of a haven away from Immortan Joe’s abuse (I phrase the Green Place like this as they have no concrete image of it, and are seeking the concept not necessarily a specific place); Nux is seeking technological salvation, and to make his half-life existence meaningful (similarly, Valhalla is phrased like this, as he’s motivated by the concept, not the place as he’s only ever had it described to him, he’s never seen it); these are all motives that they envision in the future for themselves, not something they are gripping on to from the past. It just so happens that in the narrative, all of these instances of their individual sanctum sanctorum converge on the Citadel over the course of their character growth as they realize that they can only seek comfort in reforming their personas, so while the direct action they take switches from materialistic to impersonal, the motivation remains the same for a prospect of the future. This is why everyone but Max staying is so crucial, as it maintains the consistency of the messaging and the individual character plots we’ve been presented with throughout the course of the movie. This film’s ending is merely a means to help individualize Max’s personal journey from the thematic and character development that underpins the work’s message and structure.

My last thought on the end is that this could potentially hint at how this film wasn’t necessarily completely true, but rather a myth or legend passed on. The idea of there being a *first* history man implies there are others, too, all who pass on and carry stories such as what is told in *Fury Road*, similar to how in *Mad Max 2: The Road Warrior*, the narrator of the film wasn’t Max at all, but rather a survivor who was helped by him. So perhaps, the ending quote not only explains why Max left, but potentially how everyone interpreted why he left after all they have endured: Max is a legend looking for his next outpost to wander into, to better another untouched aspect of his life we’ve yet to be revealed. Just as old cliché westerns reiterate time and time again of the wandering, lone hero who’s been mythologized and solidified in glorious memory, perhaps *Fury Road*, intended for Max to represent just that: a memory to inspire. However, this would still help maintain all of the previous messaging aforementioned, as whether he was diegetically existent as the movie presents or just a fable, his story of proposing redemption still maintains the inspirational quality that he is remembered for anyway. It’s only the scope of the amount of people it reaches that changes.

There is definitely a feminist connotation and tone in throughout the film, but feminism is not the complete term based on its etymology. To be concrete from the start, according to Merriam-Webster, the term “feminism” is used to describe “the theory of political, economic, and social equality of the sexes; organizing activity on behalf of women’s rights and interests.” Even though the first definition is gender neutral, based on the word’s etymology and the more popularly associated second definition, this response will be referring to that definition of feminism throughout this response, and it’s the gender specific terminology is what prevents classifying this film as a feminist one. While it’s perfectly reasonable to say that this film exhibits feminist messaging, it’s more justified to argue that this is a more specific case tailored to our own social context in reality that is being related to the film, rather than the broader, more powerful message at hand. Given our, the audience’s, perspective on our very own society’s development and history with gender inequality, the lead antagonist being male with extremely powerful women protagonist counterparts, these are prime conditions to further a feminist message. The issue is, however, these exact messages function almost equivalently with the genders swapped in the film: if Immortan Joe was a female counterpart, the Wives and other protagonists were primarily male, the film could still be classified as feminist, which may seem contradictory, until acknowledging the innate messaging surrounding not just women, but humanity as a whole. Take the standard woman in Immortan Joe’s world: they’re treated essentially to the extent of a rape victim, being sexually abused solely for their fertility; their reduced to an object of a single purpose to forcibly bring life to Immortan’s subjects and the Citadel (this association of fertility and women is supported by the Vuvalini’s own Keeper of the Seeds who seeks to plant and sprout a new life of the Green Place). So, to have a contrast in extremely powerful women between the nomadic Vuvalini, the Wives, and Furiosa, who end up conquering Immortan Joe in the end coupled with the one of the final shots of the women releasing the “aqua-cola” to the people makes for a very strong feminist message, which in that sense, it is understandable that some critics characterizing this film as feminist. However, it’s hard to ignore the parallels between the male characters’ portrayals throughout the film. Max, even before the title card of the film has played, Max is similarly bound and exploited for his social benefits, which instead of being fertility, is holding a healthy supply of O- universally donatable blood for Immortan’s War Boys, who themselves are exploited in their own ways. War Boys, instead of being exploited for their ability to give life, it’s their ability to take it, and execute Immortan’s visions of destruction. However, more importantly than the roles themselves, is their relation to the characters themselves as these roles even reverse with Furiosa and Max by the end, where Max becomes inducted into the Vuvalini and even heals Furiosa under his own volition to perform a blood transfusion on her, while Furiosa is the one to ultimately kill Immortan Joe at the end. These intermixing of the movie’s established gender roles further blur the lines to say which gender is being empowered more, by essentially neglecting the need for it to exist and categorize the characters at all; what defined women and their exploitation, and hence motive to revolt in an empowered fashion gets reassociated with the opposite gender, making it hard to define it as a distinct trait, and hence more apt to look at it as role-specific retaliations and revolts when it is a part of a greater collective body of both genders. Similar parallels can be found between major sacrifices in the movie: Nux from the, War Boys, sacrifices himself specifically to give the women (implied more than to end Immortan’s soldiers’ lives by his care for Capable) a chance to lead a longer, more fulfilling life, while many of the Vuvalini give their life to end Immortan’s army as they stand diametrically opposed to their own morals of community (see the scene where Max convinces the group to redirect themselves towards attacking the Citadel). There’s a constant defying of expectations to mirror and reverse supposed gender-specific roles within the film to unify the genders as one. So while there is in fact a large component that empowers women, due to the parallels that the men’s stories provide, it’s more accurate to say that this is a movie about empowering humanity as a whole above the tyranny and unethical practices that Immortan Joe embodies (it’s not surprising that Immortan Joe’s name mimics not only the word “immoral” in addition to “immortal”). Even the name of the stronghold the protagonists reclaim, the Citadel, echoes deeply not just the despot they’re overthrowing, but the entire symbol for oppression: it’s not just a fortress dominating people, it is an empire dominating the humanity of the people.

One fact that might augment the message to be a bit more women-centric, though, is the naming of the Wives: Angharad (Welsh name for “much loved one”), The Dag (Australian slang for “funny/amusing”), Toast the Knowing, Cheedo the Fragile, and Capable, each seemingly named after a core value they embody of charismatic/leader-like, comedic, wise, fragile, and confident respectively. With each embodying an extremely distinct human quality, it’s hard not to see how they themselves personify humanity as a group. That combined with their distinct light, (relatively) elegant clothing, the group also clearly is hope incarnate, which can seen most distinctly during the nighttime Green Place car chase scene with the Bullet Farmer, where they are the ones seen to be holding the only lightsource between the crew of the War Rig, contrasting the deep, unsettling emptiness of the exhausted, corrupted Green Place. However, when interpreted like this, it’s worth noting how the *other* characters perceive the Wives: they are the ultimate prize. Immortan Joe seeks them to bear healthy children, so in a literal sense, the Wives are the ultimate material possession for him. However, Max, Furiosa, Nux, and other aiding protagonists are not drawn to the Wives for their physical traits, but rather because of who they represent; they want to protect the Wives as they are the characteristics of humanity they all seek to restore themselves, which furthers the lack of gender as a needed category of a theme. When perceived as symbols, the Wives exemplify a much greater presence in the film as the “MacGuffin” that everyone seeks to find and protect as a medium to try and redeem some humanity within themselves; the core values they represent ends up being a universally sought after, set of appreciable qualities.

Furiosa was born into the Green Place as the daughter of Mary Jabassa of the tribe Swaddle Dog of the Vuvalini, being taught and trained by her “initiate mother” K.T. Concannon. There, in her matriarchal society, she is taught to value her relationships and who she is in this tribe of mothers. However, she was abducted – stolen – from her home by Immortan Joe to the Citadel along with her mother, who died within 3 days of captivity. Immortan Joe took in Furiosa as one of his new wives, seeking for her to bear his new healthy son. Unable to successfully impregnate her with a possible heir, Immortan Joe had no use for her to serve in his vault as a breeder. Unable to watch a possible “resource” go underutilized, Immortan Joe gave Furiosa to one of his Imperators, a high commanding military officer who takes control of their invaluable War Rigs. Constantly exposed to war and automotive technology, Furiosa became an experienced, and newly indispensable asset from one unable to give life to others, to one able to quickly and efficiently take it. But, only so much experience can provide so much benefit without repercussions: she lost an arm in combat, forcing her to create a prosthetic extension to be able to continue serving. Once her half-life mentor had passed, she replaced his title and claimed Imperator for herself as a leader in Immortan Joe’s ranks, becoming one of his most trusted commanding officers and couriers. Due to his trust and surplus of willing warriors, Immortan Joe assigned Furiosa to watch over his most prized possessions that Furiosa was once almost inducted into: his prized breeders, the Five Wives. Relating to their physical abuse, the Wives were the first people since Furiosa’s capture that she connects with. The abuse and effort she had to commit to, though, took a toll on Furiosa across her 7000 days of imprisonment. On many occasions, she has considered defecting and escaping in search of her once lost Green Place and taking refuge in her family. Holding as much power as she did with her War Rig, she saw an opportunity, a clear one no less, to retrace her path, running from Immortan’s grasp, and find her lost home among the barren deserts of a once fruitful land. She tries to leave, but not alone: Furiosa smuggles the Five Wives with her, knowing they need the Green Place just as much as she does.

The first bit of basic information is given directly through her identity speech upon regrouping with the Vuvalini for the first time. We learn about her compassion and love of her people through her introductory speech to the Vuvalini: she never refers to herself by her name, but rather what that name was associated with, and she does so with very specific tenses. She *was* once part of Swaddle Dog, but *is* one of the Vuvalini. Her initiate mother *was* K.T. Concannon, but she still *is* the daughter of Mary Jabassa. She talks as if she has outgrown her childhood culture of Swaddle Dog – she had to for the sake of survival – but she still talks as one of the Many Mothers, and as if she still wishes to be associated with and accepted into this group she still cares for (present tense phrases). Seeking reaffirmation, she hopes to show that she is still selfless in the cause of the group and the people within it, because without them, Furiosa’s name means nothing to her. Elucidating on her combat experience, one could only imagine that her inability to bear children is why she was selected to become the inevitable driver of the War Rig, and how she lost her arm. *Fury Road*, within the first 5 minutes, before the title card, makes sure to establish norms and the social constructs that govern Immortan Joe’s Citadel, and from the very start, women have been boiled down to a single purpose: fertility. Her being assigned under the command of an existing Imperator would connect a lot of the scenes to losing an arm in battle, driving the War Rig in the first place, and how she is able to be so prepared for combat. Take the scene when Max, still muzzled and enchained to Nux, Furiosa is able to take down Max, be more than dominant at close quarters combat with a wrech, disarm Max of his shotgun, pull out a secret handgun on Max, while also preemptively pulling the kill switches on the War Rig as well so that even if Max proved victorious in their small skirmish, he couldn’t steal the resources. Not to mention the amount of firearms Furiosa stashes in the War Rig that Max reveals immediately after their scuffle, and her experience with a sniper rifle to take out the Bullet Farmer later during the night chase. There’s no way in her 20 years she could have advanced nearly as far as she could have without already being valued by a highly ranked member in society, as we by see the number of War Boys forced to conform to that initial ranking and die in battle, or grow only to the extent of the military’s drum corp. It is difficult to connect the Wives and Furiosa, other than the fact that Furiosa was at one point almost inducted into the group of prized breeders. We know Immortan Joe trusted Furiosa immensely for executing his water, bullet, and resource runs that extended beyond the Citadel, holding upwards of 3000 gallons of the prized resource “guzzoline” plus a surplus of water at a time. For how much of the Citadel he constantly monitored, to extend someone beyond his immediate control, and even reallocate some power and influence to someone else speaks immensely of the trust he bestowed unto Furiosa. So, for the fact that she’s a powerful commanding officer, and has a non-insignificant connection to the Wives indicates that is how she got in touch with them, and how she communicated her plan to smuggle them out as well. After she escapes the Citadel, we are now well into the film’s plot, and concludes the biography.

While the above looked at three specific presences within the film, the three below are a series of more opinionated pieces discussing some of my favorite sticking points.

My favorite shot in *Mad Max: Fury Road* occurs for only a few frames during the Rock Riders chase scene after Furiosa’s exchange with them went south (this shot is one that I don’t think I consciously took in until my 2nd or 3rd viewing of the movie). They are well into the chase at this point with the Rock Rider’s signature bikers attacking the War Rig at all angles: explosives from the side, bikes jumping over the War Rig, armed bikers firing rounds whenever the can, all the meanwhile Immortan Joe in his Dodge Fargo 1940 “BigFoot” monster truck is catching up to them from the rear. In a moment of panic and without weapon, Furiosa lunges into her arsenal and grabs some kind of pistol from the bag, and in a single movement, lines up her shot alongside Max at a Rock Rider angling themselves along the side of their vehicle. This shot is only a few frames in length, but communicates so much about the character development of both Max and Furiosa and the relationship between them. We can actually map out their entire relationship visually from strangers, to foe, to forced allies, to tight-knit friends. You can see that they are strangers when they never share a frame; for the first part of the movie, there is usually a cut, or shift in lens focus to direct the audience’s attention to either Max *or* Furiosa, neither at the same time. They become foes at their first interaction after the sandstorm, and this is clearly indicated via their exchange of weapons. Furiosa attempts to shoot Max with his own shotgun, and moments later, Max threatens to shoot Furiosa with her own pistol. This helps emphasize the turbulent power dynamic between the two: they are both just as capable, but are perceiving each other as threats, so they continue to try and disarm each other, and inevitably turn one’s own weapon against them. This continues in the following scenes where they become reluctant partners, driving the War Rig together. Since Furiosa is the only one who can drive the War Rig, she seems to be in control of the situation. But, the first thing Max does before the War Rig departs is take and hold hostage every single last gun compartmentalized throughout the vehicle, bringing the power dynamic back into his favor. It’s similar to the back and forth they had as enemies, but now instead of being a constant duel of competitors, it’s now more of a dance of adversaries: they both want to accomplish different goals and have different ideas in mind separate from one another, but are both tense as they both potentially act as a threat to that success if they choose to act on it, almost like mutually assured destruction – one party stops another also forces to stop themselves. This is a change from the previous relationship where they both saw each other as direct impediments that are competing for the same goal, but they now realize that their ideas aren’t mutually exclusive. So, as unlikely companions, Max doesn’t kill Furiosa, but completely disarms her. This tension eventually gets alleviated as the two are both placed in precarious situations, and Max ultimately returns a weapon to her to help defend the crew, visually showing the trust building. Now, the shot I picked as my favorite is where the two are finally visually cued into being equals to one another: capability-wise, trust-wise, and as trusted allies. In those few frames, showing the two both aim at a single target together in the same shot is all it took for this film, with minimal dialogue between the characters, to graphically establish their relationship. I also really like it *because* of how short it is: it forces us, the audience, to passively take in the shot which helps us to understand the two characters are now strongly bonded without forcing their ever changing connection and feeding it to us directly. These cues allow us to take in a lot of information very quickly, and this is one of the best examples the film offers for how it does so with such a nuanced topic like character interaction.

My favorite line in *Mad Max: Fury Road* is one of Furiosa’s final lines of the film during the final action sequence in which the protagonists end Immortan Joe’s corrupt rule. As Furiosa forces her way up to the driver’s side window to finally kill Immortan Joe, she delivers a final send off before he lashes out one last scream: “Remember me.” It twists the War Boys fabled motto, “Witness me!”, that they call before attempting a suicide act in an effort to be permitted into Valhalla under Immortan Joe’s servitude by giving their life in noble act of war. This phrase, “Witness me!”, evokes a tone of acknowledgement of the action; a phrase that asks for those who know they have already died to do so. It holds a connotation that what is being seen is merely recognized, but not appreciated, and that’s due in part by what the purpose of “witness” is: it’s an act to better oneself into the salvation of Valhalla, which they celebrate in normalizing their expendability of their current life and to carry on into an uncertain next. Furiosa’s spin on the phrase reverses that tone completely by saying “remember” instead of “witness”, which instead of asking for acknowledgement and self-betterment, is asking for gratitude. While Furiosa isn’t the one who physically dies, and Immortan Joe doesn’t physically survive, it makes sense to view it symbolically as if that was the way the scene was framed. By perceiving as so, it turns the messaging around by asking “Immortal” Joe to live on forever with full knowledge that she is who restored the world with the amount of sacrifices she had to endure; she’s asking not to “die” in vain, but to live in memory of those she has worked so hard to help. While it has the exact same physical outcome that the old, willing sacrifices that were “witnessed” had, it changes the mentality and respect surrounding those who may not have had a choice in their death to truly internalize the impact it had on those who had beared the trauma of the death in question.

This expression also immediately sets the ideology she seeks to replace Immortan Joe’s society with: people aren’t objects to watch be expended, but they are intelligent beings who deserve to forever live by their accomplishments and compassion they have shared for others beyond their self-interest. What makes this specific sentiment especially powerful is that the only memory that Immortan Joe would have to remember Furiosa by is *her own* exploitation that he leveraged, so by executing him with that memory is an extremely clear indicator that she wants those memories and experiences to die with him so no other person has to suffer through it and pushing her own abuse behind her. It’s Furiosa climactic developmental moment: just as Max gave his name to Furiosa to accept his *past*, Furiosa buries her past to accept her new *future*. She wants to have Immortan Joe explicitly know that he was the one who “killed the world” by destroying the very world he has built upon abuse and denigration with ending his life. The final component of this line that makes it so inspiring is coupled with the next important line that Nux so emotionally delivers: “Witness me.” The very phrase that Furiosa has reformed dies gently with Nux, the transitory character that was once a War Boy, now Vuvalini takes the cursed connotation of the phrase “witness” and completely transforms it to accommodate his developed moral enlightenment. Nux, for his whole life, has been told that the world doesn’t care for him, but he cares for the world; the Citadel will continue to run with the powerful machines that they worship, and they are merely the small, insignificant cogs in a social machine. Even though they are aware they are “kamakrazee”, they know that their life has no more meaning than what Immortan Joe provides. But by the end, after all he’s undergone and tolerated, he pours all of his new found emotion Furiosa has modelled throughout their adventure into his delivery, completely dropping his trochaic-accented meter of the War Boys’ chants, remnant of the famous Gregorian chant of death, Dies irae, before burying the last of the phrase along with the remaining scraps of Immortan Joe’s convoy, influence, and power. He knew he would die, and that there was no Valhalla for him to seek, but he hopes that him accepting a willing death for those he loves would be remembered beyond just a glorified suicide to the group, but someone of a friend. That last utterance of “witness” gives the very contrast needed to emphasize the importance of Furiosa’s choice of wording.

“Remember me!” is such a powerful line as it reverses the very sentiment that Immortan Joe terribly exploited back onto him, while also completely rebuilding a new tone and society within just those two words to break down what we, the audience, have been conditioned to hear so frequently that normalized death to respect its consequences and implications beyond a single use case. Furiosa, in just a few syllables, was able to take an entire society, destroy it, rebuild it, and accept her sacrifices for it.

*Mad Max: Fury Road* contains many broad, strongly supported topics of discussion, one of which includes a central thematic subject that revolves around how redemption is a self-realizing process that cannot be induced forcibly by a person, place, or material object, but can be conducted through one as a medium to enlighten oneself beyond past perception.

Similar to the analysis of the ended above, every major character follows an arc in which their persona is fleshed out such that they are able to newly perceive their life and how they fit into the events that they have been witness to. Continuously, the film, almost forcefully, imposes redemption as a subject matter. The first instance that redemption appears as a focus is within some of Immortan Joe’s first lines: “I am your redeemer! It is by my hand you will rise from the ashes of this world!” Many of his War Boys buy into this false ideology and into the notion that their salvation will be presented as an action that Immortan Joe provides directly to them. Max and Nux learn this first hand: literally being abandoned in the aftermath of a sandstorm, and symbolically “rising from the ashes” that Immortan Joe has incited with his chase of Furiosa. They then go on through their redemption arc (Nux’s arc won’t enter a phase of redemption until further into the film), but this scene helps to recontextualize Immortan Joe’s previous dialogue to fit the theme described. It wasn’t Immortan Joe who presented them with their redemption, but it was he whose hand presented the *opportunity* to redeem themselves; he was merely the objective that awarded redemption, not salvation himself. Another clear instance of the film’s portrayal of redemption is via Furiosa, with her deeply intimate and revealing conversation with Max, moments before her reunion with the Vuvalini. It’s here where she unveils not only why she’s seeking the Green Place for herself but for the others as well:

“And [the Wives]?”

“They’re looking for hope.”

“And you?”

“...Redemption.”

This distinction is important as it aids in differentiating the purpose of the Green Place between the Wives and Furiosa. The Wives specifically are holding onto seeking something material — something tangible that directly impacts their life to separate them from Immortan Joe’s exploitation. They need that promise to believe in a life worth living. Furiosa, on the other hand, has a much deeper connection to the Green Place. It was her home that she was extracted from, enslaved at the hands of the very force she was trained to avoid: men. Knowing the trauma her people had to endure along with the acts she deeply laments serving Immortan Joe, Furiosa desires to repent and atone for, which she tries to do through helping the Wives and restoring herself to the Vuvalini. In the end, Furiosa does find acceptance and redemption, it doesn’t come without cost: nearly all of the Vuvalini die in her and the Wives’ names, reiterating how there isn’t manifestations of redemption that one can possess or know to enter a new state, but there are certain people and objects that can help guide one through their self-solace, because if that wasn’t true, the ending of *Fury Road* would have almost no impact; Furiosa’s story wouldn’t have a chance to resolve with the lack of the Vuvalini present, and Max’s leaving would make even less sense, as he is then literally abandoning his redemption, which is his most overarching plot motivator that has guided him through the film.

This is even reiterated in the music scoring for the film. When listening to the track labelled, “Redemption”, which plays during the previously described scene of Furiosa’s personal conversation with Max on their way to the Vuvalini and the Green Place. In it, a certain leitmotif is established along with a very distinct tone. However, in the following scene’s track, “Many Mothers”, it takes the redemption leitmotif that “Redemption” established, but emboldened with a fuller orchestration, connecting the theme to the scene. Similarly, during the blood transfusion scene where Max desperately tries to save Furiosa from dying of blood loss, his associated track, “My Name is Max”, it also contains the same leitmotif, but this time emphasizing rests and silence to allow room for the music to breathe and react to Max’s dialogue, especially his key titular line the track has been named for. If we were to take the connection in musical theme and apply it to the idea that redemption is something that can be contained within a person or group of people, like Max and the Vuvalini — characters who either left or died — then it would only reiterate the aforementioned theory that *Fury Road* along with many of its characters would remain unresolved, and the conclusion would have no significance nor impact that they movie clearly tried to convey. It’s more reasonable that the connection between “Redemption” and “Many Mothers” as well as “My Name is Max” is that it is used to represent the action that transpired *through or from* the relevant characters, permeating long after their presence with the others. They conducted and induced their redemption, albeit via different mediums, they are the ones who incited the redemption themselves, instead of being given it, or attained it via a sudden acquisition of someone or something.

There are many other examples, especially with Nux and his relationships between Capable and Max that exemplify the versatility and omnipresent nature of this theme of the film, these are some of the most distinct examples that elucidate Miller’s conveyance of the nature of redemption, and its intrinsically self-realizing and growing process.

]]>Let me propose a question to start. Try to solve the following:

An infinite power tower which supposedly equals 2? Seems unlikely, but those familiar with these infinite-operation type problems likely know the strategy to solve this. Notice how there's a copy of our equation stacked on top of itself.

Since we know that equation in the box is equal to 2 because it's a duplicate of our original equation, we can easily reduce the problem down to something much more manageable.

So, raising $\sqrt{2}$ to itself over and over again equals 2. What other equations can we solve? Let's try this one.

Using the same strategy as before, this one is trivial.

Which is… the same answer as before? How can $f(x) = \sqrt{2}^x$ iterated over itself equal both 2 and 4 at the same time? When in doubt, we can ask our calculator for some confirmation.

With some simple Python, we can get a pretty good approximation quickly.

import math def f(x): temp = x for i in range(1000): temp = math.sqrt(2)**temp return temp print(f(1))

The above code creates and evaluates a power tower 1000 numbers tall, giving us an approximation of `2.0000000000000004`

, which is pretty close to 2. So, is 4 anywhere to be seen? Actually, yeah; our solution wasn't *completely* false. Notice that at the end of the script it says `f(1)`

. That 1 is our *seed value*. Since our power tower can't be infinite in order to get a calculable approximation, we need to cut it off after some amount (in this case, 1000 numbers high). In order to do that, though, there has to be some number there at the top of that power tower. In this case it was 1, but it can be anything as we constantly plug our output back into our input, in the case of an infinitely stacked power tower, that seed value is negligible. Let's see what happens if that is changed to `f(4)`

.

print(f(4))

Due to rounding, our script actually blows up to infinity with `f(4)`

, but we can reason this out by hand. If we start with 4, then our first output of iteration will be $\sqrt{2}^4 = 4$. Since 4 is our output, that's our new input. But since 4 was also our seed value, it'll just constantly output 4 at every iteration. So 4 *is* a convergent value (as we can only calculate finite approximations) to the infinite power tower of $\sqrt{2}$, but only for its seed value. To better understand this, we can use a tool known as a *cobweb plot*.

Cobweb plots are a simple, elegant method to model iterative functions in the Cartesian plane by utilizing a seemingly mundane auxiliary function: $y = x$. What is probably the first graphs people are taught in elementary school is one of the most helpful in modeling these complicated and otherwise impossible to view functions. Here's how to make a cobweb plot: 1) Plot the function to be iterated on (in this case, $f(x) = \sqrt{2}^x$) and $y = x$ together. 2) Pick a seed value to start iterating on. 3) Alternately draw vertical and horizontal lines within bounds of each graph for as many iterations as one needs. Steps 1 and 2 should be clear enough as they're fairly similar to what we did above, but Step 3 might need a visual to go along with it.

Here's the first step's resulting plot:

Nothing too crazy. The green graph is our $f(x) = \sqrt{2}^x$, while the red graph is our $y = x$. For Step 2 we'll pick $x = 1$ as our seed value as we did before. This is where the magic of Step 3 comes in: from $x = 1$, we'll draw a vertical line from the red graph until it intersects at the green graph.

Now we have a line segment with points $(1,1)\rightarrow(1,f(1))$. This step is equivalent to plugging in 1 into the top of our power tower, geometrically doing the operation of $f(x)$. Since we just a drew a vertical line, we now draw a horizontal one from the green graph $f(x)$ until it intersects the red one $y = x$.

Now we have a new line segment from $(1,f(1))\rightarrow(f(1),f(1))$. You can probably see where this is going. Now that we have a new point at $x = f(1)$, we can draw a new vertical line until it hits the green graph, geometrically finding the value of $f(f(1))$, performing our repeated operation! We can do this series of horizontal to vertical lines as many times as we want to get as many iterations of our repeated function as we want!

Now you can probably see why this is called a cobweb plot, as we weave back and forth creating a net-like shape between the graphs (and it only gets more wild looking with different iterative functions!). Even in the previous graph where I set the seed value to be $x=-1$, our graph still quickly hones in on evaluating to $x = 2$ for the $\sqrt{2}$ power tower, just where it happens to be the intersection of our two plots. This is a pretty narrow scope of our graph, though; let's zoom out and see more of this plot.

There's also an intersection at $x=4$! Even with all of this, I don't think it would be wrong to feel that $x=4$ should *not* be a solution to some extent. Even though, it clearly shows a lot of the same characteristics that $x=2$ does, it still feels weird for this to be considered an answer, or at least to the same extent that $x=2$ is. For any seed $x<4$, our iteration converges to $x=2$, and for any $x>4$, it diverges. Only at $x=4$ does our repeated power tower equal 4. To properly understand this, we'll need to utilize derivatives.

The classic definition of the derivative $f'(x)$ is a function that returns the slope of $f(x)$ at every point $x$. While this definition of the derivative isn't wrong, it is fairly limiting when only considered in the contexts of slopes. We can reframe the idea of a derivative not to be the slope of a function at a point $(a,f(a))$ but rather how *sensitive* the function is at the point $(a,f(a))$. This will be more apparent if we plot our $f(x)=\sqrt{2}^x$ in a new way.

You can generate the above plot with the following Python:

import numpy as np import matplotlib.pyplot as pltdef f(x): return np.sqrt(2)**x inp = np.linspace(-5,5,40) out = [f(n) for n in inp] d = 10

fig = plt.figure(figsize=(20,4)) axes = plt.gca() axes.set_xlim([-5.3,5.3]) axes.set_ylim([-6,6])

plt.scatter(inp, [d/2 for n in range(len(inp))]) plt.scatter(out, [-d/2 for n in range(len(out))]) for n in range(len(inp)): plt.plot([inp[n], out[n]], [d/2, -d/2], color='green')

This basically just took the $y$-axis of our Cartesian graph and rotated it $90^\circ$. The blue dots represent the preimage of points $x$, while the orange dots represent their associated transformations under $f(x)$ with green lines connecting them. Just looking at it, it's consistent with our Cartesian graph as $f(x)$ never goes below 0, which makes sense as an exponential is always positive. The reason why we want this graph as it guides the intuition behind this idea of sensitivity and the derivative.

Notice the dots around $x=-3$ in the preimage (blue) points. They all get mapped and squished down near $.354$ under $f(x)$; they get tightly pressed together. But just *how* tightly pressed together are they? That's exactly what the derivative tells us! For a small change $dx$, we want to know how much that changes the output $df$. In this case, $f(x)=\sqrt{2}^x \rightarrow f'(x)=\sqrt{2}^x\cdot\ln{\sqrt{2}}$. Plugging in $f'(-3)=.1225$. This means that around $x=-3$, the ratio between how much the points around it changes under $f(x)$ is $.1225$, in other words, the area around $x=-3$ appears to have shrunk *inward* by a factor of $.1225$. In the contexts of slopes, this ratio would be the slope of our tangent line, telling us how tall $df$ would be relative to $dx$. Since the derivative $f(-3)$ is small, we can say that $f(x)$ is not very sensitive around $x=-3$, as a small change in input from $-3$ will still evaluate to about the same value.

Now let's look on the right half of the graph. Trying $f'(4.5)=1.6486$ would imply under our previous logic, that we'd expect points to stretch *away* from $x=4.5$ by a factor of $1.6486$. Just by looking at our plot, that's not so hard to believe. This means that our $f(x)$ is sort of sensitive around $x=4.5$, as a small difference in input from $4.5$ can lead to a big difference in evaluating $f(x)$.

So now we know that for a given $a$, if $|f'(a)| < 1$, it's a shrink, and if $|f'(a)| > 1$, it's a stretch (a negative derivative implies there's also a flip occurring, but we care only about magnitude). You can now kind of imagine what effects these have when we iterate over $f(x)$ for a long time: points will gravitate towards numbers that shrink the area around them, and be repelled away from numbers that stretch them. Now, relating this back to our original Cartesian plot, let's highlight the areas in which $|f'(a)| > 1$.

Well, look at that! Our $x=4$ solution is in our blue $|f'(x)|>1$ region, while our $x=2$ solution is not!

Connecting this all together now, we had two solutions to an iterative function, but only one of which was appearing in practically every case. When graphing its respective cobweb plot, we see that one solution lies in a non-sensitive region ($f'(2) = .6931$), while the other does ($f'(4) = 1.3863$). So what can we say about either solution? Since we know $f(2)$ is not sensitive to small changes and moreover shrinks space around it, we know that $x=2$ is a **stable fixed point** of the iterative function $f(x) = \sqrt{2}^x$. It's stable under the notion that because it isn't sensitive to small changes in its neighborhood of points, with each iteration we take, we map points closer and closer to $x=2$ due to the squishing effect of its derivative. But for $x=4$, which is sensitive, each iteration tends to stretch and repel points away from $x=4$, even though it too intersects in our cobweb plot as well as analytically solves the equation. Hence, we call $x=4$ an **unstable fixed point** of the system. Just like we've described, while $x=4$ is valid for its seed value, the slightest discrepancy in value pushes numbers away from it to either start approaching $x=2$, or diverge to infinity (like in our rounding error in the Python script before!). If we quickly go back to our graph style with 2 number lines and perform the function iteratively there, we can really see what these pulls and pushes of numbers looks like. Here's what the first 10 iterations of $\sqrt{2}^x$ looks like:

You can really see how tight the points coil around $x=2$, and split away from $x=4$. Even with an initial value that starts so close to $x=4$, you can still see it slightly drift away from it at each iteration. This is why thinking of derivatives as measures of sensitivity is so important: the value of the derivative tells you how strong of a pull or push certain numbers have. Consistent with our findings, $x=2$ has a pulling effect around it with a small derivative, while $x=4$ has a pushing effect with its large derivative.

This is why we were also able to use cobweb plots: they were the geometric algorithm to solve when $f(x)=x$, which makes sense as if something is a fixed point, no matter how many times we apply a function to it, it should remain the same. So when solving $\sqrt{2}^x = x$, you'll get the intersections we found earlier at $x=2,4$ (if you want to try and actually solve this equation, it requires the clever use of the Lambert W-function). That's why we were able to analytically solve for two different solutions, but only one kept popping up everywhere. This isn't limited to just power towers, though.

This type of relationship between stable and unstable fixed points is everywhere. Take the well-known infinite fraction below:

By setting this equal to $x$, we can solve it just like we did before with the power towers.

$1 + \frac{1}{x} = x$

$x^2-x-1=0$

Using the quadratic formula, we once again get two solutions:

The famous Golden ratio $\varphi$ and its underrated second solution. Still, it begs the question, how can a completely positive infinite fraction equate to something negative? Illustrating this with our cobweb and sensitivity regions will make this clear once again. Setting $f(x)=1+\frac{1}{x}$, we get…

A lot like $x=4$ when iterating $\sqrt{2}^x$, $1-\varphi$ is the unstable fixed point in the sensitive region, with numbers getting pushed away at every iteration, while $\varphi$ is the stable one which we quickly spiral down towards. We can quickly verify that $1-\varphi$ is a "valid" solution by plugging it into $1+\frac{1}{x}$ just like we did with $x=4$ into $\sqrt{2}^x$.

For its own seed value, $1-\varphi$ is valid, but I guess that's up to you if you want to equate a negative value to a positive infinite fraction.

For those who are interested, try setting your seed value to a number in the form of $-\frac{F_n}{F_{n+1}}$ where $F_n$ represents the nth Fibonacci number. The Golden ratio is closely tied to the Fibonacci numbers, so it may be a bit unsurprising why they may relate here. If you try to iterate over any number in this form, you'll eventually hit a point where evaluating the function becomes undefined. Try plugging in a few and watch the strange cascading effect happen.

There are a whole host of functions that have interesting iterations as well. Let's try $f(x) = \cos(x)$

Since $f'(x) = -\sin(x)$, $|f'(x)|$ is always less than or equal to 1, so all fixed points it has will not diverge. In this case, we get a solution of $\approx .73909$, sometimes referred to as the Dottie number, which has its own set of interesting properties (for one, it's a transcendental number of the likes of $\pi$ and $e$!). If you are interested in a bit of why this has a fixed point, allow me to point you towards the Banach Fixed-Point Theorem for an interesting perspective that guarantees this fixed point. Let's try another function. What happens if we scale $f(x)$? Let's try $5f(x) = 5\cos(x)$

We have not one, not two, but three different intersection points of where $5\cos(x) = x$. But notice, all three of them lie within the sensitive region where $f'(x) > 1$; they're all unstable. You can probably tell just by looking at it, it's a very chaotic diagram. This might not be unexpected for some of you though. If it doesn't converge to anything, but also not diverge, why wouldn't it just randomly jump around ad infinitum? Well, let me just present another function to explain why. Let's make a cobweb plot for $f(x) = 3.2x(1-x)$

Here we have 2 intersection points, both of which are in the sensitive region where points should not converge to excluding its own value, and that's exactly what we see with no definite attraction to any one fixed point. Yet, it's not like our iterations are randomly moving. In fact, just looking at the diagram, it's quite predictably going in a cycle between two $x$-values of $\approx .516$ and $\approx .8$. The difference between $5\cos(x)$ and $3.2x(1-x)$ is how it interacts with our seed value. For the former, it has a quality known as *sensitive dependence on initial conditions*, or more commonly referred to as the Butterfly effect: a small change in the seed value can produce wildly different outputs in iteration in the long run, just like how a butterfly's wings can produce a hurricane years later halfway across the globe. This is a common property of what is aptly deemed *chaotic behavior*. The latter function, while it may not have a convergent value, it does not exhibit Butterfly effect-esque behavior nor chaos while iterating over it, and instead settles into this cycle. As a kickstarter for those interested, $3.2$ in the latter function was not an arbitrary choice: it comes from a family of iterative functions of the form $rx(1-x)$ known as the logistic map. There's so much to talk about there, it likely will be its own post later, but that's for another day.

I want to go back to the Golden ratio problem as there's a neat extension to a more general case of an iterative approximation technique that can be more applicable to problem solving that I want to share. It is known as the **Newton-Raphson Method** which can (usually) effectively hone in on roots of a polynomial quite efficiently.

The idea is fairly similar to what we did before, but since it's catered to finding roots of polynomials, its iterations have a modified step as we're looking for intersections with the $x$-axis instead of the line $y=x$. Here's the basic idea: 1) Pick an initial seed value $x_0$. 2) Draw a vertical line (like we did with the cobweb) until we hit the function $f(x)$. 3) Draw the tangent line of $f(x)$ at $x_0$, and see where it hits the $x$-axis. Call this new point $x_1$. 4) Repeat the process as many times as you'd like for as accurate an approximation as you'd like up to some $x_n$. Here's an example geometric interpretation for this method with $f(x) = x^2 - 13$.

I had to zoom in extremely close for this graph because, as you can see, just after two iterations from a seed value $x_0=5$ finds a really accurate approximation of one of the roots of $f(x)$ and you wouldn't be able to see those lines unless magnified by this much. Let's work out a general iterative formula for this method. We first start with some $f(x)$. Just by using derivatives and definition of a line passing through the point $(x_n,f(x_n))$ for our tangent, we can solve the equation

to find the next point $x_{n+1}$ to continue iterating on (as it should be the $x$-intercept of that line like the instructions describe). Doing some basic algebra shows that:

$f'(x_n)(x-x_n) = -f(x_n)$

$x = x_n - \frac{f(x_n)}{f'(x_n)}$

So, tidying things up, for a given (continuous and differentiable) function $f(x)$, we can approximate its roots by iterating over with some initial $x_0$:

Trying this out with our $f(x) = x^2 - 13$, our recurrence relation after some simplifying becomes

Or if you liked our previous notation, we can rewrite this as a function and iterate over

Since this is in function form, we can use our old friend the cobweb to solve this for us.

It nicely finds $\sqrt{13}$ as a solution, just as we would expect. However, notice that there are two intersection points that lie *outside* of the sensitive region. One we found at $x=\sqrt{13}$, and the other is actually the second solution to $x^2-13=0$ at $x=-\sqrt{13}$. Our seed value significantly matters more in this case, as now depending on which zero of $f(x)$ is closer, our iteration will target only the closest solution, and this only becomes more important the more zeroes our function contains.

Even with all those caveats, notice what we just made! Our iterative function $g(x)$ is essentially a square root estimator, but with no exponents! While it's nice and convenient just to use exact answers, having decimal approximations are just as useful, especially for computers who don't have unlimited memory to use exact answers. For any number $n$, we can calculate $\sqrt{n}$ as accurately as we'd like by iterating over the function

as many times as we want. There are some exceptions where certain seeds can infinitely cycle or actually result in no subsequent $x_{n+1}$ (imagine a horizontal tangent line), but this method is incredibly useful, as this doesn't just extend to square roots, but to any function you want to approximate using the aforementioned formula

Here are a few other iterative functions for other roots of $n$:

$\sqrt[3]{n} \rightarrow \frac{1}{3}(2x+\frac{n}{x^2})$

$\sqrt[4]{n} \rightarrow \frac{1}{4}(3x+\frac{n}{x^3})$

$\sqrt[p]{n} \rightarrow \frac{1}{p}((p-1)x+\frac{n}{x^{p-1}})$

Going back to our Golden ratio iteration, we can rewrite it under the fixed point formula $f(x)=x\rightarrow 1+\frac{1}{x}=x$. If you multiply that through by $x$ and rearrange, we get a quadratic $x^2-x-1=0$. That's a quadratic we can solve for with the Newton-Raphson Method! Plugging it into the formula, we get a function to iterate over as

And sure enough, it works! The advantage of using the Newton-Raphson Method in this case, is that we no longer have to worry about unstable fixed points, as all of our solutions lie outside the sensitivity region. So even if we lose some insight into the nature of each solution, we consistently find each solution of $\varphi$ and $1-\varphi$ to an accurate decimal expansion with the right seed.

Iteration and fixed points become one of the prime topics for dynamical systems and describing much of the world around us. We discussed the Newton-Raphson Method of root finding, but there are many other recurrence relations for approximating roots of functions, each catered for their own purpose with different convergence rates and fail cases. Moreover, this is just a single *use* of the Newton-Raphson Method, for it is more well known as an alternative to gradient descent. Solving systems of differential equations comes down to finding the equivalent of a higher-dimensional fixed point, or in other words, an eigenvector: a vector (which is just an object that can encode more than one number and hence dimension) which doesn't change direction under the transformation describing the system of equations. Markov chains are also another extremely important occurence of fixed points over iteration: after a long series of transitions between states, we can make an overarching statement about the system as a whole reaching an *equilibrium state* where transition probabilities are expected to remain the same (going back to that idea of eigenvectors!). Synchronization is a prime example of a fixed point under iteration: even if a group of fireflies begin out of phase with one another, their coupling over time will reduce each other into a single large group with one cyclic, uniform behavior. The Mandelbrot set (and all of the Julia sets, for that matter) arise out of the fact that some complex numbers are bounded under iteration of functions $f(z)=z^n+c$ that remain bounded after a long time (sometimes being bounded to multiple values at once!). There are even entire studies dedicated to this. *Invariant theory* studies mathematical groups and polynomials to see how they remain unchanged under transformations. Almost all of chaos theory is about stability (or the lack thereof) over long periods of time (Nicky Case has a great introduction to attractors), and especially when what should be simple, predictable equations are not (we already talked about the logistic map, but see it illustrated in the Bifurcation diagram. It is particularly interesting for it appears in the most unlikely of places). We saw some chaotic behavior earlier, and the way I deduced it was chaotic was with a quantifier all iterative functions and maps have known as the Lyapunov exponent, and this itself is so interesting to look at for how functions change in behavior along with its Lyapunov exponent. For fixed points alone, there are hundreds of theorems dedicated to analyzing them (most notable of them being Brouwer's Fixed-Point Theorem).

If you are interested in anything covered here, popular math YouTube channel 3Blue1Brown made not one but two videos discussing this idea of derivatives and infinitely stacked operations with the exact puzzle I posed at the start of this post. Their first video is what originally inspired me to look into these objects more when I first saw it a couple yeas back. Their animations do wonders compared to what any text post can do, so please do check them out if you want a more visual approach to these processes along with some additional justification for solutions to iterative processes.

Fixed points appear everywhere, and I hope this shared a few insights into how they can appear, deceive, and approximate even the most out there of expressions.

]]>Brief summaries are at the bottom of each section if you want a quick referesher for anything above, but first, some review.

This is also all written more formally with other examples in this paper.

**Markov chains**, in essence, are a way to model a process that randomly jumps between different outputs, where each output is said to have some probability to jump to other outputs. They're sort of like rolling dice, but the likelihood you roll any number is only dependent on the number you rolled last. It might help to describe this with an example. Let's say you want to know what the weather will be in 5 days: will it be sunny or rainy? Fortunately, the weather doesn't vary too much, so if it's sunny one day, it's likely to be sunny again the next day with 80% chance. If it's rainy, it will likely be rainy again too, with, say, 60% chance. This can be shown quite succinctly in a little diagram:

This is our actual Markov chain, showing the two **transition states**, S(unny) and R(ainy) with their associated transition probabilities. However, we can't actually *do* much with just a picture alone. So, we can rewrite these probabilities and encode them in a matrix:

You can think of each row as a different state for current weather, and the columns as probabilities for different states of tomorrow's weather. In this case, I have written row 1 and column 1 to indicate sunny days, and row 2 and column 2 to be rainy days. That's why entry $a_{1,1}$ in row 1, column 1 shows 80%, because if it is sunny today (row 1), we expect an 80% chance for it to be sunny tomorrow (column 1). Similarly $a_{2,2}=.6$, as if it's rainy today, we expect a 60% chance for rain again. $a_{1,2}=.2$ means that if today is sunny, then there is a 20% chance of rain tomorrow, and for completeness sake, $a_{2,1}=.4$ indicates a 40% chance for it to be sunny given today is rainy.

What we've built here is known as a **transition matrix**, as, well, it's a matrix that shows transition probabilities; it's a matrix that shows how likely we are to jump from one state to another. In this case, our states are the different weathers: sunny or rainy. So, how does this help us answer our original question of the what the weather will be in 5 days? Well, let's first try to find the weather 2 days from now. We know how to model 1 day from now, and since these are probabilities, wouldn't it make sense just to multiply our matrix by itself?

Our probabilities have changed a little bit. Now it's saying, if today is sunny, there is a 72% chance it will be sunny 2 days from now. The reason why multiplying our matrix itself to get this result makes sense is because of the mechanics of matrix multiplication essentially asks: "What is the probability from getting from one state to another in two steps?" If you work out the multiplication itself, it might be clearer, but the way I like to think about it is in terms of transformations of space. For those familiar with a bit of linear algebra, we can think of our matrix $M$ as a collection of basis vectors that scale space (where our vectors in space can be thought of as a collection of starting states, i.e. the initial observed proportion of sunny days to rainy days). So applying $M$ once transforms space, we can then take that as a new "default" or "unit". If we apply $M$ again to our basis vectors, it has the effect of transforming space once again. This can be thought of as our standard, independent probability multiplication, but instead of changing a singular probability (i.e. dice value), we are changing two (likelihood of sunny *and* likelihood of rainy days).

With this in mind, our question is easy. It boils down to what $M^5$ is.

So if today is sunny, we look at row 1 and can expect a 67.008% chance of sunny weather, and if it's rainy, row 2 shows a 65.984% chance for sunny weather. Nice! But you might be looking at that matrix and notice that row 1 and row 2 are *almost* the same. Watch what happens if we don't check for any 5 days in the future, but if we look towards an infinite number of days ahead?

The rows *do* become the same. So, if we were to pick a random day far, far into the future, we can expect it to be twice as likely to be sunny than rainy regardless of today's weather. There's two important interpretations of this fact. 1) going back to our transformation of space idea, this **equilibrium state** is our eigenvector (specifically for $\lambda=1$) of our transition matrix $M$. Meaning, it is the solution to the matrix equation $vM = v$ where $v$ is a row vector (here, $v=\begin{bmatrix} .\overline{666} & .\overline{333} \end{bmatrix}$). The second—and more important—way to think of this equilibrium state is that it is the final, or **stationary** distribution of sunny and rainy days. That is, if you took the fraction of $\frac{\textrm{Sunny Days}}{\textrm{Total Days}}$, you'd expect it to approach $\frac{2}{3}$ as time went on, and $\frac{\textrm{Rainy Days}}{\textrm{Total Days}}$ to likewise approach $\frac{1}{3}$.

To summarize, here are a few important concepts about Markov chains:

- A Markov chain is a random process that describes the ability to switch between multiple states.
- A Markov chain's probability for any future state depends only on the current state (this is also known as the Markov property).
- The sum of each row of a Markov chain's transition matrix must sum to 1 (something has to occur at each time step for each state, even if that means not changing states)
- All Markov chains will eventually reach an equilibrium state that describes the final distribution of states over a long time.

Markov chains are extremely powerful tools to model dynamics with multiple states due to their above properties, but some of their uses from chaos to disease modeling deserve their own post another day.

If you understood this so far, you've got the hardest part of Markov chain Monte Carlo methods under your belt. That being said, we are still missing second MC of MCMC.

**Monte Carlo simulations** are probably the closest you'll ever get to the scientific version of guess-and-check. The idea is if there is something that's too hard to calculate, you do a bunch of mini, random experiments to obtain data that can give us numerical approximations. It's very akin to Bayesian thinking: the more data you give to your approximation, the better the you can "update" your approximation to be more accurate and confident. As with all things, let's do a quick example.

If I hand you a coin, you probably would assume it's a fair coin: 50/50 chance for either heads or tails. But how could you verify that it is indeed a fair coin? Well you could flip it and see what it turns up as. Heads! "It must be an unfair coin as it flips heads 100% of the time!" said no one ever. Of course a single data point isn't nearly enough to draw any conclusions, so you need to flip it again. Heads again! Definitely weighted, right? Even if you get only heads twice in a row, that still isn't conclusive. You need to flip the coin a lot of times. By a lot, upwards of hundreds for a reasonable guess at the balance of the coin, and upwards of thousands for an ideal approximation. For all you know, those first 2 heads could be in a much larger sequence of flips you have yet to unfold:

`H-H-T-H-T-T-H-T-T-T-H-T-H`

Just like that, our coin reaches that 50/50 split significantly closer within just a few additional flips.

Each one of our data points were flips in this case, and we call those data points **samples**. The important part to note, though, is that there is a sense of randomness in each sample. The idea behind a Monte Carlo simulation is that even if our sampling method is random, the more samples we take will average out to the true value (think the Law of Large Numbers). The is why the more samples we take, the more accurate our estimations become. This is a lot like unbiased sampling in research studies: you can't reasonably survey everyone in a population, so you take a smaller, random sample in the hopes that it will be representative *enough* to make reasonable conclusions of the larger population.

Again, just to summarize a few details:

- Monte Carlo simulations use random sampling to get numerical estimations for hard to otherwise calculate results.
- The more samples/trials we take, the more accurate our results.
- While taking more samples is more accurate, it also become less efficient to compute and gather results, so you have strike that balance between more accurate results or quicker results.

With all that out of the way, let's put it all together into one cool algorithm.

So far, we've sampled from relativiely easy things to run trials on and get samples. Flipping a coin and rolling a dice are nice distributions to run trials on are they both can be modelled by a nice uniform distribution (even for weighted dice/coins by partitioning the uniformness). This is due to the niceness of a **discrete** distribution where there is only a finite number of results our black box can output. Often the case, we have a **continuous function** where we don't have probabilities for individual results, but rather a range of results. To get the gist of it, take the uniform probability distribution between $[0,1]$. What's the probability that you pick $0.235326…$? Obviously, out of an infinite amount of possibilities, a single, specific number to pick is probability 0. BUT, the probability of picking a number between $[.25,.75]$ is exactly $.5$, as we're picking from half of our total range. This is the idea of **probability density**. So, you can imagine for more complicated distributions (especially those taken from real life data) can be a lot more difficult to get samples from, or properly know the densities of regions. Here's where our MCMC comes from.

**Markov chain Monte Carlo** methods combine two important aspects of the two concepts the name implies: a Markov chain's equilibrium distribution and Monte Carlo simulation's random sampling. Here, we make a Markov chain who's stationary distribution is *equal* to our hard-to-model probability distribution by doing a random walk around the distribution (for the sake of notation, we'll call our "target" distribution we're trying to model $\pi(x)$). In this case, we do so with the Metropolis-Hastings algorithm which is extremely simple:

- Pick a starting point $x_0 \rightarrow$ this is the start of our "walk". An initial sample, if you will, that we provide ($x_t$ means our current sample at time $t$).
- Now pick a new,
*random*point $y$. Call $y$ the "proposed state" for $x_{t+1}$. See how "good" $y$ is compared to $x_t$.

i. If $y$ is "better", we let $x_{t+1}=y$

ii. If $y$ is "worse", we

*might*let $x_{t+1}=y$, but not always.- For $t=1,2,3,…$, repeat steps 2 and 3.
- Profit.

This is extremely vague, but I intentionally left it as such, because often times the formulas can confuse the language. In essence, this is what Metropolis-Hastings does to generate samples. We take a sample $x_t$ at a time $t$ that "traces" our distribution, and as $t$ gets larger, the more accurate our "trace" of the curve we walk around gets better. Let's put some of the formulas back into the instructions above and go at it one step at a time.

**Step 1** is easy enough: we give any number for our algorithm to start with. Literally anything. You can give smart guesses that speed up the process, but that will be clear in a second.

**Step 2** we don't actually perform, but rather design. Unlike Step 1 where we gave some determined number of our choosing, Step 2 we implement a **transition kernel** to pick a step for us. This kernel is a function $Q$ that takes a current spot $x$ and with some probability outputs a new spot $y$. That is, $Q$ is a distribution that randomly generates a new point $y$ *given* a current one $x$, which we will write $Q(y|x)$. This is how we make our "proposed state" and how we actually implement our walk. You may be wondering though, "What *actually* is $Q$?" Well, that's up to you to decide! Since $Q$ itself is a distribution around our current state $x$, you can shape $Q$ in whatever way you want! In general, though, it's not too important, but spending time to design a specific kernel can optimize and speed up the process.

**Step 3** is our "goodness" check. Once we have a proposed state generated by $Q$, we need to see if this proposed state is in a more "likely" or dense spot on our distribution $\pi(x)$. The idea is we want to generate samples representative of $\pi(x)$, so it should be obvious that we should visit the probabilistically more dense spots, a.k.a. visit the spots the distribution says is more likely. Geometrically, this is a point *higher* on our distribution curve.

But remember, just because $y$ is not better doesn't mean that we outright reject it. We instead accept it with probability *proportional* to how much worse it is. If $y$ is half as high as our current spot $x$, we flip a coin and might accept it with 50% probability. If $y$ was a third as high $x$, we flip a weighted coin and might accept it with probability $\frac{1}{3}$. In other words, we can write our acceptance probability $A=\min(1, \frac{\pi(y)}{\pi(x)})$. If $y$ is higher than $x$, or $\pi(y)>\pi(x)$, then $\frac{\pi(y)}{\pi(x)} > 1$ and we accept it outright. If $\frac{\pi(y)}{\pi(x)} < 1$, then we accept it with probability of that fraction.

This acceptance probability is also what makes this algorithm so good: we only need to know our target distribution $\pi(x)$ up to a constant! If $\pi(x) = c\cdot P(x)$, then our acceptance probability would be $A=\min(1, \frac{c\cdot P(y)}{c\cdot P(x)})$ which simplifies to $\min(1, \frac{P(y)}{P(x)})$, making the constant irrelevant. This is ideal for real life experiments as perfectly measuring constants from observation can be very difficult.

**Steps 4 and 5** are pretty self-explanatory, so just to rewrite it more formally, here is the whole algorithm one more time:

- Pick a starting point $x_0$.
- Sample a new proposal state $y$ with probability $Q(y|x_t)$
Compute $A=\min(1, \frac{\pi(y)}{\pi(x_t)})$.

i. With probability $A$, accept our proposed state and let $x_{t+1}=y$

For $t=1,2,3,…$, repeat steps 2 and 3.

- Profit.

However I must admit, I did lie to you, but only a *little* bit. The acceptance probability I gave is actually for the Metropolis algorithm, not the Metropolis-Hastings algorithm. The acceptance probability for the Metropolis-Hastings algorithm is $A=\min(1, \frac{\pi(y)Q(x_t|y)}{\pi(x)Q(y|x_t)})$. This is because the Metropolis algorithm only works when $Q$ is a symmetric distribution, meaning that $Q(y|x_t)=Q(x_t|y)$, which returns us to our familiar fraction from before. MH allows asymmetric kernels to speed up the algorithm, but otherwise the concept is the same.

With 5 very simple steps, we are able to take samples from continuous distributions just like that! The Monte Carlo aspect is pretty obvious with the random steps with generating random "proposal states" $y$ in **Step 2**. The Markov chain might be a bit more concealed, as we never actually explicitly define it. But, look at **Step 3** again, as that resembles something very close to our transition probabilities before. Step 3 is actually our Markov chain *implicitly* defined! Since there are an infinite number of states/values to pick and another infinite number of states to transition to, we can't define an infinitely sized transition matrix. So, instead, we define transition probabilities as needed with our kernel $Q$. And notice, our kernel maintains the Markov property as each proposed state only relies on the current. This is because we sort of reversed the way we defined our Markov chain! In our weather example with sunny and rainy days from above, we defined transition states and the stationary distribution followed suit, almost like property or characteristic of our Markov chain. Here, our Markov chain is instead defined by the fact we want our stationary distribution to mimic $\pi(x)$. This is why we don't outright reject states that are less "good" in our acceptance probability, but rather accept it proportional to how less "good" it is as that will reflect our distribution's shape.

But just like in our original Markov chain example, it's not perfect immediately. Notice in our original weather example with sunny and rainy days, 2 iterations with $M^2$ was no where near close our stationary distribution, and while 5 iterations at $M^5$ was closer, it still was nowhere near ideal. You have to *burn in* some states before proper, accurate samples can be generated.

Here's some short Python to implement the Metropolis-Hastings algorithm to estimate the following Laplace distribution:

Here it is in only 20 lines of code:

import numpy as np import matplotlib.pyplot as pltdef target(x): return .5 * np.exp(-abs(x)) # Target distribution π(x)

def accept(p): flip = np.random.uniform(0,1) return p >= flip

def metropolis(iterations): states = [] # Samples generated by the algorithm # Step 1 --> initialize an x0 current = 1 for i in range(iterations): states.append(current) # Step 2 --> Q generates a proposal (normal distribution) proposal = np.random.normal(current, 1) # Step 3 --> Check how good our proposal is goodness = min(1, target(proposal)/target(current)) if accept(goodness): current = proposal # If we like the proposal state, we jump there! return states

Here is the scatter plot of our algorithm walking all around $\pi(x)$ across 10000 iterations...

...and here is the corresponding histogram that fits almost too perfectly to our target distribution.

We can now generate discrete samples proportional to our continuous distribution!

The algorithm aside, an extremely important concept is shown here: reframing questions and objects and asking them from a different perspective can lead to extremely powerful tools and thoughts. We take a Markov chain, and instead of letting its equilibrium state arise as a property, we use it to turn our definition inside out and use the equilibrium state itself to define the Markov chain. This pattern of rethinking concepts has always been a useful, sobeit from building intuition while learning, to defining tools in all of math. From connecting why Mandelbrot set to its cardioid and cycloids, to encoding parameters in 4-dimensional space means, to even Fourier rebuilding functions from sine waves, the most impactful question one can ask is usually in the form of, "What if?"

]]>Try typing the fraction $\frac{1}{98}$ into your calculator see what you get. Don't have one on hand? Here's a calculator ready and waiting for you.

Next try $\frac{100}{9899}$. See if anything stands out to you. Even with the few amount of decimals this displays, you might notice some patterns appearing. $\frac{1}{98}$ expanded as a decimal appears to contain the powers of 2! The second fraction might require for a more robust calculator, but with enough decimals it's clear that it too has a hidden sequence: the Fibonacci numbers are in its decimal expansion!

You can try and guess at other fractions with unique expansions, but there is a systematic way to generate these fractions to show not just simple sequences like this, but any sequence you want! It's all a byproduct of one of the most powerful tools in discrete math and combinatorics: the generating function.

First, some terminology and context. A **generating function** may look complicated, but its essence is actually very simple. If you have some sequence of numbers, say, $A = \{ a_0, a_1, a_2, a_3, \cdots \}$, its corresponding generating function is the power series $A(x) = a_0 + a_1x^1 + a_2x^2 + a_3x^3 + \cdots$. That's all a generating function is! If you like fancy math notation, we can write this more concisely as $A(x) = \sum_{n=0}^{\infty} a_nx^n$. Something important to note, though, is that the powers of $x$ in the series don't actually *mean* anything. We only really care about the coefficients, and we happen to be using the series to **encode** our sequence $A$. Herbet S. Wilf put this best in his aptly named book, generatingfunctionology: "A generating function is a clothesline on which we hang up a sequence of numbers for display." Basically, our generating function is purely a convenient way to place all of our sequence terms into a singular object. That's why it's not just any power series, but a *formal power series*, where it extends on towards an infinite number of terms where we don't really care about convergence, but rather just the representation itself. What's great about generating functions too are that it turns questions about sequences and integers into one about functions, and over the course of centuries, we can do a *lot* with functions. You'll see quickly why we use a power series specifically, as exponent properties play very nicely into the types of problems and tricks generating functions can help us out with. Knowing this, you shouldn't let the notation of a generating function ever scare you! They're truly a simple object obscured by harsh notation, so always focus what they represent instead of how they are written.

So, for our powers of 2, its sequence would be $P=\{ 2^n \}_{n=0}^{\infty}$ and corresponding generating function would be $P(x) = 1+2x+4x^2+8x^3+\cdots$. If you're familiar with your series, this is a geometric series and we can condense it into the following formula: $A(x) = 1+(2x)^1+(2x)^2+(2x)^3 + \cdots = \frac{1}{1-2x}$. Remember how I said the powers of $x$ don't really mean anything? This is a case where we can actually leverage the fact that our generating function is in fact a "function" (this is a specific use case as we normally don't treat them as standard functions). Consider the general generating function $A(x)=a_0+a_1x+a_2x^2+a_3x^3+\cdots$. Watch what happens if we plug in $A(.1) = a_0+a_1(.1)+a_2(.001)+a_3(.0001)+\cdots$. This may not look like much, but since we use a base 10 counting system, plugging in $.1$ is the same thing as moving a decimal point to the left one spot. So, we can rewrite that infinite sum as the nice float $a_0.a_1a_2a_3\ldots$ Each number in our sequence becomes a decimal in our final number!

But, this can become a problem if a number in our sequence $a_n$ is more than one digit long, so we can change the value we plug in to get more precise decimals with more numbers from our sequence: $A(.01) = a_0.0a_10a_20a_3\ldots$ and just like that we have buffer 0s in between numbers. So, doing this for our generating function for the powers of 2, we get that $P(.01) = \frac{1}{1-2(.01)} = \frac{1}{.98} = 1.0204081632\ldots$ Just for aesthetic pleasure, I like to multiply the final fraction by the value of $x$ we plugged in to shift that initial $a_0$ after the decimal point, and get a nicer looking fraction at the end: $\frac{1}{.98}\cdot .01 = \frac{1}{98}$, giving the familiar fraction from the start and the nice decimal of $.010204081632\ldots$

As cool as this may be, this relied on the fact we recognized what kind of series the generating function was (for the powers of 2, it was geometric). Let's take a look at a slightly more complicated sequence: the Fibonacci sequence. Unlike the powers of 2 where we knew a nice closed formula off the bat for each term, we don't have one (shhh) for the Fibonacci numbers. Instead, we can define the sequence by relating it to other terms. We'll call the Fibonacci sequence $F = \{ f_0, f_1, f_2, f_3, \cdots \}$ and its associated generating function $F(x) = \sum_{n=0}^{\infty} f_nx^n$ where $f_n$ is the nth Fibonacci number. By definition of the Fibonacci numbers, we also know that

This equation is known as a **recurrence relation**, as, well, it's a recursive relationship; any given term in the sequence can be expressed in some form related to other terms. What's useful about having an equation like this is that we can relate this to our generating function! If we can solve for the generating function, we might be able to get a function that can get us our cool fraction with the sequence embedded in the decimals again! If we multiply through by $x^n$, we get…

…and then we sum over from $0$ to $\infty$ we end up with…

We're now starting to have a set of terms that awfully resemble our generating function $F(x)$. Let's look at the left-hand and see if we can make any sense of it. Just writing it out can tell us a lot, so let's do that.

It looks like our original generating function, but offset! Remember, we want the subscript of the term coefficient to equal the power of the $x$ it is attached to. We can multiply through by $x^2$ to easily fix that.

However, we don't want to actually change the value of our recurrence and add extra factors to both sides. To counter the effects of the multiplication, we just add a term of $\frac{1}{x^2}$ before it since $\frac{1}{x^2} \cdot x^2 = 1$, negating the effects of our multiplication.

Now look at that right-hand side: it's our generating function $F(x)$ missing the first two terms, $f_0$ and $f_1x$!

$F(x) \color{red}{- f_0 - f_1x} = f_2x^2 + f_3x^3 + f_4x^4 + f_5x^5 + \cdots$

Finally, after plugging it all back in, we end up with

You can do a similar process with the other terms on the right-hand side of our original equation to finally get an expression in terms of the generating function, instead of the recurrences.

Now we just need to turn the wheel and solve for $F(x)$!

Remember, we had initial values $f_0=0$ and $f_1=1$, so we can plug those in to further simplify our fraction.

And sure enough, $F(.01) = \frac{100}{9899} = 0.0101020305081321 \ldots$

But why stop there? Although we just solved that $F(x) = \frac{x}{1-x-x^2}$, don't forget our original definition that $F(x) = \sum_{n=0}^{\infty}f_nx^n$. These equations imply that if we can find a power series $\sum_{n=0}^{\infty}f_nx^n = \frac{x}{1-x-x^2}$, we should get a closed form for the nth Fibonacci number!

First, we need to decompose our function into its partial fractions. Let $\phi = \frac{1+\sqrt{5}}{2}$ and $\varphi = \frac{1-\sqrt{5}}{2}$.

Note our final result mimics the closed form of two different geometric series!

So, to wrap it all up:

$\large{f_n = \frac{\phi^n-\varphi^n}{\sqrt{5}}}$

Just like that, we've found a formula for the nth Fibonacci number (this is known as Binet's formula)! This is only a *sliver* of the power of generating functions: being able to turn a recurrence relation into a closed form solution, barely even interacting with the sequence at all!

Now, let's try a different type of problem generating functions can help us out with.

Say you're visiting an aviary with some friends. Well-respected, the aviary has a vast number of birds, but they've noticed some interesting patterns in the behavior of their avifauna: their hummingbirds always fly solo; blue jays tend to nest in fours; toucans perch in pairs; and cassowaries chill in fives. How many ways can you see a total of 20 birds?

This may seeem like an odd spot for generating functions, but we'll see a nice property of exponents that allows us to use them. Here, a generating function $A(x) = \sum_{n=0}^{\infty}a_nx^n$ is an encoding such that each term $a_n$ denotes how many ways there are to see $n$ birds. Let's write the generating function for hummingbirds:

So, if we want to see any number of birds, there is exactly one way we can see that many birds with only seeing hummingbirds. That makes sense! What about blue jays?

Jays come in groups of 3, so it would make sense we could only see total birds in multiples of 3. If we want to see a group of 6 birds with only jays, there is one way we can do that (that is by seeing two groups of jays), but 0 ways to see 5 birds of only jays. Similar generating functions can be written for the other birds.

The surprising thing is now, if we want to see the number of ways to see $n$ birds through a combination of different birds, all we have to do is multiply the generating functions together! But why would this ever work? Well, let's think of what our exponents mean in each function: they are the total number of birds we see from a group. So, if our giant product results in a term of, say, $x^{14}$, we know that is one way to see 14 birds. Why? Because exponents turn multiplication into addition: $x^a \cdot x^b = x^{a+b}$. So, if we get multiple copies of $x^{14}$, they'll all accumulate in the coefficient of that term, giving us the different ways to see a total of 14 birds! This is why using a power series specifically for generating functions are so helpful: not only do the exponents have a clear meaning when applied, they also carry over the nice exponent properties we can leverage in counting. In general, to count the number of ways to see $n$ birds, we look for the coefficient in front of $x^n$.

$(1+x+x^2+\cdots)(1+x^3+x^6+\cdots)(1+x^2+x^4+\cdots)(1+x^5+x^{10}+\cdots)$

Expanding that out seems like a terrible idea, so we won't… We'll let Python do it instead! It's totally doable to do this by hand to systematically *extract the coefficient* of $x^{20}$ (especially with the series we've selected, involving many binomial coefficients with its partial fraction decomposition), but the algebra along with it can get annoyingly tedious. I'm sure there are clever ways to go about keeping track of which terms you're multiplying, but that's out of the scope of this post.

So, if you go to the aviary, we know there are 91 different ways to see a total of 20 birds. If you're interested in seeing the entire mathematical crank turn, great YouTuber Mathologer made an excellent video answering a similar question counting the number of ways to make change for dollar in which he spends much more time going into detail the algebra to solve such a problem analytically. Regardless, I hoped this gave insight into how great generating functions are as a combinatorial tool for counting, in addition to its utility as a discrete tool.

Before we end, I want to show you one more cool use case of generating functions that involve probability distributions.

A staple of tabletop gaming has always been the pair of six-sided dice. Notably, it's respected for being a considerably fair distribution, with the most likely outcome being the middle value of 7 at $\frac{1}{6}$, and the two extreme values of 2 and 12 being the least likely, both at probability $\frac{1}{36}$. This makes it great for board games, with extraordinarily good, high values being just as likely as their low, unlucky counterparts. But, these dice are boring: for centuries our dice have remained a simple numbering from 1–6, but is there a different numbering that we can use to maintain our fair play?

If we use our familiar friend the generating function, we can find out with little thinking required! We can represent our die as a generating function $P(x) = \sum_{n=0}^{\infty}p_nx^n$ where $p_n$ is the probability of rolling a value $n$. So, for a standard die its generating function would be

as you have an equal chance of rolling any number 1–6, and no other possible number (so all of their coefficients are 0 and get cancelled out). If we had a die with sides 1,2,2,7,7,7, it's generating function would look like

So, the generating function of the sum of *two* dice are just it's product like last time (to see why this is true, think about what the product means: exponents multiply into a sum, and therefore count the number of ways to sum a number from rolling two dice. We then normalize it by $\frac{1}{36}$ to get final probabilities). So, if we had two new dice—we'll call them die A and die B—with new generating functions $A(x)$ and $B(x)$, their product should equal the product of the normal dice!

So, now what? The right-hand side is currently packaged as two factors: two copies of the normal die's generating function. That means that if we can find a way to re-factor that right-hand side into two new generating functions, we should get the labelling for two new dice that are still just as fair as our ordinary dice!

Now, we just need to figure out how to repackage this into 2 terms and we should have our dice! Some things to note: 1) All the coefficients in $A(x)$ and $B(x)$ need to be nonnegative multiples of $\frac{1}{6}$, as they all should have positive probabilitiy of rolling something and each die has 6 sides. 2) A(x) and B(x) both need to have at least one factor of $x$ as if otherwise, we might end up with 0s on our dice (which can make for some very boring dice). So, right now we have $A(x) = \frac{1}{6}x$ and $B(x) = \frac{1}{6}x$. Now we need to distribute the remaining factors $(1+x+x^2)^2(1+x)^2(1-x+x^2)^2$. Since we only have six sides on our dice, it follows that our coefficients of both $A(x)$ and $B(x)$ must sum to 6 (how can we put 7 numbers on a six sided die?). Since our factors' have coefficient sums of 3, 2, and 1 respectively, it follows immediately that both $A(x)$ and $B(x)$ need at least one factor of $(1+x+x^2)$ and one factor of $(1+x)$. So, what do we do with the two factors of $(1-x+x^2)^2$? We can either give both to die A (or B, whichever you like thanks to symmetry), or one to A and one to B. If we do the latter, we get:

Which is just our normal dice from before, labelling both dice 1,2,3,4,5,6. But if we try the former option...

Now we get two very unique dice: label die A 1,3,4,5,6,8 and die B 1,2,2,3,3,4. Of course, multiplying these two generating functions together will verify their fairness as we didn't actually change any of the factors that goes into it, but you can also draw these dice's summation table and verify that all the numbers 2–12 appear as much as they should. If you want to mess with your friends a bit, making a pair of these dice for your next occasion is definitely an easy project to do in a day.

Hopefully this has shown you just how powerful generating functions and how wide of an application they have in discrete problems. From sequences, to counting, to probability, these are just a fraction of the potential generating functions have, and should always be kept in the back of your mind as not just a tool, but really as a symbol of an ongoing theme in problem solving: always look for out-of-the-box perspectives. I've spoken about duality a bit before (and it definitely warrants its own post), but just how powerful alternative representations can be can't be understated. Generating functions took seemingly impossible questions about discrete sequences and indistinguishable counting to questions about functions and series and required at most a bit of high school algebra to manipulate some of the equations.

While I left links to resources for relevant techniques and tools that I didn't explain, I do want to talk briefly on how I determined distributing factors for our polynomial coefficients in the dice problem as it's not completely obvious if you haven't seen it before. In the dice problem, I said that we need our final polynomial's coefficients to sum to 6. To ensure they summed to 6, I said that they both must be the product of a polynomial with coefficient sum of 3 and a polynomial with a coefficient sum of 2. This is because of the nice property that the product of two polynomials' coefficient sum is equal to the coefficient sum of their polynomial product. In other words: let $C$ be a function that takes a polynomial $f(x)$ as an argument, $C(f(x))$ returns the sum of the coefficients of $f(x)$. I want to show you that $C(f(x)) \cdot C(g(x)) = C(f(x)g(x))$.

Let's first do an example. Let $f(x) = x^2 - 3x + 2$ and $g(x) = 2x - 4$. The coefficient sum of $f(x)$ is $C(f(x)) = 1-3+2 = 0$. Similarly, for $C(g(x)) = 2-4 = -2$. Therefore, $C(f(x))\cdot C(g(x)) = 0\cdot -2 = 0$. So, we'd then expect $C(f(x)g(x)) = 0$ as well.

Now, let's see what the product of the two functions is and what its coefficient sum is. Let $h(x) = f(x)g(x) = 2x^3 - 10x^2 + 16x - 8$. Then, $C(h(x)) = 2-10+16-8 = 0$, just as we foresaw.

Why is this true? It comes down to a clever way of viewing the coefficient sums. Note that for any polynomial $f(x)$, $C(f(x)) = f(1)$. This fact is because plugging in 1 to any polynomial completely removes the powers of $x$, as $1^n = 1$ and $m\cdot 1 = m$, leaving us only with the coefficients. This allows us to rewrite $C(f(x)) \cdot C(g(x)) = f(1)\cdot g(1)$. Now, what about $C(f(x)g(x))$? Well, remember we defined $h(x) = f(x)g(x)$, so that means $C(f(x)g(x)) = C(h(x)) = h(1) = f(1)\cdot g(1)$, which is exactly what we got before! So this means that for any polynomial with coefficient $n$, it can be written as the product of two smaller polynomials with coefficient sums $a$ and $b$ with the only requirement that $ab = n$. That's how I knew in the dice problem that each die's generating function needed a factor with coefficient sums 3 and 2, since $3\cdot 2 = 6$.

]]>Today I want to talk about a type of geometry I think is grossly overlooked, especially when compared to the popularity of its Euclidean brother. In a world where linear transformations are the norm between translations, rotations, and dilations, sometimes it's hard to see anything but them as the workhorse geometric tools. However, there is an additional transformation that takes us from the solidarity of linear transforms to one of a type of circular transform that may seem novel at first, but is able to even extend complex analysis. Today I want to talk about **inversive geometry**. Inversive geometry takes the standard plane we know and quite literally flips it inside out. By the end of this post, you will be familiar with not only what in the world an inversion is, but a very cool theorem that results in the animation above that relates tangent circles to one another. But before we can get there, we first need to learn *how* to flip our world inside out.

As you can imagine, inversive geometry is geometry that relies on something called *inversions*. You can think of an inversion as a function that takes a point $P$ and spits out a transformed point $P'$. But, what exactly is our function? It's not a standard $f(x)$ as we're giving *two* coordinates not one. So maybe it's a 2-by-2 matrix, as we're giving a 2D vector and outputting another 2D vector? Not a bad idea, but it will quickly become clear why we don't want to do that. So, what *is* our functional object? It's actually a *circle*. As weird as that sounds, just hear it out. Given a circle $Ø$ with center $O$ and radius $r$, the point $P$ is inverted to $P'$ based on the following equation:

...where $P'$ lies on the ray $\overrightarrow{OP}$. Try dragging the points below to get a handle on this idea.

Here we have the green circle $Ø$ which we are inverting the point $P$ over. Try dragging $P$, $O$, and $R$ around to see where its image $P'$ goes under inversion.

The numbers above each point represent their distance from $O$, so you can verify the distances satisfy the inversion equation. As a nice little double entendre, this mapping is called an inversion for both algebraic and geometric reasons. The equation itself is an inverse relationship between $|OP|$ and $|OP'|$ (this is why we can't use matrices in the standard sense to represent the transform), but better yet, I'm sure within just within a few seconds that you can much more intuitively see the geometric reason: every point $P$ on the inside of the circle gets mapped to the outside of the circle, and every point outside the circle gets mapped to the inside (and every point *on* the circle stays on the circle—we say the circle itself is *invariant* under inversion). We're taking our plane and flipping it inside out centered around the circle.

As such, this specific inversion is known as a **circle inversion** or **plane inversion**.

However, there might be a glaring issue to some of you: what if the point $P$ we're inverting is the center of the circle $O$ itself? Then we get $|OP|=0$, and how can 0 times anything equal anything but 0? To get around this issue, we have to formally introduce a *point at infinity*. That way, if we try to invert the center of our inverting circle, we have a place for it to go.

Now that we can invert points, we can now easily invert *shapes*. All we have to do is invert the collection of points individually, and remember the order to connect them. We could try basic polygons like squares and triangles, but the one that is most interesting (and will be most helpful) is inverting other circles. Below, we'll again invert over the green circle with center $O$, but now instead of a point, we'll invert the blue circle with center $C$ to the red circle.

Again, we have the green circle $Ø$ as our inversion circle, but instead of just a point $P$ that we'll invert, we'll invert the entire blue circle with center $C$. Drag the different points around to see where our blue circle inverts to along with its center $C$ to $C'$.

A lot of our rules with inverting points can easily give us an intuition for how our circles might invert. Points on a circle that are *inside* of our green inversion circle get flipped to be outside of it, and vice versa, and points *on* the inversion circle stay on the inversion circle (see how the red circle passes through the intersections of the green and blue circles). But, since we are looking at a group of points, some discrepancies between points obviously exist. For instance: distance. I have drawn both the center of our blue circle $C$ as well as its inversion $C'$. But just by looking at it, it's obvious that there's no way $C'$ can be the center of the blue circle's inversion! That's due to one key aspect of inversions: **they do not preserve distances**.

That should be apparent due to the actual inversion equation $|OP| \cdot |OP'| = r^2$. This inverse relationship between the length of $OP$ and $OP'$ is what really exaggerates inversions with very small or very big values of $OP$. In fact, this is why inverting a circle is so interesting as even though the distances get all messed up, a circle will still always invert into another circle. Try inverting a square and you'll almost definitely get something that doesn't look like a square (*almost* definitely as if you align the center of the square with the center of the inversion circle, then that will result in another square; just try drawing it out and it will all fall into place). Even when it doesn't *look* like a circle, it really is! Try dragging the blue circle such that it intersects the center of our green inverting circle. You'll get something that looks like a line. While it acts as a line, we formally say that this is a really really big circle. Specifically, a circle with an infinitely large radius. It's a lot like how in calculus they say if you zoom in super far in on a curve it looks like a line, if you have a super big curve it locally looks like a line from our perspective.

Ok, that was a lot, but what does this tell us? Well, just experimenting with inverting one circle tells us much about inversions and some properties they have. Let's write them out.

- Inversions do not preserve distances. We saw this with how a circle's center may not invert to the center of the inverted circle.
- Every point $P$ has a unique inversion $P'$ for any given circle of inversion $Ø$ with center $O$ and radius $r$. This may seem obvious, but it's important to be aware of as it leads to the next very important characteristic...
**Intersections and tangency between two or more shapes are preserved during an inversion**. This fact is the one you want to hold onto the most for the upcoming sections. This should make sense as if two or more shapes share some point in common such as an intersection or tangent point $P$, that singular point has only one unique inverse $P'$ which they must also share. And if all the points need to be connected to that point after the inversion, then we should expect to see that intersection/tangent point remain after an inversion.- Lastly, just as a neat fact, performing the same inversion twice results in nothing changing (the identity). You can think of this sort of like what happens when turning a shirt inside out twice: the first time the seams come out, but the second time it just goes back to how it started. Just going back to our formula $OP \cdot OP' = r^2$, we know that $OP'$ has some length that corresponds with $OP$ to keep that product the same. So, if we let $P \rightarrow P'$ representing our first inversion, $P'$ needs to go back to $P$ to keep the formula the same for the second inversion. This makes the function of an inversion a special function called an involution.

With the basics of inversion down, we are now ready to explore that animation from the very top of this post.

Steiner's Porism can sound a lot more complicated than it is, but I promise you the animation at the top explains everything. Let's break it down step by step. First, draw two non-intersecting circles with one inside the other. Second, draw a third circle that is tangent to the inside and outside circle. Third, draw as many circles as you can, each tangent to the inside, outside, and last circle you drew until no more can fit (we call this chain of circles a *Steiner chain*). **Steiner's Porism** states if the last circle is tangent to the first circle you drew, then there are an infinite number of chains that are tangent to one another (and if they are not tangent, then there are an infinite number of chains that are not tangent). So in the GIF above, since the chain of black circles are tangent to one another, the black circles are free to rotate like ball bearings in between the outer blue and inner red bounding circles and the chain will always link up to make one of the infinite configurations. It's pretty interesting, but how would someone ever prove this? That's where our friend inversion comes in.

Before we go any further though, let's quickly see if there's an easier version of this scenario. One of the main issues I first had when looking at this was the fact that the black circles rolling around didn't have to be the same size. Fortunately, there's an obvious case where we don't have to worry about that: when the two bounding circles are concentric.

Try dragging the red point to change the red circle's radius, and try rotating the black point to see the symmetry in the chains of circles.

When the circles share the same center, then Steiner's Porism becomes obvious: our set up now becomes symmetrical, so you can think of any starting point as a rotation of the original chain of circles. Since this is an easy fact to see, we can now use a key property of inversions to prove the general case for *any* pair of bounding circles:

Intersections and tangency between two or more shapes are preserved during an inversion

This is great for us, as then if we can find an inversion that turns our non-concentric circles into concentric circles, we can then use the fact that our tangents of the chain circles are preserved and use the obvious concentric circles case to close out the theorem. It may sound complicated, but think of it as a way to work backwards: if we can show we can turn any non-concentric circles into concentric ones, then the reverse is also true where there is some corresponding pair of concentric circles that invert to our original, non-concentric ones. Since the rules of tangency remain true between inversions, the rules for our circle chains remain as well (since they are only governed by tangents).

To find our desired circle of inversion, I'll present it as a series of steps that might not make sense immediately, but will definitely make sense retroactively. So, for now, I ask that you follow along with the steps and we'll discuss it at the end.

The **radical axis** of a pair of circles is the line (or axis, I guess) that every point $P$ along the line is the same distance away from the *tangents* of the two circles. This sounds more complicated than it is and is much easier to see with a picture. Fortunately, it's not too hard to find with some simple geometry. We'll draw the radical axis in green.

Although we can drag the green point anywhere, it always allows us to find an purple orthogonal circle.

Of course, the point $P$ in question has to be *outside* the two circles to be able to find tangents, but that's only a worry for intersecting circles (which we don't care about). I drew a purple circle center around $P$ to show that the tangents are in fact equal in length. This purple circle, however, has one notable property due to the 4 tangent lines it has as its radii: it is **orthogonal** to the red and blue circles, meaning that it intersects the red and blue circles at right angles. This is a result from the fact that a circle's radius is perpendicular to its tangent. Hold up a corner of a piece of paper and you'll see the right angles clearly. Speaking of orthogonal circles, that brings us to our next step.

This step is easy enough since we've basically already done half of it. We just need to draw another purple orthogonal circle as we've done before, and then find their intersection. Our space will start to get cluttered quickly, so I'll remove the purple tangency lines, but just know that those are what determine our purple circles. We'll draw this intersection point in black.

For any pair of red and blue circles, their orthogonal circles always intersect in the same two locations.

Here I've selected the outermost intersection for clarity, but we'll see in just a second that either of the two intersections work just fine. First, it's worth noting that for a given configuration of outer blue and inner red bounding circles, the intersection points remain constant. No matter how you may slide those green points, the intersection point doesn't change. That should help cue you into its importance.

As a separate, interesting fact (that I haven't looked into enough), the centers of the red and blue circles are collinear with the two intersection points of the purple circles. Quirk aside, though, we can move onto the third and final step of this inversion circle finding process.

Notice how I said "any" circle. A circle with any radius will suffice as our desired circle. This will be our circle to invert over! We're going to invert a total of four circles: of course, the red and blue ones, but we'll also invert the two helper, purple circles. The diagram might look a bit busy, but just remember that this building off the same diagrams from before; look for what's new in the graphic, and it will be less overwhelming.

Finally, we are able to invert our red and blue circles into concentric ones based on the black intersection point.

And just like that, we've obtained our concentric circles just as we desired! Just to reiterate, because tangencies are preserved through our inversion, we can then draw our chain of tangent circles in the original blue and red bounding circles and know for a fact that they'll remain tangent after our inversion as well. Moreover, since the inversion turns our circles into concentric ones which is the nice symmetric case from before, Steiner's Porism is nicely proved as we know, once again, tangencies are preserved during an inversion.

Ok, but why does this even work? I mean, yeah, it produces concentric circles, but our steps seemed so arbitrary. Why should it work? It has to do with our purple orthogonal circles. Remember, these circles are orthogonal meaning that they intersect our blue and red circles at right angles. Moreover, remember that by definition of our construction in **Step 2**, these orthogonal circles pass through the center of our black inversion circle. As we saw before, a circle passing through the center of the inversion circle means that these purple circles will invert into circles of infinitely-large radii (or lines, if you prefer). Lastly, we also know that, in addition to tangencies, intersections are preserved during an inversion. So, not only do we know that our inverted purple lines must intersect, but the intersections between the red and blue circles as well as the purple circles are also maintained.

So, we have two lines that intersect that need to be orthogonal to two other circles. What configuration allows this? The only way that a pair of lines can be orthogonal to a circle is if those lines are the radial line of the circle! So, both circles must share the same center of the intersection of the lines which ensure the lines become radial, and by definition of sharing a center, they must be concentric! Isn't that neat?

Also, this explains why our black intersection points of the orthogonal circles are invariant: regardless of what pair of orthogonal circles we use, there is precisely one center of inversion that maps both circles to have the same center (two, technically, but that just flips what circle is on the outside).

One thing worth noting, though, is that we get a solution even when the two circles are *not* contained within one another. If the two circles are non-intersecting and are completely separated from one another, we can still follow our procedure from before: we can find a radical axis of the two circles, which leads to our two purple orthogonal circles, that finally intersect at the center of our inversion circle. However, we now get a reversed solution with the red circle becoming the outer concentric circle instead of the inner one (this only happens as a result of the choice of intersection point of our orthogonal circles).

Inversive geometry has all sorts of interesting quirks and facts to explore, and should be more well known than what it is. Maybe one day I'll touch on its connection to polar curves. But anyways, this post wouldn't be complete if you couldn't build a Steiner chain of circles of your own, so below there is one last widget to experiment with tangent circles. I have left the special black inversion circle on the canvas just so you can see how all of our work to get concentric circles relates to any pair of nested, bounding circles. There's so much I had to gloss over to keep this short, such as the hidden conics in the path of the tangent circles, so I highly recommend skimming other articles such as Wolfram MathWorld's and even Wikipedia's discussions on Steiner chains. With all things in math, this story is never over: Steiner's Porism has a projective geometry cousin known as Poncelet's Porism, but that deserves its own post entirely some day. Inversive geometry is a simple yet powerful tool, and even just knowing the concept alone is useful to keep in your back pocket as you never know when you may come across something that has an uncanny resemblence to it. Nevertheless, I hope you can at least leave this page with not just an appreciation of a cool bit of math, but a nice animation as well. As before, don't forget to try separating the circles to be outside of one another to get some strange, but special solutions to Steiner's Porism (if you're having trouble seeing the animation clearly, try reducing the radius of the black inversion circle).

]]>We are all familiar with the idea of a grid. From making up the small pixels on our screen, to the compact city maps of New York, grids pop up everywhere due to the kind nature of the innate squares built into them; grids are extremely space efficient packing in squares above and below each other while still maintaining a sense of order. But, why do we grids love squares so much? Today, we'll look at a nice proof for why the square is the only regular polygon that can fit in a grid.

First, let's define what a grid is for us. A **grid** is a set of lattice points whose cardinal neighbors (up, down, left, right) are all equidistant from the given point. That's a lot more complicated than it sounds, but all you need to think of is a generic, **square grid** like you would find on a piece of graph paper.

No need to worry about any triangular or hexagonal grids (thank you organic chemistry). Obviously, squares fit in our grid, but how can any other regular polygon possibly fit? Well, remember, we don't necessarily need to only draw horizontal or vertical lines: we can easily draw tilted squares too.

Now that you know about tilted squares, here's nice puzzle to think about: given an $n \times n$ square grid, how many different squares can you draw? Check the footnote below if you want a solution, but just drawing it out will likely give you the intuition you need. Anyways, this tilted square reveals an important property of grids: rotating a lattice point by 90° around another point gives you a new, different lattice point. You can see this nicely with complex numbers. If you have a lattice point $a+bi$, a 90° rotation is equivalent to multiplying our number by $i\rightarrow i \cdot (a+bi) = -b + ai$. The coefficients remain integers, so if $(a,b)$ is a point, so is $(-b,a)$.

Another (less relevant for us) property is that if you know a line segment defined by 2 lattice points, and you are given a 3rd lattice point, you can find a 4th one by drawing a second line segment from your 3rd point (think of it like vector addition: if we know a vector and a point, we can find a new point by adding that vector the point). For the purpose of this post, though, just remember the former property.

The proof that only regular polygon that a grid can define is a square is very simple, but very clever. Just as an example, we'll use a pentagon for demonstrative purposes. Let's assume that our regular $p$-gon (in our case, pentagon; I use $p$-gon due to poor variable naming later) exists in the grid.

If these 5 points that define our pentagon exists in the grid, then we should be able to generate 5 new, totally valid grid points by rotating them 90° around their neighbor.

Notice, though, that we just made another, smaller regular pentagon! ...Or did we? We can prove this quite simply geometrically (trust me, drawing it out and symmetry will guide you all the way through), but I don't want to draw anything right now so instead I'll show you a much more needlessly complicated, linear algebra approach to it (this will, though, give us specific numbers at the end of it). If you can accept this red pentagon is in fact a regular pentagon, just skip ahead, but for now I'll present the proof.

If we can show that the new red pentagon lies on a parametric circle, we can then show that we our 5 angles to generate the original, black pentagon, maps to the new red pentagon. The way we generated our red pentagon was by taking a black point $v$, rotating it around its neighbor $t$ by 90° to land at $v'$ as seen above. We can write this transformation as a product of 3 matrices: translating by $-t$, rotating 90°, then translating back by $+t$ (in a linear transformation, the origin remains fixed so the translations are our way to rotate about any point we want). If $v$ is a point of the form $(\cos\frac{2\pi n}{5}, \sin\frac{2\pi n}{5})$, then $t$ is the point $(\cos\frac{2\pi (n-1)}{5}, \sin\frac{2\pi (n-1)}{5})$ just as definition of being a pentagon on the unit circle, and $v$ and $t$ being neighboring points. So, our matrix equation of going from $v\rightarrow v'$ is

$ \begin{bmatrix} 0 & -1 & \sin\frac{2\pi (n-1)}{5} + \cos\frac{2\pi (n-1)}{5} \\ 1 & 0 & \sin\frac{2\pi (n-1)}{5} - \cos\frac{2\pi (n-1)}{5} \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos\frac{2\pi n}{5} \\ \sin\frac{2\pi n}{5} \\ 1 \end{bmatrix} = v' $

Before we really dig into the matrix computations, take a look at the final column of that last matrix: it has something that looks like $\sin(x) + \cos(x)$ and $\sin(x) - \cos(x)$. These seem too nice not to have a formula for this sum and difference. So, before we go on, it will be worthwhile to see if we can condense those into nicer formulas. In fact, when you plot these functions, you do get what looks like nice sine waves.

Let's say we think it is some type of cosine curve.

$A$ is the amplitude of this new curve, and $\phi$ is the phase offset. Now, we fortunately have a well known angle addition formula for cosine.

This may seem hard to solve, but all we need to do now is match our coefficients (highlighted in red). For the right hand side to equal the left hand side

$-A\sin(\phi) = 1$

That way the $\cos(x)$ and $\sin(x)$ terms will be equal on either side. We can now square both equations and add them together to get

Remembering that $\cos^2(\phi) + \sin^2(\phi) = 1$, we get that $A = \sqrt{2}$. Now we can solve for $\phi$ fairly quickly, too. Though, we'll have to be careful about range restrictions on $\cos^{-1}(x)$ and $\sin^{-1}(x)$, so we should corroborate them to make sure we get a value that satisfies both equations.

The only (reduced) angle that is satisfies both equations is $\phi = -\frac{\pi}{4}$. Putting it all together now, we can see that

In a similar manner,

but we can make this more akin to our previous equation recalling that $\cos(x) = \sin(\frac{\pi}{2} - x)$.

Working with one term instead of the sum of two will make our life easier moving forward.

Unfortunately, this transformation matrix alone can't show us our net tranform is purely a rotation and scaling due to that third column (which indicates a translation). So, we will have to look at the individual components of $v'$.

$ \begin{bmatrix} -\sin\frac{2\pi n}{5} + \sqrt{2}\cos(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ \cos\frac{2\pi n}{5} + \sqrt{2}\sin(\frac{2\pi (n-1)}{5} - \frac{\pi}{4}) \\ 1 \end{bmatrix} = v' $

If $v'$ truly is just a rotated and scaled version of a vertex of our original pentagon, then it follows that any $v'$ should lie on a circle, just as $v$ does. So, we can use the Pythagorean identity that $(r\cos\theta)^2 + (r\sin\theta)^2 = r^2$ which implies that if we square the $x$ and $y$ components of $v'$ and add them together, we should get a constant. For simplicity in writing, we'll use $\alpha = \frac{2\pi (n-1)}{5} - \frac{\pi}{4}$.

$ \cos^2 \frac{2\pi n}{5} + \sin^2 \frac{2\pi n}{5} + 2\cos^2 \alpha + 2\sin^2 \alpha + 2\sqrt{2}(\cos\frac{2\pi n}{5}\sin \alpha - \sin\frac{2\pi n}{5}\cos \alpha) $

This may look like a pain to work with, but just grouping like terms and trigonometric identities clean this up real fast.

So it does simplify to a constant! This constant represents the radius$^2$ of the new circle the red pentagon lies on (based on the black pentagon's unit circle). Meaning, the radius of the circle the red pentagon lies on is $\sqrt{0.479852979} \approx 0.692714211$. To find the angle it is rotated by, we just find how far the first vertex ($n=0$) is rotated:

What's great about our linear algebra approach too is that it quickly generalizes for any regular $p$-gon! The only part of the result impacted by our choice of a pentagon is any appearance of $\frac{2\pi}{5}$. So if you wanted to do it for any regular $p$-gon, all you do is replace $\frac{2\pi}{5}$ with $\frac{2\pi}{p}$.

To get back to the original point though, we have shown that a vertex $v$ on the unit circle under our specific transformation maps to a vertex $v'$ on a scaled, rotated circle (and because it was all linear transformations, the scalings and rotations are uniform around the origin), which implies that our black pentagon maps to another regular pentagon in red. Since those vertices in red are valid points in the grid since we found them with 90° rotations of other, valid lattice points, we can find yet *another* set of 5 valid lattice points by doing the same operation again of rotating each vertex 90° around its neighbor.

With our previous logic, we know that this too is a regular pentagon, which again, allows us to find 5 *new* lattice points in our grid by doing the same operation of rotating around a neighbor by 90°... And again... And again... For as many new lattice points and pentagons as we want.

Here are 15 nested pentagons; try dragging any of the vertices to zoom in and you can see they are just smaller, rotated regular pentagons.

Some might look at this graph and see an obvious flaw: what about the case of the equilateral triangle though ($p=3$)? Yes, technically it has an $r>1$ and this does actually result in it expanding out.

If the points expand out, we can't really say much about it existing in the grid or not since we only have a minimum bound on the distance between lattice points. But, there was a second property I glossed over regarding grids:

If you know a line segment defined by 2 lattice points, and you are given a 3rd lattice point, you can find a 4th one by drawing a second line segment from your 3rd point.

This property allows the equilateral triangle to be turned into the equivalent case of a hexagon, leading it to a case where $p>4$ and hence an $r<1$, indicating that an equilateral triangle, too, is impossible to draw within the grid.

Another neat little fact is the bend at $p=8$; an octagon has the smallest bounding radius for the 90° neighbor rotations (since that makes the $\sin$ term go to 1). But after that, $r(p)$ starts to increase. How can we know it won't equal or exceed 1? Well, no matter how big $p$ is, $\frac{2\pi}{p} + \frac{\pi}{4}$ will always be greater than $\frac{\pi}{4}$. Just thinking about it though leads to an interesting thought: what about at the limit of $p\to\infty$?

Which as we saw, would lead to an $r$ value of 1. So, according to this limit, an infinite sided $p$-gon, better known as the circle, is possible... *at the limit*. You'll get better and better approximations of the circle the more sides you add, but this essentially turns our grid into lattice points that are infinitely close together, which ruins the point of the grid in my opinion. So, it's up to you if you think a circle can exist in a grid, but an interesting thought nonetheless.

Earlier I mentioned that we did't need to worry about triangular or hexagonal grids, but what if we did?

The triangular (left) and hexagonal (right) grids are less obvious for what other regular polygons they can fit.

Fortunately, this just requires tweaking our matrix equation from before a little bit: instead of rotating by 90°, we now rotate by 60° and 120° for the triangular and hexagonal grids respectively. In general, if we want to rotate by $\theta$ radians (easy conversion from degrees) around a neighboring point for a regular $p$-gon, the vector for $v'$ in terms of $v$ is

Finding $r=\sqrt{x^2 + y^2}$ again reveals that

...which is in fact a constant. So rotations of *any* angle around neighboring points output more regular polygons.

Try rotating one of the red dots to watch the regular pentagon grow and shrink according to the angle you rotate around its neighbor.

As before, we can plot this function to see what regular polygons have a bounding radius less than one for each grid:

Even though it would appear that no hexagons can be made in the triangular grid as they have a bounding radius $r_{\frac{\pi}{3}}(6)=0$. However, the point they actually collapse into is valid vertex of the triangular grid. This is because that's a type of definition of the triangular grid: draw a hexagonal grid and place extra points in the center of each hexagon. So again, the only apparent shapes in that can be contained in the triangular and hexagonal grids unfortunately appear to be none other than just the equilateral triangle and regular hexagon themselves.

This post was inspired by a Mathologer video discussing an application of shrinking polygons and this out-of-the-box thinking that is so cool. Part way through the video, he glosses over the reasoning behind why the shrinking polygons are similar, so this is my own take on that portion of the video.

As I said, drawing it out is your best bet. Let's start with how many $1 \times 1$ squares there can be in an $n \times n$ grid. Well, it's just $(n-1)^2$ by definition of the size of the grid (remember, $n$ refers to the number of dots, so there are only $(n-1) \times (n-1)$ squares). How about $2 \times 2$ squares? Well, we have eliminated a possible row and column from which we can place the square, so there are $(n-2)^2$ total $2 \times 2$ squares. You might be tempted to say that there are $\sum_{i=1}^{n-1} (n-i)^2$ total squares (where we are individually counting every $i\times i$ square up to $n-1$), but you can't forget that there are tilted squares too. The trick here is now that you have all the non-tilted squares, you just need to find how many possible tilted squares can be contained in a non-tilted square. In a $1 \times 1$ square, there is no room to tilt a square in it, so we move on. In a $2 \times 2$ square, there is exactly one extra lattice point to tilt on, so we add a factor of $2$ to our count of $2\times 2$ squares $\rightarrow 2(n-2)^2$. Similarly for a the $3 \times 3$ squares, we have 2 new lattice points we can tilt to, tripling our count $\rightarrow 3(n-3)^2$ squares. So finally, we can write it as a final sum of $\sum_{i=1}^{n-1} i(n-i)^2 = \frac{1}{12}(n^2)(n^2-1)$ total squares for an $n \times n$ grid.

]]> http://xperimex.github.io/blog/grid-polygonsCOVID-19 is one of those events that will likely not just define the way people will interact with each other, but likely entire socieities. I wouldn't even be surprised to see this pop up in an AP US History textbook in a decade just for how long the pandemic has been drawn out for. So it should be no surprise that from the first month of the pandemic, a vaccine was the only thing on people's minds. I mean, just look at this graph depicting the number of total coronavirus cases in the U.S. *alone*.

Data as reported by the CDC (updated as of July 15th)

It took around 2 months to hit a million cases in the U.S., and another 3 months to reach 5 million cases. That's the issue with pandemics: they explode at a rate faster than we can realize. So, today, I want to talk a little bit about different types of disease models, and how these can educate ourselves on the right preventative courses of action.

The **S**usceptible **I**nfected **R**ecovered Model of disease spread groups individuals into 3 boxes and relates them to show how each box grows or diminishes over time: susceptible means that you are currently healthy but are vulnerable to the infection; infected means just that and indicates a current illness; and while recovered doesn't necessarily mean you overcame the disease, it means you are unable to spread the disease anymore (sobeit proper recovery and developed immunity, but also the unfortunate case if you die since both are no longer disease vectors). This sounds like the perfect use for a Markov chain (here's a refresher if you need one)! Our Markov chain will have 3 states being the aforementioned susceptible, infected, and recovered, and will follow a transition model as such:

The SIR Markov chain model

Let's say everyone starts out as susceptible; we can write our initial population as $N$, and the initial distribution of those $N$ people as $P = \begin{bmatrix} N & 0 & 0 \end{bmatrix}$ for $N$ susceptible, 0 infected, and 0 recovered.

With that out of the way, let's think about the above Markov chain. If you are currently susceptible, each day there's a chance you might become infected. Let's call this probability $\beta$ as the average chance of getting infected. At the same time though, if you are smart and act carefully, you might not become infected, and this will be $1-\beta$. Similarly if you're infected, there's a chance you might (positively or otherwise) overcome the disease! We'll call this probability $\gamma$, and the chance of not recovering $1-\gamma$ (you can think of $\gamma$ by taking its inverse: $\frac{1}{\gamma}$ is the average number of days for a recovery to occur). Finally, if you're recovered, that's the end of your journey, as once recovered you're always recovered. This can be condensed into the simple transition matrix below.

With that, let's run a trial with infectivity $\beta=.4$ and $\gamma=\frac{1}{5}$ (we expect an infection to take 5 days to recover). So on Day 0, everyone is healthy and (ironically) susceptible to our disease. The following Day 1, people are interacting and enjoying themselves, unbeknown to them that they are spreading a new contagion. Day 1 results in $PM = \begin{bmatrix} 800 & 200 & 0 \end{bmatrix}$, which makes sense as we expected 20% to become infected. We can track each state of S, I, and R and plot them accordingly across the span of a month.

Evolution of the SIR Markov chain as percent of the population

For such a simple model, it's not bad, but there are some obvious flaws. For starters, the infected population doesn't really infect others; all infections stem from random appearances in the susceptible population. This is why by the end of the 30 days, we have 0 susceptible and only recovered, since infections don't require infected to be present which is kind of odd (notice how our initial population was *only* susceptible and yet infected pop up). We can do better.

While Markov chains were appealing due to the nature of having 3 states, what we really want to focus on is the relationships between the 3 states: if there are lots of infected people, that should increase the rate at which others get infected. Since we're looking at how the value of one box (infected people) affect the rate at which the other boxes change (how fast susceptible and recovery decrease/increase), perhaps we should try a system of differential equations. Instead of states, we'll now have 3 different functions: $S(t)$, $I(t)$, and $R(t)$, which returns the number of susceptible, infected, and recovered at a time $t$. If you're not familiar with differential equations, don't worry, this section will be brief. The idea behind differential equations is that sometimes it's hard to exactly quantify a function or value, but we know how the function *changes* relative to another value. See the first few minutes of this video if this idea intrigues you and for some nice opening examples. Besides, you already know more or less what the equations look like we know what the function should look like.

That's really all there is to them. The specific math behind it looks more complicated than it is, but keeping the above in mind will make it much clearer.

I like to think of it that of that the fraction of the vulnerable people $\frac{S}{N}$ have a $\beta$ chance of being infected, which is scaled up by the number of infected people $I$. That makes sense for number of new infected people. Similarly, if the average chance of recovery is $\gamma$, then you should expected a proportion of $\gamma$ infected people to recover ($\gamma I$). Last important fact to note is what happens when we add these equations together:

No matter how the 3 different categories of people evolve over time, the net change between all of them will be 0, meaning our total population remains the same over time (which is good since we don't want people appearing and disappearing out of nowhere). Let's watch the scenario unfold once more with $N=1000$ with a distribution of $P = \begin{bmatrix} 999 & 1 & 0 \end{bmatrix}$ (*someone* has to start the pandemic), $\beta=.4$, and $\gamma=\frac{1}{5}$.

Evolution of the SIR differential equations as percent of the population

That's more like that famous curve. It may not be as drastic as the 10 day everyone-is-infected model as the Markov chain we tried earlier, but it's definitely much more realistic. A single person was able to infect about 800 people across only a couple of months, which is still scary fast.

Just as famous as a curve this may be, you likely have heard of the idea of $R$ and $R_0$ too. $R$ is the **reproduction number** of a disease/virus that tells you on average, how many people an infected person will spread the disease to. If $R>1$, then you get the epidemic issue, where the disease is spreading exponentially as each person is giving more people than just themselves the disease. When $R=1$, you have an *endemic* where the disease is neither spreading nor being contained. When $R<1$, then you have a contained virus that is decaying throughout a community. $R_0$ is just the $R$ value at the start of the outbreak, and can be found with the formula $R = \frac{\beta}{\gamma}$. In our simulation, $R_0 = \frac{.4}{.2} = 2$. So if we wanted to contain this disease, we want to find how many infections we need to contain for $R<1$. Since $R_0 \cdot .5 = 1$, we only need to contain 50% of infections to contain the outbreak!

You can see as soon as there is less than 50% of the population left to infect, infections start to decline—not end, but decline. You can also reduce infectivity by wearing preventative measures (i.e. masks) or getting the appropriate immunization against the disease (i.e. vaccines). If you want to read more about this model and more of its intricacies, Nicky Case wrote a very nice interactive article about it.

While we did find a nice model to represent disease spread, the SIR model only is really nice for analyzing very *big* communities. With the SIR model, we assume everyone interacts with enough people so that infected people can always target a susceptible person if they need to in this very dense network of people.

Most people, though, only regularly interact and really care about those immediately in their social circles. In this small scale, such as say, a neighborhood or even a friend group, people do *not* interact with others equally. Some are more introverted and interact with only a few people, and some are extroverted beacons that interact with most of the group. This changes the dynamics of disease spread a lot as obviously those who are more outgoing and meet more people will be more at risk of being infected than someone who only talks to a few people. As such, we'll need a different approach than the SIR model.

In order to model connections between people we'll use a **graph**. A graph in this case is not the standard parabola of $f(x) = x^2$, but rather it is a visual tool consisting of two parts: there are **nodes** which are dots (to represent people here), and **edges** that are lines that connect nodes (to represent interactions or friendships).

Example of a graph that may represent the individual friendships in a clique

As in most groups, we have a couple of people at the center connected to quite a few people, and some on the edge of the circle regularly interacting with as few as 2 people. This is a really helpful representation as now not only can we watch *where* the virus spreads, but we also have a direct way to implement small scale prevention tactics as we can see specifically who is infected. Ideally, we would vaccinate everyone in this group and make sure (consequential) disease spread can occur, but that might not happen. So, **if we could only vaccinate, say, 25% of this group, who would you vaccinate to minimize a disease vector?** There are some obvious candidates such as the nodes with the most connections, but what is the *best* way Let's leave our susceptible people in green and infected in red, but we'll add a new purple node to indicate vaccinated.

Randomly vaccinating 5 of the 20 people in the network. Can we do better?

Before we talk strategies, let's be clear about some of the assumptions made for simplicity sake.

- Vaccinations are 100% effective both ways, completely negating the possibility of infection of and transmission from the vaccinated person as if fully immune.
- Edge lengths do not matter; if someone is connected to another person, we assume that that edge will act the exact same way as all other interactions and edges. This ensures a contant $\beta$.
- Dying and recovering from the disease will be again treated the same under an overarching "Recovered" state, in which the affected person becomes fully immune.
- An infected person "recovers" after only a single day ($\gamma = 1$) since a) it will ensure our simulation can spit out some numbers at the end as it's hard to
*completely*protect an entire network of people, and b) our program wasn't written the most efficiently.

As before, we want to find strategies ways to reduce $R_0$. But, what is $R_0$ in this case? Well, remember $R_0$ is just the average number of transmissions a single infected person will instigate. So, we can approximate this with the average number of edges an infected node has, multiplied by the infectivity. Since we can't change the infectivity rate ($\beta$), the only way we can lower it is by removing possible edges for infected people to transmit across. So, let's look at different strategies that are both really good and really bad at removing edges from our graph.

**Random**: Realistically, vaccinating people randomly seems like a bad idea. In a simulation, though, it's never a bad starting point.**Random then neighbors**: Here, we start with a single random person to vaccinate, then only vaccinate people connected to an already vaccinated person. If your friend got vaccinated, why shouldn't you?**Most connected**: This is the obvious solution. If we want to reduce edges efficiently, why don't we vaccinate the nodes with the most edges connected to the most people?**Most connected non-neighbors**: This is a spin on the last one. We want to vaccinate the most connected people, but why not spread them out a bit? We pick the highest connected node to vaccinate, and then proceed to vaccinate the next highest node that is not connected to a vaccinated person.**Least connected**: This one is more of a why not scenario. It's like the previous two strategies but reversed: find the loners of the group and vaccinate them.

Again, we'll randomly vaccinate $\approx$25% of our population (50/100, 100/200, 200/400, and 400/800 people/total connections) according to our strategies above. We'll also infect 10% of our population with the contagion of your choosing with the same infectivity at $\beta = .4$ as before. We'll do 10 trials per population size, and average them to get a rough estimate at how each strategy faired.

Unsurprisingly, vacciating the most connected nodes without restriction faired the best; closing off as many routes of infection as fast as possible led to only about 10% of non-vaccinated people getting infected. Following close behind is most connected non-neighbors strategy. Even though we tried spreading out vaccinations since the edge between two vaccinated people doesn't require two does of the same vaccine, by nature of being a popular node, it is likely connected to other popular nodes too. So in reality, we are removing fewer edges than we could be just vaccinating the most popular nodes. It's for this reason why our orange strategy of randomly vaccinating one person then its neighbors was such a bad strategy: not only do we not know if we're vaccinating a well connected person, but we're wasting vaccines but doubly protecting edges between neighbors. Since reducing the edge count is our only form of reducing $R_0$, this is a very bad strategy. What is surprising though, is that vaccinating the least connected people was almost as good as random. Since unpopular people can only be connected to so many people, they tend to be pretty spread out removing a fair number of edges between all the vaccinated people.

Honestly, our graph model isn't all that bad considering how simple it is, but let's look at some nice upgrades we can hand over to it.

This was originally made as a school project, so we generated our graph by taking some number of edges and placing them randomly between nodes for simplicity in presentation. However, these obviously aren't the only types of networks. A prominent one for modelling interactions between people is the **Erdös-Rényi** graph. For $n$ nodes, there are $n \choose 2$ possible pairings of nodes for edges. In an ER graph, we flip a weighted coin to accept an individual edge, and the final result is just the collection of edges we added. If our acceptance probability is $p$, then we would expect the graph to have a total of ${n \choose 2}p$ edges, each node to have on average $p(n-1)$ connections. In this network, people are approximately as popular as others (since our coin flip method leads to a binomial distribution with few super popular or super unpopular people).

An Erdös-Rényi graph with an average connection of 4.8. Look how nice and even this graph is.

Another type of graph is the **Barabási-Albert** type graph. The idea with this network is that people tend to be connected to a distinct, central "hub" of a few people that make up most of a network. The idea is that you start with a small network of $m$ people/nodes to begin with. At each step, you add a new node and connect it to another person $i$ with probability $p_i = \frac{E(i)}{\sum_jE(j)}$ where $E(x)$ is the number of edges for node $x$ and $j$ represents all other nodes. The idea is that if $E(i)$ is really big (i.e. a super popular person), then you'll have a much bigger probability to connect to that person over someone who only has a few connections. Wouldn't you want to hang out with the cool kids too? You can give $i$ as many connections as you want, so long as it's less than $m$ (think about the first new node: how can someone have 10 different friends in a group of 5?).

A Barabási-Albert graph where each new node gets 2 connections to the existing network. Can you see the clusters of the graph?

If you want to explore this more, the inspiring article that led to this project analyzes these 2 exact graphs in the exact problem we discussed above, going a bit more in depth with the idea of the counterintuitive friendship paradox.

We looked at a very specific set of vaccination parameters, namely relying on all edges are treated equally and that the number of edges a node has is constant. In reality, people don't interact with each other equally, so instead of having a generic edge, we could implement an edge **weighting**. This quality would add a number between 0 and 1 to each edge to represent how "friendly" two people are with each other. We could then simulate the difference between best friends and mere colleagues interacting that may have more or less of a chance to spread a virus. The other aspect to consider was updating edge counts. Ideally we would only want to count a vaccine's candidate's number of edges to *vulnerable* people, not vaccinated onr infected people too, with counts updating after every vaccination. Then we could focus on protecting people specifically.

The final problem I wanted to talk about what the goal of vaccines are in a network. As we said, it's about removing edges efficiently, but some edges are definitely worth more than others. Imagine you have two friend circles, with exactly one mutual person between them. Even if that mutual friend has only two edges, vaccinating him would close that bridge between the two friend circles, isolating the virus into only one of them.

Vaccinating the literal middle-man isolates the two groups

This searching for a way to split our graph in two parts with the fewest number of edges removed is called the **sparsest cut problem**. If we can find the (usually approximate) sparsest cut, we can divide our graph in two and attempt to isolate the virus in only a singular bubble. Before, we were looking to remove something known as **Hamiltonian paths**, where you can't connect any one node to another. Here, we are trying to isolate the virus into a smaller bubble.

Sparsest cut of a random ER graph that approximately cuts the graph in half

So here we would want to vaccinate everyone on the border of the yellow-purple divide to make the sparsest cut a reality. Of course, this only works if you have enough vaccines AND if your sparsest cut splits the graph in half evenly. This strategy can be used recursively on the two sub-graphs, so the more vaccines you have, the better this strategy works.

I hope this brought some light to the epidemic math that goes on behind the scenes of so many news articles, but more importantly showed you the power of modelling situations not just in different ways, but creative ones as well. Graph theory in particular has such widespread applications, often just thinking of something simply with connections can allow you to borrow from its variety of tools and techniques.

]]>As with all puzzles, drawing something always helps.

Label the ends of the stick to be 0 and 1, and we'll make the first break at point $x$.

The key to this puzzle is to use the **triangle inequality**: no side of a triangle can be longer than the sum of the other two.

The triangle inequality visualized.

So, that means that no segment can be longer than half the length of our stick. Now, we can start solving the puzzle. Without loss of generality, make the first break at point $x$ between 0 and .5 to ensure our first break is on the left side (if it's not within this range, it'll be some symmetric case on the right half of the stick). Our left half of the stick is already less than .5, so that's good, but that means our second break *must* be on the right half, as otherwise that will be a leg that will be greater than .5, and that's no good. The probability our second break will be on the right half of the leg is equal to the length of that segment over the total line (since it's uniformly random): $\frac{1-x}{1}$. Let's analyze that longer leg now.

The feasibility region in red shows the location of a valid cut to ensure it does not make a segment longer than .5 units.

We can't make a cut at or further than the .5 mark on the segment for the obvious reason: it would make a leg longer than .5, breaking the triangle inequality. For a similar reason, we can't make a cut at $1-x-.5$ or earlier, as that will make a leg on the right side longer than .5 too. We need a cut in between those two boundaries. Just like we found the probability for making a cut on this longer leg to begin with, if we can find the length of that feasibile region and divide that by the length of the leg, that will give us the probability for making a good cut here. The length of the red feasible region is $|.5-(1-x-.5)| = |1-1+x| = x$, and the total length is just $1-x$. So the probability our second cut is valid is $\frac{x}{1-x}$.

Putting it all together, we get the probability that our second cut is valid given a first cut at $x$ is $\frac{1-x}{1} \cdot \frac{x}{1-x} = x$. Kind of neat that the probability is exactly proportional to the length of the first break. But of course, $x$ isn't constant. It too is a random variable, so we can average it to get an expected probability across a large number of trials.

As a gut check, this should make sense as .25 is exactly the midpoint between 0 and .5, the boundaries we set at the start of the solution.

So, if you break a stick at 2 uniformly random points into 3 segments, the probability those 3 segments can form a triangle is exactly $\frac{1}{4}$.

]]>Obviously, we can't just make infinite of each wearable. We have some constraints between available cloth, printing, and effort. If you have 20 sheets of cloth, and each shirt requires $\frac{1}{2}$ of one and the hats only require $\frac{1}{5}$ of one. For printing, maybe you only have enough ink for 100 prints total. Finally, hats are hard to make, so you cap yourself at only making 70 and no more. We have a list of constraints which can be nicely added as a set of inequalities:

$\begin{align} \frac{1}{2}s + \frac{1}{5}h & \leq 35 \ \newline s + h & \leq 100 \ \newline h & \leq 70 \end{align}$

This isn't an easy system to solve as, well, we're not solving anything in the traditional sense by plugging in equations to one another, nor is it a standard maximization problem where we can take some form of a gradient since every equation here is a line, so we would only get constants that give us no new information.

This type of problem where we want to maximize a linear equation that is also bound by a set of linear restraints is called a **linear program**. These types of problems are called specifically programs as you'll see that we have analytic ways to find solutions, but to find the *best* solution requires some well-crafted algorithms to make solving them faster and more efficient. Before we can understand the 2, and later the general $n$, variable set up, let's try an easier case.

As with all problems, let's try and cut back on degrees of freedom so we can really focus on what's going on here. Instead of maximizing a function of 2 variables, let's do a function of 1 variable.

$\begin{align} 5x & \leq 50 \ \newline x & \geq 6 \end{align}$

This isn't all that interesting, as I'm sure the answer is popping out almost immediately: $\max(3x)=30$ from since we can only have at most 10 of variable $x$ from the first constraint. What will become an important concept later is how we visualize these solutions to linear programs, so let's graph our inequalities on a number line to represent the quantity of variable $x$.

Feasiblity region graphed for our 1-variable linear program.

In blue, we have our first constraint of showing all values of $x$ that satisfy $5x \leq 50$, and in red we have our second constraint showing all $x$ that satisfy $x \geq 6$. Where those two regions overlap in the purple-ish area is the **feasibility region**, as well, it's the area where all of our constraints are satisfied and where *feasible* values of $x$ lie. The feasibility region defines a *polytope*, a generalization of the 2D polygon and 3D polyhedron. The key takeaway is that our solution of $x=10$ is an *edge case*: it lies on the furthest possible boundary of our feasibility region.

This makes sense for two reasons. 1) We want to always be looking to maximize or minimize our **objective function** ($3x$, in this case) so if we can add more or less of $x$ to our solution, we want to do that depending on our goal. This is a more intuitive way to show reason 2) our function is linear, so as long as we go in one direction, it will always be increasing or decreasing. Let's add the $y$-axis into this graph and let $y=3x$ to see the value of the function we're maximizing at different values of $x$.

Objective function $y=3x$ plotted with an extra dimension.

Since our objective function is linear, it is directly proportional to $x$, so as $x$ increases or decreases, $y$ also has to only increase or decrease with no chance of weird curves or bends in its function. So if we want to maximize our function, we just want to walk in the direction of $x$ that does so until we hit the edge of our feasiblity region. Similarly, if we wanted to minimize it, we'd walk the other way along $x$ that decreases $y$.

With this in mind, we can now go back to our original 2 variable case.

Recalling what our original setup was

$\begin{align} \frac{1}{2}s + \frac{1}{5}h & \leq 35 \ \newline s + h & \leq 100 \ \newline h & \leq 70 \end{align}$

Let's plot our feasibility region like we did before. Only now, it'll be a 2-dimensional region instead of just the number line. We'll put $s$ for shirts on the $x$-axis and $h$ for hats on the $y$-axis.

5 vertices map out our 2D linear program's feasibile region, meaning there are 5 possible points that can be our optimal solution.

Just as in the 1-variable case, our solution for maximizing our objective function should lie at the edge of our feasible region. I've marked the feasible region's defining vertices in purple. It's worth pointing out that I have added an additional vertex not defined by our original constraints, and that's the vertex $(0,0)$. This is aptly called the non-negativity constraints as it, well, constrains our variables to be non-negative (shocking, right). These tend to be a requirement for some minimization cases just so that we don't end up with problem of running into a pair of negaively infinite numbers as our solution, which obviously doesn't make sense.

Although what I said about our solution being on the edge of the region isn't true, I can tell you more about our maximum solution: it will (usually) be one of the vertices of our feasible region. If you want a more math-based explanation, check this out, but there's a very intuitive way to view think of this, especially with a 2-variable linear program.

Remember how in the 1D case, our objective function $y=3x$ could be represent as a line? We can do a similar idea and add a third $z$-axis to our 2D problem and get $z=15s+10h$. This equation gives us a 3-dimensional plane as we encode the number of shirts, hats, and the amount of money they profit. Our constraints are also planes, acting as curtain-like walls extending forever vertically in the $z$-axis (as all values of $z$ will satisfy whatever values of $s$ and $h$ that lie on the line). Now, imagine our objective function plane of $z=15s+10h$ to be like a tilted table, and we place a ball on it to roll, what does the ball do? Well, the ball will just roll down the incline to the lowest point gravity will push it down. Normally, the ball will roll down forever and ever as our table extends infinitely in all directions, but we have our constraining walls to stop the ball. As soon as the ball hits a wall, it'll continue to slide down that wall as long as the ground tilts downward. It'll only stop if another wall pushes it into a corner, which is where our constraints form a vertex. Isn't that neat? The only times this won't be true is if one of our constraints form a wall that is exactly perpendicular to the direction the ball is rolling, in which case you will have a line segment worth of solutions (which includes two vertices at the end of it). This specific logic applies to finding minimums of our objective function, but you can also think of the reverse for maximums with a tilted ceiling and a balloon instead of a ball. It's basically a streamlined version of gradient descent since the gradient of the plane is always constant.

So if you manually check all 5 vertices, you'll find that a combination of $(50 \textrm{ shirts}, 50 \textrm{ hats})$ gives the optimal combination of $1250 profit. Not bad.

Not to dismiss the wonders of guess-and-check, but it only worked well for us due to the small number of variables and constraints we had. As we tackle more complex and intricate problems, we want to be able to solve linear programs much quicker and more efficiently. There are many algorithms that have been developed for LP optimization with varying motives, but the first one was George Dantzig's **simplex algorithm**.

The simplex algorithm is essentially a systematic way for us to narrow our guess-and-check quickly. We start at some vertex, and then travel edges of the feasbile region between vertices until it lands on the optimal solution. The idea is by turning our constraints into a matrix, we can use Gaussian elimination to move between possible candidate solutions until no improvement to our objective function can be made. Let's use our original 2-variable problem to try it out.

$\begin{align} \frac{1}{2}x + \frac{1}{5}y & \leq 35 \ \newline x + y & \leq 100 \ \newline y & \leq 70 \end{align}$

To start, we're going to turn these inequalities into equations by adding **slack variables**. To avoid confusing variable names later, I have replaced $s$ with $x$ and $h$ with $y$. The idea is that if our inequalities are anything less than the right-hand side, we should be able to cover that excess by adding an extra variable to make it equal. Rewriting all of our inequalities (including the objective function with a new objective variable $z$) into equations, we get

This is a system of linear equations we can encode as an augmented matrix!

Here are some things to identify in our **simplex tableau** above. The first row is more a convenience than anything, keeping track of which column corresponds to each variable. The second, third, and fourth rows all correspond to some constraint we have rewritten as equations with slack. The fifth and final row is our objective function which we are trying to maximize. So, by default, we assume we start at the point $(0,0)$ in our feasibility region. Yes, it is a vertex and therefore a candidate solution, but obviously it's not the maximum solution we want, giving us a whopping $z=0$. How do we find the next candidate point? Well we first want to identify which of our variables would increase $z$ the quickest. Well, notice that $x$ has a coefficient of $-15$ in the objective function final row, while $y$ has only a value of $-10$ (we call these numbers **indicators**). This means for every unit of $x$, we gain an additional \$15 contrasting an additional \$10 from a singular unit of $y$. Since $x$ increases $z$ faster than $y$ does, let's focus on column 1.

If we're going to increase $x$ to increase $z$, how do we know how much to increase it by? We can use our handy constraints to tell us exactly that. What we do is we take each value in the $x$ column and divide its associated row's constraint by that value. So, for the $x$ variable we have

Now why is this helpful? These divisions tell us the maximum amount of $x$ we can have according to each constraint (as we assume $y$ can be 0). So, if $\frac{1}{2}x + \frac{1}{5}y \leq 35$, then $x$ can be at most 70 without breaking that constraint. Similarly, $x$ can be at most 100 for the second constraint, and there's no limit for $x$ on the third constraint since it doesn't impact that inequality. So, this tells us we should use the first row as our guiding row! This is because if we used the second row with a maximum of 100, we would be violating our first constraint that $x \leq 70$. So, we call $\frac{1}{2}$ our **pivot** as we use that value to shift our focus from one vertex solution to another. We use our pivot to do Gaussian elimination, and create 0s in that column's other rows to land us at a new vertex.

$\downarrow$

$R_2 - 2R_1$

$R_3 - 0R_1$

$R_4 + 30R_1$

$\downarrow$

$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{1}{2} & \frac{1}{5} & 1 & 0 & 0 & 0 & 35 \\ 0 & \frac{3}{5} & -2 & 1 & 0 & 0 & 30 \\ 0 & 1 & 0 & 0 & 1 & 0 & 70 \\ \hline 0 & -4 & 30 & 0 & 0 & 1 & 1050 \\ \end{array} $

This new vertex we have arrived at is exactly the solution you get only focusing on achieving a maximum $x$ at the point $(70,0)$, which is in fact, the solution with the most units of $x$.

The first step in our simplex algorithm visualized as we walk the edge between $(0,0)$ to $(70,0)$. This first step is definitely a better solution than before, but how can we tell if it is the *best* one?

\$1050 is definitely much better than netting \$0, but how do we know if we can make more? Going back to our objective function in the last row of our tableau, there's a $-4$ in the $y$ column, and as before, should mean there's a potential to increase profit by adding some $y$ component to our solution. Before we do that though, notice how some of our constraints have changed. $R_2$ used to denote $x+y+s_2=100$, but now it shows $\frac{3}{5}y - 2s_1 + s_2 = 30$, or as an inequality, $\frac{3}{5}y \leq 30$. What caused our constraint to change? It has to do with the fact we added rows together. Let's analyze the two original constraints in question of $R_1$ and $R_2$.

We can then treat these as a system of equations like we did with Guassian elimination, and perform the same operation as we did before with $R_2 - 2R_1$.

The important part to notice is that we're subtracting *from* the blue region $\color{blue}{R_2}$ and expecting a *positive* result (since both $x$ and $y$ must be greater than 0). This can only happen for our feasible region where the blue area $\color{blue}{R_2}$ is strictly greater than the red area $\color{red}{R_1}$. This appears only when $\color{red}{R_1}$ is *contained* within $\color{blue}{R_2}$. The $y$-value we solved for of $\frac{3}{5}y \leq 30$ is that maximum $y$ value of which that holds true. Anything above $(50,50)$ and suddenly the blue region is contained by the red, but anything below is fair game.

Our constraints haven't actually changed, they actually narrowed in given new information of which boundaries we are at with our current candidate solutions.

With that justified, we can now move on to repeating our pivoting process like before. In our new tableau, the $y$ column has a value of $-4$ in the indicators, meaning we have a possibility to increase $z$ by changing increasing $y$. Going through our process all over again to find and use the pivot:

With a new pivot found...

$\downarrow$

$3R_1 - R_2$

$3R_3 - 5R_2$

$3R_4 + 20R_2$

$\downarrow$

$ \begin{array}{cccccc|c} {\bf x} & {\bf y} & {\bf s_1} & {\bf s_2} & {\bf s_3} & {\bf z} & \textbf{constraints} \\ \hline \frac{3}{2} & 0 & 5 & -1 & 0 & 0 & 75 \\ 0 & \frac{3}{5} & -2 & 1 & 0 & 0 & 30 \\ 0 & 0 & 10 & -5 & 3 & 0 & 60 \\ \hline 0 & 0 & 50 & 20 & 0 & 3 & 3750 \\ \end{array} $

There are no more negative numbers in the objective function row's indicators, so this should be our optimal solution! This is because of the actual equation that row represents:

$3z = 3750 - 50s_1 - 20s_2$

Since every variable, including our slack variables $s_1$ and $s_2$, is non-negative, the maximum our objective function can be is $3z=3750$ with both $s_1$ and $s_2$ equal to 0. So, while we may be done doing computing our solution, we should simplify our tableau to make it more readable. Let's make $x$, $y$, $z$, and $s_3$—our non-zero variables—have coefficients of 1 so we can easily find their values.

Our second step in the simplex algorithm brought us to our optimal solution of $x=50$, $y=50$, $s_1=0$, $s_2=0$, $s_3=20$, and $z=1250$.

The second and final step in the simplex algorithm takes us from our last candidate solution $(70,0)$ to the optimal vertex of $(50,50)$.

Here's a summary of the simplex algorithm to solve linear programs:

- Rewrite the objective function and constraints into equations with slack variables.
- Create the initial simplex tableau using the newly written equations.
- Identify the most negative indicator to find the pivot variable.
- Calculate quotients to find upper bounds on the pivot variable, and select the smallest quotient; this is the pivot for this iteration.
- Using Gaussian elimination and row operations to turn all other values in the column to 0 using the pivot.
- If any negative indicators remain after all row operations, repeat steps 3–5.
- If no negative indicators remain, we are done and at the optimal solution$^*$!

While we only looked at 1- and 2-dimensional linear programs, remember that this can work for as many variables as you'd like, which is nice since I don't want to deal with imagining a ball rolling down a 12-dimensional tabletop to find my function minimum. There is a small asterisk though, since sometimes the simplex algorithm can "stall" or "cycle", resulting in no net improvements of the objective function. Fortunately other algorithms based on other concepts have been built to not just be quicker, but avoid degenerate cases stalling and cycling.

What the values of $x$, $y$, and $z$ should be clear as those correspond to the point on the feasible region and maximum of the function, but what do the values of $s_1$, $s_2$, and $s_3$ mean? Recall that these are our **slack variables** as these are the variables that turned our constraining inequalities into equations by accounting for any *slack* in the constraints themselves. So, if we have 0 slack for one of our constraints, it means that we are using up as much of that constraint as we can; there is no slack to account for, and the corresponding slack variable is 0. If there is slack, it means we are not using up a constraint to its fullest potential. Recall our third constraint was

Also remember that our optimal solution included that $y=50$. We set a cap of $y=70$, but we are only using 50 of those possible 70 units. $s_3$ tells us that in the third constraint, we have an excess of 20 unused constraining units. In that same sense, $s_1$ and $s_2$ tell us we have 0 wasted resources for the first and second constraints. This also tells us that we are precisely at the vertex of where the first and second constraints graphically meet, since our solution is on the edge of both inequalities.

That's not the only information our slack variables tell us, though. You can also find how sensitive our result is in the final row of our tableau. The $\frac{50}{3}$ and $\frac{20}{3}$ tells us for every additional unit we add to constraints of $R_1$ or $R_2$, we will make an additional \$ $\frac{50}{3}$ or \$ $\frac{20}{3}$ respectively since $\frac{50}{3}(s_1 + 1) = \frac{50}{3}s_1 + \frac{50}{3}$. These are called **shadow prices** of our objective function. If you want to read more about shadow prices and other marginal analysis such as **reduced costs**, MIT OpenCourseWare has you covered, but for now there are a few cool extensions of linear programming I want to cover.

Even though we have a systematic way to find our ideal solution to a linear program, we could have quickly found some facts about our solution before we started. Again, here is our previous, 2-variable linear program.

$\begin{align} \frac{1}{2}x + \frac{1}{5}y & \leq 35 \ \newline x + y & \leq 100 \ \newline y & \leq 70 \end{align}$

In our 1st constraint, we could have equivalently rewritten it as

by multiplying both sides by 60. Furthermore, since both $x$ and $y$ are greater than 0, we can also compare it to our objective function.

So we now have an upper bound on our objective function: we know for sure it has to be less than 2100. We can be even smarter about this and do a similar process with our second constraint. By multiplying both sides of the second constraint by 15, we can say

which gives us an even tighter upper bound on our optimal solution. This is the heart of **duality**: instead of trying to directly solve our maximization problem, we can turn it into an equivalent minimization problem to find the lowest upper bound of our objective function, indirectly solving it.

We can generalize this by multiplying all of our constraints by a scalar $a_i$ for each constraint and adding them all together.

Adding these all together, we get a unifying inequality of

Lastly, remember this is supposed to be an upper bound on our objective function, so we set this all greater than our objective function which I've highlighted in red. In our first few tries to bound the problem, we had $(a_1, a_2, a_3)$ equal $(60,0,0)$ and $(0,15,0)$ respectively. We can summarize our goal of minimizing the right-hand side while maintaining all these inequalities as true as

$\begin{align} \frac{1}{2}a_1 + a_{2} & \geq 15 \ \newline \frac{1}{5}a_1 + a_2 + a_3 & \geq 10 \ \end{align}$

This looks like another linear program! We originally started with a **primal** problem with 2 variables and 3 constraints and used it to formulate its **dual** problem with 3 variables and 2 constraints!

But, what even is the purpose of this dual formulation? We already have a way to solve for an optimal solution, why needlessly copmlicate it with an extra intermediary step? The dual problem is useful as it allows us to readily access a lot of the sensitivity analysis. Each variable $a_i$ we're solving in the dual problem corresponds to the optimal *shadow price* (marginal utility of a resource) of the $i^\textrm{th}$ constraint in the primal problem.

Let's take a quick look back at our primal problem's solution. Recall that it had shadow prices of \$ $\frac{50}{3}$ for the first constraint of $\frac{1}{2}x + \frac{1}{5}y \leq \color{blue}{35}$, a shadow price of \$ $\frac{20}{3}$ for the second constraint of $x + y \leq \color{blue}{100}$, and a final shadow price of $0$ for the third constraint of $y \leq \color{blue}{70}$. If we value our total resources (highlighted in blue) at their marginal costs, we get that

which is precisely the optimal profit we got originally from the primal problem. The leading principle of a dual problem is that if we can solve for the optimal marginal profits of each resource, then we know implicitly how much of that item we should buy given the total selling price of each primal variable.

We try to minimize the cost of our primal constraints, while trying to ensure our dual constraints satisfy the coefficients of the primal objective function. In the case of shirt and hat selling, we want our marginal profits to at least equal the price we want to sell our products at. Even more interestingly, just like how our dual variables are equal to the primal shadow prices, the dual shadow prices are equal to the primal variables! Everything has been switched around!

Curiosities aside, we still haven't talked about many of the reasons *why* analyzing dual problems is useful. Here's a rundown of some of the benefits of duality:

**Sometimes it's just easier:**As you just saw, we turned a problem of 2 variables and 3 constraints into one of 3 variables and 2 constraints, and turned it from a question of maximization to one of minimization. For problems with few variables and lots of constraints, it's usually much easier to turn it into a problem with lots of variables and fewer constraints as every constraint in the problem will add some number of extra vertices for algorithms like the simplex to check, making it less efficient to check every case.**Feasibility and boundedness:**Sometimes a linear program will have no solutions whatsoever to check (imagine constraints like $x \geq 0$ and $x \leq 0$; can't be both at the same time). If the primal is unbounded (think no non-negativity constraints), then the dual is infeasible, and vice versa.**Specialized algorithms and theorems:**Beyond optimizing business plans, duality has found its way into combinatorics, graph theory, and even into the fame of game theory as a means to prove the minimax theorem for zero-sum games. Many more results like these tend to stem out of the Weak and Strong Duality Theorems (Weak Duality talks about how a dual problem can set an upper bound to a primal soultion, and Strong Duality says it can*find*the optimal primal solution).

Duality is a powerful concept not just in linear programming, but frequently pops up in other areas of math and recognizing when you can represent one problem with another is a great tool to have in your back pocket.

Linear programming is only one small niche of optimization study, but the depth and applicability in its simple premises is wildly effective. Starting out in the 40s with Dantzig's original simplex algorith, it has grown to affect much of computer science and math to systematically solve otherwise impossibly long computations. Even now, variations of the original linear programming formulation is still being researched with new methods to not only solve them, but also adding new fundamental restrictions to the problem. If you were a car manufacturer that wanted to know the optimal distribution of models to produce, you can't just make 41237.7963 cars; you would only care about integer solutions, and thus **integer linear programming (ILP)** was born. Linear programming's innate utility in optimization has lent itself kindly to modern applications, but from ILP, to the even crazier **mixed-integer linear programming** with a combination of integer and non-integer variables, as well as duality, LP has found itself touching every corner of math from combinatorics to graph theory as a simple multidimensional geometric encoding of a constraint function.

Today's post is one that's been months in the making. It originally started as one that only covers a single problem, but quickly branched off as I delved deep into dozens of papers and videos, just with more and more questions coming up. We're going to be discussing one of the oldest mixed studies of algebra and geometry: *dynamical systems*. Today's post is going to be a long one, but should have a fair number of fun visuals to keep it worth the scroll. This is really two, maybe three posts in one, so I recommend reading this with breaks at each header as to make it less overwhelming.

To begin, let's look at a type of problem you might have *experienced* before, rather than have read formally: **billiard problems**.

*For the best experience, avoid reading this in Safari; most other browsers should work and load the visuals correctly, but Safari breaks rendering a few of them.*

If you have ever played pool, or even laser tag, you might already be familiar with *billiard problems*. If you have a billiard ball you want to land in a pocket, but there are other balls in your way, where on the side of the pool table do you want to bounce your ball to land in the pocket? Alternatively, if you have a laser gun and an opponent by a mirror, where do you want to aim your laser to hit your opponent? To reduce this problem further (and for future reference), if you have a laser, a target, and a wall, where do you want to aim the laser to hit the target?

Can you make the light hit the target? The light automatically follows your cursor, but you can press the *1* key on your keyboard to lock the angle. Try dragging the points for different problem setups.

We covered how to solve this exact problem in a previous post, which also shows how light reflects and bounces (which is what we'll be using today). To recap that post:

- Our laser/billiards bounces must follow the
**Law of Reflection**: the angle the laser strikes the mirror is the same angle it reflects. - To solve for where to bounce the light off the wall, we create a "mirror world" to find the reflection point.

The idea is to reflect our target over the wall, draw a straight line from our laser to the reflected target, and the intersection point is the point of reflection. This seems arbitrary, but there's a good reason for it: the angle that our straight line creates in the "mirror world" is equal to the incident angle, and therefore ensures the angle the light bounces off at in the normal world is equal. I recommend following through with the previous post for a more on this, but the following demonstration should suffice.

We reflect our target over the wall, and draw a line between the light and that reflection. Can you see why this finds us our desired reflection point?

Simple enough, but this reflection technique is an invaluable tool for solving these types of problems, so put a pin in that for later. But now, let's look at a problem that throws that very easy technique out of the window.

Dynamical systems have been studied since forever at this point. One of the oldest (hard) billiard problems comes from Ptolemy 150 AD:

**If you have a candle and a circular mirror, where on the wall do you have to aim to hit a target?**

This problem plagued the Greeks for centuries, primarily since they tried to solve this with their typical ruler-and-compass constructions. Dr. Peter Neumann proved that this is an impossible task, but other solutions and proofs have arisen over the years by some of the most famous mathematicians including Huygens and l'Hôpital. The problem was named after Abu Ali al Hassan ibn al Hassan ibn Alhaitham, later retroactively given the mononym Alhazen, who discussed this in book *Optics* around the 10th century.

Now, with the same light and target, can you hit the target? Drag and drop points for different problem setups; if you want to lock the light's direction, hit the *2* key.

First, let's clear some details up.

- To make sense of a curved mirror, the reflection acts according to the
*tangent*line at the point of reflection. So, you can think of the curved mirror as made up of infinite, infinitesimal straight-walled mirrors. - The light is a pure
**ray**like a laser. No point source or beams to get cheap answers like, "If I stand*about*here, some of the light will reflect on the target." This is a laser beam that can only blind us if it directly hits us. - In a similar manner to the light assumption, the target has been reduced a to 0-dimensional
**point**. The light must hit the target*exactly*in our solution. - And, importantly, we assume a solution exists. If the light and target are on opposite sides of the circle, clearly no reflection will make it. So, to simplify, we'll just assume that they are in positions with a possible reflection point.

Before I show you my solution to the problem, I suggest you try this problem for yourself. This is one of the few, very simple-to-state problems that has caused me a lot of trouble deciphering a clean solution for it. Mathematicians have developed quite the dictionary to solve this one problem which we'll discuss a bit at the end, but only one way has appeared the most elegant (in the most stretched definition) to me.

The way I solved this problem was with the magnificence of complex numbers. Let's center our circular mirror as the unit circle $Ø$ at the origin $O$, and let's call the light and target $\color{red}{a}$ and $\color{green}{b}$ (if your circular mirror is not of radius 1, just scale all coordinates appropriately to make it so). We want to solve for the point $\color{purple}{z}$ on the mirror such that $\angle azO = \angle Ozb$.

The path our light takes can be modelled in two segments: from the candle $a$ to the mirror $z$ as $a-z$, and then from the mirror to the target as $b-z$ (you'll see why we pick these directions later). To make our lives easier, let's rotate our whole setup so that the mirror reflection point occurs at $z=1+0i$ by dividing our whole setup by $z$. So right now, we have two vectors representing our reflected ray of light as $\frac{a-z}{z}$ and $\frac{b-z}{z}$ (if you're not completely familiar with the geometry of complex numbers and why this division tactic works, here's a good introductory video).

Here we have the light at $\color{red}{a}$ and the target at $\color{green}{b}$ at some arbitrary spots, with a theoretical solution at $\color{purple}{z}$. Since we don't want to work with some random tilted axis, we divide everything by $\color{purple}{z}$ to so that our set up is centered around the real axis with $\color{purple}{z} = 1$.

Since our vectors are now symmetrical about the real axis, we know that $\arg(\frac{a-z}{z}) = -\arg(\frac{b-z}{z})$ as the angle of reflection must be equal angle to the angle of incidence (see this previous post for a proof).

Ok, so why is any of this helpful? Just as we divided by two complex numbers to subtract the angles of their vectors, we can multiply them to add them together. If we multiply $(\frac{a-z}{z})(\frac{b-z}{z})$, we get that their angles sum to 0 since their arguments are opposite! If the argument of the product is 0, that means that it lies on the real axis, and therefore is a real number! We can then extract that

where $\operatorname{Im}(c)$ denotes the imaginary part of a complex number $c$. Noting that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$ we can further simplify this equation.

$\large{\operatorname{Im}((ab)\overline{z}^2) = \operatorname{Im}((a+b)\overline{z})}$

This might not seem like much, this describes all possible points $\color{purple}{z}$ our reflection point can lie on! Let $\color{purple}{z} = x + yi$, $\color{red}{a} \color{green}{b} = p + qi$, and $\color{red}{a} + \color{green}{b} = r + si$, and we can rewrite our equation in terms of cartesian coordinates $(x,y)$.

$\large{q(x^2 - y^2) - 2pxy = sx - ry}$

This is an equation for a hyperbola! And since we want $\color{purple}{z}$ to lie on the circumference of our spherical mirror, the point $(x,y)$ must also be a solution to $x^2+y^2 = 1$ to lie on the unit circle as well. To find where our desired $\color{purple}{z}$ is, we just need to find the intersection between this hyperbola and the unit circle.

Given our previous light and target positions, we get this specific hyperbola which intersects our mirror in 4 locations, giving 4 possible reflection points. How can we compute their coordinates, and find the correct point?

It's no coincidence that this problem involves finding the intersection between a circle and a hyperbola. While yes, all of our complex number algebra does the job, there's a purely geometrical way of coming to the same conclusion involving isogonal conjugates. I wasn't very familiar with them, so I decided to present the complex number approach instead.

You could try and solve this system of equations, but the nature of the conic sections make it a pretty tedious and gross task. Fortunately, we can actually reduce this sytem of 2 simultaneous equations to a single polynomial! Going back to one of our previous equations:

The important thing to note is that we have a complex number whose imaginary component is equal to 0. This means that this expression is equal to its conjugate, since there is no imaginary component to flip the sign of: $x + 0i = x - 0i = x$.

Again noting that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$, we can multiply both sides by $\color{purple}{z}^2$ to get that

All we need are the complex solutions $\color{purple}{z}$, and since this a quartic equation, we *technically* have a closed form solution. Once we have the coordinates of our reflection points, all we do is graph $(\operatorname{Re}(\color{purple}{z}), \operatorname{Im}(\color{purple}{z}))$, and we are done.

Our complex quartic generates 4 possible solutions for $\color{purple}{z}$, all on the unit circle.

If these points look familiar, they should: they are precisely the points our hyperbola predicted before!

While not ideal to compute, our hyperbola did in fact find the same potential solutions.

Also, notice how we only used information about where the supposed solution $\color{purple}{z}$ to find our quartic; we never specified any conditions for where the light $\color{red}{a}$ or target $\color{green}{b}$ had to be! This means we can have our light and target on the *inside* of our circular mirror and find points that satisfy the Law of Reflection.

A valid bounce inside a circular mirror.

The best part is, all 4 of the possible solutions our quartic and hyperbola find are valid! No need to worry about the laser clipping through the mirror randomly; it all works out.

While we solved the problem, there are a few details to address that are not completely obvious about this approach using complex numbers.

Why are there 4 "solutions" according to our polynomial? Being a quartic equation, 4 complex solutions isn't unexpected, but they don't seem to have any physical significance for our bouncing laser. Obviously, only one looks like it can reflect our points correctly; how can *this* be considered a viable spot to aim your laser?

A supposed "solution" our quartic generates, despite the fact the laser has to phase through the mirror on its way to the target.

No mirror bounces like that, let alone allow the laser to move straight through it. Moreover, our Law of Reflection looks completely broken, too. What's happening here?

It lies in the direction of our vectors. Watch what happens as I extend the line segment from the target to the "solution".

If we extend the ray from the target to the "solution", we can get a "mirrored target", kind of like what we did for the straight wall case.

If we extend the ray, then it's clear that the Law of Reflection is satisfied, and this is true for the 3 other supposed "solutions": every "bad" reflection point is correct if the rays are extended far enough. So the 4 "solutions" correspond with how are vectors are lined up, since if we change the direction of our light bouncing we can get different points where the Law of Reflection is satisfied.

That doesn't answer, though, how we know which point to pick as the "correct" reflection point? If you read the previous post on retroreflectors, then you know light takes the fastest and (only in this case) shortest path. So, we just pick the point where the total distance of the light's path is minimized: $\min(|\color{purple}{z} - \color{red}{a}| + |\color{purple}{z} - \color{green}{b}|)$

Some of you might be wondering why this should produce *any* solutions on the unit circle—let alone 4 at that. That is in part by the property we have mentioned a few times: $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$. If the geometry of this isn't obvious to you with the rotations and scalings, we can turn this into Cartesian coordinates by setting $\color{purple}{z}=x+yi$, we get that

$\large{x^2 + y^2 = 1}$

Which is precisely the equation for the unit circle. The property that $\color{purple}{z} \cdot \overline{\color{purple}{z}} = 1$ forces $z$ to be on the unit circle.

…for the most part. The proof I've highlighted is adapted from this paper. As it shows, if you have the light at $\color{red}{a}=.5+.5i$ and a target at $\color{green}{b}=.5+0i$, two of the supposed solutions for $\color{purple}{z}$ are completely off the unit circle.

When $\color{red}{a}=.5+.5i$ and $\color{green}{b}=.5+0i$, one supposed reflection point is on the inside of the circle, and the other is so far outside of the circle its offscreen.

I'm not totally sure why this happens, but the previously linked paper proves that at least two of the generated solutions must be on the unit circle.

If you want to look at this problem more, there is also an algebraic solution, here there are ideas involving tangent ellipses in this paper, and there's even an approach discussed in Dorrie's *100 Great Problems of Elementary Mathematics*. These would be my recommended starting points. For even more depth, this is a solution involving origami (yes, the paper folding) and here's the same problem in hyperbolic space.

Now that we've seen where billiard problems started, let's see how far they've come with the main focus of today's post.

You and an assassin are trapped in a square room. With a single bullet, the assassin wants to do everything he can to take you out without wasting his shot. You, however, came prepared and hired a bodyguard to prevent direct line of sight between you and the assassin. But remember, you're trapped in a room. Without hesitation, the assassin flicks a shot to the side and ricochets off the wall and grazes your arm, avoiding the bodyguard completely. You might have been lucky this time, but who knows what happens next.

**If you (target) and an assassin are placed in a square room, can you hire a finite number of bodyguards to prevent any shot from hitting you (including ricochets)?**

At first glance, this might seem absurd. There are an infinite number of ways for the assassin to line up and bounce his shot, so how can anything less than an infinite number of bodyguards suffice? As one might anticipate, this wouldn't be a blog post if it didn't have an incredible answer.

Can you hit the target with the assassin's shot despite the bodyguards? Drag the assassin and target points to move them, and press the *3* key to lock the angle.

It's no coincidence I placed the bodyguards where I did in the above widget; not only does a finite number of bodyguards make do, you can prove you only need to hire **16** to ensure 100% protection!

I first found this problem through Tai-Danae Bradley's video and post, where she writes up the proof very well on her own. It was this problem that inspired me to look for other problems to extend this post, and moreover it was a fun programming challenge. Here, I want to outline the proof with the key insights Bradley utilizes, as well as pose a few other questions of my own.

Just as we did before, let's clarify some problem details:

- Just like before, the assassin's bullet is a pure
**ray**, and the target has been reduced a to 0-dimensional**point**. - Now, though, we have bodyguards, which are 0-dimensional points as well; if they are going to protect the target, they have to fully take the hit.
- Lastly, just to make it clear, this is a perfectly square room and the bullet ricochets at exact angles off the walls, so the assassin's shot can bounce forever if needed to hit its mark.

Let's look at an easy case: what if the assassin can only reflect off the left wall? Since this is a square room with straight walls, we can use our reflection trick from before. Now, though, we have to analyze the target's position in space *relative* to the room.

By reflecting the room, we create a "mirrored world" to track our reflection.

I've color-coded the left, right, top, and bottom walls to be yellow, gray, magenta, and cyan respectively. The reason if we want to track the target's location within the room even after the reflection, we have to reflect the room itself too, creating an actual "mirror world" that I referred to earlier.

Ok, so that isn't too different than what we've already been doing, so what's the point? The magic lies in modelling *multiple* bounces. Remember, we reflected over the yellow wall to say we wanted our bounce to be off that wall, but we can chain these reflections to give multiple instructions to our bounces. If we first reflect over the yellow wall and *then* the magenta wall, we create a doubly mirrored world with our straight line showing the path of what 2 bounces looks like.

Reflecting over a second wall gives us another straight-line intersection to find our first and therefore second reflection point too.

If you want to convince yourself this trick works for multiple bounces, I recommend finding the congruent angles within the mirrored world's straight line and the actual bounces within the room in question.

An important part of finding this mirrored world's straight line though is the fact that it intersects the colored walls in the order the assassin's bullet bounces. In the initial setup provided, the straight line hits the yellow wall first before intersecting the magenta wall, just as the beam's bounce path reflects off the yellow then magenta walls *in that order*. If you move the assassin and target, you can see this idea holds for a magenta then yellow wall bounce too.

So, if we wanted to model the assassin's hitting the target in more bounces, we just reflect our room more times and draw the straight line between the assassin and mirrored target.

Even with many more reflections of the room, our bullet still bounces off the colored walls in the order our straight line intersects them. Try dragging both the assassin, target, and mirrored target to see how the paths change. Note these are only paths that result in the assassin successfully hitting the target.

Moreover, since squares can tile the plane maintain the same "silhouette" under reflection, we can infinitely tile the plane with reflected copies of our room. Since a line through this plane can represent any bounce shot from the assassin in the original room, we have successfully simplified our problem setup. Why? With straight lines, we can now use coordinate geometry to place our bodyguards and not worry about annoying reflectedl light patterns within our square.

In more math-y terms, we have turned our original room into what is known as a **flat torus** (yes, the thing that's equal to a coffee cup). Essentially, all this means is that our problem sort of exists in a world similar to the game of *Asteroids*: as you exit the top or left of one flat torus, you enter through the bottom or right of another one (and vice versa). This fact is what allows us to tile the plane consistently with our problem setup. If you look back at the 2-by-2 grid setup from before modelling the 2-bounce paths, you'll see that our top/bottom edges are both cyan and our left/right edges are both gray, showing that exact relationship we'd expect in a flat torus.

Connecting opposite edges of a square turns it into the equivalent of a torus.

This is the first key insight to solving this problem: turning our bouncing shots into straight lines in an infinitely tiled plane. Working with straight lines makes life so much easier than bent ones. With that, we can move on to the second epiphany to prove our result.

Even though we tile the plane infinitely, there aren't an infinite type of rooms. Just looking at our 2-by-2 grid that makes our flat torus shows us everything we need to know: there are exactly 4 types of rooms that build our tiling: the original one, the one reflected over the yellow wall, the one reflected over the magenta wall, and the one reflected over both walls. This regularity is clear visually: watch what happens if I reflect the target into every mirrored room.

Having 4 "unique" rooms generates 4 unique lattices of mirrored targets.

Each one of the reflected rooms generates a lattice of that reflected target! I've colored the 4 different lattices in green, yellow, magenta, and cyan. Now, since each dot represents a way of hitting our target, we just need to block every line from the assassin to any one of these colored dots.

This is the second critical idea to finish out this proof: every reflected target falls into 1 of 4 possible lattices (each represented by a color). Dividing the mirrored targets into lattices is nice since it places all dots in a given lattice to be the same distance away from each other.

At this point, you have everything you need to finish this proof using the flat torus tiling and the 4 lattices. If you want to try and finish it through, I recommend doing so as it has some pretty satisfying reasoning throughout it. If you just want to keep reading, I'd recommend visiting Bradley's post where she completes the proof there.

Once you reach the end of the proof, you'll find that you need exactly 4 bodyguards to protect any given lattice, and since there are 4 lattices, we need $4 \cdot 4 = 16$ bodyguards total to completely protect the target.

No matter where the assassin shoots, the target remains safe and sound. Try moving either of them around, and watch the bodyguards adapt and reduce the assassin's efforts to nil.

Since it is possible to protect the target from the assassin with a finite number of bodyguards, we can say that the square is a **secure polygon**.

This is one of the most surprising facts I've come across in a long time. But, there's more places to take these billiard problems and dynamical systems. One of the first extensions I thought of upon seeing this was other grids. As we know, there are also hexagonal and triangular grids in addition to the square one we analyzed today.

Examples of triangular (left) and hexagonal (right). Are they secure polygons?

Not to mention, every other regular polygon doesn't tile the plane, so modelling their bouncing paths will be even more difficult. What about non-regular polygons? Or concave ones? This is definitely something I'll revisit in the future and try to find the conditions for a polygon to be secure, but until then, we have just scraped the surface of billiard problems and dynamical systems.

A few other, related problems to consider. While we only looked at rays for light sources, others have considered other types of light sources. In an Illumination Problem, we consider point sources (i.e. light source that produces light in every direction instead of one). Actually, we've already looked at one type of Illumination Problem: the secure square! It can be rephrased as the the following: if a light bulb is placed in a mirror room, is it possible to place a finite number of pillars such that a given spot is never illuminated? It's idential to our secure polygon question. One of my favorite Illumination Problems is the Art Gallery Problem, which is not only a readily applicable problem, but also has a wonderfully elegant proof that Steve Fisk conjured (it speaks volumes how nice this proof is for it to be in Martin Aigner's

Even outside of classic billiard problems, even just knowing of the simple reflection technique to model bounces is invaluable. Grant Sanderson of 3blue1brown fame used bouncing light as an analogy to solve a kinematics problem and bring in circles almost magically. I've said it before, and I'll say it again: duality and different perspectives are some of the most powerful problem solving tools you can have. This small reflection technique, or the complex number algebra with Alhazen's problem, might not mean much to you now, but it's another tool to stow away in your back pocket. Despite only seeing this ability to turn dynamics into geometry, I've seen them enough to know that these techniques and ideas are more than just an intriguing fact. You'll never know when you might be able to use such a tool, but when you do, who knows the new worlds that a new paradigm can unlock for you.

If you're interested in learning a bit behind today's graphics and widgets, see the follow up I wrote up detailing some seemingly innocuous math with some high-budget applications and cool patterns. ]]>

Last post, we looked at different types of billiard problems, a class of math problems analyzing how light bounces with different setups of mirrors. Notably, we saw how straight lines make for very simple, easy to compute mirrors, while others like circular ones, can be incredibly frustrating.

A large portion of last post's content, though, was made up of interactive graphics. While I went over much of the math that goes into *solving* these types of problems, we skipped over a large part of the math that goes into *simulating* them. Math is very nice in that many problems can be solved with nothing more than a pen, paper, and your mind, but oftentimes, that's only helpful if you are confident in how to approach the problem. What computer's can do is help build our intuition to solve a problem by calculating, drawing, and modelling scenarios with precision and speed we can only wish to achieve.

So, today, we'll look at some of the clever math that goes into computer graphics (that we'll later extend), and to introduce such a topic, we'll look at a simple, fundamental problem in graphics: how do you find the intersection between a line and a circle?

Before we can even attempt this problem, we're going to have to start from scratch, since we have one *slight* issue: a computer has no idea what a line or a circle is! So before we can do anything, let's teach our computer how to draw a line.

At its core, computer graphics is displaying a set of pixels with certain colors. If we want to visualize anything on a computer screen, we just need to find all the relevant pixels (coordinates) to light up and color. Because we want to compute these individual coordinates of, say, a line or circle very quickly and easily, almost always we will use **vectors**. These can be typical column or row vectors you see in linear algebra, or it can even take the form of complex numbers. The reason why these tend to be helpful is that they give very easy ways to compute coordinates for lines, circles, and other shapes.

If we want to draw a line with slope, say 2, we need to ensure that it is constructed by a vector of slope 2. An easy one to find is the vector $v=\small{\begin{bmatrix} 1 \\ 2 \end{bmatrix}}$ since we know that will pass through the point $(1,2)$. So, to get other points beyond this vector, we can scale $v$ by a factor of $t$ to get other vectors (i.e. points) with the same slope. If $t=2$, we get the point $(2,4)$. If $t=1.5$, we get the point $(1.5,3)$. If $t=239470$, we get the point $(239470,478940)$. Whatever you choose $t$ to be, our vector $v$ will give us a point on the line $y=2x$.

However, this isn't super helpful, since we are still only restricted to lines that go through the origin at $(0,0)$. So, we can add a starting point $\color{red}{p}$ to our vector equation to offset the line by $\color{red}{p}$, guaranteeing our line goes through the point $\color{red}{p}$ (since that's the coordinate generated by $t=0$).

Now we just plot every point for $t \in (-\infty, \infty)$, and we get a line with $v$ dictating the slope of our line (negative $t$ values gives us coordinates *behind* $\color{red}{p}$)!

Our parametric line $l$ going through point $\color{red}{p}$. Drag the point to adjust it's position.

We can do a similar process for a circle. To parameterize a circle, we'll have to pull from trigonometry. We know that a circle is defined by $x^2 + y^2 = r^2$. The Pythagorean identity tells us that $\cos^2(\theta) + \sin^2(\theta) = 1$, so we can quickly make the connection that $x=r\cos(\theta)$ and $y=r\sin(\theta)$ (which the geometry justifies). This precisely defines $x$ and $y$ in terms of the parameter $\theta$! Again, though, this is centered at the origin, so we can center the circle around a point $\color{blue}{q}$ by adding it to our parameterization.

where $r$ is some real number for the radius of the circle, and $\theta \in [0, 2\pi)$. We can now easily draw both lines and circles!

Now we also have a circle centered at $\color{blue}{q}$ too. Drag the center point to change its position, and the radial point its radius.

Now that we have defined our line and circle for our computer to interpret, we can start thinking about how to detect collisions between a line and a circle.

A good place to start is by looking at how far away the line $l$ is from the center of the circle $\color{blue}{q}$. For reference, the distance from a point to a line is the shortest (i.e. perpendicular) distance from the point to the line. If $l$ is more than a distance of $r$ away from $\color{blue}{q}$, then we know that it's outside the circle and doesn't intersect, and if $l$ is less than a distance $r$ away from $\color{blue}{q}$, then we know it's inside the circle and does intersect.

$l_1$ is a distance less than $r$ away from the center, and clearly intersects the circle. $l_2$ is a distance greater than $r$ away, and clearly does not intersect the circle. $l_3$ is exactly a distance $r$ away, making it tangent to the circle (1 intersection point instead of 2).

Let's look at an individual line and see if we can draw any useful conclusions about this distance.

From a given point $\color{red}{p}$ on our line $l$, we can find a new vector between $\color{red}{p}$ and the circle's center $\color{blue}{q}$ as $\overrightarrow{\color{blue}{q} - \color{red}{p}}$. This will form some angle $\theta$ with $l$, more specifically its vector $v$. Recalling that $\color{green}{d}$ is the perpendicular distance between $\color{blue}{q}$ and $l$, we have a right triangle that gives us that $\color{green}{d} = |\overrightarrow{\color{blue}{q} - \color{red}{p}}| \sin \theta$.

If you're familiar with your linear algebra, this almost looks like the formula for the magnitude of the cross product: $|v \times u| = |v||u|\sin \theta$. So, writing our two relevant vectors and rearranging we can see that…

$|\overrightarrow{\color{blue}{q} - \color{red}{p}}| \sin \theta = |\overrightarrow{\color{blue}{q} - \color{red}{p}} \times \frac{v}{|v|}|$

So all we need to do to see if our line intersects our circle is if that cross product is less than or equal to the radius of our circle (if you're concerned about the dimensionality of our vectors—cross products only exist in dimensions 3 and 7—we can treat them as 3D vectors with z-component 0, which makes the calculation easier and equivalent to the determinant).

If this isn't totally apparent why this is true, it has to do with the geometrical interpretation for the cross product: we're finding the area of the parallelogram that the two vectors span, and since the area of a parallelogram is $A=\textrm{base}\cdot\textrm{height}$, we're essentially finding the height of that parallelogram by dividing by its base.

Using the closest distance between the circle and line, we can successfully identify when the line intersects our circle.

We have a working condition! Using the cross product, we can identify point-circle intersections with a single line of computation. However, this simple solution does have its limitations. Mainly, this is only a **boolean** condition; this method only tells us whether or not an intersection occurs, but nothing else. We don't know where on the line it intersects, nor how many times. Sometimes, this doesn't really matter like when you want to approximate lines intersecting points (since then you can treat points as small circles). But for more complex tasks and graphics like raytracing, this won't cut it.

If we have a point $x$ on our circle, then the distance between $x$ and the center of the circle $\color{blue}{q}$ should be equal to the radius $r$. As an equation, the magnitude of the vector from $x$ to $\color{blue}{q}$ equals $r$.

Moreover, we want this point $x$ on our circle to also be on our line $l$. So, $x = \color{red}{p} + tv$ for some value of $t$. With this in mind, we can substitute $x$ in our previous equation.

Now, let's square both sides.

This may seem pointless, but it helps us rewrite that left side of the equation. Generally, working with the magnitude of a vector as an operator isn't super helpful, but we can quickly rewrite the *square* of the magnitude in terms of the dot product, since for any vector $v \cdot v = |v|^2$.

Expanding this out and collecting like terms gives us…

$t^2(v \cdot v) + 2t(v \cdot (\color{red}{p} - \color{blue}{q})) + (\color{red}{p} - \color{blue}{q}) \cdot (\color{red}{p} - \color{blue}{q}) - r^2 = 0$

Which is just a quadratic equation in $t$! With coefficients…

…we can solve for $t$ using our trusted quadratic formula (note that $a$, $b$, and $c$ are all outputs of dot products, ensuring they are valid scalars to plug in).

Remember, $t$ is the scalar that tells us where on our line we are, so if there are real solutions to $t$, then we will have the exact intersection points for our line and the circle!

Our quadratic formula now not only tells us when the line intersects the circle, but also *where* they intersect.

We can analyze this quadratic like any other to give us insight into our intersection points. Specifically, using the discriminant. When $b^2 - 4ac > 0$, then we get two solutions/intersection points. If $b^2 - 4ac < 0$, then we get no real solutions and therefore no intersection points. Finally, if $b^2 - 4ac = 0$, then we have exactly one intersection point, and can conclude our line is tangent to the circle.

Also, this quadratic can straight up replace our closest-distance method from before, since the point at which our line is closest to the circle corresponds to the vertex of the parabola at $t=\frac{-b}{2a}$.

Not to mention, notice how everything we did here was independent of the fact our line and circle exist in two dimensions; we can easily use this for 3D graphics, and even higher dimensions as well to find the intersections between lines and hyperspheres! Below is a raytraced scene I drew of 3 balls using this exact quadratic to compute lighting with shadows and reflections (a.k.a. my formal application to Pixar).

This raytraced scene is just thousands of uses of the quadratic formula.

And to think that we'd never use the quadratic formula in real life.

Before I end off this post, I want to include some other interesting circle facts since I don't know where else to put them.

**If you have a ray of light start from the circumference of the circle, after a total of $n$ reflections within the circle, the sum of all the angles of reflection will be $n^2$ times the initial angle.**

Between this and the Basel problem, circles and squares are just weirdly intertwined. The reason this particular statement is true is because of how much the angle with the horizontal increases after a single bounce. If your light starts at an angle $\alpha$, we can show that every additional bounce will add $2\alpha$ to the angle with respect to the horizontal.

With the help of some auxiliary lines, I hope the above picture makes this clear. Then by symmetry, of the circle, we can see that each subsequent bounce will also add $2\alpha$ to the angle. Moreover, since our initial angle itself is $\alpha$, every bounce will just be the odd multiples of $\alpha$ (since odd numbers can be thought of as a multiple of 2 plus 1, which is precisely what our angle bounces mimic)! So, for a series of $n$ bounces, the sum of the angles of each reflection is equal to

(Yes I am aware there is a formula for an arithmetic sequence with with any initial term but this is how I remember to solve them okay) I didn't know how to fit it in last post with the mention of circular mirrors there, but here seems like a good spot to mention it.

**The set of intersection points between two orthogonal parabolas lie on a common circle.**

To show this is true, we just need to crank out the algebra. To find our intersection points, we need to solve the system of equations

If these individual equations are true for our intersection points, then so is their sum.

$x^2 - x(2\color{red}{x_1}) + \color{red}{x_1}^2 + y^2 - y(2\color{blue}{y_2}) + \color{blue}{y_2}^2 = y - \color{red}{y_1} + x - \color{blue}{x_2}$

$x^2 - x(2\color{red}{x_1} + 1) + y^2 - y(2\color{blue}{y_2} + 1) = -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$

$(x - (\color{red}{x_1} + \frac{1}{2}))^2 - (\color{red}{x_1} + \frac{1}{2})^2 + (y - (\color{blue}{y_2} + \frac{1}{2}))^2 - (\color{blue}{y_2} + \frac{1}{2})^2 = -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$

$(x - (\color{red}{x_1} + \frac{1}{2}))^2 + (y - (\color{blue}{y_2} + \frac{1}{2}))^2 = (\color{red}{x_1} + \frac{1}{2})^2 + (\color{blue}{y_2} + \frac{1}{2})^2 -\color{red}{y_1} - \color{red}{x_1}^2 - \color{blue}{x_2} - \color{blue}{y_2}^2$

While that last line may seem a bit unruly, note that $\color{red}{x_1}$, $\color{red}{y_1}$, $\color{blue}{x_2}$, and $\color{blue}{y_2}$ are all constants, so the right-hand side of that last equation can be summarized as one big constant.

That's precisely the equation of a circle with a center at $(\color{red}{x_1} + \frac{1}{2}, \color{blue}{y_2} + \frac{1}{2})$ and radius $\sqrt{C}$, and that's exactly what is plotted above.

I have a few more circle tidbits to share, but they have more to expand on in their own posts for another day.

Until then, hopefully you found this slight detour into the world of graphics interesting. There are (as you could imagine) a lot more to graphics I want to share. From image homography, to video textures, to even a more in-depth look into raytracing and rasterization, but we'll save those for later.

]]>First, let's take a loot at the integration by parts shortcut.

This is by far my new favorite trick to pull out of my back pocket whenever I can. It leverages the fact of the innate products built into integration by parts, and the nature of antiderivatives. Though, I'm sure some of us could use a refresher on integration by parts.

When given the product of two functions $f(x)g(x)$, the standard formula to compute its derivative is

Now if you integrate both sides and rearrange a little bit we can conclude that

$\int f(x)g'(x) \ dx = f(x)g(x) - \int f'(x)g(x) \ dx$

Let $u = f(x)$ and $v = g(x)$ to get that

…which is precisely the integration by parts formula we've come to know. It really just is the opposite of the product rule. The main reason why it's such a useful technique is because if you have a function that's really hard or you don't know how to integrate, you can use it as your function $u$ and express its integral purely in terms of its derivative. Let's try this with an example:

This doesn't seem like a particularly product-y integral that can leverage integration by parts, but it really is!

Since we don't want to deal with integrating $\ln x$ (besides, that's what we're trying to find anyway), we can set $u = \ln x$ and $dv = 1 \ dx$. Then working it through we get that

That's pretty neat! We were able to reduce a relatively hard integral into one that was much simpler by thinking of it in terms of what product of functions when differentiated would include our original integral.

However, this doesn't always work by itself, and this is where our integral trick comes into play.

Let's try a very similar integral from before:

This doesn't look too bad, right? It's basically the same as before. Let's try the same choice of $u = \ln(x+1)$ and $dv = 1 \ dx$ again and see where it takes us.

Aaand there's our problem. Our supposed simplified integral of $\int v \ du$ ended up with something also annoying: $\int x \cdot \frac{1}{x+1} \ dx$. This isn't too bad if you're okay with polynomial division (with this one being relatively easy, too), but it isn't necessarily trivial. Since we want to avoid doing more work, we can do much better by realizing an overlooked aspect of integration by parts.

We rewrote our original integral in the form of $\int u \ dv$, and later found an antiderivative $v$ from that differential. In our case, we let $dv = 1 \ dx$ and deduced that $v = x$ by undoing the power rule of differentiation. This isn't *wrong*, per se, but it is incomplete. The antiderivative of $1\ dx$ isn't $x$, but $x \ \mathbf{+ \ C}$. Since, remember, the derivative of any constant goes to 0, we can add whatever constant we want to the end of our antiderivative and it'll still remain valid.

So how can this help us? Well, let's do the same integral with the same choice of $u$ and $dv$, but instead of letting $v = x$, let's make $v = x+1$.

Look at that! That last, previously annoying integral, has become much simpler! Instead of getting two polynomials dividing each other, our new choice of $v$ reduces it to $\int (x+1) \cdot \frac{1}{x+1} \ dx = \int 1 \ dx = x + C$. So, finally, we can conclude that

In fact, for any choice of a constant $\alpha$, we can see that

It's such a simple trick, but an important reminder to remember the basics and fundamentals when attempting a problem. Besides, $+C$ being more than a formality is at least a little bit satisfying.

The last integral trick is a bit niche—it relies on that second integral when doing IBP to give a quotient two polynomials of matching degrees. This next shortcut, though, uses this idea of polynomial degree a bit more cleverly, but does not always work. When it does, though, it's certainly satisfying.

Let's find the following antiderivative:

This doesn't look particularly friendly, but we can make some observations about this function. The numerator of our function is a cubic, or a polynomial of degree 3. Similarly, our denominator is the square root of a quadratic, or polynomial degree 2. In the very loose sense of "degree" we can say that asymptotically, the denominator is closer to a degree 1 or linear polynomial (yes, I know that that $\sqrt{x^2} = |x| \neq x$, but just play along for now). So, if we were to carry out all of the polynomial long division, we'd expect our original function to behave like a degree $3 - \frac{2}{2} = 2$ polynomial.

Ok, so what? With basic integration, antidifferentiating a polynomial increases its degree by one. This is just the power rule.

This fact implies that if our polynomial is loosely of "degree" 2, then integrating it should give us a function of degree $2+1=3$. So, let's make a guess at what our integral might look like.

This guess should look somewhat reasonable, since we have a quadratic multiplied by square root of another quadratic, which we loosely said was degree 1. And a polynomial of degree 2 multiplied by a polynomial of degree 1 gives us a polynomial of degree 3, which we wanted. However, you might wonder why we even wrote this as a product; why not just write this out as a pure cubic of $ax^3 + bx^2 + cx + d$? The main reason is expecting the chain rule of some kind to occur. When composing functions, the derivative—and therefore the integral—tend to include the structure of these compositions, so it's not unreasonable to make a guess with the denominator in the result.

Now here's the trick: let's differentiate both sides.

If we simplify this expression and expand the right side…

$x^3 + 2x^2 + 3 = (2ax + b)(x^2+3) + (ax^2 + bx + c) \cdot x$

$x^3 + 2x^2 + 3 = 3ax^3 + 2bx^2 + (6a + c)x + 3b$

For this last equation to hold, we need the coefficients to match.

Therefore,

Putting it all together, we can go back to our original guess of the antiderivative and find that

There we go! A succcessful antiderivative found.

This is trick is a great first attempt at integrating rational functions, but it is also extremely sensitive to minute changes in the integrand. For example, if we change our integral to

our algorithm breaks. It's the cost associated with what makes this algorithm so convenient: we don't touch the numerator of the integrand at all. Our antiderivative guess only depended on the denominator, and as a result, the coefficients we tried to match at the end had no intrinsic tie to the numerator and thus polynomial we were matching.

This integral shortcut's convenience is definitely a double-edged sword, but the method behind making these educated guesses is a useful idea in its own right to take away. For more on this type of integration, I recommend reading up on the Risch algorithm, a standard in computing indefinite integrals. Here's also a very thorough synopsis on evaluating integrals on the Wolfram Blog.

This last integration stratagem comes from none other than the celebrated Richard Feynman of physics fame, and thus has been aptly coined as Feynman's Intregral Trick.

…as it has been popularized. What I'm about to show has historically been known as the Leibniz Integral Rule, or differentiation under the integral sign. Not quite the same ring to it, but nonetheless good to know for accuracy.

Let's try the following integral:

What we're about to do might seem insane, but it will be immensely helpful in a second. What we're going to do is *generalize* this integral. Let

We've replaced the exponent of 2 with a $t$. So, in our new, generalized problem, we want to find $f(2)$. A useful fact also to note is that we know some values of $f(t)$. For example, we know that $f(0)=0$. How does this help? Well, now we can what the name of this trick alludes to—moreso outright says: we'll differentiate under the integral sign. Let's take the derivative of $f(t)$ with respect to $t$.

That last integral is super easy, only reversing the power rule to calculate.

Now that we know the derivative of $f(t)$, we can now integrate this simpler function in terms of known values and use the Fundamental Theorem of Calculus to find $f(2)$.

Note how I used the Fundamental Theorem of Calculus with clever bounds for our integral. You could instead solve the differential equation generally, but the FTC skips shortcuts a few steps. So, after all of that, we can conclude

As counterintuitive as it may seem, solving a general problem can sometimes actually be easier to solve than its individual cases. The best part about this technique, though, is that we haven't just solved one integral, but a whole *family* of integrals. For any exponent $\alpha$, we can conclude that

You might have noticed something different with this integral compared to our previous approaches: this applies to *definite* integrals as opposed to *indefinite* integrals (or antiderivatives). Namely, because of the fact we have to integrate not once but twice in this method. So, at the following step,

if this was not a definite integral, we would end up with a $+C$ attached to the end of it that we would not be able to solve for.

Here's another application of Feynman's trick:

Knowing Feynman's trick wins you the battle, but knowing *how* to use it wins you the war. Many times, you have to be creative in your choice of parameter when wanting to differentiate under the integral sign, so don't be discouraged if it doesn't work the first time. For this particular integral, we'll want to consider

Now, we want to find $f(1)$, and we know that $f(0)=0$. Now let's differentiate both sides with respect to $t$.

Decomposing that last integral into its partial fractions yields

Now, with more elementary calculus, we can evaluate that integral.

$\large{\frac{\partial f}{\partial t}}$ $=\large{\frac{-4\ln(t+1) + 2\ln(2) + t\pi}{4(t^2 + 1)}}$

Now, we want $f(1)$, and know that $f(0)=0$, so let's integrate this function of $t$ from 0 to 1.

$f(1) = \int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t - \int_{0}^{1} \large{\frac{\ln(t+1)}{t^2 + 1}}$ $\partial t = \int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t - f(1)$

$f(1) = \large{\frac{1}{2}}$ $\int_{0}^{1} \large{\frac{2\ln(2) + t\pi}{4(t^2 + 1)}}$ $\partial t = \large{\frac{\pi \ln 2}{8}}$

Again, almost magically, by generalizing a hard integral, it became a much easier one to tackle.

To give you an idea how powerful this technique is, the above integral comes from the 2005 Putnam Exam. Not only does it come from one of the most difficult math tests, it's also the 5th problem of the first set of problems (with problem 1 being the "easiest" and 6 being near impossible). And, in only a few lines, Feynman and Leibniz had it beat.

These are the three most recent integration techniques I have picked up and tucked away in my problem solving toolbox, but if you're interested in more advanced integral shortcuts and tricks, take a look at this MathStackExchange post I came across while doing this write-up. There are some genuinely mesmerizing ideas showcased there that just are out of the scope of my ability to explain, so do browse the forum if you're interested.

]]>This raytraced scene is just millions of uses of the quadratic formula.

But I have to admit, I sort of lied to you. While, yes, that image *does* use the quadratic formula millions of times, it doesn't *only* do that. To render shadows and reflections, the scene also had to compute lighting and the physics you'd expect with mirror-like objects. Without any of this, our scene would just look like, well, uh, this:

Now *this* is a peak graphical performance. In a word: art.

Whichever one you think is better looking is up to the eye of the beholder, but what can't be argued, is that the second image is much cheaper to render; I'm sure you could guess, no shadows and reflections causes the scene to be rendered in a fifth of the time. *A fifth*.

Intuitively, more stuff to compute should take a computer a longer amount of time to go through, but can we pinpoint this bottleneck? Let's quickly look at what it takes to compute some of these reflections. When light bounces off, say, a mirror, these calculations become much easier when we use the mirror or surface's *normal vector*: the vector perpendicular to the surface (or the point at the surface) in question.

How to reflect a ray over a normal vector.

The above formula for reflecting a ray works in general for reflecting any ray $\vec{R}$ over another vector $\vec{N}$ (even if they're not normal)… Under a small assumption: the vector $\vec{N}$ is *normalized* (yes, the naming scheme isn't ideal), or of unit length (denoted by a little hat $\hat{N}$). We can do this by just scaling the vector down by its own length:

Recalling that the length of a vector 3D $\lVert N \rVert = \lVert <x,y,z> \rVert = \sqrt{x^2 + y^2 + z^2}$, we end up with

And here lies our bottleneck. While we, as humans, treat division not too differently from multiplication in theory, computers can't work with "just in theory"; computers have to actually compute this arithmetic somehow. It turns out, while multiplication is a bit more complicated than addition, we've been able to make algorithms for *decades* to accelerate the computation. Division, on the other hand, has been such a difficult endeavor to match other operations speed, major companies like Intel have lots of research dedicated to this alone.

So, what do we do?

Under pressure, people can do some amazing things. You can imagine if someone was making a game or anything that required lots of lighting calculations, say, in a video game, calculating $\frac{1}{\sqrt{x}}$ millions of times, therefore also computing millions of divisions won't really cut it.

The developers of the video game *Quaker III*, an incredibly fast-paced shooter that definitely needed these optimizations, used a now infamous algorithm aptly called the *fast inverse square root*, because, well, it computes the inverse square root $\frac{1}{\sqrt{x}}$, fast and avoids dreaded division. The history of the algorithm has been found to predate the game that made it so infamous, but pop culture assigns value to whatever it latches onto first. Without further ado, the original source code (along with all the original comments and annotations) for *Quake III* was released in 2005, and the program is right there for us to learn from:

float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; }

It's not a long algorithm by any means, but I think the comments themselves explain just how crazy this is; even the developers who USED it are impressed, but let's break it down line by line.

long i; float x2, y; const float threehalfs = 1.5F;

Here we define four different numbers: `i`

, `x2`

, `y`

, and `threehalfs`

. But not all of these numbers are treated the same.

In our day-to-day routine, we (at least, most of us) use base 10, decimal, to represent our numbers. What this means is that each digit in a number corresponds to some power of 10 we add together. For example, the number 1409, can be grouped as

with 1 thousands, 4 hundred, 0 tens, and 9 ones. You add these all together to get $1(10^3) + 4(10^2) + 0(10^1) + 9(10^0) = 1409$. This may seem obvious, but this is a really important idea in how we write numbers. Each digit represents a condensed shorthand for how many of a specific power of 10 is in our number. Computers do it similarly, but instead of base 10, they use base 2, or binary. If we wanted to represent 1409 in binary, we'd have

Now if you went and added these together you could verify that $2^{10} + 2^{8} + 2^{7} + 2^{0} = 1409$. Now each digit—or decimal-igit—represents a power of 2. We call these binary-igits bits. That first line of code that defines a `long i`

means we define a number with 32 bits that looks like

`00000000 00000000 00000000 00000000`

With 32 bits, we can write any number from 0 to $2^{32} - 1 = 4294967295$. Let's notice some nice properties with this format. In decimal, if we wanted to add a 0 to the end of our number like $1409 \rightarrow 14090$, this is the same as multiplying our number by 10, because now every digit has moved up into one bucket higher than before.

In the same way, we can remove zeroes on the right $14090 \rightarrow 1409$ by dividing by 10 since every digit will be shifted into one power lower. Binary works the same. If we want to add a 0 to the right end of our number, we now multiply our number by 2, but if we wanted to remove a 0, we divide by 2. This is known as **bit shifting**, and serves as one of the nice workarounds of division: if you want to divide specifically by a power of 2, just bit shift the number in binary by however many zeroes you need.

But that leads to an issue: only even numbers can be wholly divided by 2, so what do we do if we want to divide an odd number? How would we write a decimal like .5? How would we write any rational number? We currently can only represent 32-bit *integers*, since we have no way of writing fractional parts. If we wanted to add decimals, why don't we just throw in a decimal point then?

`00000000 00000000 . 00000000 00000000`

(Remember this decimal point is just for our convenience; the computer doesn't actually see anything here but the 32 bits) If we use the left 16 bits for the integer part, and the right 16 bits for the decimal, we can now in fact write rational numbers, but all of a sudden the range of numbers we can represent has shrunk immensely, only to just under 65535.9999847, but the upside is now we can do much more precise, rational numbers. This is okay, but we can do better.

If you have ever taken a chemistry or physics class, you're probably all too familiar with *scientific notation*. We can write really big numbers and really small numbers in a much more condensed way by turning everything into a product of a power of ten:

This is how our calculators can do arithmetic beyond just the digits shown on the screen. By slapping on a power of 10, we can now represent a much wider range of values using the same number of digits previously. Similarly, binary works just the same except with powers of 2.

So, let's do just that. Let's allocate 8 bits of our previous 32 to representing the exponent, and another 23 to our actual number.

`00000000`

`0.0000000000000000000000`

With 8 bits for the exponent, we can represent anything from 0 to 255, but we also want negative exponents, so we just subtract 127 to get the new range of exponents from -127 to 128. With our fractional number and its 1 whole bit and 22 decimal bits, we can represent numbers from 0 to 1.999999761581. We call this part of the number the mantissa. However, this is actually not the full extent of the potential precision we can get. In all of our examples of scientific notation, there was always a non-zero number before the decimal point, since if there was a leading 0, that's another power of our exponent we could factor out. In binary, there's only two values: 0 and 1. If we know the first digit is non-zero, then we know it has to be a 1! So we can actually shift our decimal point over and gain an extra bit.

`00000000`

`.00000000000000000000000`

Now all we have to do is affix a leading 1 and we're good to go. So, the number

`11001010`

`.01110100000000000000000`

would represent $1\color{blue}{.453125} \cdot \color{green}{2^{202-127}}$. In general, if we're given a 23 bit number representing the mantissa and an 8 bit number representing the exponent, we can write our number we're expressing as:

We divide our mantissa by $2^{23}$ so that it is only the fractional part of our number like we want. However, doesn't part of this just *feel* wrong? Like, when we were defining an integer as a 32 bit `long`

, we used every bit to denote a new power of 2. Here, we really are writing two *different* binary numbers side-by-side. If we wrote our number as

`.01110100000000000000000`

`11001010`

would it really make a difference? So, just for the sake of consistency, we'll put the exponent first, and we can then represent our number's bit representation as a sum:

We just added 23 filler zeroes to our exponent to make sure it landed where we wanted to in the final bit representation. That sounds like a bit shift! We can thus multiply our exponent by $2^{23}$ to give us our 23 extra zeroes So, this final sum—our exponent and mantissa together—can be written as

Now, there are some flaws that do need to be addressed. If we assume our leading bit is non-zero, how _do_ we represent 0? That actually doesn't matter in the grand scheme of our intended use in lighting (i.e. we only call the fast inverse square root when we *have* to use it), and when we're not, this is a single edge case that can be put in later. Yeah, I'll admit, it's not ideal, but it gives us more precision where we need it. The other issue is we haven't used all of our 32 bits! $8+23=31$, so where did our last bit go? In our set-up, we have it such that we can only represent *positive* numbers. We can attach an additional *sign bit* at the front, where if it's a 0, we say the number is positive, and if it's a 1, the number is negative.

`0`

`11001010`

`.01110100000000000000000`

However, we want to know how to find *positive* square roots, not enter the complex world, so we always just assume it's positive. So if we don't use the bit, why don't we repurpose it? Conventions. The standard for binary fractional-part representation and arithmetic is known as **IEEE 754**, and for that reason we just have to abide by it.

I've been calling this with terms like "decimal points" and "fractional-parts", but a decimal point seems wrong when we're doing it in binary. The type of number we've just formatted is called a *floating point* or a `float`

as we see in lines 2 and 3. While `float`

s give us nice ways to represent a lot of numbers, they are a bit annoying compared to `long`

s in the sense that **we can't bit shift or manipulate a float** since the bits in a

`float`

represent multiple, different parts of the number in question, namely the exponent and the mantissa.This may seem like a gross, unnecessary dive into how computers understand numbers, but understanding what binary, bits, and floats are will help us greatly in understanding the ingenuity behind the fast inverse square root. To recap, we've found that we can represent a binary number as a `float`

with two parts, an exponent and a mantissa, as if we were using scientific notation. To find the actual number our `float`

represents, we use the formula

Since we're working with two different binary numbers together, we combine them into one sequence of bits that to as a shorthand to represent our `float`

with the formula

We can now perform some mathematical magic. Let's take the logarithm of the actual number of our `float`

(note by $\log(x)$, we assume it to be $\log_2(x)$ since we're working in binary).

This may not seem that useful, but there's an important detail here: we're looking to *optimize* a program, not get exact results. So, a useful fact to note is that for $x$ between 0 and 1, $\log_2(1+x)\approx x$.

We can simplify $\log_2(1+x)$ by approximating it as $x$.

We can get an *even better* approximation by slightly offseting our estimate; $\log_2(1+x)\approx x + \delta$ is a better approximation than $\log_2(1+x)\approx x$

We can approximate $\log_2(1+x)$ better with small shift.

It turns out the best value for $\delta = 0.0430357$ (as in minimizing the average error). By definition, our mantissa is between 0 and 1, so we can use this approximation ourselves.

If we rearrange this a bit,

Okay, why did we do any of this? This definitely is kinda random to not only take the $\log$ of our `float`

, but also do all these approximations to then get rid of that $\log$ too? Why?

Look inside the parantheses in the above equation.

That's precisely the bit representation of our `float`

! So, in a way, the $\log$ of our number is equal to the bit representation of our `float`

, up to some scaling and shifting.

With this under our belt, we can finally start looking at the steps of the fast inverse square root algorithm.

First, we assign our number we want to find the inverse square root of into a `float`

(a.k.a. scientific notation-type decimal number).

```
y = number;
```

Now, recall that a `float`

isn't that compatible with bit shifting or that many operations, so here's the first clever part of the algorithm.

i = * ( long * ) &y;

What this does is we take the exact bits of our number as a `float`

and copies it into a `long`

. That's it. Under the hood, it takes the number at the memory address of `y`

and exactly transfers the bits over to `i`

. This will make our life easier here in the next step.

Since we have now put our number that we're trying to find the inverse square root to, `y`

, as its bit representation, we have effectively stored approximately $\log({y})$ into `i`

.

The fabled step that makes this algorithm so smart.

i = 0x5f3759df - ( i >> 1 );

Remember, at the end of all of this, we want to find a number, $\alpha = \frac{1}{\sqrt{{y}}}$, but we have been working almost exclusively in logarithms. So, let's take the $\log$ of both sides.

Wait, but we have a division in there! On quite the contrary, it's a division by 2, and since `i`

is a `long`

, we can just bit shift to the right 1 to divide by 2! That's precisely what `i >> 1`

does: it bit shifts `i`

once to the right.

But what is the deal with that `0x5f3759df`

? Well, remember that `i`

is only an approximation for the $\log(y)$ up to some constants. So, we have to account for those constants *somehow*. Let's go back to $\alpha$. We know that

In terms of `float`

s…

Fortunately we already know how to expand this from before.

This looks pretty bad, but after some simplifying and rearranging…

We know that anything of the form $E\cdot 2^{23} + M$ is just the bit representation of the number, and we know the bits of $y$ is just `i`

, so

That magic constant `0x5f3759df`

is the hexadecimal (not totally sure why there is so many changes of bases) approximation of that constant $\frac{3}{2}2^{23}(127 - \delta)$. So what we do in this line of code is we bit shift `i`

once to the right to halve it, and take that result and subtract it from `0x5f3759df`

to correct for all the constants that came with our approximations of $\log(y)$. Not totally sure why the developers felt the need to write a variable for `threehalfs`

and not this number, but what can we do.

But now note we are storing this value in `i`

. So, from here on `i`

no longer refers to the bits of $y$, but the bits of $\alpha$, our desired number. The bits, though, aren't particularly helpful since we want the `float`

and decimal representation of $\alpha$, so we do just that:

y = * ( float * ) &i;

Just like how we casted the bits of a `float`

$y$ into a `long i`

, we now do the reverse and cast the bits of `i`

into a `float`

$y$.

At this point, we're technically done: $y$ currently stores an approximation of $\frac{1}{\sqrt{\texttt{number}}}$, using 0 steps of slow division! But we can do better for a marginal amount of extra computation.

Say we wanted to solve for the zeroes of the function

where $C$ is any arbitrary constant. Solving for $y$…

If we could find a way to approximate the roots of this function, we'd then in turn have a way to approximate the inverse square root of any number!

In a previous post, we discussed a technique to precisely do that: the Newton-Raphson Method (sometimes just called Newton's Method).

Let's say we have a random function $g(x)$. To find a solution, what can we do? Well, not a good idea, but *an* idea, we could just guess a random number $x_0$ as a solution. If $x_0$ is a solution, then obviously $g(x_0)=0$.

A pretty bad first guess.

As you'd imagine, the chances of guessing a root of $g(x)$ immediately is slim. So, the next step in Newton's Method tells us to draw the *tangent line* at our first guess $(x_0, g(x_0))$ to get our next guess $x_1$.

A better, but still not ideal, approximation.

*Now* we're getting pretty close. That's the whole premise of the Newton-Raphson Method:

- Pick an initial guess $x_n$
- Draw the tangent line at $(x_n, g(x_n))$ and find where it intersects the $x$-axis
- Use that as your new guess $x_{n+1}$
- Repeat steps 1–3 as needed

So, if we do another iteration of our example above…

Now we're getting to a reasonable estimation.

There are some edge cases though where this obviously won't work, such as if our guess happens to hit an extremum.

In this case, there's no additional guess since our tangent line is parallel to the axis.

We could even get loops where we just continuously bounce back and forth between two guesses. Fortunately, we don't have to worry about that. If our first guess is already really accurate and near the actual solution, then our graph $g(x)$ begins to look like this:

Up close, smooth, continuous graphs look linear.

$g(x)$ starts to look like a line! And when a function locally looks like a line, it also locally looks like its *tangent line*.

Can't really beat that now.

This is important to us since we already have a good estimate from all of our bit manipulation from earlier, so we do one iteration of Newton's method to get an even *better* approximation.

To put this in terms of some equations to compute, we want to estimate the root of

Given an initial guess $y_n$, our next guess $y_{n+1}$ is the solution to

since this describes where our tangent line generates our next solution. Solving for $y$ we get that

Now it's just a matter of plugging everything in.

With a small substitution of $x_2 = \frac{C}{2}$,

If we look at the line of code that entails this "1st iteration",

y = y * ( threehalfs - ( x2 * y * y ) );

That's precisely the formula they have right there. You might wonder if that $\frac{3}{2}$ poses an issue at all in terms of division, but it is of no concern since we know its decimal expansion to be 1.5 so we can just use floating point arithmetic from the start; division becomes an increasingly hard problem when we *don't* know what the decimal representation of the quotient in question is.

Let's quickly recap what we've learned about the fast inverse square root algorithm and how it works:

float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; }

- We first take our number as a
`float`

$y$ and store its bits in a`long i`

. - Noting that $\log_2(y) \approx \texttt{i}$, we approximate $\frac{1}{\sqrt{y}}$ using bit shifting and the magic number 0x5f3759df.
- We then perform one iteration of Newton's method to get an even better approximation.

These three steps required a fair bit of knowledge to properly unpack, but it's incredibly insightful when thinking about the lenghts programmers go to optimize their code. Remember: this is all to avoid using any division! What we consider such a simple operation is almost never such for a computer, and when it comes to teaching computers to do these things, they are as blind as a bat. But it's this difficulty and blank slate of a circuit board that makes computers be able to teach us, just as much as we teach them.

]]>Even this struggles to capture how well the Grand Canyon lives up to its name.

Something about the extreme wide angle capturing more than what your eye can behold all at once makes it truly magical. And, if you have an Android phone, you can get even more incredible types of otherwise impossible photos with Google's staple Photo Spheres:

An "unwrapped" Photo Sphere; normally this would be an interactive literal sphere of the environment.

While these are cool, it does beg the question: how are these made? It's not like your phone is able to take them in a single snapshot; usually you have to spin or take multiple photos. How is your phone able to tie multiple images together into a single cohesive one? We'll explore a little bit of linear algebra to manipulate our photos exactly how we want to, and indulge in stylistic photos your others can only dream of.

Before anything else, let's first define what a panorama is. In computer graphics, a panorama is a type of **mosaic**, that is, a unification of two or more images. A panorama in particular, though, is a mosaic in which all the photos to be stitched together are all taken from the same camera position. When you take a panorama on your phone, all you're doing is spinning in place, so that's what we want to recreate.

So, if we're given two images from different angles,

how do we combine them into a single unified image? Thinking about it this way is not super helpful, since, well, that's what we already know about panoramas. What do we want to get out of a panorama? If our panorama is done well, then objects in one image should overlap properly with the same objects in the other image.

For example, take a look at the top left of my computer monitor:

In our panorama, these two points should overlap, since we obviously recognize these to be the same object in the real world. But to the two pictures, they are wildly different! In the first picture, the corner of my monitor is closer to the left side of the frame, while in the second picture, it is almost on the top-right edge of the image. So, if we try to manually align these two corners so that they overlap, we get:

While our single corner of the monitor is aligned between the two images, I don't think I have to try very hard to convince you that this isn't a great panorama. I mean, just look at the rest of the overlap.

The skew angles of the resting laptop and the monitor itself don't align at all, and while my cable management is bad, it's not *that* bad. This is the real challenge at the heart of making panoramas: images are *flat*, while the motion of a panorama is *cylindrical*. Ideally we would take a "cylindrical" photo and unwrap that into a rectangle, but we can't. Undoubtedly, we will have to warp our images somehow to align.

How do we find the right way to warp our image? The first thing we will need to do is get more data! Having one point line up between the photos is not great, but, say, 10 different data points might not be bad.

So, our final panorama would like the same numbered red and blue points to overlap. With data to use, the second thing we will need is a way to actually warp our image; how do we actually make our points between photos line up? For that, we turn to linear algebra.

If you had a random ordinary point, how might you describe its position to someone?

A lonely, solitary point living in the plane.

A common choice we're all familiar with in some way is by using a **coordinate system**. That is, we define a place to be $(0,0)$ and locate every point relative to that **origin** in terms of its $x$ and $y$ coordinates $(x,y)$.

A still lonely, solitary point living in the plane, but with more lines.

In the above coordiante space choice, we might say the point is at $(3,2)$. But what exactly do we *mean* by the point being at $(3,2)$? What this really implies is that the point is 3 steps to the right of the origin, and 2 steps above the origin.

So, instead of thinking of this point in terms of separate coordinates, we can think of it in terms of these two **basis vectors**. Let's use $\color{blue}{i}$ to represent the blue, horizontal vector, and $\color{red}{j}$ to represent the red, vertical vector. So, our point is really the combination of $\color{blue}{3i}$ and $\color{red}{2j}$, or simply, $\color{blue}{3i} + \color{red}{2j}$, which itself repsents another vector (the one pointing from the origin to the point $(3,2)$).

This might seem extra and unnecessary, since we just rewrote a vector as the sum of its horizontal and vertical components, which is what coordinates literally do in the first place. But the useful insight here is that there is nothing that says our basis vectors have to be in the unit directions! We can now rewrite points in multiple ways depending on our choice of $\color{blue}{i}$ and $\color{red}{j}$.

A new, quirky choice of basis vectors.

With a new choice of basis vectors, the vector $\color{blue}{3i} + \color{red}{2j}$ has a totally new position as that now encodes the coordinate $(5,3)$ since neither $\color{blue}{i}$ nor $\color{red}{j}$ represents horizontal or vertical steps anymore, but rather skew, diagonal steps.

But look at that! We've basically accomplished our goal of warping points! We've managed to transform the point $(3,2) \rightarrow (5,3)$ by manipulating $\color{blue}{i}$ and $\color{red}{j}$; both points are techincally at $\color{blue}{3i} + \color{red}{2j}$, just for different basis vectors.

This is what linear algebra and matrices encode geometrically. If we write our basis vectors $\color{blue}{i}$ and $\color{red}{j}$ in a matrix and multiply that by the vector representing our initial point, we will get a new point representing our transformation (a.k.a., our warp). How do we write our basis vectors in a matrix? Each vector implicitly has coordinates associated with themselves! In the above picture, $\color{blue}{i}$ points at $\color{blue}{(1,-1)}$, since from its tail to its tip it moves one step to the right and one step down. $\color{red}{j}$, on the other hand, points at $\color{red}{(1,3)}$, and these are precisely the vectors we see in our matrix.

For the unit basis vectors that point at $(1,0)$ and $(0,1)$, to differentiate them from any old pair of basis vectors, we call them $\color{blue}{\hat{\imath}}$ and $\color{red}{\hat{\jmath}}$ wearing a little hat, and their respective matrix the *identity matrix*, since it leaves vectors unchanged after multiplication (since that's what we used to define coordinates in the first place).

The underlying idea of linear transformations.

These $2 \times 2$ matrices represent *linear transformations*. They're transformations in they way that they transform points from one coordinate to another (well, most of them at least), and they are linear in the sense that keep all grid lines parallel, evenly spaced and, well, linear after the transformation. This is best seen through video and not stills. For this post you don't need to understand the mechanics of matrix-vector multiplication, but just understand that it represents some transformation on a point.

So why should we care? Why is this helpful in any way? If we think of each pixel on our images as a coordinate, we can just apply our transformation to all pixels on that image, find where they land and color them, and generate a new image. Let's take this picture as an example.

We can scale it, rotate it, or even shear it by applying the same transformation matrix to every pixel by changing $\color{blue}{i}$ and $\color{red}{j}$ like before.

An example transformation acting on our image.

But we have a *big* problem here: the typical linear transformation does not allow for translations. By the qualities of linear transformations, the origin cannot move, therefore forcing the bottom left corner of our images to always overlap! That's pretty restrictive in terms of the panoramas we can make—and for practical purposes—a complete nonstarter. If we want to continue through with making a panorama, we'll need to find a way around translations.

There's a very sneaky workaround being confined to the origin. To do so, we'll need to do something that might seem a bit weird to do translations. Let's rewrite our 2D points with a *3rd* coordinate. For a given $(x,y)$, let's rewrite that with a $z$-coordinate $(x,y,1)$. If $z \neq 1$, then we can just divide all the other coordinates by $z$ to make it equal to 1: $(x,y,w) \rightarrow (\frac{x}{w}, \frac{y}{w}, 1)$ (we generally use $w$ to represent the $z$-coordinate to indicate that there is no "real" $z$ value since everything is projected into 2D; we use $w$ as a "weight" to say how much we scale our projections down to). This means we have multiple coordinates represent the same point. In this way, $(2,5,1)$ and $(4,10,2)$ and $(-3,-7.5,-1.5)$ and $(2w,5w,w)$ all represent the same point (we don't include points when $z=0$ as it represents a point at infinity).

This might seem arbitrary, but what we're doing here is not too different than our original, 2-coordinate system. When we look at a cross-section of the $xyz$ coordinate space, it looks exactly like the $xy$ plane. What we are doing here is projecting all of $xyz$ space onto the plane $z=1$.

The geometry of projecting points onto $z=1$ is equivalent to drawing a line through the origin and the point, and finding where it intersects that plane.

In fact, many of you are already familiar with **homogeneous coordinates** (representing 2D points with a 3rd scalar coordinate) and **projective planes**! When you take a photo on your phone, how does the camera know what's drawn in its frame? How does it take a 3-dimensional world and put it into a 2-dimensional picture? The many light rays that enter the camera lens (the origin) will intersect a plane ($z=1$) based on its focal length, and colors the pixel based on the projection.

A photo is homogeneous coordinates in disguise. While there's many sides to the building, our camera only cares about what it sees in front of it (a.k.a., what gets projected onto the frame). Our worldview is contained to a small projection.

Using this analogy with photos, clearly translations should be possible! If you've seen any cat videos on the internet, clearly it is possible for the cat to enter and exit the frame freely without the camera necessarily moving, and that is precisely possible due to the fact the origin $(0,0,0)$ is not contained in our projective plane $z=1$, since all of our basis vectors have to stem out of the origin! (For those interested a translation matrix is equivalent to a shear along the $z$-axis.)

Think about what we're doing here: we're turning a linear transformation in 3-dimensions to create special non-linear **affine transformations** in 2-dimensions! When I first learned this geometrically, awe can't encapsulate the total shock I felt. So, if we're given a point $p$, we can transform it with a matrix $M$ to get its image $p'$.

(Notice how $M$ is now a $3 \times 3$ matrix as we are working with 3D coordinates now) If our matrix-vector product results in a $w$ value not equal to 1, then we just divide everything by $w$ to make it so, and get our coordinate in terms of our 2D plane.

These projections with homogeneous coordinates are known as **homographies**. When we take one picture, and reproject it according to a matrix like this but *keeping the same camera center* (like the origin), we call it a homography. Again, like homogeneous coordinates, people have been leveraging homographies artistically for a while now. That weird, perspective street art you might have seen before? That's the most manual you can get to using homographies—literally warping images with the angle you look at them to make them appear at a normal proportion.

Street artists have used the power of perspective for a long time.

What we're doing is computing a homography to build a **mosaic**. Just like the decorative tile art of the same name, we are taking tiles of photos that we transform to overlap, and them stitching them all together into one, broader image.

Moreover, our homographies have a really funny interpretation to them. Since we are reprojecting pictures, what it geometrically looks like is that we're taking two photos which should be rotated in space (as you would spin taking the panorama), and *taking a photo of the two photos*. Photo-ception.

If you take a photo of two existing photos, you get one photo that unites the two together. If we can find out the right way to take the photo such that the overlap is correct, we get a panorama.

There are other ways to reproject images to make other mosaics with the own benefits and downsides, but this is what we'll use for now. Benefits with this type of mosaics? They are (relatively) easy and fast(er) to compute. Downsides? Since we are projecting onto a plane like this, we can only take panoramas up to 180° wide (can you see why?).

While it's great that we are able to transform points with matrices, let me remind us what our goal is.

We have these two photos, where we want to transform one image's points to overlap with the other's. In terms of our matrix arithmetic from before, we have $p$ and $p'$, but no matrix $M$... Up to this point we have been finding our image points using our own matrices, but how do we find that intermediate matrix given a point and its image?

Like homogeneous coordinates, many of you will already be familiar with solving for the intermediate matrix given a $p$ and $p'$. Let's do it with a simpler example.

If you had the points $(1,2)$ and $(6,4)$, and I needed you to find the line $y=mx+b$ that went through them, most of you would be able to do that. We'd set up a system of linear equations

and solve for $m$ and $b$ respectively. In this case, $m=\frac{2}{5}$ and $b=\frac{8}{5}$. Simple algebra with little to worry about here. What is important to note here is that we *could* solve for a unique $m$ and $b$, since two points define one unique line.

An ordinary line going between two points.

But what if I introduced a third point $(p_3,p_3')$? Or even a 4th point $(p_4,p_4')$? How do we draw a line through those 4 points? There might be a line that goes through all 4 points, but it's highly unlikely.

While there's no one line through all 4 points, what's the *closest* to a line we can get?

We may not have *exact* values for $m$ and $b$, but what's the best value for both to get the *closest* solution to this system of equations?

That's a task as simple as plugging it into a spreadsheet and doing a **linear regression**. More specifically, we can use the common **least-squares regression** where we want to minimize not the sum of the errors, but the sum of the square of the errors (as the name would suggest). For those a little more comfortable working with matrices and linear algebra, here's a more in-depth explanation of what we're doing with our data when finding a regression.

To many, this might seem like an obvious thing to do; everyone from middle schoolers to office workers have been finding trend lines forever. But what we did here is pretty useful when we think more abstractly: given a system of linear equations that correlated independent data $p$ with their dependent data $p'$, we were able to solve for the **best coefficients of that system of linear equations** that most closely solved the system (in the previous case was $m$ and $b$). Finding a line was a nice byproduct, but what we're really doing here is solving that system of linear equations.

Now I promise this will be helpful. Let's look at our original expanded matrix equation of $Mp = p'$.

Remember, we're working in homogeneous coordinates, so $p'$ might not land on the plane $z=1$, and we account for that with $w$ here. I also set $i=1$, since a) that corresponds with a certain scaling and is not necessarily unique vector in the land of homogeneous coordinates, and as a result, more importantly b) gives us one less variable to solve for.

Here, we will need to actually do the matrix-vector multiplication, and carrying it out nets a system of linear equations! (I know I said you won't need to know the mechanics of these operations, but it's hard to avoid it now. If you can accept this fact, that's great, but I'd recommend looking here if you are unfamiliar.)

Using the third equation in tandem with the first two…

Just like before, we can solve for $a$, $b$, $c$, $d$, $e$, $f$, $g$, and $h$ with a least-squares regression! Since we have 8 variables, at minimum we need 8 equations, or 4 pairs of $p$ and $p'$ (since each pair contains two equations: one for $x'$ and one for $y'$). Though, just like we have 10 points, generally it is better to have more data and overfit than less (we'd rather have an overall average fit, than just 4 points be *exatly* where we want them to be). It's weird to think of this geometrically, since what we're doing here is not finding the line between one independent variable and one dependent variable, but rather *two* independent variables $(x,y)$ with *two* corresponding dependent variables $(x',y')$; our regression exists in 4-dimensions!

Let's quickly reflect on what we've covered thus far.

- We've redefined coordinates purely with vectors, allowing us to nicely compact our image-warping transformations in matrices.
- Our original definition of coordinates failed to include translations—a key transformation. We described 2-dimensional points in 3-dimensions with homogeneous coordinates, resolving our worries.
- We then ran into ANOTHER problem in that while we knew how to warp images
*given*the transformation matrix, we really wanted to be able to find the matrix given a starting point and an end point to map to. - Using a least-squares regression, we were able to turn our unknown matrix equation into a system of linear equations that were much easier to work with to compute our homography (sort of, see the aside below).

Let's use this first photo to give our list of points $p$.

And we'll try to match those red points to these blue points on the second photo: our list of $p'$.

Having the computer compute the transformation matrix, we take that matrix and multiply every pixel (remember, treating them as coordinates/vectors), and warping the first image. Then, we can overlay them to see how close our points line up! If our points were well selected, and our computed homography—with the least-squares regression—has minimal error, we should get a pretty decent attempt at a panorama.

Sure, the blending isn't great, and it didn't *completely* fix the overlap issue, but the seams and photo stitching definitely is much nicer! And honestly, it's pretty cool seeing how the image was transformed and finding the outline of the images cross like that.

With some simple masking and basic filtering (basically averaging every pixel's color with the pixels around it), suddenly it really begins to look clean.

While this is cool, it does reveal another unfortunate downside of our choice of mosaic: if we want a uniform picture, we have to sacrifice a lot of data.

Even so, it doesn't even look that bad. All in all, though, not a bad first attempt at building a panorama.

While we have a working prototype, we can do signficantly better. For one, I used only 10 labelled points to compute our homography, but if you use even more, it's not hard to get a better, and closer fit. With algorithms like LoFTR, finding lots of corresponding labelled points between multiple images is quick and easy.

Some really smart people made an algorithm specifically to finding high quality object matching between multiple photos. Credit: LoFTR Team

Also, since we are manually constructing our panorama, we can stitch and blend photos that have no right being together in a panorama.

Going from a well-lit to a dark photo makes for some artsy renditions (even more if you blend it a little nicer).

In a similar manner, we only conjoined two photos together, but we can easily extend this to as many photos as we want (but I can't say how well the photos towards the end will necessarily stretch).

We never really touched on our homographies, either. When we decided 10 initial points $p$ and 10 warped points $p'$, our $p'$ was decided as a result of lining up 2 photos. What if we didn't want to line up multiple photos, but rather just creatively warp a single photo?

Something not quite lined up? A simple homography can fix that for us.

This is know as **rectification**, as it is a means to correct for mistakes we might have had in our photo.

Finally, the last improvement we can make to our mosaics is trying new projections and warpings. If we want something even as simple as just wider, up to 360°, full views, we'll need to find something more robust than our previous approach. Or what if we wanted to make something akin to a full photosphere like from before?

What we did today was simply **planar projection**, or just reprojection onto a plane. We did that with homogenous coordinates. For wider, more complete mosaics, we'll need either **cylindrical** or **spherical projection**, which is exactly what it sounds like. These have their own benefits like wider field of view, but because of the nature of projecting onto a curved surface, the images being stitched together do tend to, well, curve. The type of mosaic one uses comes down to preference and artistic need.

And lastly, there are many optimizations and polishing details we could add to make our panoramas cleaner, and run faster. For instance, we never mentioned the discontinuities that could be present in warping our images with matrix multiplication. While linear transformations keep lines before the warp as lines afterwards as well, that's only helpful if our line is *continuous*. Pictures are *not* continuous! They are discrete points! So, **forward warping** with our matrix multiplication and finding where pixels lands can sometimes create (albeit, usually imperceptible) holes in our images, but they are there nonetheless. Instead, we can **reverse warp** by applying the inverse of our transformation matrix, and find what coordinates land on our original image! Not to mention different blending and masking techniques, or even just algorithmic improvements to make the code run faster. Check out the Python notebook below for more details.

For more like this and additional resources, I recommend reading these slides from UC Berkeley's introductory computer vision and computational photography class.

I hope this gave an interesting peak at the intersection of linear algebra and photography, and more over, I hope this gave you an appreciation for the math your phone goes through every time you take a panorama.

If you're interested, here's a link to a Python notebook where you can see some of my experiments during my struggle and exploration with panoramas and homographies.

Okay, this previous section is really hard to describe without already knowing a fair amount of linear algebra, and it felt a little flat without having a more methodical procedure of solving a least-squares regression. I wasn't planning on including this section, but it felt incomplete otherwise. For those interested, feel free to peer over it, but this is not necessary within the scope of this post; all you need to understand is what our regression is accomplishing, thinking of that "line of best fit" idea giving rise to optimal coefficients in a overfitted system of linear equations.

Let's go back to when we were trying to find a line between two points. If you have 2 points, $(p_1, p_1')$ and $(p_2, p_2')$ being fit to the line $y=mx+b$, we have a system of linear equations like before.

We can solve this just like we did before to find $m$ and $b$, but there's another, sly way we can approach this. If we look carefully at the structure of these equations, there's actually a secret matrix relationship embedded into this system.

In a sense, that's what a matrix is: a system of linear equations, and you can freely go between either a system of linear equations or a matrix via matrix multiplication. (I know I said you won't need to know the mechanics of these operations, but it's hard to avoid it now. If you can accept this fact, that's great, but I'd recommend looking here for more details.)

If we write this in general terms, we are basically solving the equation

where $A$ is a matrix, and $b$ and $x$ are vectors, and we are solving for the latter. It might seem pointless to rewrite it, but what we're actually solving is

Since $Ax$ is *exactly* equal to $b$ in the 2-point case, we can solve this matrix equation fairly directly; when there's a unique, perfect solution $Ax$ is the same vector as to $b$. We were able to find a unique line with $m$ and $b$ through them, no? Just as we were able to solve the system of linear equations before, we can easily solve this with matrix inverses:

Now, let's add more points.

Now we turn this into a matrix equation like before.

We know that there's a good chance our four points don't all lie on the same line. So it's unlikely that $Ax - b = 0$. Moreover, now that our matrix $A$ isn't square, we can't just use inverses to solve for $x$. So instead, we want to get a line that gets *as close* to 0 (a.k.a. being a perfect fit). So our goal is to

Here, the $||x||_2$ means we're looking at the Euclidean distance (a.k.a. straight line distance) as our error for our line of best fit, and we're squaring it to get a tighter fit since small errors are kept relatively small, while large errors are weighed heavier. We know $A$ and $b$ with $x$ as our unknown—this sort of looks like a parabola-y equation! When we minimize a single variable function, we do so with the derivative. We can do the same thing here except with the multivariable equivalent: the gradient. So, we know the minimum occurs where the gradient of this function is 0.

Even if you're not familiar with multivariable calculus, much of the following should still look vaguely familiar to the chain and power rules of single-variable calculus.

All finally simplifying to the very nice formula of

I like this approach for it's intuitive roots in the geometry of single-variable calculus, but if you want a more strictly linear algebra approach, here's this excerpt from Georgia Tech that explains another proof for the same formula:

**Theorem.** Let $A$ be a $m \times n$ matrix and let $b$ be a vector in $\mathbb{R}^m$. The following are equivalent:

- $Ax=b$ has a unique least-squares solution.
- The columns of $A$ are linearly independent.
- $A^TA$ is invertible.

In this case, the least-squares solution is

**Proof.** The set of least-squares solutions of $Ax = b$ is the solution set of the consistent equation $A^TAx = A^Tb$, which is a translate of the solution set of the homogeneous equation $A^TAx = 0$. Since $A^TAx$ is a square matrix, the equivalence of [facts] 1 and 3 follows from the invertible matrix theorem. The set of least squares-solutions is also the solution set of the consistent equation $Ax=b_{\textrm{Col}(A)}$, which has a unique solution if and only if the columns of A are linearly independent.

Basically, it says if our system of linear equations contain only unique equations (i.e. no one equation is a multiple of another), we can turn our non-square matrix $A$ into a square one by multiplying by its transpose $A^T$, and solve our least squares the way we'd solve it before with inverses. In other words, if our matrix follows the criteria listed above, our minimizing solution comes from creating an equivalent equation with an invertible matrix:

Netting precisely the same formula as before.

Now, let's recall the our matrix equation from before of the homography we wanted to solve.

Then, we expanded this into 3 linear equations, and further simplified them to the following two:

This, can be rewritten as another, secret matrix equation:

Wait, we turned our original matrix equation into another one? As awful as that may look, this is much more useful than our original equation since now, all of our unknowns are in a vector instead of a matrix; it really is no different than our previous least-squares examples, and we're still solving for the vector $x$ in

So, we can still solve it like before finding

And with that, we now have also gone through what our program is doing under the hood, and have gone through some of the tedium of justifying what a least-squares regression is from a linear algebra perspective.

]]>