**Hello, World**! I’m Adi Mittal, a student at Terman Middle School. I enjoy food, math, running, martial arts, and music of memes.

My main intent in starting this blog is to share my thoughts, ideas, and outlooks on cool math. My primary goal is to create content that's interesting and to share my thoughts on the world around me. I hope my ideas and thoughts can appeal to you, just as it did to me and showed me the interests of math!

]]>**Coins!** As some of you may relate to this, I love to take a coin, and just spin it on a flat surface or table. It's just satisfying, but being put to shame by the so called "Fidget Spinner". I was spinning a coin a few days ago, and was ultimately bored at the time, so I decided to ask myself a simple question: **What information from the coin can be taken away from it spinning?** This then was taken into is there a **correlation**, or ratio, between the **rate at which the coin rotates**, and the **rate at which it "wobbles"**. With a goal in mind, I picked up my pencil and started to work things out.

First things first, defining and finding all of our givens:

This diagram represents what the coin might look like at a given instance.

What we know about the coin:

The **radius** of our coin $= R$

The **circumference** of our coin $C = 2 \pi R$

Now for the circle our coin rotates and wobbles upon:

The **radius** of this circle is entirely dependant on the angle of the coin to the horizontal (table/flat surface. We will define that as $\theta$). Using the diagram, we can find that $r = R \cos (\theta)$

The **circumference** is $c = 2 \pi R \cos (\theta)$

With all the givens that we need out of the way, on to the application.

A thing to note is that as the coin completes a **full rotation** around the smaller circle, the **original placement of the coin moves by a certain amount**. You can easily demonstrate this by drawing an arrow on a quarter, and guiding it through a rotation on a circle smaller than the quarter.

This extra distance it covers can easily be thought out to be as $C - c$.

$2 \pi R (1 - \cos(\theta)) =$ the rate at which the distance our coin completes per revolution.

The rate for the distance per revolution our coin completes while "wobbling" is the same as the circumference our coin moves around upon, which we know is $2 \pi R \cos(\theta)$

Putting these two together as ratio of rate of rotation, to rate of "wobbling", we get:

$\large= \frac{1}{\cos (\theta)} - 1$

This expression represents that at any given moment, the ratio between how fast the coin is spinning, and the how fast the coin is "wobbling" (which can be seen as the amount of hertz produced by the coin), will be $\frac{1}{\cos (\theta)} - 1$. This also means that if you multiply the frequency of "wobbling" by this expression, it will output how fast the coin should be spinning at the given value of theta. For example, let's say the coin is wobbling at a frequency of $5\,hertz$ at an angle of $\frac{\pi}{4}\,radians$ (because radians are cool), the coin would have to be rotating about $4.52\,revolutions\,a\,second$ to maintain that angle to the horizontal at that "wobbling" frequency (because of how hertz measure $cycles\,per\,second$, the cycles translate into revolutions for the output).

Of course, this is all theoretical. In practice, the coin may slip. Wind may change the local air pressure, thus changing the air resistance. Everything needs to stay **constant**, with no disturbances or changes occuring during the coin's movement. But this is still a neat thing if you were to ask me!

Just to recap, we took the basics and givens of our coin and it's enviornment. Used those to get the generalized rates of the coin spinning and wobbling. We then used those calculate our ratio between the two at a given instant. Not bad!

If you have any questions or comments, send me an email or leave a comment!

]]>**How do you... How do you even...** This is when you know the problem you are about to be shown, will be annoying. When a friend of mine first introduced this problem, I thought this would be very, VERY simple, to solve. Use some angle properties, use the given similar triangles, and soon enough, a solution will be found. Of course this didn't work. I tried a few other things, same result. Showed it to my family, not much help was gained. They tried what I did. Then it hit me. The goal is to find a specific **measure** of the **triangle**s. **Trigonometry** $=$ The **Measurement of Triangles**. The solution quickly followed using a specific property, and I was like, "Meh. That was quite obvious." Enough ranting, it's time you got a look at the problem itself.

*In the diagram below,* $\angle ABC = \angle ACB = \angle DEC = \angle CDE$, $\,\overline {BC} = 8$, *and* $\,\overline {DB} = 2$. *Find* $\,\overline {AB}$

When drawing everyting out...

Before you continue reading, I highly encourage you attempt this geometry problem. It's an interesting problem, and once you found the concept you need to use to solve it, it's all an easy ride down from there. Following this warning will be the full solution and my thoughts on how I solved this myself. I know I already talked about my thoughts and how I solved this a little in the beginning of this post, but from here on will be

The property I thought of (after my 45 minutes of trial-and-error) that we can use to solve for $\,\overline {AB}$ is the

Where $A$ is the angle oposite of $a$, $\,B$ is the angle oposite of $b$, and $C$ is the angle oposite of $c$.

We can rewrite this using the diagram:

So now that we have this written out, we can start solving for $\,\overline {AB}$. For convenience, I'm going to refer to the angles equivalent to $\,\angle ABC$ as $\,\theta$.

Using the **Angle Sum Theorem**, $\,\angle BAC = 180 - 2 \theta$. Using this, we can find an expression equal to $\,\overline {AB}$.

Doing some substitution...

$\large \frac{8}{\sin 2 \theta} = \frac{\overline {AB}}{\sin \theta}$

For where the $\sin 2 \theta$ came from, $\sin 180 - 2 \theta$ when evaluated, is the same as $\sin 2 \theta$. Now for some expansion and evaluation...

$\overline {AB} \sin \theta \cos \theta = 4 \sin \theta$

$\overline {AB} = \large \frac{4}{\cos \theta}$

Now that we have an **expression** for $\overline {AB}$, we just need to find a value of $\cos \theta$, and that will give us the length of $\overline {AB}$! So now, what can we do? What I first thought (based on the information we were given), if we find two expressions representing the value of the **same side length**, we can set those two expressions to equal one another, to find a value that makes that equation true. That equation will mostly likely output a value of a function of an angle, as we know very few side lengths, and know no angles (we're hoping it would output a value of $\cos \theta$). Again, this is only what I was thinking when solving this problem at the time. The only reason I thought this, is that I noticed two triangles, that were similar to $\triangle ABC$, contained within $\triangle ABC$.

We have similar triangles $\triangle DEC$ and $\triangle BEC$. And we know that they are similar as the both triangles share the same angles ($\theta, \theta, and \,180 - \theta$) as the original triangle $\triangle ABC$. And rememeber that side length I mentioned earlier that we could find two expressions for, and use those to solve for its length (that's a mouthful)? That length is $\overline {CE}$! It shares a side length with $\triangle DEC$ and $\triangle BEC$, and we can find our two expressions, by solving for the length of $\overline {CE}$ once, in $\triangle DEC$, and again in $\triangle BEC$. Agian, we're hoping for a value of $\cos \theta$. Starting by solving for $\triangle BEC$...

We are given the length of $\overline {BC} = 8$, which simplifies our job quite a bit. We can do the same thing we did to find an expression for $\overline {AB}$: Use the **Law of Sines**!

$\large \frac{8}{\sin \theta} = \frac{\overline {CE}}{2 \sin \theta \cos \theta}$

$\overline {CE} \sin \theta = 16 \sin \theta \cos \theta$

$\overline {CE} = 16 \cos \theta$

We now have a value of $\overline {CE}$ from $\triangle BEC$, time to solve $\overline {CE}$ for $\triangle DEC$...

First off, all though it's not stated, we know the length of $\overline {DE}$. $\triangle BEC$ is an isosceles, where $\angle BEC = \angle ECB$, which also means $\overline {BC} = \overline {BE}$. As $\overline {BC} = 8$, therfore $\overline {BE} = 8$. Since we were told $\overline {DB} = 2$, we can solve $\overline {BE} - \overline {BD} = \overline {DE} = 6$. Now back to the all-mighty, **Law of Sines**...

Substitution and expansion...

$\large \frac{3}{\sin \theta \cos \theta} = \frac{\overline {CE}}{\sin \theta}$

$\overline {CE} \sin \theta \cos \theta = 3 \sin \theta$

$\overline {CE} = \large \frac{3}{\cos \theta}$

Great! We're lucky that it came out as a value of $\cos \theta$, but anyways, we have our two expressions, now just to set them equal to one another...

$16 \cos^2 \theta = 3$

$cos^2 \theta = \large \frac{3}{16}$

$\cos \theta = \large \frac{\sqrt{3}}{4}$

Now that we have our value of $\cos \theta$, we can just substitute this into our original expression for $\overline {AB}$...

$= \large \frac{4}{(\frac{\sqrt{3}}{4})}$

$ = \large \frac{16}{\sqrt{3}}$

And there it would be, our solution! Although it might of seemed quite lengthy to get to $\frac{16}{\sqrt{3}}$, it all just revolved around the one concept of the **Law of Sines**, so not to bad.

Although this is one way to obtain the solution, I'm sure there are other ways to tackle this problem, and I found another way which completely negates our first step, to find an expression for $\overline {AB}$, but adds an extra step to the end.

With our value of $\cos \theta = \frac{\sqrt{3}}{4}$, we can draw a right triangle with this as one of our angles with a bit moving around.

We can do this, because as we stated earlier $\theta = any\,angle\,equivalent\,to\, \angle ABC$ (and that's the exact angle we're working with). We also **bisected** $\overline {BC}$ at $F$ to form the 2 right triangles within our isosceles triangle, so the length of $\overline {BF} = 4$. We can then use some basic trigonometry and evaluation to solve for $\overline {AB}$.

$\cos \theta = \large \frac {4}{\overline {AB}}$

$\cos (\arccos \large \frac {\sqrt{3}}{4}) = \large \frac {4}{\overline {AB}}$

$\large \frac {\sqrt{3}}{4} = \large \frac {4}{\overline {AB}}$

${\large \frac {\sqrt{3}}{4}} \overline {AB} = 4$

$\overline {AB} = \large \frac{16}{\sqrt{3}}$

Just another simple way of getting to the exact same answer.

If you have any questions or comments, send me an email or leave a comment!

This specific solution, is one of my favorites that I have seen. One of my inital attempts was to use the dimensions of the similar triangels and find the common ratio between the side length and the base of the triangle. I knew it could be done, but never put my finger on it. However, when a friend of mine took a look at this problem, after a bit of thought, he managed to come up with this. It's really quite a spectacular of a solution, and this is credited entirely to him (no use of name for privacy reasons). Oh, and I'll be speaking in first person, just so I don't cause any confusion, or make it seem like I'm taking it as mine. Just to be clear.

So the first step is to take the three triangles we know to be similar to one another ($\triangle ABC, \triangle BEC, and \triangle CED$. We know that they are similar due to the fact they all share two common angles, which force them to have a common ratio between the base and a leg of the triangle. This will be important to remember later), and we will $0-index$ them from the original triangle, to the following divisions within one another. I will also now be referring to the triangles by their respective index numbers.

Now using the fact that every triangle is similar, and that each progressive triangle was formed by using the base length of the previous triangle to form the leg of the next triangle, we can find a ratio between a dimension (say, the base) of a triangle, and its previous/next triangle, and use that to find the length of $\overline{AB}$. I know that is kind of confusing right now, but trust me, it will makes more sense the more I go on.

So we know the base length of two bases of two triangles ($\triangle 0$, and $\triangle 2$). Since we know that they should share a common ratio, we can right them as a ratio between one another, and hence find said ratio.

$\large = \frac{4}{3}$

So we have a ratio, but the problem with this ratio it's for two divisions. It's for going between $\triangle 0$ and $\triangle 2$. We want one between $\triangle 0$ and $\triangle 1$, or $\triangle 1$ and $\triangle 2$. But this is easy! Since a division in this case is a factor of the previous triangle. This means if we take some dimenstion _a_ of a triangle, multiply it by our ratio **once**, we will obtain the dimension _a_ of the next division's triangle. For an example, if we have triangle-base $\overline{BC}$, and multiply it by our ratio, we should get the length of triangle-base $\overline{EC}$. Take a look at the diagram if that helps. Essentially, the base length of $\triangle 0$, multiplied by some ratio, we will get the base length of $\triangle 2$, and do that again, we will get the base length of $\triangle 3$. Now if you see, we had to multiply **twice** to get from $\triangle 0$ to $\triangle 2$. A.K.A., take the square of the ratio. To undo a square, you take the **squareroot**. So we can undo our two-division ratio, by taking the squareroot of that, to get our one-division ratio.

So that's our ratio between a one triangle division. So now we need to find the length of $\overline{AB}$. So we can do what we did originally with the base lengths, only with the legs of the triangle. Larger triangle, over the divided triangle. In this case, $\triangle 0$ over $\triangle 1$.

$\large \frac{\overline{AB}}{\overline{BC}} = \frac{2}{\sqrt{3}}$

$\large \frac{\overline{AB}}{8} = \frac{2}{\sqrt{3}}$

$\overline{AB} = \large \frac{2 \cdot 8}{\sqrt{3}}$

$\overline{AB} = \large \frac{16}{\sqrt{3}}$

The Earth has a diameter of approximately 12742000 meters. Most people of course wouldn't travel that far, but what if you did? How fast can you get across with nothing but yourself? That's essentially what people have asked in the form of the question: How long will it take to fall through the center of the Earth?

Following our standard procedure, let's list all the givens:

$The\,Force\,of\,Gravity\,is\,F = \large \frac{G m M}{r^2}$

Where...

$G = Gravity$

$m = Mass\,of\,Object_1\, (in\,this\,case,\,us)$

$M = Mass\,of\,Object_2\,(in\,this\,case,\,Earth)$

$r = the\,distance\,between\,m\,and\,M.$

So now we are just trying to find as many values or expressions to variables within that eqauation of force. We can leave $m$ as is, becuase that's the mass of our human/us. So what we really need is $r$ and $M$.

One thing we have to worry about though, is that as we fall $r$ will change. As we fall we will get clsoer to Earth's center of mass, eventually pass it, and then get farther from it. So we will call our current distance relative to Earth's center of mass as $x$. And what's great about this, if we are any distance into our fall, we can just ignore any mass above us. Using the diagram as an example, if we are $R-x$ deep into our fall, we can ignore any mass of Earth contained between $R$ and $x$. Some of you may think, "But wait! Wouldn't the mass above us have it's own force of gravity acting upon you, and therefore slowing you down as you fall?" The answer is technically yes, but that all balances out with the mass *below* you and to the *side of* you. All these forces cancel out, making it not affect you at all. So, all we really care about is the amount of mass below us, and the distance between us and the Earth's center of mass (which would be the radius $x$ as we have been discussing). So we have one variable filled.

Now we need $M$. The formula for mass is $M = volume \times density$. The volume of the Earth $= \frac{4 \pi x^3}{3}$ (we are using $x$ again as the mass affecting us changes over our fall). And we can represent density with $\rho$. So the $M$ equals:

Putting this all together, the force of gravity acting upon us during this fall equals:

$ = \large \frac{4 \pi G m \rho}{3} x$

If we let $\frac{4 \pi G m \rho}{3} =$ say, v, we get $F = -v x$. It's negative because we are falling first. This is actually an **oscillating system**. To represent this, I've made a mock graph to show how gravity affects us over time starting from the top of "Earth". The graph is just a representation.

If the x-axis is time, and we fell from the top of Earth (and there is NO air resistence), as you can see, we would just continuously bounce back and forth between the top and bottom of the Earth. Now we need to find the **period** of our oscillating system. The period is the time it takes for one cycle to be completed. To be more precise, we need *half* of the period. That is because one cycle (in this case) is falling all the way down, and coming all the way back. We only want the time it takes to fall down, so that's why the half.

The eqauation for the period of a simple oscillating system (also called a harmonic motion) is:

The variable representation is that $k$ is our oscillating system, and $m$ is our mass. But since we want half of that, so therefor time to fall through the Earth is...

Doing some substitution...

$Time = \pi \sqrt {\large \frac {m}{\large \frac{4 \pi G m \rho}{3}}}$

$ = \pi \sqrt {\large \frac {3 m }{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 m \pi^2}{4 \pi G m \rho}}$

$ = \sqrt {\large \frac {3 \pi}{4 G \rho}}$

Now all we need to do is put in $G$ as the Gravitational Constant, and $\rho$ as the density ($\rho = \frac{mass}{volume}$) of Earth (I did some Googling...)!

$= \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot s^{-2} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = \sqrt {\large \frac {3 \pi \cdot s^{2}}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

$ = s \sqrt {\large \frac {3 \pi}{4 \cdot 6.67408 \cdot 10^{-11} \cdot \frac{5.972 \cdot 10^{24} \cdot 3}{4 \pi \cdot 6371000^3} }}$

So, I don't know about you, but when I have something like this, I just straight up put it into *Wolfram Alpha* , or a similar calculator, as I am just lazy and it's a pain to evaluate. So, letting it be computed by the calculator...

$ \large = 2530.5\,seconds$

This, funnily enough is also the answer to the universe and all of its questions. $2530.5\,seconds = 42\,minutes\,(+10.5\,seconds)$. Quite a coincidence if I say so!

Now what's great about our equation we used ($ = \sqrt {\frac {3 \pi}{4 G \rho}}$), it's quite easy to apply to other objects, as most of it is constant! 3, is well, a constant. So is 4. $\pi$ has been universally agreed upon for its value. And as far as we can tell in the universe, the Gravitational Constant is true. The only thing that determines the fall length is the density. So you could have two planets, one with $x$ as its radius, and the other as $100 x$. If the are just as dense as one another, you will fall through them (across the diameter) in the same time.

Now just as a random fact that I thought was amusing, was the top speed you would attain. We know that acceleration due to gravity on Earth is $\frac{9.807\,m}{s^2}$. The top speed would be when you reach the center of the Earth, which is 6,371,000 meters from the surface (aka, the radius of Earth). Using this, we can calculate the speed at which we would be at in meters per second at the center. Just to be sure, we can calculate acceleration due to gravity, using our original formula, where we ignore our mass: $g = \frac{G \cdot M_{Earth}}{R^2}$

$g = \frac{6.67408 \cdot 10^{-11} \cdot m^3 \cdot kg^{-1} \cdot s^{-2} \cdot 5.972 \cdot 10^{24}\, kg }{6371000^2\,m^2}$

$g = \frac{6.67408 \cdot 5.972 \cdot 10^{13} \cdot m}{6371000^2 \cdot s^2}$

Thanks to a calculator...

$\approx \large\frac{9.82m}{s^2}$

Of course this is not the same as what others have put on the internet, values will differ from here to there. I trust the value of $\frac{9.807m}{s^2}, as I think there values they used to calculate it would be more accurate. Back to the top speed now.

$ = \large \frac{9.807 \cdot 6371000}{s^2}$

$ = \large \frac{\sqrt{9.807 \cdot 6371000}}{s}$

$\approx \large \frac{7904.454251 m}{s}$

$\approx \large \frac{17681.760583\,miles}{hour}$

That's about 23.23 times the speed of sound! This literally means you can't yell during this fall, as you would be going literally faster than the time it takes to vibrate the air around you. It will be a silent fall. That is, if there was air, and the terminal velcoity of a human wasn't $\frac{53m}{s}$.

So that would be it for this post! We found out we can cross Earth in under 45 minutes, and break the sound barrier 23 times over!

I plan on following it up with another post showing how you can use integration to find the time to fall through Earth (and that equation to find the period of an oscillating system/simple harmonic motion that kind of came out of nowhere. The $2 \pi \sqrt{\frac{m}{k}}$), and to show some other cool properties and interesting things about falling, pendulums, and oscillating systems in general.

Now here's an extra challenge for you: How long will it take to fall through Earth, 500 kilometers above the surface?

If you have any questions or comments, send me an email or leave a comment!

]]>There is just no introduction needed here. The problem at hand is probably one of the hardest, most controversial topic in computer science:

In case this is not clear (or never have heard this problem before), it is to show that all NP-hard problems are P problems, or show that they are not equal. An NP-hard problem is a problem that cannot be solved in polynomial time ($NP$ represents for non-deterministic polynomial-time, and $P$ just represents for polynomial-time). Polynomial time is time that can be represented as a function of the input (input being whatever you need to achieve/solve for in the problem), and the function is a simple polynomial function. For example, the following function representing the time it takes to solve some problem,

..., this would classify the problem as a $P$ problem, as we can represent the time it takes to solve the problem as a simple polynomial function. An example for how long some $NP-hard$ problem might take to solve would be such as...

This is bad for computation time, since $x$ is our input, our values would explode the greater the amount of input we have. That's why this is an $NP-hard$ problem. We would essentially have to brute force our, and check every possible scenario (within allotted values for our problem) to solve for this.

So now the reason why this problem is so controversial, it's because that if we can show that $P = NP$ is true, we can then theoretically solve ANY problem within an algorithmic, and in polynomial time. It will cut so much time off of the time it takes to solve all the crazy hard, unsolved problems.

And I know, some of you may be thinking, "But, hey! Wouldn't most problems need a completely different approach to solve, than another problem?" Well, my response to this, would be yes, but, there are some $NP-hard$ problems that

Okay, now with that all out of the way, the reason why I started discussing $NP-problems$, the $P$ versus $NP$ problem. This problem bothers me so much, for a few reasons. **ONE**: This seems a lot easier to solve than it actually is, and this just intuitively bothers me more than other problems do. It seems like such a simple statement to show, but it's just not. **TWO**: The way people are approaching this problem, it seems all to *awkward* and incorrect to me. It seems that they are overcomplicating this quite a bit. But this is computer science, so I don't have much say. And the person who proved *Fermat's Last Theorem* did so in more or less 150 pages (I think), so this could very well be so as well.

My attempts haven't been as successful (well, if it was successful, I would be too excited to write this up), but I do have a few thoughts on the matter. My first attempt was rather bleak. Take a generalized form of the time it takes to solve an $NP-hard$ problem, and just try to work it down to some representation of polynomial time. This obviously, did not work. What ended up happening was that I was trying to represent the wrong variable into polynomial-time representation, and couldn't find a way to expand on onto the variable that I needed to express. So, that idea was gone. The second idea, would be a bit more practical. Take some $NP-complete$ problem, look at it how it's time is in its NP form, then try to find some algorithm that results in the same solution, but is in polynomial time. The reason why I would do this, is because you can link almost any $NP-problem$ to one of the $NP-complete$ problems. Using this, we can creat a map, linking every $NP-complete$ problem to another. That way, if we can solve for one, we have then technically shown for every $NP-problem$. We can do that, or generalize somehow our $NP-complete$ problem, and show from there. My last idea on the matter, is to think of the consequences of this statement ($P=NP$) of being **true**, or **false**. If this is **true**, I feel that this would create a paradox. Because finding the polynomial-time fuction of a $NP-problem$ is $NP-hard$ in itself. But that cannot happen, as we said that $P=NP$, so we have a contradiction in itself. So you would then have to show that finding a P function of a NP function is in P. But that is also $NP-hard$. Then you would have to show that is also in P. But that's also $NP-hard$, so we have to show that it's in P, etc., etc. So we end up with having to contiuously prove that something that is in NP is in P, to show that the smaller $NP-hard$ problem of P versus NP (showing the conversion of NP to P in a given $NP-problem$), is also in P (that was a bit long and a mouthful. Essentially you get a recursive $NP-problem$, and each iteration of this recursive problem is slightly different than the last iteration, but with the same goal of showing that that iteration of an $NP-problem$ takes P time to actually solve). If it was **false**, we would stay where we are computationally, and nothing would of changed. Personally, based on what I have done so far, I think $P \neq NP$. But don't think that is my final decision. People have shown $NP-hard$ problems to be computed in polynomial-time, so based on my second idea of mapping $NP-complete$ problems, there is still some possibility. Expect some updates, and future posts, as this is one of many other problems (I'll just say them: The Millenium Problems) that have gotten me thinking in almost no way I have done before (that's probably because they are not all math-based, and I'm a math-based guy, so math-based + not-math-based-problem = new type of thinking). Actually, don't expect future updates and posts, just know there will be future updates and posts.

If you have any questions or comments, send me an email or leave a comment!

Not much for an introduction this post. Found this problem when looking for interesting problems for myself. Shoutout to Harvard's Problem of the Week (from 2002 to 2004). The problem at hand is:

(a) What is your expected value you win when playing the game?

(b) Play the same game, except let your earnings be $2^{n-1}$, where $n$ is the amount of flips. What do you expect to win now? Does it make sense?

**(a)**: Expected value is the amount you win, multiplied by the probability of it occuring, and adding up all the possible outcomes.

You have a 50% chance to win 1 dollar. 25% chance to win 2 dollars. 12.5% chance to win 3 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$= \large \sum _{n=1}^{\infty }\: \frac{n}{2^n}$

$= \large 2$

**OR**

$\large \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{4}{16} + ...$

$\large =(\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{6} + ...) + (\frac{1}{4} + \frac{1}{8} + \frac{1}{16} + ...) + (\frac{1}{8} + \frac{1}{16} + ...) + ...$

$= (1) + \large (\frac{1}{2}) + (\frac{1}{4}) + (\frac{1}{8}) + (\frac{1}{16}) +...$

$\large = 2$

So you can expect to win

This is where the fun is at.

**(b)**: We have 50% chance to win 1 dollar. We have a 25% chance to win 2 dollars. We have a 12.5% chance to win 4 dollars...

$\large \frac{1}{2} + \frac{2}{4} + \frac{4}{8} + \frac{8}{16} +...$

If you don't mind, since I like to write things in sigma notation, I would like to write the simplified verison of this sum in sigma notation.

$=\large \sum _{n=1}^{\infty}\: \frac{1}{2}$

$\large = \infty$

This is why I picked this problem. The first part is quite simple, but this part creates quite a dilemma. What can we do now? How should we interpret this for the expected value of our game? Now one would ever put up a game in which the player is expected to win an infinte amount of money, since no one has an infinite amount money!

The following explanation is a jumble between what I thought, and Harvard's. I recommend looking at what they said specifically.

The solution is that our game (would be known as the *experiment* in our scenario) doesn't agree with the exact definition of **expected value**. Expected value is defined as an average over an *infinite* amount of attempts/trials (this can be viewed at least as the limit towards an infinite number of attempts/trials). The thing is that, you'll never be able to play an infinite amount of games. Essentially, our experiment (game) doesn't agree with our calculated expected value, as the experiment has nothing to do whatsoever with the precise defintion of expected value. Just as an example, if you were to (somehow) play an infinite amount of games, your earnings would indeed average an infinite amount. This whole idea of this expecting to win an infinite amount, and it "not working/making sense/not being possible" arises when we try to make expected value, something it isn't.

Okay, I like math, but from this point onward I didn't have much. And what I did wasn't cohesive, as 25% was written down, the other 75% was in my head. The problem is, that 75% *was* in my head. I would try to go through and get my complete explanation, but I feel that Harvard's solution is already quite nice. So the rest is all Harvard's explanation. Only credit I get here is for the fact I formatted it for this page. Here you go.

"*This might not be a very satisfying explanation, so let us get a better feeling for the problem by looking at a situation where someone plays $N = 2^n$ games. How much money would a “reasonable” person be willing to put up front for the opportunity to play these N games? Well, in about $2^{n−1}$ games he will win one dollar; in about 2^{n−2} he will win two dollars; in about $2^{n−3}$ games he will win four dollars; etc., until in about one game he will win $2^{n−1}$ dollars. In addition, there are the “fractional” numbers of games where he wins much larger quantities of money (for example, inhalf a game he will win $2^n$ dollars, etc.), and this is indeed where the infinite expectation value comes from, in the calculation above. But let us forget about these for the moment, in order to just get a lower bound on what a reasonable person should put on the table. Adding up the above cases gives the total winnings as: $2^{n−1}(1) + 2^{n−2}(2) + 2^{n−3}(4) +· · ·+ 1(2^{n−1}) = 2^{n−1}n$. The average value of these winnings in the $N = 2^n$ games is therefore $\frac{2^{n−1}n}{2^n} = \frac{n}{2} = \frac{(\log_2 N)}{2}$. A reasonable person should therefore expect to win at least $\frac{(\log_2 N)}{2}$ dollars per game. (By “expect”, we mean that if the player plays a very large number of sets of $N$ games, and then takes an average over these sets, he will win at least $2^{n−1}n$ dollars per set.) This clearly increases with $N$, and goes to infinity as $N$ goes to infinity. It is nice to see that we can obtain this infinite limit without having to worry about what happens in the infinite number of “fractional” games. Remember, though, that this quantity, $\frac{(\log_2 N)}{2}$, has nothing to do with a true expectation value, which is only defined for $N → ∞$. Someone may still not be satisfied and want to ask, “But what if I play only $N$ games? I will never ever play another game. How much money do I expect to win?” The proper answer is that the question has no meaning. It is not possible to define how much one expects to win, if one is not willing to take an average over a arbitrarily large number of trials.*"

Neat little problem if I do say so myself. Some of my work, some of Harvard's, hope it was cohesive and clear who was writing what when. I wish I could of gotten my last piece of explanation, just would of taken a bit too long for something I need to redo. Moral of the story: Take complete notes.

If you have any questions or comments, send me an email or leave a comment!

**Over 7 million students** across the United States missed 15+ days of school in the 2015-16 school year (US Department of Education, 2012). These *chronically absent* who miss 10% of their academic year cause over 40 *billion* worth of instructional minutes to go to waste. Even more jarring, in the same report, it was cited that inconsistent student attendance is a better indicator than test scores to whether a student will drop out of school or not. In case it wasn't clear enough, student absenteeism is a major issue haunting the education system, and the need find a solution to it is greater than ever. Over the past 8 months, I have conducted a small series of experiments to try and alter the state of this issue in one public high school in Palo Alto, California.

**In the case of Palo Alto's Henry M. Gunn High School**, only 5% of the school's 2000 students are in this category of the chronically absent. That means on average, 1 student is missing in every class on campus. What makes this statistic concerning though is when one considers the demographic of Palo Alto. Here are two maps of the United States: one of the median household income (New York Public Radio), and another of chronic absentee rates by district (US Department of Education, 2012).

Although Palo Alto is in one of the most affluent neighborhoods in the nation, its chronic absentee rate is equal to some areas with with not even half its median household income. This led to the idea that absenteeism may be fueled by different motivators across the country. For example, in some less fortunate neighborhoods, kids may be an active contributor to their family income, leading to conflicts with their academic commitment. Similarly, because of Palo Alto's wealth and greater access to resources, academic competitiveness may fuel absenteeism. This may seem counterintuitive at first: if someone wants to do well in class, why would they skip it? These *strategic absentees* skip class not because they want to, but rather as a necessary evil: they feel the need to skip a class a means to prepare for another one. These are the students that can most affect the absentee rate, as not only are they motivated to go to class, but their absences are likely more sporadic than they standard chronically absent, meaning that they are more likely to be in class to be influenced by a teacher or administrator, which nicely leads into the next section.

**Students can't be forced to go to class**. Especially as Henry M. Gunn High School sports an open-campus policy, it is impossible to be able to force a substantial number of absentees to go to class beyond the already instated measures. So, instead, we tried to **persuade** the students, as if they agree for themselves attending class is the right thing to do, they are more likely to act on it. To do so, we used specially engineered social measures to try and convince the students as best as possible, known as **nudges**.

Nudges, at their core, are suggestions. They don't affect one's ability to choose, but they utilize the person's experience to guide them to pick an option. The most commonly used nudge utilizes comparisons: let's say a restaurant has a dish that doesn't sell super well, and is ultimately costing them money having it listed on the menu. To boost its sales, they can list a slightly cheaper dish that has no intention of actually being sold in large quantities. This "fake" dish provides a reference frame for the buyer, making what originally seemed as overprice as suddenly as a great deal. The first time this type of persuasion was first formalized in a book of the same name: *Nudge: Improving Decisions About Health, Wealth, and Happiness*, written by Cass Sunstein and Richard Thaler ( highly recommended read ). It should be reiterated that this doesn't change one's ability to pick what food item they want, and that's what makes a nudge so effective. It allows the person to convince themselves without needing to feel as if the choice was imposed on them (i.e. remove everything on the menu except one dish). This is a form of *indirect persuasion*: we're not explicitly saying what we want the affected person to do, but we are adding information to guide one choice over another. The counterpart to this would be *direct persuasion*: giving explicit desires for which option to choose.

**This project employed 2 types of persuasion tactics that were each disseminated via 2 types of mediums**. The two types of mediums you are already familiar with: direct and indirect persuasion. This was examined by having the teacher give information by describing a set of data (see below) for direct persuasion. Indirect persuasion was tested by having the environment give the information, meaning instead of having the teacher give information on data, the students take in the information for themselves by noticing the data as a poster in the classroom. Now, I have been purposely vague about what data was shown to the students as there were actually *two* different sets of data that were shown to different classes, but they communicated the same idea.

If these curves look the same, that's because they are; they are the same set of data, but one is presented with a positive connotation (attendance is good) and the other with a negative connotation (absenteeism is bad). These are the two sets of data presented, as to see whether *how* one presents data affects one's ability to influence.

So, in total, there are 5 classes: $\mathrm{i})$ teacher presents positive connotation data; $\mathrm{ii})$ teacher presents negative connotation data; $\mathrm{iii})$ classroom presents positive connotation data; $\mathrm{iv})$ classroom presents negative connotation data; and lastly, all compared to a $\mathrm{v})$ control with no intervention.

As you can see, there are 5 lines shown on the graph: 4 lines, 2 blue and 2 red, to represent data collected and a green line that represents the aforementioned Gunn's 5% chronic absentee rate. The blue lines represent tardy and absentee data from January, to collect pre-experimental data. Red lines represent data collected in February, the month in which the experiment was set in motion.

This graph may be a bit intimidating to read, but it helps to realize that the x-axis is not a representation of time like most line graphs, but rather are the individual class models. This allows for easy comparison between months, as if there was, say, a virus moving throughout the classes that caused a 3% increase to absences to all classes, it will look as if one line was shifted upwards, having the same structure more or less to the other.

These results are really surprising, as there was *no* improvement at all from any implementation of our stimulus. If anything, it got marginally *worse* with a slight increase to tardies in all classes ($\approx$3-4% increase). Which makes some sense as when you consider whenever someone commands you to do something or says you're doing something wrong, the first instinct is to disagree and defend your actions. This feeling is known as **reactance**, and it is likely what caused this mild increase in tardies.

What is extremely concerning however is the massive increase in absences seen in the "-/teacher" class (negative connotation data presented by teacher), which observed an astonishing **85% increase** in absences. That is the difference between another **90 additional students** absent in Gunn, and an **additional 560 across all of the Palo Alto Unified School District**. This incites an interesting thought: **subconscious effects, such as reactance, can be amplified by other effects as well**. In this case, it was the *negativity bias* -- the idea that negative connotations tend to be overestimated in their impact than positive ones -- that amplified the reactance. For instance, say you are an avid fan of the fruit, apples. If someone says oranges are better than apples, reactance will be incurred and you'll likely disagree. If someone says apples are worse than oranges, however, now there is this feeling of losing an opinion as well, amplifying the disagreement. Here, it is the difference between saying attending class is better than skipping it, and vice versa.

**These results are highly specific**. Before you try to go and apply these ideas beyond the realm of this project, you should consider who were subject to this experiment. In fact, I had witnessed this exact concept during my trials: the inspiration for an easy experimental model was inspired by Moore (2004), who conducted an almost identical scenario of my experiment and found it to be effective for *university* students, while mine showed to not be effective for high schoolers. Further testing is needed, but the greatest takeaway from this experiment is that communication and persuasion is something that can only be achieved when it is very specifically tailored for a specific audience.

If you want to read to a higher degree of depth on the matter, everything cited here and more can be found in my original paper.

**Originally, this project was never supposed to be about attendance**. Originally, I was intending on studying voting theory as I was (and still am) super interested in how individual behavior affects the collective, and how that can be leveraged. I was looking at networks and graph theory, synchronization, and other related topics, but I was especially looking at behavioral economics and prospect theory. Realizing I was about a year too early to be able to study any recent elections or voting processes, I honed in on the behavioral economics aspect of the research, and started to look for a new problem to address. I talked to Gunn High School's principal on school issues that could be examined, and attendance was a recurring theme. This was only corroborated by the *California Healthy Kids Survey*, a questionnaire that surveyed 9th and 11th graders each year, and it reported that 7% of freshmen and 11% of juniors cut class to prepare for another class alone in the month the survey was distributed (California Department of Education, 2017-18). So, I started directing my attention to different studies and papers that were already conducted to learn what ideas have been tried and tested, such as what Moore (2004) and Self (2012) did.

Very quickly on, however, I realized I was probably going to have the same issue that I was going to have with voting theory: gathering original experimental data. Not that there is anything wrong with using pre-existing data, I personally wanted to collect my own original data to analyze, especially as I wasn't sure if something like attendance would be internalized the same way in a high school population as the more frequently studied college population (and as stated previously, there was in fact a discrepancy between Moore's and my findings due to different populations studied). So, I began reaching out to different teachers to see who would be willing to help run the experiment in their classes. Doing so proved to be very difficult, as I needed a teacher who taught at least 5 classes of 20+ students that had some absentees in each class, who also had the class time to be able to explain the necessary information to 2 classes. I was able to properly contact 2 teachers, one of which I was able to collect data for, and the other I was not due to the sudden COVID-19 outbreak.

Another thing that made this project difficult was that I had an incredible workload set out for this year. Between the 5 required academic classes, 2 electives, an after-school elective, club commitments, and a sport for two-thirds of the year, I just didn't have the time in my schedule to devote another 2+ hours of class time, which didn't even include time needed outside of class to research and write. Not to mention any extracurriculars I had in place as well. I had to schedule almost all of my meetings at 7am or earlier, or worse, do them entirely across email threads, which made communication ambiguous and difficult at times.

Regardless, this whole experience taught me so much about academia beyond the scope of what any high school classroom could have, and taught me how no matter no simple a question one has, willing to ask it can lead to incredible results.

As stated at the top, this was done as part of the **A**dvanced **A**uthentic **R**esearch program that PAUSD provides to its two high schools as a means to introduce students to formal research and academic writing, well beyond what a standard English class essay or chemistry lab write-up teaches. Providing students with community mentors, experts, and connections, it fosters student growth via the students' own motivation to learn, creating an environment where projects, such as the former, can be created out of curiosity, and not by seeking a letter on a report card.

This post looks to describe an interesting property intrinsic to any and all quartic functions, and it has to do with the relationship between the functions' inflection points. Below is a Desmos graph, with labeled $f(x)=Ax^4+Bx^3+Cx^2+Dx+E$, the general quartic equation, as well as its 2 inflection points $P$ and $Q$. A third point $R$ is labeled, which is the point that intersects the line between $P$ and $Q$ and $f(x)$. What we are interested in today is the ratio $\frac{PR}{PQ}$. Play with the graph below to vary $f(x)$ and see what happens to that ratio.

Quickly you will notice that for (most) non-zero $A$ values, $\frac{PR}{PQ}$ always remains at the rather famous constant, $\varphi=1.61803...$, the golden ratio. This may seem coincidental, but there is a rather nice way of proving that this ratio is *exactly* equal to the golden ratio.

The classic definition of $\varphi$ comes from a specific geometric construction of a rectangle.

In this *golden rectangle*, there are two rectangles to focus on: the large one with aspect ratio $\frac{a+b}{a}$, and the smaller red one with aspect ratio $\frac{a}{b}$. The golden ratio is given by $\frac{a}{b}$ when the small red rectangle has the *same* aspect ratio as the larger rectangle (made up of the blue square and red rectangle). Letting $\varphi=\frac{a}{b}$, and setting the ratios equal to each other nets us:

$1+\frac{b}{a}=\frac{a}{b}$

$1+\frac{1}{\varphi}=\varphi$

$\varphi+1=\varphi^2$

$\varphi^2-\varphi-1=0$

$\varphi=\large{ \frac{1 \pm \sqrt{5}}{2} }$

The positive solution to this quadratic is the more well known value $\varphi$. Taking some variations of the previous equations can net other interesting relationships that $\varphi$ pertains to. For example, taking the third line from the derivation of $\varphi$ nets a recursive, cyclic definition of the golden ratio. Expanding out the relation gives another famous definition of $\varphi$.

An infinite descending fraction solely containing 1s. Taking a variation of the fourth line also gives an interesting appearance of $\varphi$.

An infinitely nested radical solely containing 1s. Notice, however, that the solution to the golden ratio has a negative counter part as well: $1-\varphi=-.61803$. Although it may seem nonsensical to assign a negative value to many of the expressions we used in defining $\varphi$, this value holds many of the same properties that $\varphi$ holds on its own as well, and the reason we don't see it as often has to do with the volatility of the value in these iterated scenarios, but that's for another time.

First, let's look at how to find the inflection points of a quartic. Inflection points are given by the quality that it's the point along a function where its concavity changes. I.e. if you look at the tangent lines along a curve as you vary the input $x$, the tangent lines' slopes will change. The inflection points are found when the slopes' behavior alters. Take the function $x^3$, for example.

Notice how as we let our value $a$ increase, the slope of our tangent line — the first derivative of $f(x)$ — decreases from $-1.5$ to $0$. But from $0$ to $1.5$, the slope begins to increase. This is all visualized in our graph $f'(x)$ which plots every point $x$ and the value of its slope at $f(x)$. One can clearly see $f'(x)$ tends in a downward manner initially, before rising again. And for $f'(x)$ to have a slope that's first negative (decreasing) then positive (increasing), it must have a slope of zero in between. So, our point where our concavity changes is when the slope of $f'(x)$ equals 0. In other words, when the second derivative $f''(x)=0$. Here we can see it clearly visualized at the solution $x=0$, which confirms all of our previous observations. Doing so for any general quartic nets us:

$f'(x) = 4Ax^3+3Bx^2+2Cx+D$

$f''(x) = 12Ax^2+6Bx+2C = 0$

As this is a degree 2 polynomial, the quadratic formula quickly gives our two solutions for $x$ in general, which we will call $p$ and $q$.

This also explains why only most values of our constants have inflection points, as if the $9B^2-24AC$ term is negative, it results in an imaginary solution, meaning no inflection point is found within the real plane. With valid constants giving us solutions for our inflection points $P$ and $Q$ respectively, the line through them can quickly be written as:

The intersection point $R$ can be found solving for when $f(x)=g(x)$, or in other words, when $f(x)-g(x)=0$

$Ax^4+Bx^3+Cx^2+Dx+E-\frac{f(q)-f(p)}{q-p}(x-p)-f(p) = 0$

One can try to factor and work this out, but there is a much nicer approach that avoids working with this messy equation.

If we limit our transformations to purely scaling and translating our graph, all of our ratios will remain equivalent. So if we can find a set of transformations to make our work easier, we will still be able to prove our initial proposition, but in a much easier way. To (re)start, we're going to define a new function $h(x)$ that takes $f(x)$, and scales and moves it around as follows:

This may seem arbitrary, but keeping in mind what $p$ and $q$ mean, this transformation alters the graph in a rather specific and useful way. First notice these two key components in our transformation:

This results in shifting the graph over to the left $p$ units and down $f(p)$ units. Or more clearly, it takes our first inflection point $(p,f(p)) \rightarrow (0,0)$, the origin. We'll refer to the origin as $P'$. Now, let's look at the remaining components of the transformation:

Multiplying $x$ by $q-p$ results in *compressing* the $x$-axis by a factor of $q-p$. So, the x coordinate distance between our inflection points is condensed from a length of $q-p$ to a length $\frac{q-p}{q-p} = 1$. Just to keep our scaling consistent throughout $f(x)$, we also scale the $y$-axis down by a factor $q-p$, so we add an extra factor of $\frac{1}{q-p}$. This factor is almost purely for aesthetic purposes, as you will see it will preserve the structure of our graphs and make it easier to see our scaled copy of $f(x)$ in $h(x)$. So, as the difference in $x$ coordinates between $P'$ and $Q'$ is 1, $Q'$ will be at $(1,h(1))$. $R'$ will retain its same definition as $R$, differing only in that it is on our newly transformed function.

Notice how the two inflection lines are parallel. That is due to that extra factor of $\frac{1}{q-p}$ in $h(x)$, but note that the math that follows is not dependent on it.

It's worth noting that we don't actually know any of the constants that shape our new quartic $h(x)=ax^4+bx^3+cx^2+dx+e$ as they don't change according to our scaling factors (notice the change in capitalization; these new constants for $h(x)$ is separate and different to those of $f(x)$). However, we do know the solutions to $h''(x)=0$. Instead of using our function to find its second derivative like we did in our original approach, we are working backwards from our second derivative to narrow in on our function. Since we know where our inflection points are at, we can rewrite our $h''(x)$ as a product of factors.

The factor of $12a$ comes from the leading term when taking the second derivative of any general quartic, as we saw in the original attempt to prove this. Expanding this expression and integrating twice gives us:

$h'(x)=4ax^3-6ax^2+b$

$h(x)=ax^4-2ax^3+bx$

Notice I didn't add a new constant after the second integration, as that is equivalent to the $y$-intercept, which we know to be at $(0,0)$. Now that we have $h(x)$ in terms of itself, separated from $f(x)$, we can easily find the coordinates of $Q'$ and find $h(1)$.

Now we can create a new secant line $g(x)$ to pass through our two inflection points, $P':(0,0)$ and $Q':(1,b-a)$.

Now we can continue using our original method, which is to find all solutions to $h(x)-g(x)=0$. Only this time, our transformations should net a cleaner equation.

$ax^4-2ax^3+bx-(b-a)x=0$

$ax^4-2ax^3+bx-bx+ax=0$

$ax^4-2ax^3+ax=0$

$ax(x^3-2x^2+1)=0$

That $ax$ we factored out is our solution at $x=0$, or $P'$, which we used to construct the line in the first place. Similarly, because we used $Q'$ to construct the line as well at $x=1$, we can factor out an $x-1$ as well.

$ax(x-1)(x^2-x-1)=0$

That last factor is the exact quadratic that we derived to define the golden ratio. Knowing that, we now have all of our solutions to the intersection points between our quartic and secant line.

The negative solution to the golden ratio here is the fourth point of intersection at $S:(s,h(s))$ with $s<0$. Now the last thing to note is that our 3 points of interest, $P'$, $Q'$, and $R'$, are all collinear. So, they can be thought of as a projection of the $x$-axis to a sloped line that scales how far they are spaced apart. However, since this is multiplicative, the ratios will be the same, so we only need to look at the ratios between their $x$ coordinates.

You can also quickly find other ratios of different lengths and find other interesting connections. Take $\frac{PQ}{QR}$, for example.

If you look at our defining quadratic $\varphi^2-\varphi-1=0$, it can be rewritten as $\varphi(\varphi-1)=1 \rightarrow \varphi=\frac{1}{\varphi-1}$. Completing our expression gives us:

Just as our golden rectangle previously foretold.

]]>They are surfaces not covered in flat mirrors, but rather are tessellated with the corners of cubes that are mirrored. Why is that? To find out, we first need to talk about Fermat's principle, and $90^\circ$ angles.

Fermat's principle, or the principle of least time, was an idea coined in 1662 by the mathematician of the same name, and it states that the path taken by any given ray of light is always the quickest one. Although this may seem obvious, it allows for many properties of light and optics to be derived from it. The one that it helps demonstrate for us is the common equality of the *Law of Reflection*: the angle a light approaches a surface is the same angle it reflects at.

Let's say we have a light source $S$, and we're reflecting it off a mirror (black) at point $R$, to have our ray reach an end point $E$. To show that the angle of incidence must equal the angle of reflection, we are going to create a mirrored copy of our end point, $E'$ (points $P$ and $Q$ are exclusively reference points). As $E'$ is a reflection of $E$ across the mirror, they are both equidistant to $R$, so we end up with two orange lines of equal length, $\overline{RE}$ and $\overline{RE'}$. However, because $\overline{RE} = \overline{RE'}$, our original path of reflection $SRE$ can be modeled with the new path $SRE'$. Note that the speed of the light isn't changing throughout our model, so we only need to find the shortest path $SE'$. To minimize $\overline{SE'}$, the shortest path is clearly just a straight line (blue). We already new that the angle $\angle{ERQ} = \angle{E'RQ}$ by definition of reflection of $E \rightarrow E'$, and now that we know $\overline{SE'}$ is a straight line, the angle that $\angle{SRP} = \angle{E'RQ}$. Combining these two inequalities nets us $\angle{SRP} = \angle{E'RQ} = \angle{ERQ}$, which was what we wanted to show.

Although this seems like an obvious fact, knowing why it this fact is true helps to understand how we will apply it to our bike reflector and corner cubes.

To understand why corner cubes are chosen as bike reflectors structure, looking at simpler cases always helps. Instead of looking at corners of cubes to see how light interacts with them, we can first work from the corner of a square and see what happens.

Notice how regardless of what angle the light is hitting the corner, the light reflected from the corner is always parallel to the ray entering it. We can prove this remains true for any angle $\alpha$ quite simply using some basic geometry.

We want to show that ray $\overrightarrow{M}$ is parallel to $\overrightarrow{N}$ given $\overrightarrow{M}$ intersects the corner at an angle $\alpha$ and that we have a true square corner that is a right angle. Filling in the givens, the rest follows nicely. The Law of Reflection gives the angle congruent to the initial $\alpha$, and the idea that all triangles' angles sum to $180^\circ$ gives the $90-\alpha$. The trick in proving this involves adding an auxiliary line as such and the rest follows.

We add another line parallel to one of the sides of our corner. This creates another right angle. Since we know that $90-\alpha$ makes part of the right angle, we know that $\alpha$ must make up the rest of the right angle, as $90-\alpha+\alpha=90$. By Law of Reflection we then know that there is a symmetrical angle of measure $\alpha$. Now since $\overrightarrow{M}$ and $\overrightarrow{N}$ both are attached to parallel lines at congruent angles, the only way that can happen is if $\overrightarrow{M}$ was parallel to $\overrightarrow{N}$ as well. Hence, a ray $\overrightarrow{M}$ has a reflected path $\overrightarrow{N}$ that exits parallel to its ray of incidence.

Moreover, we can show this only holds true for right angles using very similar logic. Setting our once right angle to $\theta$...

From our diagram, it's clear that for $\overrightarrow{M}$ to be parallel to $\overrightarrow{N}$, $\alpha=\alpha+\theta-90$, which when solving for $\theta$ gives $\theta=90$, our previous right angle.

All of the previous arguments can be applied to the 3-dimensional case by decomposing the ray of light into two other rays, and by showing that those two rays are parallel to the initial, that the composite ray is as well. With all of this together, it makes perfect sense why bike reflectors are corners of cubes: they send light back to its source. If you had a standard mirror, no light would return back to where it came from unless looking perfectly perpendicular to the mirror.

If no light goes back to its source, to say, a car's headlights, no light will hit the driver's eye to indicate that there is a bright, shining reflector to show that there is a bike up ahead (for this reason exactly, most reflectors actually have angles slightly large than 90$^\circ$ so that most light returns back to its source, and some can scatter to an observer slightly above/below/left/right of the source). These reflectors actually have a specific name to it, and they're known as *retroreflectors*, literally meaning to reflect backwards. This concept has been leveraged to aid satellites, and indirectly the military. There's a reason why no stealth-based aerial technology has no right angles: they want to avoid creating an accidental retroreflector that can return radio waves.

Hopefully this gave insight into a seemingly arbitrary design choice in one of the most common bike accessories used today.

]]>Let me propose a question to start. Try to solve the following:

An infinite power tower which supposedly equals 2? Seems unlikely, but those familiar with these infinite-operation type problems likely know the strategy to solve this. Notice how there's a copy of our equation stacked on top of itself.

Since we know that equation in the box is equal to 2 because it's a duplicate of our original equation, we can easily reduce the problem down to something much more manageable.

So, raising $\sqrt{2}$ to itself over and over again equals 2. What other equations can we solve? Let's try this one.

Using the same strategy as before, this one is trivial.

Which is… the same answer as before? How can $f(x) = \sqrt{2}^x$ iterated over itself equal both 2 and 4 at the same time? When in doubt, we can ask our calculator for some confirmation.

With some simple Python, we can get a pretty good approximation quickly.

import math def f(x): temp = x for i in range(1000): temp = math.sqrt(2)**temp return temp print(f(1))

The above code creates and evaluates a power tower 1000 numbers tall, giving us an approximation of `2.0000000000000004`

, which is pretty close to 2. So, is 4 anywhere to be seen? Actually, yeah; our solution wasn't *completely* false. Notice that at the end of the script it says `f(1)`

. That 1 is our *seed value*. Since our power tower can't be infinite in order to get a calculable approximation, we need to cut it off after some amount (in this case, 1000 numbers high). In order to do that, though, there has to be some number there at the top of that power tower. In this case it was 1, but it can be anything as we constantly plug our output back into our input, in the case of an infinitely stacked power tower, that seed value is negligible. Let's see what happens if that is changed to `f(4)`

.

print(f(4))

Due to rounding, our script actually blows up to infinity with `f(4)`

, but we can reason this out by hand. If we start with 4, then our first output of iteration will be $\sqrt{2}^4 = 4$. Since 4 is our output, that's our new input. But since 4 was also our seed value, it'll just constantly output 4 at every iteration. So 4 *is* a convergent value (as we can only calculate finite approximations) to the infinite power tower of $\sqrt{2}$, but only for its seed value. To better understand this, we can use a tool known as a *cobweb plot*.

Cobweb plots are a simple, elegant method to model iterative functions in the Cartesian plane by utilizing a seemingly mundane auxiliary function: $y = x$. What is probably the first graphs people are taught in elementary school is one of the most helpful in modeling these complicated and otherwise impossible to view functions. Here's how to make a cobweb plot: 1) Plot the function to be iterated on (in this case, $f(x) = \sqrt{2}^x$) and $y = x$ together. 2) Pick a seed value to start iterating on. 3) Alternately draw vertical and horizontal lines within bounds of each graph for as many iterations as one needs. Steps 1 and 2 should be clear enough as they're fairly similar to what we did above, but Step 3 might need a visual to go along with it.

Here's the first step's resulting plot:

Nothing too crazy. The green graph is our $f(x) = \sqrt{2}^x$, while the red graph is our $y = x$. For Step 2 we'll pick $x = 1$ as our seed value as we did before. This is where the magic of Step 3 comes in: from $x = 1$, we'll draw a vertical line from the red graph until it intersects at the green graph.

Now we have a line segment with points $(1,1)\rightarrow(1,f(1))$. This step is equivalent to plugging in 1 into the top of our power tower, geometrically doing the operation of $f(x)$. Since we just a drew a vertical line, we now draw a horizontal one from the green graph $f(x)$ until it intersects the red one $y = x$.

Now we have a new line segment from $(1,f(1))\rightarrow(f(1),f(1))$. You can probably see where this is going. Now that we have a new point at $x = f(1)$, we can draw a new vertical line until it hits the green graph, geometrically finding the value of $f(f(1))$, performing our repeated operation! We can do this series of horizontal to vertical lines as many times as we want to get as many iterations of our repeated function as we want!

Now you can probably see why this is called a cobweb plot, as we weave back and forth creating a net-like shape between the graphs (and it only gets more wild looking with different iterative functions!). Even in the previous graph where I set the seed value to be $x=-1$, our graph still quickly hones in on evaluating to $x = 2$ for the $\sqrt{2}$ power tower, just where it happens to be the intersection of our two plots. This is a pretty narrow scope of our graph, though; let's zoom out and see more of this plot.

There's also an intersection at $x=4$! Even with all of this, I don't think it would be wrong to feel that $x=4$ should *not* be a solution to some extent. Even though, it clearly shows a lot of the same characteristics that $x=2$ does, it still feels weird for this to be considered an answer, or at least to the same extent that $x=2$ is. For any seed $x<4$, our iteration converges to $x=2$, and for any $x>4$, it diverges. Only at $x=4$ does our repeated power tower equal 4. To properly understand this, we'll need to utilize derivatives.

The classic definition of the derivative $f'(x)$ is a function that returns the slope of $f(x)$ at every point $x$. While this definition of the derivative isn't wrong, it is fairly limiting when only considered in the contexts of slopes. We can reframe the idea of a derivative not to be the slope of a function at a point $(a,f(a))$ but rather how *sensitive* the function is at the point $(a,f(a))$. This will be more apparent if we plot our $f(x)=\sqrt{2}^x$ in a new way.

You can generate the above plot with the following Python:

import numpy as np import matplotlib.pyplot as pltdef f(x): return np.sqrt(2)**x inp = np.linspace(-5,5,40) out = [f(n) for n in inp] d = 10

fig = plt.figure(figsize=(20,4)) axes = plt.gca() axes.set_xlim([-5.3,5.3]) axes.set_ylim([-6,6])

plt.scatter(inp, [d/2 for n in range(len(inp))]) plt.scatter(out, [-d/2 for n in range(len(out))]) for n in range(len(inp)): plt.plot([inp[n], out[n]], [d/2, -d/2], color='green')

This basically just took the $y$-axis of our Cartesian graph and rotated it $90^\circ$. The blue dots represent the preimage of points $x$, while the orange dots represent their associated transformations under $f(x)$ with green lines connecting them. Just looking at it, it's consistent with our Cartesian graph as $f(x)$ never goes below 0, which makes sense as an exponential is always positive. The reason why we want this graph as it guides the intuition behind this idea of sensitivity and the derivative.

Notice the dots around $x=-3$ in the preimage (blue) points. They all get mapped and squished down near $.354$ under $f(x)$; they get tightly pressed together. But just *how* tightly pressed together are they? That's exactly what the derivative tells us! For a small change $dx$, we want to know how much that changes the output $df$. In this case, $f(x)=\sqrt{2}^x \rightarrow f'(x)=\sqrt{2}^x\cdot\ln{\sqrt{2}}$. Plugging in $f'(-3)=.1225$. This means that around $x=-3$, the ratio between how much the points around it changes under $f(x)$ is $.1225$, in other words, the area around $x=-3$ appears to have shrunk *inward* by a factor of $.1225$. In the contexts of slopes, this ratio would be the slope of our tangent line, telling us how tall $df$ would be relative to $dx$. Since the derivative $f(-3)$ is small, we can say that $f(x)$ is not very sensitive around $x=-3$, as a small change in input from $-3$ will still evaluate to about the same value.

Now let's look on the right half of the graph. Trying $f'(4.5)=1.6486$ would imply under our previous logic, that we'd expect points to stretch *away* from $x=4.5$ by a factor of $1.6486$. Just by looking at our plot, that's not so hard to believe. This means that our $f(x)$ is sort of sensitive around $x=4.5$, as a small difference in input from $4.5$ can lead to a big difference in evaluating $f(x)$.

So now we know that for a given $a$, if $|f'(a)| < 1$, it's a shrink, and if $|f'(a)| > 1$, it's a stretch (a negative derivative implies there's also a flip occurring, but we care only about magnitude). You can now kind of imagine what effects these have when we iterate over $f(x)$ for a long time: points will gravitate towards numbers that shrink the area around them, and be repelled away from numbers that stretch them. Now, relating this back to our original Cartesian plot, let's highlight the areas in which $|f'(a)| > 1$.

Well, look at that! Our $x=4$ solution is in our blue $|f'(x)|>1$ region, while our $x=2$ solution is not!

Connecting this all together now, we had two solutions to an iterative function, but only one of which was appearing in practically every case. When graphing its respective cobweb plot, we see that one solution lies in a non-sensitive region ($f'(2) = .6931$), while the other does ($f'(4) = 1.3863$). So what can we say about either solution? Since we know $f(2)$ is not sensitive to small changes and moreover shrinks space around it, we know that $x=2$ is a **stable fixed point** of the iterative function $f(x) = \sqrt{2}^x$. It's stable under the notion that because it isn't sensitive to small changes in its neighborhood of points, with each iteration we take, we map points closer and closer to $x=2$ due to the squishing effect of its derivative. But for $x=4$, which is sensitive, each iteration tends to stretch and repel points away from $x=4$, even though it too intersects in our cobweb plot as well as analytically solves the equation. Hence, we call $x=4$ an **unstable fixed point** of the system. Just like we've described, while $x=4$ is valid for its seed value, the slightest discrepancy in value pushes numbers away from it to either start approaching $x=2$, or diverge to infinity (like in our rounding error in the Python script before!). If we quickly go back to our graph style with 2 number lines and perform the function iteratively there, we can really see what these pulls and pushes of numbers looks like. Here's what the first 10 iterations of $\sqrt{2}^x$ looks like:

You can really see how tight the points coil around $x=2$, and split away from $x=4$. Even with an initial value that starts so close to $x=4$, you can still see it slightly drift away from it at each iteration. This is why thinking of derivatives as measures of sensitivity is so important: the value of the derivative tells you how strong of a pull or push certain numbers have. Consistent with our findings, $x=2$ has a pulling effect around it with a small derivative, while $x=4$ has a pushing effect with its large derivative.

This is why we were also able to use cobweb plots: they were the geometric algorithm to solve when $f(x)=x$, which makes sense as if something is a fixed point, no matter how many times we apply a function to it, it should remain the same. So when solving $\sqrt{2}^x = x$, you'll get the intersections we found earlier at $x=2,4$ (if you want to try and actually solve this equation, it requires the clever use of the Lambert W-function). That's why we were able to analytically solve for two different solutions, but only one kept popping up everywhere. This isn't limited to just power towers, though.

This type of relationship between stable and unstable fixed points is everywhere. Take the well-known infinite fraction below:

By setting this equal to $x$, we can solve it just like we did before with the power towers.

$1 + \frac{1}{x} = x$

$x^2-x-1=0$

Using the quadratic formula, we once again get two solutions:

The famous Golden ratio $\varphi$ and its underrated second solution. Still, it begs the question, how can a completely positive infinite fraction equate to something negative? Illustrating this with our cobweb and sensitivity regions will make this clear once again. Setting $f(x)=1+\frac{1}{x}$, we get…

A lot like $x=4$ when iterating $\sqrt{2}^x$, $1-\varphi$ is the unstable fixed point in the sensitive region, with numbers getting pushed away at every iteration, while $\varphi$ is the stable one which we quickly spiral down towards. We can quickly verify that $1-\varphi$ is a "valid" solution by plugging it into $1+\frac{1}{x}$ just like we did with $x=4$ into $\sqrt{2}^x$.

For its own seed value, $1-\varphi$ is valid, but I guess that's up to you if you want to equate a negative value to a positive infinite fraction.

For those who are interested, try setting your seed value to a number in the form of $-\frac{F_n}{F_{n+1}}$ where $F_n$ represents the nth Fibonacci number. The Golden ratio is closely tied to the Fibonacci numbers, so it may be a bit unsurprising why they may relate here. If you try to iterate over any number in this form, you'll eventually hit a point where evaluating the function becomes undefined. Try plugging in a few and watch the strange cascading effect happen.

There are a whole host of functions that have interesting iterations as well. Let's try $f(x) = \cos(x)$

Since $f'(x) = -\sin(x)$, $|f'(x)|$ is always less than or equal to 1, so all fixed points it has will not diverge. In this case, we get a solution of $\approx .73909$, sometimes referred to as the Dottie number, which has its own set of interesting properties (for one, it's a transcendental number of the likes of $\pi$ and $e$!). Let's try another function. What happens if we scale $f(x)$? Let's try $5f(x) = 5\cos(x)$

We have not one, not two, but three different intersection points of where $5\cos(x) = x$. But notice, all three of them lie within the sensitive region where $f'(x) > 1$; they're all unstable. You can probably tell just by looking at it, it's a very chaotic diagram. This might not be unexpected for some of you though. If it doesn't converge to anything, but also not diverge, why wouldn't it just randomly jump around ad infinitum? Well, let me just present another function to explain why. Let's make a cobweb plot for $f(x) = 3.2x(1-x)$

Here we have 2 intersection points, both of which are in the sensitive region where points should not converge to excluding its own value, and that's exactly what we see with no definite attraction to any one fixed point. Yet, it's not like our iterations are randomly moving. In fact, just looking at the diagram, it's quite predictably going in a cycle between two $x$-values of $\approx .516$ and $\approx .8$. The difference between $5\cos(x)$ and $3.2x(1-x)$ is how it interacts with our seed value. For the former, it has a quality known as *sensitive dependence on initial conditions*, or more commonly referred to as the Butterfly effect: a small change in the seed value can produce wildly different outputs in iteration in the long run, just like how a butterfly's wings can produce a hurricane years later halfway across the globe. This is a common property of what is aptly deemed *chaotic behavior*. The latter function, while it may not have a convergent value, it does not exhibit Butterfly effect-esque behavior nor chaos while iterating over it, and instead settles into this cycle. As a kickstarter for those interested, $3.2$ in the latter function was not an arbitrary choice: it comes from a family of iterative functions of the form $rx(1-x)$ known as the logistic map. There's so much to talk about there, it likely will be its own post later, but that's for another day.

I want to go back to the Golden ratio problem as there's a neat extension to a more general case of an iterative approximation technique that can be more applicable to problem solving that I want to share. It is known as the **Newton-Raphson Method** which can (usually) effectively hone in on roots of a polynomial quite efficiently.

The idea is fairly similar to what we did before, but since it's catered to finding roots of polynomials, its iterations have a modified step as we're looking for intersections with the $x$-axis instead of the line $y=x$. Here's the basic idea: 1) Pick an initial seed value $x_0$. 2) Draw a vertical line (like we did with the cobweb) until we hit the function $f(x)$. 3) Draw the tangent line of $f(x)$ at $x_0$, and see where it hits the $x$-axis. Call this new point $x_1$. 4) Repeat the process as many times as you'd like for as accurate an approximation as you'd like up to some $x_n$. Here's an example geometric interpretation for this method with $f(x) = x^2 - 13$.

I had to zoom in extremely close for this graph because, as you can see, just after two iterations from a seed value $x_0=5$ finds a really accurate approximation of one of the roots of $f(x)$ and you wouldn't be able to see those lines unless magnified by this much. Let's work out a general iterative formula for this method. We first start with some $f(x)$. Just by using derivatives and definition of a line passing through the point $(x_n,f(x_n))$ for our tangent, we can solve the equation

to find the next point $x_{n+1}$ to continue iterating on (as it should be the $x$-intercept of that line like the instructions describe). Doing some basic algebra shows that:

$f'(x_n)(x-x_n) = -f(x_n)$

$x = x_n - \frac{f(x_n)}{f'(x_n)}$

So, tidying things up, for a given (continuous and differentiable) function $f(x)$, we can approximate its roots by iterating over with some initial $x_0$:

Trying this out with our $f(x) = x^2 - 13$, our recurrence relation after some simplifying becomes

Or if you liked our previous notation, we can rewrite this as a function and iterate over

Since this is in function form, we can use our old friend the cobweb to solve this for us.

It nicely finds $\sqrt{13}$ as a solution, just as we would expect. However, notice that there are two intersection points that lie *outside* of the sensitive region. One we found at $x=\sqrt{13}$, and the other is actually the second solution to $x^2-13=0$ at $x=-\sqrt{13}$. Our seed value significantly matters more in this case, as now depending on which zero of $f(x)$ is closer, our iteration will target only the closest solution, and this only becomes more important the more zeroes our function contains.

Even with all those caveats, notice what we just made! Our iterative function $g(x)$ is essentially a square root estimator, but with no exponents! While it's nice and convenient just to use exact answers, having decimal approximations are just as useful, especially for computers who don't have unlimited memory to use exact answers. For any number $n$, we can calculate $\sqrt{n}$ as accurately as we'd like by iterating over the function

as many times as we want. There are some exceptions where certain seeds can infinitely cycle or actually result in no subsequent $x_{n+1}$ (imagine a horizontal tangent line), but this method is incredibly useful, as this doesn't just extend to square roots, but to any function you want to approximate using the aforementioned formula

Here are a few other iterative functions for other roots of $n$:

$\sqrt[3]{n} \rightarrow \frac{1}{3}(2x+\frac{n}{x^2})$

$\sqrt[4]{n} \rightarrow \frac{1}{4}(3x+\frac{n}{x^3})$

$\sqrt[p]{n} \rightarrow \frac{1}{p}((p-1)x+\frac{n}{x^{p-1}})$

Going back to our Golden ratio iteration, we can rewrite it under the fixed point formula $f(x)=x\rightarrow 1+\frac{1}{x}=x$. If you multiply that through by $x$ and rearrange, we get a quadratic $x^2-x-1=0$. That's a quadratic we can solve for with the Newton-Raphson Method! Plugging it into the formula, we get a function to iterate over as

And sure enough, it works! The advantage of using the Newton-Raphson Method in this case, is that we no longer have to worry about unstable fixed points, as all of our solutions lie outside the sensitivity region. So even if we lose some insight into the nature of each solution, we consistently find each solution of $\varphi$ and $1-\varphi$ to an accurate decimal expansion with the right seed.

Iteration and fixed points become one of the prime topics for dynamical systems and describing much of the world around us. We discussed the Newton-Raphson Method of root finding, but there are many other recurrence relations for approximating roots of functions, each catered for their own purpose with different convergence rates and fail cases. Moreover, this is just a single *use* of the Newton-Raphson Method, for it is more well known as an alternative to gradient descent. Solving systems of differential equations comes down to finding the equivalent of a higher-dimensional fixed point, or in other words, an eigenvector: a vector (which is just an object that can encode more than one number and hence dimension) which doesn't change direction under the transformation describing the system of equations. Markov chains are also another extremely important occurence of fixed points over iteration: after a long series of transitions between states, we can make an overarching statement about the system as a whole reaching an *equilibrium state* where transition probabilities are expected to remain the same (going back to that idea of eigenvectors!). Synchronization is a prime example of a fixed point under iteration: even if a group of fireflies begin out of phase with one another, their coupling over time will reduce each other into a single large group with one cyclic, uniform behavior. The Mandelbrot set (and all of the Julia sets, for that matter) arise out of the fact that some complex numbers are bounded under iteration of functions $f(z)=z^n+c$ that remain bounded after a long time (sometimes being bounded to multiple values at once!). There are even entire studies dedicated to this. *Invariant theory* studies mathematical groups and polynomials to see how they remain unchanged under transformations. Almost all of chaos theory is about stability (or the lack thereof) over long periods of time (Nicky Case has a great introduction to attractors), and especially when what should be simple, predictable equations are not (we already talked about the logistic map, but see it illustrated in the Bifurcation diagram. It is particularly interesting for it appears in the most unlikely of places). We saw some chaotic behavior earlier, and the way I deduced it was chaotic was with a quantifier all iterative functions and maps have known as the Lyapunov exponent, and this itself is so interesting to look at for how functions change in behavior along with its Lyapunov exponent. For fixed points alone, there are hundreds of theorems dedicated to analyzing them (most notable of them being Brouwer's Fixed-Point Theorem).

If you are interested in anything covered here, popular math YouTube channel 3Blue1Brown made not one but two videos discussing this idea of derivatives and infinitely stacked operations with the exact puzzle I posed at the start of this post. Their first video is what originally inspired me to look into these objects more when I first saw it a couple yeas back. Their animations do wonders compared to what any text post can do, so please do check them out if you want a more visual approach to these processes along with some additional justification for solutions to iterative processes.

Fixed points appear everywhere, and I hope this shared a few insights into how they can appear, deceive, and approximate even the most out there of expressions.

]]>Brief summaries are at the bottom of each section if you want a quick referesher for anything above, but first, some review.

This is also all written more formally with other examples in this paper.

**Markov chains**, in essence, are a way to model a process that randomly jumps between different outputs, where each output is said to have some probability to jump to other outputs. They're sort of like rolling dice, but the likelihood you roll any number is only dependent on the number you rolled last. It might help to describe this with an example. Let's say you want to know what the weather will be in 5 days: will it be sunny or rainy? Fortunately, the weather doesn't vary too much, so if it's sunny one day, it's likely to be sunny again the next day with 80% chance. If it's rainy, it will likely be rainy again too, with, say, 60% chance. This can be shown quite succinctly in a little diagram:

This is our actual Markov chain, showing the two **transition states**, S(unny) and R(ainy) with their associated transition probabilities. However, we can't actually *do* much with just a picture alone. So, we can rewrite these probabilities and encode them in a matrix:

You can think of each row as a different state for current weather, and the columns as probabilities for different states of tomorrow's weather. In this case, I have written row 1 and column 1 to indicate sunny days, and row 2 and column 2 to be rainy days. That's why entry $a_{1,1}$ in row 1, column 1 shows 80%, because if it is sunny today (row 1), we expect an 80% chance for it to be sunny tomorrow (column 1). Similarly $a_{2,2}=.6$, as if it's rainy today, we expect a 60% chance for rain again. $a_{1,2}=.2$ means that if today is sunny, then there is a 20% chance of rain tomorrow, and for completeness sake, $a_{2,1}=.4$ indicates a 40% chance for it to be sunny given today is rainy.

What we've built here is known as a **transition matrix**, as, well, it's a matrix that shows transition probabilities; it's a matrix that shows how likely we are to jump from one state to another. In this case, our states are the different weathers: sunny or rainy. So, how does this help us answer our original question of the what the weather will be in 5 days? Well, let's first try to find the weather 2 days from now. We know how to model 1 day from now, and since these are probabilities, wouldn't it make sense just to multiply our matrix by itself?

Our probabilities have changed a little bit. Now it's saying, if today is sunny, there is a 72% chance it will be sunny 2 days from now. The reason why multiplying our matrix itself to get this result makes sense is because of the mechanics of matrix multiplication essentially asks: "What is the probability from getting from one state to another in two steps?" If you work out the multiplication itself, it might be clearer, but the way I like to think about it is in terms of transformations of space. For those familiar with a bit of linear algebra, we can think of our matrix $M$ as a collection of basis vectors that scale space (where our vectors in space can be thought of as a collection of starting states, i.e. the initial observed proportion of sunny days to rainy days). So applying $M$ once transforms space, we can then take that as a new "default" or "unit". If we apply $M$ again to our basis vectors, it has the effect of transforming space once again. This can be thought of as our standard, independent probability multiplication, but instead of changing a singular probability (i.e. dice value), we are changing two (likelihood of sunny *and* likelihood of rainy days).

With this in mind, our question is easy. It boils down to what $M^5$ is.

So if today is sunny, we look at row 1 and can expect a 67.008% chance of sunny weather, and if it's rainy, row 2 shows a 65.984% chance for sunny weather. Nice! But you might be looking at that matrix and notice that row 1 and row 2 are *almost* the same. Watch what happens if we don't check for any 5 days in the future, but if we look towards an infinite number of days ahead?

The rows *do* become the same. So, if we were to pick a random day far, far into the future, we can expect it to be twice as likely to be sunny than rainy regardless of today's weather. There's two important interpretations of this fact. 1) going back to our transformation of space idea, this **equilibrium state** is our eigenvector (specifically for $\lambda=1$) of our transition matrix $M$. Meaning, it is the solution to the matrix equation $vM = v$ where $v$ is a row vector (here, $v=\begin{bmatrix} .\overline{666} & .\overline{333} \end{bmatrix}$). The second—and more important—way to think of this equilibrium state is that it is the final, or **stationary** distribution of sunny and rainy days. That is, if you took the fraction of $\frac{\textrm{Sunny Days}}{\textrm{Total Days}}$, you'd expect it to approach $\frac{2}{3}$ as time went on, and $\frac{\textrm{Rainy Days}}{\textrm{Total Days}}$ to likewise approach $\frac{1}{3}$.

To summarize, here are a few important concepts about Markov chains:

- A Markov chain is a random process that describes the ability to switch between multiple states.
- A Markov chain's probability for any future state depends only on the current state (this is also known as the Markov property).
- The sum of each row of a Markov chain's transition matrix must sum to 1 (something has to occur at each time step for each state, even if that means not changing states)
- All Markov chains will eventually reach an equilibrium state that describes the final distribution of states over a long time.

Markov chains are extremely powerful tools to model dynamics with multiple states due to their above properties, but some of their uses from chaos to disease modeling deserve their own post another day.

If you understood this so far, you've got the hardest part of Markov chain Monte Carlo methods under your belt. That being said, we are still missing second MC of MCMC.

**Monte Carlo simulations** are probably the closest you'll ever get to the scientific version of guess-and-check. The idea is if there is something that's too hard to calculate, you do a bunch of mini, random experiments to obtain data that can give us numerical approximations. It's very akin to Bayesian thinking: the more data you give to your approximation, the better the you can "update" your approximation to be more accurate and confident. As with all things, let's do a quick example.

If I hand you a coin, you probably would assume it's a fair coin: 50/50 chance for either heads or tails. But how could you verify that it is indeed a fair coin? Well you could flip it and see what it turns up as. Heads! "It must be an unfair coin as it flips heads 100% of the time!" said no one ever. Of course a single data point isn't nearly enough to draw any conclusions, so you need to flip it again. Heads again! Definitely weighted, right? Even if you get only heads twice in a row, that still isn't conclusive. You need to flip the coin a lot of times. By a lot, upwards of hundreds for a reasonable guess at the balance of the coin, and upwards of thousands for an ideal approximation. For all you know, those first 2 heads could be in a much larger sequence of flips you have yet to unfold:

`H-H-T-H-T-T-H-T-T-T-H-T-H`

Just like that, our coin reaches that 50/50 split significantly closer within just a few additional flips.

Each one of our data points were flips in this case, and we call those data points **samples**. The important part to note, though, is that there is a sense of randomness in each sample. The idea behind a Monte Carlo simulation is that even if our sampling method is random, the more samples we take will average out to the true value (think the Law of Large Numbers). The is why the more samples we take, the more accurate our estimations become. This is a lot like unbiased sampling in research studies: you can't reasonably survey everyone in a population, so you take a smaller, random sample in the hopes that it will be representative *enough* to make reasonable conclusions of the larger population.

Again, just to summarize a few details:

- Monte Carlo simulations use random sampling to get numerical estimations for hard to otherwise calculate results.
- The more samples/trials we take, the more accurate our results.
- While taking more samples is more accurate, it also become less efficient to compute and gather results, so you have strike that balance between more accurate results or quicker results.

With all that out of the way, let's put it all together into one cool algorithm.

So far, we've sampled from relativiely easy things to run trials on and get samples. Flipping a coin and rolling a dice are nice distributions to run trials on are they both can be modelled by a nice uniform distribution (even for weighted dice/coins by partitioning the uniformness). This is due to the niceness of a **discrete** distribution where there is only a finite number of results our black box can output. Often the case, we have a **continuous function** where we don't have probabilities for individual results, but rather a range of results. To get the gist of it, take the uniform probability distribution between $[0,1]$. What's the probability that you pick $0.235326…$? Obviously, out of an infinite amount of possibilities, a single, specific number to pick is probability 0. BUT, the probability of picking a number between $[.25,.75]$ is exactly $.5$, as we're picking from half of our total range. This is the idea of **probability density**. So, you can imagine for more complicated distributions (especially those taken from real life data) can be a lot more difficult to get samples from, or properly know the densities of regions. Here's where our MCMC comes from.

**Markov chain Monte Carlo** methods combine two important aspects of the two concepts the name implies: a Markov chain's equilibrium distribution and Monte Carlo simulation's random sampling. Here, we make a Markov chain who's stationary distribution is *equal* to our hard-to-model probability distribution by doing a random walk around the distribution (for the sake of notation, we'll call our "target" distribution we're trying to model $\pi(x)$). In this case, we do so with the Metropolis-Hastings algorithm which is extremely simple:

- Pick a starting point $x_0 \rightarrow$ this is the start of our "walk". An initial sample, if you will, that we provide ($x_t$ means our current sample at time $t$).
- Now pick a new,
*random*point $y$. Call $y$ the "proposed state" for $x_{t+1}$. See how "good" $y$ is compared to $x_t$.

i. If $y$ is "better", we let $x_{t+1}=y$

ii. If $y$ is "worse", we

*might*let $x_{t+1}=y$, but not always.- For $t=1,2,3,…$, repeat steps 2 and 3.
- Profit.

This is extremely vague, but I intentionally left it as such, because often times the formulas can confuse the language. In essence, this is what Metropolis-Hastings does to generate samples. We take a sample $x_t$ at a time $t$ that "traces" our distribution, and as $t$ gets larger, the more accurate our "trace" of the curve we walk around gets better. Let's put some of the formulas back into the instructions above and go at it one step at a time.

**Step 1** is easy enough: we give any number for our algorithm to start with. Literally anything. You can give smart guesses that speed up the process, but that will be clear in a second.

**Step 2** we don't actually perform, but rather design. Unlike Step 1 where we gave some determined number of our choosing, Step 2 we implement a **transition kernel** to pick a step for us. This kernel is a function $Q$ that takes a current spot $x$ and with some probability outputs a new spot $y$. That is, $Q$ is a distribution that randomly generates a new point $y$ *given* a current one $x$, which we will write $Q(y|x)$. This is how we make our "proposed state" and how we actually implement our walk. You may be wondering though, "What *actually* is $Q$?" Well, that's up to you to decide! Since $Q$ itself is a distribution around our current state $x$, you can shape $Q$ in whatever way you want! In general, though, it's not too important, but spending time to design a specific kernel can optimize and speed up the process.

**Step 3** is our "goodness" check. Once we have a proposed state generated by $Q$, we need to see if this proposed state is in a more "likely" or dense spot on our distribution $\pi(x)$. The idea is we want to generate samples representative of $\pi(x)$, so it should be obvious that we should visit the probabilistically more dense spots, a.k.a. visit the spots the distribution says is more likely. Geometrically, this is a point *higher* on our distribution curve.

But remember, just because $y$ is not better doesn't mean that we don't outright reject it. We instead accept it with probability *proportional* to how much worse it is. If $y$ is half as high as our current spot $x$, we flip a coin and might accept it with 50% probability. If $y$ was a third as high $x$, we flip a weighted coin and might accept it with probability $\frac{1}{3}$. In other words, we can write our acceptance probability $A=\min(1, \frac{\pi(y)}{\pi(x)})$. If $y$ is higher than $x$, or $\pi(y)>\pi(x)$, then $\frac{\pi(y)}{\pi(x)} > 1$ and we accept it outright. If $\frac{\pi(y)}{\pi(x)} < 1$, then we accept it with probability of that fraction.

This acceptance probability is also what makes this algorithm so good: we only need to know our target distribution $\pi(x)$ up to a constant! If $\pi(x) = c\cdot P(x)$, then our acceptance probability would be $A=\min(1, \frac{c\cdot P(y)}{c\cdot P(x)})$ which simplifies to $\min(1, \frac{P(y)}{P(x)})$, making the constant irrelevant. This is ideal for real life experiments as perfectly measuring constants from observation can be very difficult.

**Steps 4 and 5** are pretty self-explanatory, so just to rewrite it more formally, here is the whole algorithm one more time:

- Pick a starting point $x_0$.
- Sample a new proposal state $y$ with probability $Q(y|x_t)$
Compute $A=\min(1, \frac{\pi(y)}{\pi(x_t)})$.

i. With probability $A$, accept our proposed state and let $x_{t+1}=y$

For $t=1,2,3,…$, repeat steps 2 and 3.

- Profit.

However I must admit, I did lie to you, but only a *little* bit. The acceptance probability I gave is actually for the Metropolis algorithm, not the Metropolis-Hastings algorithm. The acceptance probability for the Metropolis-Hastings algorithm is $A=\min(1, \frac{\pi(y)Q(x_t|y)}{\pi(x)Q(y|x_t)})$. This is because the Metropolis algorithm only works when $Q$ is a symmetric distribution, meaning that $Q(y|x_t)=Q(x_t|y)$, which returns us to our familiar fraction from before. MH allows asymmetric kernels to speed up the algorithm, but otherwise the concept is the same.

With 5 very simple steps, we are able to take samples from continuous distributions just like that! The Monte Carlo aspect is pretty obvious with the random steps with generating random "proposal states" $y$ in **Step 2**. The Markov chain might be a bit more concealed, as we never actually explicitly define it. But, look at **Step 3** again, as that resembles something very close to our transition probabilities before. Step 3 is actually our Markov chain *implicitly* defined! Since there are an infinite number of states/values to pick and another infinite number of states to transition to, we can't define an infinitely sized transition matrix. So, instead, we define transition probabilities as needed with our kernel $Q$. And notice, our kernel maintains the Markov property as each proposed state only relies on the current. This is because we sort of reversed the way we defined our Markov chain! In our weather example with sunny and rainy days from above, we defined transition states and the stationary distribution followed suit, almost like property or characteristic of our Markov chain. Here, our Markov chain is instead defined by the fact we want our stationary distribution to mimic $\pi(x)$. This is why we don't outright reject states that are less "good" in our acceptance probability, but rather accept it proportional to how less "good" it is as that will reflect our distribution's shape.

But just like in our original Markov chain example, it's not perfect immediately. Notice in our original weather example with sunny and rainy days, 2 iterations with $M^2$ was no where near close our stationary distribution, and while 5 iterations at $M^5$ was closer, it still was nowhere near ideal. You have to *burn in* some states before proper, accurate samples can be generated.

Here's some short Python to implement the Metropolis-Hastings algorithm to estimate the following Laplace distribution:

Here it is in only 20 lines of code:

import numpy as np import matplotlib.pyplot as pltdef target(x): return .5 * np.exp(-abs(x)) # Target distribution π(x)

def accept(p): flip = np.random.uniform(0,1) return p >= flip

def metropolis(iterations): states = [] # Samples generated by the algorithm # Step 1 --> initialize an x0 current = 1 for i in range(iterations): states.append(current) # Step 2 --> Q generates a proposal (normal distribution) proposal = np.random.normal(current, 1) # Step 3 --> Check how good our proposal is goodness = min(1, target(proposal)/target(current)) if accept(goodness): current = proposal # If we like the proposal state, we jump there! return states

Here is the scatter plot of our algorithm walking all around $\pi(x)$ across 10000 iterations...

...and here is the corresponding histogram that fits almost too perfectly to our target distribution.

We can now generate discrete samples proportional to our continuous distribution!

The algorithm aside, an extremely important concept is shown here: reframing questions and objects and asking them from a different perspective can lead to extremely powerful tools and thoughts. We take a Markov chain, and instead of letting its equilibrium state arise as a property, we use it to turn our definition inside out and use the equilibrium state itself to define the Markov chain. This pattern of rethinking concepts has always been a useful, sobeit from building intuition while learning, to defining tools in all of math. From connecting why Mandelbrot set to its cardioid and cycloids, to encoding parameters in 4-dimensional space means, to even Fourier rebuilding functions from sine waves, the most impactful question one can ask is usually in the form of, "What if?"

]]>