How to (Randomly) Write Shakespeare

I.

Take an immortal monkey, and have it bang on a typewriter from now until the end of time, then eventually the monkey would produce the complete works of Shakespeare. The saying goes something like that. Human’s aren’t really good at imagining things that are infinite, so while the saying appeals to some sense of intuition that we have about rare events eventually happening given a long enough time frame, it’s hard to imagine how randomly typing on a keyboard would eventually produce a coherent work of art. Even after an infinite amount of time.

You can read wikipedia about this theorem and almost surely you’ll find a bunch of gobbledy-gook and precise computations that are meaningless to the casual reader. “Okay, blah-blah-blah, the theorem is true, I’m getting a Starbucks.” I can relate, much of my undergrad felt that way. Sorry casual reader, this post is more gobbledy-gook, but I think it’s intuitive gobbledy-gook.

We can reframe the infinite-monkey theorem as follows: Consider a random string of n letters, where n is a positive integer. Let p_n be the probability that the n letters have substring of consecutive letters that spell out some phrase we care about. For example, take n = 10 , and let our phrase be the word “fun”, then p_{10} is the probability that 10 random letters spells out the word “fun” somewhere inside. We can see pretty clearly that if n < 3 , then p_n = 0 , since there aren’t enough letters to actually spell our word “fun”. But when n \geq 3 , then p_n is meaningful, since there is a non-zero probability this could happen. There is at least one way a random sequence of 10 letters could spell out “fun” somewhere in the middle. In other words, if you give your toddler a bunch of scrabble tiles, and ask her to spell out some random words for grandma, there is a non-zero probability your toddler is about to embarrass you.

So, this let’s us conclude that for n \geq 3, we have p_n > 0 . However, this probability is very small. To see this, let’s think about how many different strings of length 3 we could make. Since there are 26 letters in the alphabet, we have 26 choices for the first letter, 26 for the second letter, and 26 for the third letter to give 26^3 = 17576 total choices. But there is only one way to spell “fun”, so the probability for n = 3 ends up being p_3 = \frac{1}{17576} . This is basically a lottery.

However, if n = 4 , then we can spell “fun” in several ways. We could spell it as `Xfun’, or `funX’, where X is just some random letter. How many total ways can we do this? Well, for `Xfun’, the first letter is irrelevant, so there are 26 possible choices there. Then there is only one choice for the next letter, one choice for the third letter, and one choice for the last letter (`f’, `u’, `n’, respectively). So we end up with 26 \cdot 1 \cdot 1 \cdot 1 = 26 ways to do this. In the second case, where X comes last, we also have 26 ways to spell `fun’. Then we have 26 + 26 = 52 possible ways to spell `fun’ out of 26^4   possible ways. So the probability ends up being p_4 = \frac{52}{26^4} = \frac{2 \cdot 26}{26^4} = \frac{2}{26^3} = \frac{2}{17576} > p_3 .

Huh.

So even though the increase in probability is minuscule, there is still an increase. Now if we had a monkey typing for an infinite amount of type, we are essentially letting n \rightarrow \infty. Does it now follow that p_n \rightarrow 1 ? In other words, if we let n get bigger and bigger, so the probability we see the word `fun’ appear somewhere in the n-string get better and better?

Actually, yes.

Intuitively we might see why this is true. The larger n is, the more ways we can have the word `fun’ appear somewhere in our string, which is pretty fun itself. But does it work for some sequence of arbitrary length? Can it? Hamlet is about 130,000 words, does it make sense to talk about this?

Yes. We can prove this.

II.

We will prove that if we have some fixed string of m letters (we’ll call this a phrase), and a random string of n \geq m letters, then the probability the n letter string contains the fixed sub-string of m letters converges to 100% as n is allowed to grow to infinity. That is, we will show that p_n \rightarrow 1 as n \rightarrow \infty, where p_n is the aforementioned probability.

Let m be positive integer and consider a fixed phrase of m letters in some order we care about. Define p_n as the probability a random string of n \geq m letters contains a sub-string of m letters spelling out our desired phrase. We know that p_n > 0 , at least, but we can also show that it is an increasing function of n: if we have n + 1 letters, the phrase could appear in the first n letters with probability p_n . If not in the first n letters, then the last letter of the phrase would have to be the (n + 1) th letter, which happens with some probability q_n > 0.  Since these events are mutually exclusive, we add them and get:p_{n +1} = p_n + q_n > p_n.

So, our sequence of probabilities \{ p_n \} is an increasing sequence. From elementary real analysis, if we can find a sub-sequence contained in \{ p_n \} which converges to 1 , then the entire sequence converges to 1 .

What kind of sub-sequence does this? Consider p_{mn} , the probability of the desired phrase appearing in a string of length m\cdot n. If a is the probability a randomly selected string of length m does not contain our desired phrase, then a < 1 . We can split up our string of length m \cdot n into n strings of length m , so then the probability our desired phrase appears in at least one of these n strings has probability

1 - a^n \leq p_{mn} \leq 1.

Since a < 1 , letting n \rightarrow \infty implies a^n \rightarrow 0 . By the squeeze theorem p_{mn} \rightarrow 1 as n \rightarrow \infty , but \{ p_{mn} \} was a sub-sequence of \{ p_n \} , so p_n \rightarrow 1 .

III.

Notice that we actually never specified m. It was left as some arbitrary integer. If m = 130,000 , the proof still works, so we’ve actually proved that our immortal typing monkey can write Shakespeare on a long enough time frame. In fact, the monkey can type any particular text of finite length (if m = \infty , then it’s not clear what happens next).

This proof is a generalization of an exercise I saw in the (excellent) book A Walk Through Combinatorics. Turns out counting is pretty fun. Who knew?

 

***

  1. Even though the book is eventually written in the limit, the probabilities for any finite n are extremely small. We probably can’t expect it to happen, even if we had an astronomical amount of monkeys typing until the heat death of the universe. Still neat, though.

Queer uses for Probability Theory, or, My Friend Might Be Sick So I Tried To Show Her Bayes’ Theorem

My friend tested positive for Tuberculosis. She is young, a vegetarian, and exercises every day. She has good grades and wants to go to medical school. And now she’s going to die.

Actually, probably not, TB is pretty treatable, but it certainly doesn’t look like any fun. But her anxiety over the positive test is natural. I don’t know if probability is good for soothing anxiety, but given her positive test, what is the probability she actually has Tuberculosis? How worried should she be?

Getting tested for Tuberculosis is apparently standard protocol for health care employees, and was a required test for my friend as she was just accepted to a medical internship. Required medical tests are a useful screening process, because obviously you can’t let a sick person work with people who may have compromised immune systems. Apparently Tuberculosis kills lots of people with compromised immune systems and can even live inside the body in a latent form, producing no symptoms until it “activates” and tries to kill you (how many biology classes do you think I’ve had?).

But, sometimes tests fail, and when you apply medical tests to people who are healthy, you occasionally get a false positive. Even low false positive rates can imply that, given a positive test, the probability you actually have the disease can be fairly low. This is a result of Bayes’ Theorem, which has the form P(A|B) = \frac{P(B|A)P(A)}{P(B)}. Let TB be the event of contracting Tuberculosis, and let \{ Positive \} be the event of a positive test. Then, Bayes’ Theorem looks like

P(TB|\text{Positive}) = \frac{P(\text{Positive}|TB)P(TB)}{P(\text{Positive})} = \frac{P(\text{Positive}|TB)P(TB)}{P(\text{Positive}|TB)P(TB) + P(\text{Positive} | \text{No TB}) P(\text{No TB})}

If we want to apply Bayes’ Theorem to my friend’s situation, Then we need to know what probabilities to enter into the denominator of the above expression. This means we want to know the Probability of a positive test result when someone has TB and when they don’t, which essentially means we need to know the Sensitivity and Specificity of a test.

The test my friend was given here was apparently the QuantiFERON-TB Gold, which is a blood test for Tuberculosis. According to this resource for health professionals, the Specificity of the test is >99\% for “low risk individuals”, with a Sensitivity of 92\%. I.e., the test has a “true negative rate” of greater than 99% and a “true positive rate” of about 92%. That means, if you actually have the disease, 92% of the time it will be correct, and if you don’t have the disease, 99% of the time it will be correct.

I would probably categorize my friend as a “low risk individual”. She’s a white, US-born, middle class female, which is the demographic with the lowest risk for just about anything you can think of except, perhaps, buying Starbucks while wearing Uggs (an affliction she did not escape, sadly.)

All that remains is to know the probability of contracting TB in the US for her demographic. California seems to have a slightly higher rate of TB than the rest of the country, with a reported case rate of nearly 6 per 100,000 people. Case rates aren’t probabilities, and my prior belief is that if there is a flat probability for contracting TB in California, it’s higher than this. Also, apparently TB is complicated, and it can lie dormant inside people for years before activating. But, we aren’t necessarily interested in the probability of active TB, but rather just TB or not TB. Apparently 1% of the world population is infected each year, P(TB) = 0.01 , but the world is largely made up of 1) developing countries, and people in those countries make up the majority of the population that gets TB. But we’re in the US, so our rate is likely lower and we need to adjust our priors accordingly.

If a probability of \frac{6}{100000} = 0.00006  is too low, but a probability of 0.01 is too high for the US, maybe the answer is somewhere in between (say) 0.0005 to 0.005. For now, put P(TB) = 0.001 .

So now, we can use Baye’s Theorem and calculate the probability my friend has TB. We’ll use a Sensitivity of 92% and a Specificity of 99.5% (as indicated from the statistics on the QuantiFERON test).  So we have a True Positive Rate of 92% and a False Positive Rate of 0.5%. By Bayes’ Theorem,

P(TB|+) = \frac{P(+|TB)P(TB)}{P(+|TB)P(TB) + P(+| \text{No TB}) P(\text{No TB})}  = \frac{ 0.92 \times 0.001}{0.92 \times 0.001 + 0.005 \times 0.999} \approx 15.6\%

I would probably bet about $50 that she does not have TB, but the above probability is high enough that I would bet $0 that she can get away with not going to a doctor.

Here’s a plot of the probability she has TB, as a function of the probability of TB. Since we are not confident in a particular probability of contracting TB, we might as well look at all of them.

probability_tb

The blue line and point indicates what which value of P(TB) we used to calculate our example above. Note that if the probability of contracting TB is 1%, then my friend would have nearly a 65% chance of having TB!(!!)

That’s not a bet I would take.

In any case, she’ll probably be fine, but she if she does end up having TB, it does seem like she’s exceedingly unlucky.

Book Review: Principles of Mathematical Analysis (Baby Rudin)

Also known as Rudin’s Infamous Tiny Torture Box.

Trigger Warning: Baby Rudin.

I.

For my undergrad, I majored in Applied Mathematics at a university that was fairly well known for its mathematics department. One of the constants in “Applied Mathematics” is the use of calculus for a variety of applications. “The real world”. In statistics and machine learning, you might be trying to maximize a likelihood function to obtain parameters that describe your data. In economics, you might be trying maximize your ever-coveted utils. I have no idea what is maximized or minimized in real physics, so I won’t pretend to know, but I do know that at least in Newtonian physics, calculus is good at describing physical phenomena, since I am pretty sure people have exams that involve throwing balls and taking derivatives.

Since I was an Applied Mathematics major, I not only needed to know calculus, I needed to know the proofs behind calculus. Real Analysis is the study of the real number line from a mathematical point of view. I.e., limits, differentiation, integration. I.e., Calculus. I.e., Calculus where all you do is proofs. This implies that Real Analysis was a major milestone of my (and others) undergraduate mathematics career. As one of my favorite professors once said: “This class is serious. You’re all adults now.” Real Analysis is a trial that every mathematics major goes through in some form, and because it lays the foundations for much of the applied sciences, Physicists, Statisticians, and Economists often go through it as well. I think once in a while, even a computer scientist will take the course.

Analysis can come in many different flavors. Elementary Real Analysis usually refers to Real Analysis without the concept of a measure (as related to this blog’s title), and Elementary Real Analysis is usually what makes up the content of an undergraduate course. There are many, many books that teach Real Analysis, and many diverge in philosophy. Conversely, many of these books seem to converge to the same few topics in the same order.

The Real Line. Metric Spaces. Sequences. Continuity, Differentiation, Integration, Sequences of Functions. These topics in this order are practically canonical, and while some books may have a permutation of the above list, the list itself remains the same. No small part of that is due to a now infamous “introductory” book on Mathematical Analysis. A book that has earned the name “Baby Rudin” for being the smallest of the books published by the late Walter Rudin on the art and theory of Mathematical Analysis.

This isn’t a review on the book. This is a review on what it was like to learn analysis from the book.

II.

In hindsight, I was thrown into analysis perhaps a bit too quickly, and I struggled greatly with the subject. I went from non-rigorous calculus to a single 10 week proofs course to Baby Rudin. “A talented high schooler could handle Baby Rudin” you might think. Well it just so happens that I am not and was not a talented high schooler.

Rudin’s first two chapters seemed impenetrable. Looking back now, it wa straightforward, but there is a sudden jump in abstraction when going from calculus to real analysis. The limit of a sequence is something tangible, just a bunch of points getting closer together. A continuous function is just “a line you can draw without lifting your pencil”. A differentiable function is simply “smooth”. But these ideas almost seem to be thrown out the window and instead you discuss things like supremums and infimums. Metric Spaces. What even is a metric space? If you ask this question you might be thrown examples which don’t seem to answer the question. “Consider the trivial metric space in which every point has a distance of 0 with itself, or a distance of 1 with every other point. It satisfies the criterion for a metric space.” Whoa man, slow down, I’m still on the triangle inequality.

“I have to teach myself all this stuff? Why does no one explain anything? Why is the class so fast paced?” Part of what makes mathematics difficult is not just the level of abstraction and the true difficulty of logical, coherent thought, but the pace of the material. People who are “good at math” aren’t simply good at math. They digest new information quickly, and can feel logic almost intuitively. After an exam, I remember being told by a fellow class mate – “I have no idea what question 7 was about. But I knew that my answer was correct.” He’s taking graduate courses now.

It’s not an accident. Part of developing mathematical maturity is learning how to swallow abstract definitions and theorems and provide proofs for things with no intuitive understanding of the objects in question. The idea is that one learns how to thinking completely logically, rigorously. Knowing what an object “really is” is not required, and might even be discouraged when one is learning the ins and outs of rigor. You might argue that intuition is crucial to the developing mathematician, but why are you so sure? Rudin specifically demands a kind of abstracted thinking – there are no pictures, there are no explanations on “how” to think about anything. There are just definitions, theorems, and proofs, and a couple of examples of special cases. What you are learning is not what continuity is, but how to prove things with it, and about it, while not understanding it. The ignorance of the objects here is not a bug, it’s a feature. Over time, as one works with these objects repeatedly, one can determine their own way of thinking about things. I remember asking a friend of mine who was far more mathematically developed than I was for help on how to understand some Linear Algebra theorems. He shook his head and politely refused. “I have my own way of thinking about it. It’s hard to explain.” Indeed, math is hard to explain.

While every new, difficult course presents some growing pains, there really is nothing like seeing actual rigorous mathematical analysis for the first time. The rumor I’ve heard is that when Rudin wrote his little book, it was actually a set of lecture notes that were intended to bridge the gap between Calculus and serious mathematics. But “Calculus” as we know it now is nothing like Calculus “back then”. The rumor goes that when Rudin first started teaching out of his book, the students in his class were all pure mathematics majors, and had been doing delta-epsilon proofs already for a year, were familiar with logical statements and their proofs, and had some grasp of real line topology. So the abstraction of the real line topology to the metric space topology was no big deal. The ominous, unfamiliar statement “Consider the trivial metric space…” was taken in stride.

Meanwhile, you have no idea with an infimum is and your professor starts every sentence with Let or (if he is cultured) Suppose that…

“Why do I put up with this? How do they get away with it? We don’t learn calculus like that any more, and in fact I’m an engineering student! Our teacher is terrible and doesn’t explain anything!” You’re being hazed. Our similarity scores are probably closer than you think, but the real answer is that Rudin is going to teach you everything you need to know, while simultaneously not being intended as a lesson in metric space topology, sequences of functions, or continuity. While you will learn those things, somewhat, doing Rudin is more about teaching you how to be alone in a room with yourself, how to discover “tricks”, and how to prove properties about things while not understanding them. Your professor is not actually a teacher, he is more of a guide that doesn’t speak the native tongue and the map is Rudin. It’s really up to you to discover what is buried beneath it’s terse exposition.

III.

It can be pretty difficult to read mathematics. Personally, I can’t do it the way some of my friends can, which is where they literally read the book like a novel and somehow glean the information they need. I suspect they have a higher IQ than I do, or at least less traits of ADHD. I don’t know how to check this, but if I had to guess, my friends are outliers in the distribution and the rest of us unfortunate souls are a little closer to center.

Really, this is fine. Part of developing mathematical maturity is developing your own way of learning it, and thinking about it. The point of mathematical maturity isn’t to memorize all the definitions and theorems so you can call on them whenever you need to, but to learn how to learn math. Everyone has their own way, but I would argue that there are some very general ways to learn math that work for most people. Paul Halmos said something like “Don’t just read it, fight it!”, but what on Earth does that even mean?

There is a way to read Rudin that probably works for most people and I think this is it. Certainly, I can not take credit for this idea, which originally came from a rather hilarious amazon review for Rudin, but I did try it and it did work, for the most part.

It goes something like this. “Open the book at page one. Read until you come to the word ‘theorem’. Write the statement of the theorem down on a piece of paper. Don’t read the proof. Now close the book. Prove the theorem. At least, try to prove the theorem. You will get stuck. Open the book and read the first line of the proof related to where you are stuck. Close the book and continue from there. Repeat until the theorem is eventually proved. Repeat this process for the first 8 chapters. Enjoy grad school.”

Here is a slight modification that can help if the above is too hard: Read the proof once, quickly. Close the book and then reproduce it in full detail, omitting no steps, no matter how trivial. This may seem like cheating compared to the above, but Rudin is terse enough and skips enough steps that is actually still works. My priors indicate that a person’s memory is not so good that they can reproduce a fully rigorous proof of a theorem after seeing the real proof only once. But, it does sort of prime your brain so that you have at least seen what the main steps are, and can try to link each step together with meticulous detail.

Once you complete a chapter like this, if you have mostly understood what you have read and reproduced, you should be able to tackle some of the exercises. They will be too hard when you’re starting out, but a good metric for whether you are making progress is to acknowledge how many you can complete before you have to look something up. Personally, for me, I could solve no exercises from rudin until about chapter 3, at which point I could usually solve 1, or 2. Sometimes 3. By the time I got to chapter 7, I could usually solve the first 10 by myself if I worked pretty hard. By the time I got to chapter 11, I could do probably 3/4ths of the exercises in the chapter, with the larger constructive proofs still evading me. I would peg this at about a B-grade level of understanding.

Since you will likely need help on the exercises, try to approach them the way you approached the reading. Really do try to write them out without help, and spend a good amount of effort doing this. Learning math is about exposing yourself to the feeling of “Oh, NOW I see how that works”  as much as possible. Simply copying answers won’t do this for you, but getting stuck for a good amount of time on a problem, then looking at an answer, or asking for help usually does. “Ah, so I was supposed to use theorem 6.17 here”.

The result of all of this work is that you won’t be scared of notation, even if its unfamiliar, and you will understand how a proof should look, when a step is valid or invalid. You might also learn how to think of your own proofs, but this is an ongoing skill. The point isn’t so much to be able to prove theorems, it’s to be able to understand them, and to understand how mathematics works as a formal system.

Oh, and do more exercises than assigned if at all possible. I didn’t for a long time, until I realized that my peers who were better than me generally did all the exercises. They didn’t work that hard on the supplementary exercises, but they at least did them and understood the intended solutions. This can be a nice “secret weapon” for exam preparation. Alternatively, solve exercises out of a different book, at a slightly lower level (Rudin is hard enough, to go harder is to go graduate level and graduate level means measure theory).

IV.

“Anyone who survives a year with Rudin is a mathematician”. Well, that is a nice badge of honor, but I am not sure I agree. Maybe a better statement is “Anyone who survives a year with Rudin will no longer be afraid of math, understand coherent thinking, and what it really means to prove something.” I would say that “mathematician” implies that someone can come up with original proofs to new problems, which Rudin probably won’t teach you how to do. That skill is ongoing. But, I agree that “Mathematician” certainly sounds better than a disjointed sentence of personal qualities.

At the end of my year with Rudin, I felt almost exactly this way. Math was no longer scary, just difficult. I could come up with proofs more quickly and easily than my peers who had not slaved away with me through a year of the blue book. Picking up a new mathematical skill came more quickly and easily, and statistics and probability were suddenly a lot more tractable, and not because I was suddenly a lot better at calculus.

Make no mistake, a year with Rudin is harrowing, for even if you try to cheat your way through it, there are always exams, and those do demand some understanding and originality (under time pressure, no less).

“THERE ARE BETTER BOOKS OUT THERE”.

Maybe. Abbott’s book is pretty good. I am not sure why people still recommend Apostle (too old school and wordy). But terse definition-theorem-proof mathematics never goes out of style. And Rudin does teach you a lot of things. It teaches you how to think, how to prove, how to work hard, how to read a dense book, and how to ask questions. It teaches you whether mathematics is really for you. It teaches you focus, and dedication. It even teaches you how to be alone, the yellow hue of your desk lamp shining on your scratch paper, illuminating your room.

Principles of Mathematical Analysis (Walter Rudin): 4.5/5

-0.5: too much personal suffering

Your hypothesis test sucks inside the Garden of Forking Paths

Trigger warning: I don’t do a lot of editing, since it’s my least favorite thing, so I apologize for the stream of consciousness flow of this essay. Also, at nearly 5000 words, the essay is long. Also, if I had to say so myself, I think sections I and II are weaker than III and IV.

 

I.

Andrew Gelman, A statistician and social science researcher at Columbia (but a statistician by training – check his CV) doesn’t like hypothesis testing. And after reading some of what he has to say about the subject, coupled with my own experiences, I am inclined to agree.

When I took undergraduate mathematical statistics, the course primarily consisted of defining some notions from probability theory (such as expectation, distribution functions, and convergence in probability, which is just a fancy way of saying convergence in measure, with a particular measure), and then giving a slightly too vague definition of an estimator, showing what consistency, (un)bias, and sufficiency was, then deriving a bunch of simple statistical tests, such as the two sample t-test, confidence intervals, some non-parametrics, etc.

I also remember mostly being fascinated with non-parametric statistics. Why? Few assumptions were needed in order to apply them. Consider the Mann-Whitney U-test (hereby the rank sum test )or the Wilcoxen signed rank test. In particular the Mann-Whitney test has paper thin assumptions regarding your data. You need independent observations from your groups (every statistical test ever) from an underlying continuous distribution (no problem, measurements are often of a continuous nature), you need ordinal observations (this is immediate from the assumption of a continuous distribution) and you need a well defined null and alternative hypothesis just like in every other statistical test you would use.

The rank-sum test is closely related to the independent sample t-test. Not only that, it is actually more *robust* than the t-test when the assumptions of the t-test are not adequately met. And the rank-sum test can also be used to test the homogeneity of distributions, because it is sensitive to a difference in variances between samples. While this information is perhaps less precise than a simple analysis of means between groups, it can be quite telling. If the distributions are significantly different between samples, then it’s like that “treatment” had an effect. This test seemed so magical at the time, that I couldn’t help but wonder: Why doesn’t everyone use this test? Why is the t-test so popular? You don’t need assumptions of normality, and you don’t need a “sufficiently large sample” for the central limit theorem (CLT) to hold – whatever that means.

[NB: This actually does have a meaning, but, at my level of training, I haven’t seen a formal argument for it. Here is a (the?) heuristic one: The CLT, in order to hold, requires that your sample size n \rightarrow \infty. This is actually *never* true, because we do not have infinite sample sizes. The big joke in statistics is that, as a rule of thumb, once n > ~30 or so, then that’s large enough for the CLT to hold, and you can just apply tests that require it. Does no one see the madness behind this? Moreover, the CLT is not very robust to *highly skewed* data, which means that the “rule of thumb” actually doesn’t even work if you use a data set that is, I dunno, from any actual population? If your data is from leaf life people, or observations, chances are it will not be nice and normally distributed. Skew hurts every test, including the t-test, and if it’s bad enough, no amount of praying to the central limit theorem will fix that.]

Okay, so the argument is that if you *do* have normally distributed samples (ha!) then the t-test is a better hypothesis test. Also, if you do actually have a larger amount of data, then the CLT is “pretty good”. This will positively affect power calculations.

Oof, yeah. Turns out that non-parametrics have a fatal flaw: many of them have low power. Power is, informally, just a way to detect effect sizes. Larger power — detect smaller effect. Large enough effect, and even statistical tests with shit-tier levels of power will pick up on it. Think of this as hearing. Someone who is hard of hearing has low “power” in the domain of detecting sound waves. But even a completely deaf person will “hear” a bomb going off down the street — the reverberations will rattle their bones. In particular, the rank-sum test is actually not consistent (becomes more powerful as sample size increases) unless the two groups you are testing are actually different. And in the case that your samples are actually normally distributed, the two-sample t-test remains more powerful. In fact, it has a nice property: It is uniformly most powerful, when its assumptions are met. In general, more power is good, but there is such a thing as having too much power, which often occurs when dealing with massive data sets. Typically, this won’t happen with research, as funding is limited and you can’t hook up the entire country to your MRI machine, but in the era of big-data, sample sizes that result from web-scraping, or even in business analytics, actually do possess the critical property n \rightarrow \infty and this can result in power that actually does converge to 1. Essentially this means that your hypothesis test *will* reject any null hypothesis, no matter how small the effect size is. This is actually a bad thing. Even minute differences in measurement quality between samples will result in a rejected hypothesis test with an arbitrarily small p-value. Again, this is generally  not going to be an issue in academic research. Your data sets will simply not be large enough to make a difference, and very likely, your data is going to be so noisy that your power will be closer to 0 than to 1. (t. every social science study ever). However, this has practical implications for anyone who works with “big data”. I probably need to read more, but inference on massive data sets is very tricky. *Every* effect will flag as significant, and when your features p > n (sample size), then good luck fitting a regression and having it make sense. “Feature selection, bro”. Sure, but the moment you start selecting features, your p-values will fail to achieve their nominal value. Your estimate will be biased. This is made even worse if you use a bad feature selection technique– such as stepwise regression, or pretty much any univariate feature selection. You can throw out all the features with domain expertise and just model what “makes sense”, but this still seems wrong to me. You have information that is not being used. But maybe that’s besides the point when it comes to causal modeling. (Can someone educate me here?)

I digress; there are massive tomes written on non-parametric statistics — I know because I spend too much time in the library, but they seem to be seldom used by researchers. Perhaps the boost in power you get from a t-test with a moderate sample size is worth accepting, despite the flimsy assumption checking. I am not totally sure, but if you open up pubmed and read some studies, you will hardly ever see “non-parametric” or “bootstrapping”. I am not sure if I am lacking in education regarding experimental design, or if I am actually missing something really big. Why do we focus on hypothesis tests anyway, when if you’re data is sufficiently large, or noisy, or doesn’t meet the assumptions of the test, then the results can’t be trusted?

II.

This post isn’t really about non-parametrics vs normal-theory backed tests.

I never thought that a single undergraduate level class could be transformative. But, here I am. Transformed. It was a data analysis class, and part of the reason that it was so interesting was it was in the math department, but it was open to anyone who knew a little bit of R or Python, and knew some basic rules of probability. The class was rigorous in a sense — proofs were shown. Derivations were made. But what made the class interesting was, for the first time for me, we stopped talking so much about how a test works under perfect theoretical assumptions, and instead just started working with real data. Old data, yes, but real data, from real studies. With missing values, non-symmetric distributions, unequal sample sizes between groups, you name it. While the learning still took place in an academic safe space, the data and the class itself lacked structure, and is what made the class so interesting.

Here was the skeleton of the class: You received some data, you were given a list of a couple things you *had* to do that would earn you a B or a B+ or whatever, if you did them all perfectly. If you wanted a higher grade, you had to go above and beyond and ask additional questions about the data, and perform tests. This class, in the end, turned out to be the most illuminating class of my (short) statistical career. Since I only knew a couple things, and I wanted to impress the professor, I started reading a bit more. And, as the cliche goes, the more reading I did, the less I knew.

(NB: Anyone who is familiar with research might be aware that actually broadening your hypotheses as you explore data is actually a bad habit unless the intent of the data analysis is, in fact, for exploration. This type of analysis is called an exploratory analysis and is used to generate hypotheses for further study — i.e., replication. This was, in fact, the title of the class. There is a reason I am noting this. It comes up later.)

I remember on particularly frustrating assignment, in which we had two groups that had some continuous trait that I wanted to measure. The groups were pretty close to homogeneous, except in the observation of what I later learned to be a kind of treatment effect (in a loose sense) — one group consisted of smokers, and the other did not. This was not the problem. The problem was that the non-smoking group had many more samples than the smoking group (iirc, this might be flip flopped), and the variances were obviously not equal. While the distribution of the two groups * looked* like it was normal (I mean statistically normal), there were enough differences in even the histogram that one could not reasonably be sure that they came from the same distribution. I asked my professor if this was a problem: “It is.” Later in lecture, she showed a simulation study that revealed that the t-test doesn’t like samples between groups to differ too much when the variances are also unequal. See also this link. Before you start yelling, yes, the t-test is remarkably robust to unequal variances, but only when the samples are approximately equal in size. Even if you have equal sample size, how sure are you that the variances are equal? My (and probably your) statistics text-book recommended an F-test for equality of variances prior to the t-test. This is actually a bad thing to do. Whenever you predicate a statistical test on the results of another statistical test, you are adversely affect your nominal p-value of the subsequent test, and will bias your result. In the end, I had to do some reading outside of assigned textbooks in order to figure out the correct procedure.

(The solution: There’s two. You can actually bootstrap subsamples from the larger group to a number that matches the size of the smaller group. Compute a t-test statistic, then repeat the process. Do this many times and compute an estimated p-value using your bootstrapped t-test statistics. Or, just go simple and do a bootstrapped difference in means test, which tests essentially the same thing and is easier to implement – you assume the variances are equal in the latter method, however – which doesn’t really solve the original problem. If you draw without replacement, then you have a permutation test, which will reject the null if the variances are unequal *or* the means or unequal. I actually like this test a lot — but it’s a little old fashioned. For the second approach, which is correct: Welch’s t-test. In fact, some people recommend *always* using Welch’s t-test.  When assumptions of Student’s t-test are met, they perform about the same. When the assumptions of Student’s t-test are not met, then Welch is clearly better. Allegedly, it still has a problem with vastly different sample sizes, though some would disagree with that point.)

Two weeks later, while reading a research paper, made by actual researchers, I noticed that they had committed the *exact* same mistake I almost did – but on an order of magnitude worse. They performed a two-sample t-test between two groups in which one group had only around 50 observations, and the other had well over a thousand. Variances between group were evidently *not* equal when I looked at the data myself (they made it public, thankfully). What this means is that the type I error rate can be vastly inflated or deflated. Their statistically significant result effectively cannot be trusted. I remember being somewhat perplexed by this. The textbook I was learning out of recommended a procedure that actually doesn’t work (F-test before t-test) and even professional researchers seemed to make the same type of mistakes. Around this time, I started to wonder why we perform hypothesis tests at all if we can never fully meet their assumptions. We almost could just do something more primitive: plotting data. If one has two groups of samples, and you plot the histograms of the groups, and there is a severe shift in location based on just the plot, then of course a t-test will reject the null. Of course pretty much every other test for center of location will reject as well. “But how do you know what you’re observing is actually significant?” If you overlay two histograms, and they’re almost disjoint, then it’s significant. I wish I could formalize this. If you’re testing for a difference in means, then what you’re really doing is testing a difference in center of location. If the center of location is so obviously different to the naked eye, your test will reject. However, good luck getting a couple of plots into an academic journal without an accompanying statistical test.

I can’t help but wonder: what is the point of applying hypothesis tests, carefully checking assumptions that never hold anyway, and then reporting a resultant p-value that you expected the moment you plotted the data?  If performing an F-ratio test prior to a t-test actually affects the nominal p-value of the t-test itself, then what does it mean if you plot some data, look at it, then decide on a hypothesis test?

Enter the garden of forking paths.

III.

From the abstract of the aforementioned paper: “Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data. “

See also this.

Also from the paper: “Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y; \phi (y)), where the function \phi (\cdot) is observed in the observed case.”. Here T refers to a statistical test, and \phi (\cdot) refers to just some function dependent on the data. E.g., the choice of variables to be controlled in a study.

Gelman’s more or less argues that whenever you make a choice of hypothesis tests that are contingent on the data, you are going to get p-values that are overconfident. Overconfident, because you would never intentionally choose to analyze something you expect to have no difference. A problem with null hypothesis testing in general: We are technically assuming the null hypothesis to apply some test, but we never actually believe in that null. We chose the analysis because we actually expect a rejection. This *can* include plotting the data, looking at it, then deciding to test a hypothesis based on what you see. For example, say group A is a control group and group B takes some drug to measure the effectiveness of that drug on some affliction. Let’s say we are interested in how long someone is sick taking no drug, or taking the drug. Now, if the hypothesis you are interested in is the number of days someone is sick with or without the drug, there are many many ways that hypothesis can be tested. Even if a researcher works very hard to not dredge their data, even plotting the distribution of sick days between groups can lead to overconfident results if the choice of test depends on that plotting. For example, you plot the distributions and see that the center of location is about the same, but that one group has fatter tails in their distribution. Deciding to test the variance after the fact will naturally give a p-value that rejects, and an overconfident result. Likewise, deciding to test center of location *after* seeing that the centers of location do not appear to be the same will result in the same overconfidence. As Gelman himself states, even though the researchers in his examples were trying to do a good job (they weren’t intentionally trying to lie to us), because their data analysis procedures were not well defined apriori  to seeing the data. The analysis ends up depending on the data, and loses its interpretability.

The more correct procedure then, is to gather the data with a specific hypothesis in mind, with a specific testing procedure.  “We suspect that the drug will reduce, on average, the number of days a person is sick. To investigate this, we will perform a test of for center of location of the data”. That means you gather your data and you only test that result. If you also report a change in variance, because the plot suggested it, or even that people feel optimistic taking the drug, after seeing that there is something going on there after the fact, then you are in the garden of forking paths. The reason for the ominous sounding name is that at any one moment, a researcher can apply one of various tests. Their paths are “forked”. If you navigate this maze by choosing tests that “make sense given the data” (read: dependent on the data),  then your results are probably biased.

See the aforementioned paper by Gelman for more examples of this actually happening. My favorite example is the idea that conservative leaning men are physically stronger than left-leaning men — That analysis was done by none other than the (in)famous Jordan Peterson. Essentially Peterson wanted to look at the effects of political stance on physical strength in men, but did not actually measure strength — he measured arm circumference. And he did not use men, he used male college students, so we are off to a bad start already. Gelman claims that arm circumference is actually a proxy for age, and so when age is included into the analysis, the association between “strength” and socioeconomic status evaporates. Not only that, since no main effect was found, Peterson reported the interactions. But his interactions did not take into account this age-as-a-proxy-for-arms quibble. The result is that we get to read a statistically significant p-value, it’s meaningless. While it may yet be true that political leaning has some association with strength, this study doesn’t actually show what it purports to.

Okay, fine. You’re a smart researcher. You’re critical, and well-trained, and rigorous. You know that a p-value is not the probability the null is true. You can figure out how to do research without accidentally dredging or ending up in the garden of forking paths. But can you do this? From the post:

“In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?”

Answer: I have no idea. Allegedly *none* of the statistics PhD students in Gelman’s class got this right either (this is at Columbia by the way). In fact, they couldn’t even start it. If statistics PhD students, who have extensive training in statistical theory and methodology couldn’t get this question right, then what hope do social science researchers have? “Psychology is a rigorous science! We use bayesian statistics!” I am unmoved.

I want to prevent this post from getting too long, so here is the main take away: Hypothesis testing requires a metric ton of assumptions about your data. It requires very very strict and careful methodology, and is often extremely difficult to get right. There is a mistaken view among people (including myself when I was younger) that statistical tests are like a calculator which spit out a yes-no or significant-insignificant in response to some data you feed it. This isn’t true. Statistics isn’t computer science, unfortunately (ba-dum-tss) and is difficult to get right. While there are some things that can be done in order to prevent dredging your data all over the place, they aren’t ideal. Splitting your data into exploratory and confirmatory sets will always hurt your statistical power, and in many situations, power isn’t that high to begin with. While there are problems with arbitrarily high power values in a test, the opposite problem is more problematic in actual real-world research. I am also skeptical of simply applying a correction to p-values. Anyone who has done data analysis has probably seen p-values so small that the software just estimates them as a flat 0.000. What correction can you apply? “Isn’t a p-value that’s arbitrarily small actually more evidence of a statistically significant result, dredging or no?” Maybe. It’s true that extremely small p-values are “more convincing”, but in the face of dredging, I am not so sure. If you gave me a data set, any data set, I am sure that I could dig up an arbitrarily small p-value given enough time and beer. P-values can never actually be zero, since there is always a positive probability that your observations were due to chance. However, a p-value is, technically, a random variable. That means it has a distribution, and even your very small, very pleasing p-value that got you published could be entirely due to chance.

 

IV.

“If there are so many problems with p-values, what do we do instead?” This is an answer I can’t readily give, because it remains a topic of hot debate within the statistical community. Some suggestions are simply reducing the “standard” threshold for significance to an \alpha = 0.005, but not everyone finds this to be a useful substitution. Bayesians, of course, advocate for bayesian statistics instead, though that itself is not without its problems. As researchers, there is the more immediate problem of our employers and PI’s expecting p-values. Old habits die hard, and in order to get published, you have to reject the hell out of some hypotheses. It is not good enough to say “We thought maybe these two things are associated, but turns out they are not. P>0.05“. While you won’t get published over such a claim, it is actually fallacious to think that the non-rejection of a null gives no information. Add in the fact that scientific research is based on “discovery”, and you have a recipe for accidental data dredging in order to achieve significant results.

However, here’s a couple ideas for non-statisticians who like to read studies: If the study hasn’t been replicated many times, just take it as exploratory. The fact of the matter is, most studies fall into this realm. Hypotheses aren’t well defined, someone just had some data and wanted to see if the data said anything interesting. In this area, results shouldn’t be trusted beyond piquing your interest that an association might be there.  As I was explaining earlier, exploratory studies are studies designed to give ideas for more rigorous testing later. The results from exploratory studies need replication before they can be trusted. We can be reasonably sure, for example, that smoking causes cancer, since that has been studied again and again, but any time you read an article titled “Why science says you shouldn’t…”, just ignore it. Especially if it’s a new study and being posted on Huffingtonpost or Buzzfeed, or some other clickbait site. The more I learn about statistics, the more I feel that someone like Nassim Nicholas Taleb has the right idea– heuristics that have been handed down for generations are useful, because their existence and propagation were empirical. Maybe mom really does know best. I’m not saying to abandon statistics and reading studies as an avenue for self-learning, but studies in themselves are pretty useless to a lay-person unless you have domain knowledge in either the field of interest or in the methodology. This is because you have to be able to sift through the garbage. I hesitate to make this an necessary and sufficient condition, but maybe it can be a heuristic itself: If you don’t know the field, then you shouldn’t be relying on studies in the field. The first step to acquiring domain knowledge involves textbooks, not studies. If you want to know how the brain works — buy a human physiology textbook! Don’t read stuff on pubmed until you know how to differentiate the bullshit from the novel. By extension, be careful of blogs, including this one. There is usually only one, incomplete, and flawed person behind the screen, and errors can be abound.

For the statisticians and researchers: I am fascinated by the idea of statistics without hypothesis testing. Visualization of data is simple and an extremely useful tool to understanding what is going on with your data. Parameter estimation is also useful, and has a huge robust grounding in theory. While “CI’s” suffer many of the same problems of hypothesis tests, they are, I think, very useful in describing what’s going on. Classical confidence intervals have some problems, but the use of bootstrapping to obtain confidence intervals is a pretty good procedure. They’re entirely non-parametric, your coverage will be pretty good, and if they’re disjoint, you can still claim some kind of hypothesis test for your employer/PI. A good quote from a random commentator on Gelman’s blog- “Just say something reasonable and useful about your data”. Understandably, researchers already think they are doing this, but for every statistically significant result, what is the significance to the reader? Can we go from p < 0.05 to a decision? I think maybe not, but causal modeling, probably yes.

The idea of simply “modeling the problem directly” is attractive. While I don’t know much about this methodology other than it involves the usage of “multi-level models” (a kind of fancy linear regression), my (limited) statistical training is about as classical as you can get, and so this idea is exciting to me. It is very interesting that computer\data scientists focus entirely on model building, and are able to achieve extremely accurate predictions while violating almost every assumption a classically trained statistician relies on. Even statisticians use predictive accuracy as an assessment of model fit, so what does it mean if computer scientists are so dominant at this? Yeah, the models are biased, but they have ROC scores of > .90 on the hold out set. “Machine learning is just a black box!” Maybe, but there is work being done in that direction. See also this.  And causality aside, prediction is the other half of reasons why people want to hire statisticians. “But how do we actually learn from the data if we are just focusing on out-of-sample predictions”. I’m not sure. But accurate out of sample predictions are very, very, useful, and marketable. This seems like a good thing to learn how to do well.

 

 

 

 

Should you invest in Google or Bitcoin? (VaR Risk and Utility)

Disclaimer: I am not a financial expert. I am a student in statistics and probability. What follows is some exposition on an exercise found in one of my textbooks. It was not a trading book, but rather a book on actuarial models for insurance. I thought it was interesting, and decided to write about it.

I.

On finance.yahoo.com, stock price data can be easily downloaded and analyzed. Here, we consider a stock and a cryptocurrency, Google and Bitcoin, which are words that have passed through the lips of pretty much every person who knows what an investment is. Bitcoin, in particular, was the source of many a meme, when the lovable cryptocurrency topped out in value at $19345.49. Google, on the other hand, is an actual company, with investors and products, and likely was what you used in order to find this blog. Now, if you had some fixed n amount of money, where should you invest it? Can we make a decision based only on historical data and some very elementary risk measures? (NB: No, we cannot. If such elementary machinery provided any substantial predictive power for return on investment, then nearly anyone comprehending basic calculus and probability theory would be wildly rich. However, this makes an interesting exercise.)

 

II.

From finance.yahoo, we consider historical stock price data from 2017-05-05 to 2018-05-05 (essentially present day, at the time of this writing). In this case, we will look at the Close price.  Here is the time series of the Bitcoin closing price (366 observations):

bitcoingraph

And we also have the Google price data (255 observations – some days are missing in the data because of holidays, or an error on the part of finance.yahoo. Ideally, we would want to have data with the same number of observations, but since this isn’t a “serious” exercise, we can say that this is good enough).

googlegraph

We can see that clearly both Bitcoin and Google saw big increases in price around the start of 2018, but the closing prices for bitcoin appear to be dropping as 2018 continues on, yet Google has remained relatively high. Because stock prices are effectively stochastic, ie, random, we really cannot draw any conclusions based on these charts alone. Though, some speculate that bitcoin prices are going to take off again at some point (“To the moon!”). If all I knew were these charts, it would be difficult to discern which asset will make you more money. Though, perhaps I would be tempted to go bitcoin, as its apparently high volatility may indicate another large spike in price in the future. In fact, what can we say about the returns of these two assets?

return is defined as the formula every business major knows: \frac{\text{Close Price Today} - \text{Close Price Yesterday}}{\text{Close Price Yesterday}}\times 100. We will use a modified version of this formula that is simply a ratio of the price from yesterday and the price today. That is, we will define a daily return  V as the ratio

V = \frac{\text{Today's Closing Price}}{\text{Yesterday's Closing Price}} .

So it follows that when V > 1, we saw a gain, and when $ latex V < 1$, we saw a loss. If V = 1, then naturally we didn’t lose or gain anything. However, here V is essentially a continuous random variable, so the probability that V = 1 exactly is 0. We can thus expect to either gain or lose money when considering the returns in a 24 hour period. While we could extend our time interval to beyond 24 hours, we consider only the returns on a single day – we are pretending to be day traders here.

returnsbitcoingooglereturns

The graphs of the daily returns from both assets appear to resemble white noise, which is indicative that the returns are approximately normally distributed. (A lie. Keep going). This is actually a key assumption in proceeding with our analysis, so it is pertinent to take a glance at the quantile-quantile plots as well:

bitcoinqqplotgoogleqq

Both data sets are probably close enough. There is a bit of fuzziness in the tails here, and some skewness (in particular, Google has some right tail skew), but it’s not totally unreasonable to assume that the data are approximately normally distributed.

 

(NB: We are fibbing a bit. Because the definition of returns we are using is simply a ratio, they are actually not normally distributed, as the return is strictly greater than zero. In The Black Swan, Nassim Nicholas Taleb argues heavily against the profilic use of assuming the normal distribution in the financial sector – and for good reason. Bad assumptions on models can be disasterous and it’s a generally unwaranted assumption, for actual return data has fatter tails than a normal distribution. However, since it is better to beg for forgiveness rather than permission, we will take our returns data as approximately normal, and continue with the exercise. Our aim here is to explore the concept of risk and utility; not to argue about the distribution of returns.)

It’s a relatively common assumption to assume that the daily returns are a stochastic variable, and so we can let X  denote the random return for the Bitcoin asset and Y  denote the return of the google asset. Using the data we have on the daily returns, we can estimate the mean \mu  and standard deviation \sigma  using the arithmetic mean and sample standard deviation. We have:

\hat{\mu}_X = 1.0040

\hat{\sigma}_X = 0.0547

\hat{\mu}_Y = 1.0009

\hat{\sigma}_Y = 0.0140

On inspection, it appears that the Bitcoin asset gives a higher average return than the Google asset, however, it suffers from a higher standard deviation, which is indicative of more volatility – you are more likely to get larger spikes of gains, as well as larger spikes of losses. Perhaps this is to be expected.

III.

(NB: The following conversation may be skipped for those who are well acquainted with VaR and its concepts.)

The Value-at-Risk (VaR) criterion is, as defined in Rotar’s Actuarial Models: The Mathematics of Insurance  is the “smallest value x for which P(X \leq x)  > \gamma . Or, more formally, we can write, for a random variable X, \gamma\in [0,1)

q_\gamma(X) = \inf \{x: P(X \leq x) > \gamma\} .

This is a definition of the \gamma -quantile of a random variable X .

This can also be said to be the value at \gamma\times 100\% level of risk.  For example, if we set \gamma = 0.05 , and X  is a random variable denoting income, then q_{0.05}(X) is the smallest income we can expect to occur among all possible income values with probability 95%. That is, the probability we have an income smaller than q_{0.05}(X)  occurs with only probability 5%. This is how we can use a notion of “risk” in finance. We are, in essence, trying to calculate how much money we stand to lose at certain probabilities. There are several ways of doing this. We can either place the VaR criterion on a distribution of losses, or we can place the VaR criterion on a distribution of income, and thereby calculate our losses as a function of our income at some level of risk. For example, if we bet $10,000 on a particular asset, and we calculate the VaR for income, we can see what our losses will be as a function of our bet. If the value at 5% risk (on our income) ends up being $7000 dollars, then that means 5% of the time, we stand to lose $3000. (This does not seem like a very good bet).

We can also place this VaR on a distribution of losses, where the random variable X takes on negative values. In this case, we get our losses instantly. In our example, the value at risk at 5% is -$3000 dollars. By calculating our “risk” in this manner, we can obtain information about what is at stake in our financial bets.

Since we can choose \gamma , we can essentially estimate our income at whatever level of risk we want. In this definition, if q_\gamma  is negative, than this corresponds to losses.

We say that we prefer X  over Y when q_\gamma(X) \geq q_\gamma(Y) , and in this case we write X \succeq Y .

Now for the relevant part. From Actuarial Models: Suppose we have a normally distributed random variable X  with mean \mu  and standard deviation \sigma . Now if \Phi (\frac{x - \mu}{\sigma}) is the distribution function of X , the \gamma -quantile of X  is a solution to the equation \Phi(\frac{q - \mu}{\sigma}) = \gamma . If we denote q_{\gamma s}  the \gamma -quantile of the standard normal distribution, so that \Phi (q_{\gamma s} ) = \gamma , then we can rewrite the equation as \frac{ q - \mu }{ \sigma } = q_{\gamma s } . Thus, we get:

q_\gamma (X) = \mu + q_{\gamma s}\sigma .

Since multiplying a normally distributed random variable by a constant results in a normally distributed random variable, this means we can actually compute the value at risk in our investments in Bitcoin and Google assets, as the daily returns are (approximately) normally distributed.  If we think of the amount of money we invested as a weight or scalar (say n dollars), we can define a new random variable X^* = nX that denotes our expected income from our investment. Then, the expected value of X^*  is E[X^*] = nE[X] = n\mu  and the standard deviation is SD(X^*) = nSD(X) = n\sigma .  Thus, we get for the VaR criterion,

q_\gamma(X^*) = n\mu + nq_{\gamma s}\sigma 

Since we are generally interested in small probabilities (as we are primarily interested in this notion of risk), we will take \gamma \leq 0.05 , which implies that q_{\gamma s} \leq -1.64 . Essentially, we want the expected return on investment to be as high as possible, while the standard deviation can be seen as a measure of riskiness. Since we are looking at small probabilities and assuming a normal distribution, the quantile q_{\gamma s} will always be negative, and so the higher the standard deviation, the higher the level of riskiness (as we are in a position to actually lose money if the prices drop too much unexpected).

IV.

By the preceding discussion, we can suppose that we are willing to invest n dollars into either Bitcoin or Google. We will assume that the better choice of investment will be the asset that has a smaller risk. Under the VaR criterion, we want to see which asset will not only make us the most money, but has the smallest penalty should something bad happen – like an unexpected drop in prices. Since this is based on our definition of return as a ratio, the absolutely value in the price shouldn’t matter from a mathematical point of view.

Define X^* = nX  as the expected income from the Bitcoin investment and define Y^* = nY  as the expected income from the Google investment. We will fix n for now. Also set \gamma = 0.025. Now, we have, by our estimates via the price datasets,

q_\gamma(X^*) = nE[X] + n q_{\gamma s}SD(X) = n(1.0040) - n1.96(0.0547)     (1)

q_\gamma(Y^*) = nE[Y] + n q_{\gamma s}SD(Y) = n(1.0009) - n1.96(0.0140)      (2)

We now have a function of our investment n which we can plot in Python:

investplot

We can see that the more money we invest at a time, the larger the VaR for the Google asset. As a concrete example, if we compare an investment of $10,000 in Google and Bitcoin, then our value at 2.5% risk for both assets is

q_{0.025}(X^*) = (1.004)(10,000) - (10,000)(1.96)(0.0547) = 8,967.88

q_{0.025}(Y^*) = (1.0009)(10,000) - (10,000)(1.96)(0.014) = 9,734.6

By definition of VaR, this means that 2.5% of the time, we stand to lose $1,032.12 dollars on Bitcoin, but only $265.4 dollars on Google. Bitcoin is a riskier bet. This is intuitively obvious just from looking at the larger standard deviation. The investor is in a position to lose (or gain) much more money than in a bet with a higher variance. Based on this analysis, it seems that the choice of investment should be in the Google asset. Since X denotes the income from Bitcoin and Y  denotes the income from Google, and we have that the value-at-risk is higher for Google, we would say that we prefer Google, and write the equation Y \succeq X .

But is this result generalizable to all humans? Should every investor make this choice? Perhaps, but we know that human beings are not always rational, but nevertheless their behavior can sometimes be modeled as maximizing some kind of utility function. I don’t know what my utility function is, but for whatever reason, the utility out of typing up this blog post is higher than the utility I get from doing homework.

V.

Utility functions are, in essence, some way of characterizing human behavior. Human beings all want something, and that something is often fairly intangible. Yet, one utility that seems “modelable” is the utility of money. Money gives us utility almost however you define it, since money can serve as a proxy for the things we actually want — be it power, sex, or even a new video game. (I may not value money at all as an axiom, but if I value playing video games, then I value free time so that I can play video games, and also the money required to purchase them. Ergo, I have a job that gives me weekends off. This property of valuation seems to hold almost everywhere in adults that like video games, except on the set of video-game pirating NEETs, e.g., 4channers, which is a set of small measure.)

If we take it as an axiom that every human has some innate, unknown utility function,  and that human beings are, in general, utility maximizers, then we can use some relatively simple mathematics to model a person’s behavior. There are caveats. One can never assume that a model is true, as true behavior (of anything) is inherently unknowable (empiricism helps, but only serves to make models more accurate. Not more true).

Here we denote a utility function by u. A classical example of a utility function is the natural logarithm u(x) = log(x) , where x  is some number representing capital.

naturalog

The expected utility maximization criterion is an axiom that if someone is a utility maximizer, then they would prefer an investment strategy X over Y  if and only if the expected utility payoff of X  is higher than the expected utility pay off of Y . That is

X \succeq Y \iff E[u(X)] \geq E[u(Y)] .

Perhaps the simplest utility function is simply dollar amounts, but it doesn’t capture the inherent decision making process used by actual humans. How people characterize their investments has a lot to do with how the feel about them, and their expected return, and the potential risks involved in some strategy. That’s where a utility function comes in — it is an attempt to capture some of those decision making processes.

We will avoid some details here, but essentially investors (or gamblers) can be described as either being risk averse, risk taking, or risk neutral. An investor who is risk averse is an investor who, naturally, prefers stability from their assets. Since the VaR for the Google asset is higher than the VaR for Bitcoin, a risk averse investor would be more inclined to invest in Google over Bitcoin. However, Bitcoin, has a larger variance, and so there is a probability of a large spike in price – and that means someone is potentially makes a lot of money. An investor who still prefers Bitcoin despite the lower VaR  could be said to be a risk-taker. He or she is willing to invest the money and take the risk of losing it for the (likely small) probability of a large pay off.

Utility functions don’t have to be positive (note that the natural log is not strictly postive). And since we only care about maximizing our “utils”, we can actually use the natural logarithm to model investment strategy behavior in a “rational” human being. As x \rightarrow 0 , we have that \log(x) \rightarrow -\infty . This says that an investor is absolutely interested in getting more capital, as that will give him greater utility. But after amassing a large enough amount of capital, the function slows down considerably. This makes sense: once you have $1,000,000 how much more utility do you get out of a couple more bucks? Once you’re Bill Gates, how much more utility do you get out of any amount of money? (Technically, since the natural logarithmic function is unbounded, there is some amount of money Bill Gates would eventually want to acquire. This perhaps does not reflect reality, in that Bill Gates has so much money that it’s pretty much incomprehensible. I cannot imagine what it must be like to be so rich that you couldn’t spend it all — even if you tried).

But on the other hand, since the natural logarithm diverges to - \infty  as our capital decreases to zero, we are essentially “very afraid” of being ruined. We want as much money (utility) as possible, but not so badly that we are willing to go completely broke. (This function ignores the key fact that many Americans are actually in debt, negative capital, but that is perhaps a post for another time). This seems to provide a reasonable utility function for actual human behavior. Again, this function is not underlying truth, but rather an attempt to model something. Models are incorrect, since they are estimators, but can be made more accurate.

Passing the return data from both Bitcoin and Google and taking a mean gives an expected “utility” of 0.0025 for Bitcoin and 0.0008 for Google – in this scenario, we actually prefer to invest in Bitcoin.

(This procedure may seem ad-hoc, but has some justification. Let X_1, ..., X_n be random variables denoting the 1 day returns for n days. The Expected Utility Maximization criterion (EUM) is defined so that we care about the expected utility of an investment plan. If X denotes the (day-trading) investment plan pertaining to returns X_1, ..., X_n , then we can estimate the expected utility by passing the n returns through a utility function u and taking the mean. By the law of large numbers, this will be approximate to the true expected value. That is,

\frac{u(X_1) + u(X_2) + ... + u(X_n)}{n} \approx E[u(X)] 

by the law of large numbers when n is large. )

What about a different utility function? Consider the utility function u(x) = 2ax - x^2 . This is called a quadratic utility function, and is actually a function of the mean and variance, and so, heuristically, it may be a more “objective” criterion in terms of asset assessment, as we really only care about our expected returns and our possibility of a loss. For a simple proof that this quadratic function really is a function of the mean and variance, recall that Var(X) = E[X^2] - E[X]^2 . Therefore, E[X^2] = Var(X) + E[X]^2 

E[u(X)] = E[2aX - X^2] = 2aE[X] - E[X^2] = 2aE[X] -(Var(X) + E[X]^2)

And this simply becomes 2a\mu - (\sigma^2 + \mu^2) , which completes the proof. Now, for the scalar a; one of the requirements for a utility function is that the function is non-decreasing. We can strengthen this slightly to make it an increasing function, (so that more capital or a higher return means that we get a higher utility, matching our intuition), and so we need to pick a so that x < a  for every choice of x. In other words, we want P(X < a ) = 1  . Our returns for both data sets are not higher than 1.25, so we can pick a = 1.25, so that our utility function is then u(x) = (2)(1.25)x - x^2 = 2.5x - x^2. Like before, we can estimate the expected utility from our strategies by passing the returns through this utility function and taking the mean. So, recalling that X is our Bitcoin strategy and Y is our Google strategy,

E[u(X)] \approx 1.4989

E[u(Y)] \approx 1.5003

So our expected utility is now slightly higher for the Google strategy, which matches our intuition from the VaR criterion. (However, the difference between the two is extremely small, and increasing  results in us preferring Bitcoin again. Utility seems to be quite finicky).

 

VI.

So, which is it? Google or Bitcoin. I’m sorry to have to give a non-answer, but the non-answer is that it depends on your utility function. If you decide that the VaR criterion is “good enough” to decide on which to invest in, then perhaps you’re very risk averse, your utility function is concave. In that case, Google seems like a safe bet. Even safer – a mutual fund. If, however, you’re willing to shoulder the risk of losing everything at the prospect of big gains, then obviously bitcoin is the answer here, as it has been for many other individuals. However, modeling risk seems like a general good idea. It’s why traders seem to espouse having diversified portfolios (there is a very simple mathematical proof for why this reduces variance, and thus risk, of your investments. But even heuristic reasoning may suffice. If all your eggs are in one basket, and that basket gets hit with a financial nuclear bomb, then you lose all your eggs. If you have many baskets, and one egg in each, then losing one basket means you only lost one egg — that is, one investment.)

 

****

Caveats and further questions:

i) As expressed earlier, the distribution of returns is mostly not normal. It is a very strong and simplifying assumption, which is why it is done, but by assuming the distribution of returns is normal, and then calculating the VaR at a particular percentage analytically, the actual chance of the risky event occurring is higher – perhaps much higher. There are non-parametric methods that exist, which make no assumptions about a distribution and work entirely on historical data. This is called historical simulation. There is another  method that uses Monte Carlo methods, but I don’t know much about that, yet. These methods are likely more robust, in that they use less assumptions.

ii) Utility functions as a means to optimize a portfolio seem weird to me. A utility function is a function that essentially describes the behavior of an individual by attaching “utils” to some notion of having more or less of something. This criteria for portfolio optimization is used, indeed, but I don’t know enough about it to really comment deeply. Maybe another time.

iii) There is a critical error of VaR in which Taleb occasionally critiques. When you fix a probability level, say 1%, and then calculate VaR, you make the very explicit assumption that events with probability less than 1% do not occur. This is perhaps certifiable madness in finance. If you take the stock market as entirely stochastic, which many people do thanks to the Efficient Market Hypothesisthen the stock price movements are a random walk. Since a random walk does not depend on the past at all, and markets are here to stay, eventually a “rare event”, one that occurs with probably less than 1% is bound to happen, for someone, somewhere. And here’s the problem – value at risk assumes that it doesn’t. This may not be a big deal, but there isnt’ evidence that risk and expected losses at a particular risk level are linear. In fact, as Taleb argues, it is these rare events, these Black Swans that are most impactful, most disastrous. It’s possible that while you can manage the loss of a 1% event, a 0.5% event could liquidate your company. That said, all is not lost for VaR. Part of managing risk is having a “risk measure”, in which multiple risk criterion are considered and given relative weights dependent on the concerns of the individual or the firm. There is, also, Conditional VaR, or “expected shortfall” which tries to estimate the impact of extremely rare events. Moreover, there are many other risk criterion than those two, enough to fill a textbook on the subject.

iv) I am not a trader, but I am curious about a trading strategy for the individual:  I believe that one could come up with a coherent risk measure that appeals to them, and simply evaluate many stocks on that risk measure, and then pick the stocks based on obtaining the smallest amount of risk, with some ratio of highest expected returns. This is a thing for portfolio management in the investment industry in which there are ample resources (google “risk parity”), but I wonder if it could be specified into a coherent strategy for the individual with regular-person level resources. I am likely missing something here, as this seems obvious to me, but I am not a trader and thus have not read any trading books. It’s possible that this is a strategy considered in chapter 1 of some book somewhere, but I also get the feeling that this method would be too computationally intensive for the general public, but would appeal to the analytically minded. If anyone knows, let me know. I might try and explore this idea.

v) The data at the time of this writing is already outdated. I figure that is probably just okay.

vi) Future project: investigate the Efficient Market Hypothesis by trying some machine learning techniques on stock data. Does the past model the future? Economists say no, but economists also failed to predict the 2008 crash, and the Efficient Market Hypothesis precludes the idea of a bubble and the event of a crash, so definitely worth looking into. In fact, Bitcoin has been used as an a counterexample to the EMH. Since the EMH is a mathematical derivation that relies on a mathematical assumption, precisely one counterexample is enough to refute it.

 

Proving Monotone Convergence Theorem with Fatou’s Lemma

The statement of the Monotone Convergence Theorem reads:

Suppose that the set E is measurable, and let (f_n)_{n=1}^\infty be a sequence of non-negative, measurable functions such that, for x \in E,

0 \leq f_1(x) \leq f_2 (x) \leq \cdots

Let f be defined by f_n(x) \rightarrow f(x) as n \rightarrow \infty. Then,

\int_E f_n d\mu \rightarrow \int_E f d\mu, \quad (n \rightarrow \infty)

In this proof, Fatou’s lemma will be assumed.

Notice that \lim_{n\rightarrow \infty} f_n = f implies that

\int_E \lim_{n\rightarrow \infty } f_n = \int_E f

and so by Fatou’s lemma,

 \int_E \lim_{n\rightarrow \infty } f_n = \int_E f \leq \liminf_{n\rightarrow \infty} \int_E f, \quad (1)

for \lim_{n \rightarrow \infty } f_n = f = \liminf_{n \rightarrow \infty} f_n, \quad (2)

Now, since f_{n} \leq f_{n+1} \leq \cdots \leq f, for every intger n, and the f_n's are bound below by 0, we have

0 \leq \cdots \leq \int_E f_n  \leq \cdots \leq \int_E f, for every n \in \mathbb{N}. And so, taking the supremum for k \geq n and passing to the limit gets

\limsup_{n \rightarrow \infty} \int_E f_n \leq \int_E f, \quad (3).

Now, combining (3) with (1) and (2) yields:

\limsup_{n\rightarrow \infty } \int_E f_n \leq \int_E f = \int_E \liminf_{n\rightarrow \infty} f_n \leq \liminf_{n\rightarrow \infty} \int_E f_n

hence

\limsup_{n\rightarrow \infty} f_n d\mu \leq \int_E f d\mu \leq \liminf_{n\rightarrow \infty} \int_E f_n d\mu,

therefore

\lim_{n \rightarrow \infty} \int_E f_n d\mu = \int_E f d\mu

which proves everything that was promised.

The Outer Lebesgue Measure is Translation Invariant and scales nicely with Dilation.

Define the sets A+t = \{ x+t: x\in A \} and \delta A = \{\delta x: x\in A\}. Let m^* be, of course, the outer Lebesgue measure. Suppose that we have a set A \subset \mathbb{R}. In this case, it is possible to cover A with a sequence (A_n)_{n=1}^\infty of elementary sets, that is A \subset \bigcup A_n. Note that since A_n is elementary, they may be written as a finite union of disjoint intervals I_k, A_n= \bigcup I_k, for every integer n.

By definition, m(I_k) = (b_k - a_k), where a_k, b_k are the end points of the interval I_k. Now, clearly A+t \subset \bigcup (A_n + t) \iff A \subset \bigcup A_n.

This implies that A_n +t = \bigcup (I_k + t). Note that m(I_k + t) = (b_k +t - (a_k+t) = (b_k- a_k). Hence, m(A_n + t) = \sum_k^N m(I_k + t) = \sum_k^N m(I_k) = m(A_n), but the additivity of m.

Fix \epsilon > 0 . Now, it is possible to choose these elementary coverings of A in such a way (by definition of m^*) so that

m^*(A+t) \leq \sum_n^\infty m(A_n + t) \leq m^*(A+t) + \epsilon.

But also

m^*(A) \leq \sum_n^\infty m(A_n) \leq m^*(A) + \epsilon.

But, it was shown that m(A_n) = m(A_n + t).

Hence,

m^*(A) \leq \sum_n^\infty m(A_n) = \sum_n^\infty m(A_n + t) \leq m^*(A+t) + \epsilon

And

m^*(A+t) \leq \sum_n^\infty m(A_n + t) = \sum_n^\infty m(A_n) \leq m^*(A) + \epsilon

which then implies that m^*(A+t) = m^*(A) as \epsilon was arbitrary.

Dilation follows in a similar way: A\subset \bigcup A_n \iff \delta A \subset \delta \bigcup A_n \iff \delta A \subset \bigcup \delta A_n, where \delta > 0. Now, writing each A_n as a finite union of disjoint intervals reveals

\delta A_n = \bigcup^N_k \delta I_k which implies that

m(\delta A_n ) = \sum_k^N m(\delta I_k) = \sum_k^N (\delta b_k - \delta a_k) = \sum_k^N \delta (b_k - a_k) = \delta m(A),

by the additivity of m on disjoint sets.

Thus, for a fixed \epsilon > 0, we can find a sequence of elementary sets so that

\delta m^*(A) \leq \delta \sum_{n=1}^\infty m(A_n) = \sum_{n=1}^\infty m(\delta A_n) \leq m^*(\delta A) + \epsilon

\implies \delta m^*(A) \leq m^*(\delta A),

as \epsilon was arbitrary. Likewise,

m^*(\delta A) \leq \delta m^*(A) + \epsilon can be obtained, which gives

\delta m^*(A) = m^*(\delta A)

by combining the two inequalities, as desired.

A dense open set in the reals, with small size

It will be shown that in \mathbb{R}, for any \epsilon > 0, there is a dense open set O such that \mu^*(O) < \epsilon, where \mu^* is the outer Lebesgue measure.

Consider the set of rationals, \mathbb{Q} \subset \mathbb{R}, which is known to be countable. So, the elements of the rational numbers may be enumerated so that

\mathbb{Q} = \{ q_1, q_2, ... \}. Now, fix \epsilon >0 and around each each q_n, n\in \mathbb{N}, center an interval of length \frac{\epsilon}{2^n}. That is, form the interval

I_n = (q_n - 2^{-(n+1)}\epsilon, q_n + 2^{-(n+1)}\epsilon).

Taking the union of all such intervals I_n forms an open set, which will be called O. It follows now that \mathbb{Q} \subset O = \bigcup I_n. Now, the outer lebesgue measure is subadditive, so it follows that

m^*(O) = m^*(\bigcup I_n) \leq \sum_{n=1}^\infty m^*(I_n) = \sum_{n=1}^\infty 2^{-n}\epsilon = \epsilon.

Since \epsilon >0 was arbitrary, the proof is complete.

 

NB: This is my first post. Readers who stumble here are free (and encouraged) to provide feedback and corrections. This blog is mostly going to consist of interesting math tricks and thoughts on subjects I find interesting.