Causal Factor Investing

0. Prerequisites & Preps

Study with NotebookLM
Deconstruct with Gemini or Claude for abstract concepts
The Book of Why by Judea Pearl and maybe his other work
Linear Algebra
PHIL 101

1. Intro

To be scientific is to declare falsifications responsible for a phenomenon and financial economics as a science is at a disadvatage to do so.

For the past 5 decades, most factor investing literature is mostly publishing associations not causes, in which they failed to falsification, which should be admitted as spurious.

My conclusion: Factor investing literature is immature which means the investing based on factors are immature. Dr. Prado is here to give all the causal factor investors a wakeup call.

Key takeaways:

Know the difference between association and causation(which means being logic, and learn some PHIL 101);
It’s possible, to estimate, the causal effects through natural experiments and simulations.
Chap 5 & 6: Causal confusion in econometries and factor investing and why factor investing is immature
Chap 7: Proliferation of spurious claims as a “factor zoo”
Chap 8: Scientific discipline

Author’s contributions:

Denial or ignorance of the cause of content of author’s models which makes factor investing literature logically inconsistent
A and B: two types of spurious claims. They have different roots and implications, it’s important to be able to distinguish the two. B falsely claims the time-varying nature of risk premium.
Derivation from A and B to speculate the causes that make the claims spurious.
Monte Carlo simulations to show how disastrous B claims can be.
New explanation for factor investing: The time-varing nature of risk premia reported in journals is a likely consequence of under-controlling.
Proposal to make factor investing scientific again.

2. Association vs Causation: Math, Distinguishing, Nuances

My conclusions and action items:

Engrave the definitions and clarifications of, and among “interventions”, “observations”, “causations”, “associations”, “independent” in a probobabilistic and statistic standpoint to my grey matter.
Understand do-operator(a.k.a. an intervention)
The goal for this chapter is to:
1. distinguish, differentiate and be able to clarify. Do some mock-up quizzes with NotebookLM.
2. Confine the assumptions, scope of the matter
3. Derive further investigations and experiments based on the ideas above

Definitions and clarifications

Keynote:

$X$ and $Y$ statistically independent if and only if: $$ P[X = x, Y = y] = P[X = x]P[Y = y], \forall x, y $$
$X$ and $Y$ are associated when, for some $(x,y)$: $$ P[X = x, Y = y] \ne P[X = x]P[Y = y] $$
$X$ is the cause of $Y$ when: $$ Y = f(X) $$
Equivalently:
$X$ is said to cause $Y$ probabilistically, when: a value $x$ increase the likelihood of $y$
Causation formal definition: $$ X = x \text{ causes } Y = y \iff P[Y = y | do[X = x]] > P[Y = y] $$

Gemini’s deconstruction of do-operator and causation definition like I’m a junior high:

Click to expand

## The Core Problem: Passive Observing vs. Active Smashing

Before we look at the math, think about this:

* **Scenario A (Observing):** You stand on a street corner in the rain. You notice that almost everyone walking by has an open **umbrella ($X$)**, and the **ground is wet ($Y$)**.
* **Scenario B (Intervening):** You walk up to someone on a perfectly sunny day and force them to open an umbrella ($X$). Does the ground suddenly become wet ($Y$)? Of course not.

Standard statistics and econometrics are great at **Scenario A**. They just look at data that already exists. But they traditionally lack a clean mathematical symbol to represent **Scenario B**, where you actively step in and force a change.

That is where Judea Pearl’s **$do$-operator** comes in.

---

## 1. What is the $do$-operator?

In standard probability, you write $P[Y|X]$, which reads: *"The probability of $Y$ happening, **given that we observe** $X$ happening."* If $X$ is open umbrellas and $Y$ is wet ground, $P[\text{Wet Ground} | \text{Umbrellas}]$ is very high. But that doesn't mean umbrellas *cause* wet ground.

When we write:


$$do[X = x]$$

We are saying: **"We don't care what the world was doing naturally. We are physically stepping in, clearing out any outside factors, and forcing $X$ to equal $x$."**

* **$P[Y | X = x]$**: You look through a pair of binoculars at the market or the world and count what happens.
* **$P[Y | do[X = x]]$**: You run a controlled experiment. You isolate $X$ and force it to happen, severing it from any hidden, underlying causes (like the weather).

---

## 2. Deconstructing the Definition of Causation

The text highlights a formal definition:

$$X = x \text{ causes } Y = y \iff P[Y = y | do[X = x]] > P[Y = y]$$

Let’s break this inequality down piece by piece:

* **$\iff$**: This just means "if and only if." It's a strict mathematical two-way street.
* **$P[Y = y]$**: This is the **baseline probability**. What is the normal chance of $Y$ happening out in the wild on any random day?
* **$P[Y = y | do[X = x]]$**: This is the **experimental probability**. What is the chance of $Y$ happening *after* we actively step in and force $X$ to happen?

### The Rule

If you force $X$ to happen, and the probability of $Y$ happening **increases** compared to its normal baseline, then **$X$ causes $Y$**.

If forcing $X$ to happen does absolutely nothing to the baseline probability of $Y$, then $X$ does *not* cause $Y$—no matter how highly correlated they looked when you were just passively watching.

---

## Why Lopez de Prado Cares About This for Investing

In quantitative finance, backtests are often just passive observations of historical data ($P[Y|X]$). You see that a certain factor $X$ and stock returns $Y$ went up together in the past.

But when you actually deploy real money into production, you are executing an **intervention**—you are doing $do[X=x]$ by trading. If the relationship wasn't truly *causal*, your strategy will fall apart in production because you mistook a statistical coincidence (or a shared hidden driver) for a real lever you could pull.

Does the distinction between "watching" and "doing" make sense in the context of the variables you're looking at?

Where:

$P$ is the probability
$X=x$ means When event $X$ takes value of $x$
$f(X)$ is a causal mechanism
An “intervention” is proactively setting value $x$ to $X$ in an experimental environment, denoted by Judea Pearl as $do[X=x]$ as in a do-operator.
An “observation” is observing value when $x$ took place in $X$.

Fact:

Data-gen process(or the entire picture of the system in which causality among events and variables are in) is important yet elusive, sad, but true(subdue to complexity and invisibility).

Causal graphs

$X$ and $Y$ can be a part of a more complex system.

What it does deterministically:

Visualize the cause-effects for a determined/observed subset of the system, do-operators, converge to ceteris paribus(comparing apples to apples) assumption.
Declares the variables in this system and the influences these variables act upon each other, direction of causality.
Topology to estimate the causal effects

What it does not necessarily do:

Represent the holistic system, or give out an entire picture of which $X$ and $Y$ are in.

  graph LR
    %% Subgraph 1: Before the do-operation
    subgraph Before [Before do-operation]
        Z1((Z)) -->|1| X1((X))
        Z1((Z)) -->|2| Y1((Y))
    end

%% Invisible link forcing "Before" to be on the left of "After"
    Y1 ~~~ Z2

    %% Subgraph 2: After the do-operation
    subgraph After [After do-operation]
        X2((X))
        Z2((Z)) -->|2| Y2((Y))
    end

In which the causal graph:

$Z$ causes $X$ and $Y$.
$Z$ is a confounder for $X$ and $Y$, and because of this, $X$ and $Y$ can have associations, but not necessarily causal.
Confounder $Z$ does not necessarily induce or represent the entire picture of the associations between $X$ and $Y$.
$do[X=x]$ has no effect on the probability of $P[Y=y]$

⭐️⭐️⭐️ All of the above can be referred to Chap2, page 3-4 of the book.

Author’s conclusions:

Causality is beyond statistics. It’s related to interventions, complex system, and distinct from association. It’s inappropriate and prohibited to describe causality in associational language of conditional probabilities.
Association does not imply causation, but causation does imply association because of a intervention like $do[X=x]$ can be associated to $Y=y$
Causation is directional(vector-like), meaning a statement: “$X$ causes $Y$ implies that $P[Y=y|do[X=x]]>P[Y=y]$”, but does not imply “$P[X=x|do[Y=y]]>P[X=x]$”.
Causation is sequential. “$X$ causes $Y$ means the value of $X$ is set first, and only after that $Y$ adapts.”
ceteris paribus assumption simulates an intervention which can only be understood by causal graph.

3. The Three Steps of Scientific Discovery

What is Science?

“The systematic organization of knowledge in the form of testable explanations of natural observations.”

To mature in science is to identify the causations and the mechanisms behind them.

For some knowledge to be scientific has to follow 3 steps:

Observe, phenomenology, observe patterns of associational events, or exceptions to such pattern(anomalies);
Theorize, propose causal mechanism and explanation;
Falsify, experiment against each component of the causal mechanism

3.1 The Phenomenological Step

The goal of this step is to:

State a “problem situation”, or observing an anomaly

What to do at this step:

Observe associated events(anomaly) $P[X=x,Y=y] \ne P[X=x]P[Y=y]$;
May model $P[X=x,Y=y]$
Derive conditional probabilities
Make associational statements(or prediction) $E[Y|X=x]=y$ using machine learning
May produce empirical evidence of a cause effect
Make a logical induction conclusion that:"For some unknown reason, the anomaly will reoccur."

What to not to do at this step:

Exploring, providing, concluding the cause or reasons for the association observed

3.2 Theorization

The goal of this step is to: