validation / How Validation Works

How Validation Works

Every StockTwits caller on VERTEX is graded by the same honest, forward-return method — no cherry-picking, no fabricated results, no hidden assumptions.

⚖ The Honest Method

We grade a caller by looking at what the market actually did after they made a public call. No subjective entry/stop/target judgments. No retroactive cherry-picking. The market outcome is the only judge. If the edge isn't there, we flag it honestly.

Forward-return grading of public StockTwits Bullish / Bearish cashtag tags. Entry = next session open after the post no look-ahead; held 1 / 5 / 20 trading days; direction-adjusted (Bullish = up is correct). Complete over the pulled stream; realized horizons only; small samples flagged. Hypothetical/past — not advice.

How Grading Works — Step by Step

  1. 1
    Pull the stream. We fetch the caller's public StockTwits feed — every message they've posted. We pull enough history to give 20-day forward windows a fair sample.
  2. 2
    Find calls. We keep only posts tagged Bullish or Bearish with a cashtag (like $QQQ). A post with a basket of 3+ symbols is treated as spray-tagging and skipped — those aren't real directional calls.
  3. 3
    Set entry price. The entry is the next session's open price — the first trade price on the next market day after the post. You can't act on a call before you see it, so this is the earliest honest entry.
  4. 4
    Measure forward returns. We hold the position for 1, 5, and 20 trading days and measure the actual price change. Direction-adjusted: Bullish wants the price up, Bearish wants it down. If price goes the caller's way, it's a "correct" call at that horizon.
  5. 5
    Realized horizons only. If a call was made 3 days ago, we only report the 1-day result. The 5-day and 20-day columns show "—" (not yet available). We never fabricate forward-looking results.
  6. 6
    Wilson 95% confidence interval. We compute the call accuracy (hits / total) and wrap it in a Wilson score confidence interval. This tells you: "if we repeated this process, the true hit rate would fall in this range 95% of the time." It's especially honest for small samples.
  7. 7
    Flag the result. Based on the 95% CI vs the 50% coin-flip baseline, we assign one of four flags — from "edge" to "insufficient sample."

Edge Results Shown First

When you look at the leaderboard, callers are ranked in a specific order: callers with a statistically-supported edge (95% CI entirely above the 50% coin-flip) appear first. Then callers whose results are negative (95% CI entirely below 50%). Then callers with insufficient data or no significant edge.

This isn't bias — it's the honest way to surface signal. If a caller has only 15 calls with a 73% win rate, they'll be ranked below a caller with 200 calls and a 55% win rate, because the first caller's sample is too small to trust.

Why this matters: A small sample that looks amazing is often just noise. By ranking edge results first, we surface the callers with enough data to say something meaningful — not the ones who happened to get lucky on their first 10 calls.

(1) Significance Flag Explanation

Every horizon gets one of four flags. These aren't opinions — they're computed from the statistics:

Flag Meaning What It Looks Like
edge The caller's hit rate is statistically above 50% — the lower bound of the 95% CI is above 50%. This means we can say with 95% confidence that the caller is better than a coin flip. CI: [58.2%, 74.1%] → lower bound 58.2% > 50%
negative The caller's hit rate is statistically below 50% — the upper bound of the 95% CI is below 50%. This caller is worse than random at this horizon. CI: [31.3%, 46.8%] → upper bound 46.8% < 50%
not significant The 95% CI overlaps 50%. The caller's results could be from random chance — we can't rule out a coin flip. CI: [42.5%, 61.2%] → 50% is inside the range
insufficient sample Fewer than 20 calls at this horizon. No statistical test is reliable with such a small sample. We simply refuse to call it anything. n = 12 → below minimum gate of 20
Key rule: A caller with 4/5 correct calls (80%) shows "insufficient sample," not "edge." The minimum sample of 20 is a hard gate — no exceptions. Even 19/19 (100%) would be flagged insufficient because nineteen calls aren't enough to trust the statistics.

(2) 95% Confidence Interval — Explained

The 95% confidence interval is a range that tells you how reliable the observed hit rate is. It answers: "If we could watch this caller make calls forever, where would their true hit rate likely fall?"

# Wilson score interval (robust for small sample sizes): p = hits / n # Observed hit rate z = 1.96 # 95% confidence z-score denominator = 1 + z² / n center = (p + z² / (2n)) / denominator half_margin = (z × sqrt(p × (1-p) / n + z² / (4n²))) / denominator CI = [center - half_margin, center + half_margin]

Intuition: A wider CI = less certainty. A narrower CI = more confidence in the observed hit rate.

📏

Wide CI = Small Sample

A caller with 15/25 (60%): the 95% CI might be [40%, 78%] — a 38-point range. We can't say much.

CI width ≈ 38pp → very uncertain
🎯

Narrow CI = Large Sample

A caller with 120/200 (60%): the 95% CI might be [53%, 67%] — a 14-point range. The true rate is probably near 60%.

CI width ≈ 14pp → reasonably certain

Why Wilson, not the simple "p ± 1.96×SE" formula? The simple formula can give impossible results near 0% or 100% (like a lower bound below 0%). The Wilson interval handles edge cases correctly and is especially reliable for the small-ish sample sizes we deal with in caller grading (20–500 calls).

(3) Why Sample Size Matters

🔬

Small Sample = Noise

With only 10 calls, a 7-3 record (70%) looks amazing, but the 95% CI is roughly [35%, 92%] — it overlaps 50%. This caller could easily just be lucky.

n=10 · 70% hit rate · CI overlaps 50% → NOISE
📊

Large Sample = Signal

With 200 calls, a 110-90 record (55%) gives a 95% CI of [48%, 62%] — barely touching 50%. That's a much more reliable read on skill.

n=200 · 55% hit rate · CI: [48%, 62%]

The minimum sample gate of 20 is a hard floor. Below 20 calls at any horizon, we simply don't compute a significance flag. This prevents the "hot streak" fallacy — a caller who went 5/5 on their first week isn't crowned a genius, they're marked as needing more data.

Even above 20, the CI width shrinks slowly. To cut the CI width in half, you need roughly 4× the sample size. That's why a caller with 80 calls has roughly half the CI width of a caller with 20 calls — it takes real track record to narrow the uncertainty.

(4) Why "Edge" Results Come First

The leaderboard isn't sorted by raw hit rate. It's sorted by statistical significance first, then hit rate. Here's the ranking priority:

  1. 1
    Edge callers — those whose 95% CI is entirely above 50%. These are the only callers we can say with confidence are outperforming a coin flip.
  2. 2
    Negative callers — those whose 95% CI is entirely below 50%. These underperform random chance (which is itself useful information).
  3. 3
    Not significant / Insufficient — callers whose CI overlaps 50%, or who haven't made enough calls. You can't conclude anything from their results.

Why not just sort by hit rate? Because a 10/10 caller (100%) is far less meaningful than a 110/200 caller (55%). The 10/10 caller has a 95% CI that's almost worthless — it could be anywhere from 72% to 100%. The 110/200 caller has a tight CI around 55%. The second caller has more evidence of actual skill, even though their raw number looks worse.

This ranking system is designed to surface signal, not noise. It's the honest way to grade callers.

Basket Spray-Tags Are Skipped

StockTwits callers sometimes tag 5, 10, or even 20 symbols in a single post. These "basket" or "spray-tag" posts are not genuine directional calls — they're fishing expeditions. We skip any post that tags 3 or more symbols. The number of skipped spray-tags is always reported in the audit so you can see how much content was excluded.

Why skip them? If a caller posts "$QQQ $SPY $IWM $TLT $GLD all Bullish" and one symbol happens to go up, that's not a call — it's coverage. Including spray-tags would inflate win rates artificially. Honesty demands we exclude them.

Realized Horizons Only

This is one of the most important honesty rules. If a call was made 3 days ago, we only report the 1-day forward return. The 5-day and 20-day columns show "—" because enough market sessions haven't passed yet.

We never extrapolate, never simulate, never assume. If the bars don't exist on the chart, we don't report a number. This means a caller's scorecard fills in over time — early calls have all three horizons, recent calls only have 1-day.

⚠ Decision-Support Only

This is decision-support, not investment advice. Past results do not guarantee future performance. The validation system grades historical calls using a disclosed, consistent method — it does not predict future market behavior, recommend trades, or certify any caller's future accuracy. StockTwits caller data is public and read-only; VERTEX does not post, interact, or trade on behalf of any user.