46,890 Solana meme tokens analyzed

Predict meme token winners in the first 10 minutes.

A machine learning dataset and model comparison built on real on-chain data. XGBoost achieves 40% precision on genuine runners87x better than random selection.

46,890
Solana tokens
40%
precision on winners
87x
better than random
8
ML models compared
99.5%
overall accuracy

One-time purchase. Instant download. No subscription.

Everything you need to find the edge.

A complete, reproducible analysis pipeline. Not a black box.

46,890-Token Dataset

192 columns · 8 time snapshots per token (0–60 min) · labeled with actual outcome data. CSV format, ready to analyze.

2 Jupyter Notebooks

Step-by-step EDA (9 sections) and full 8-model comparison (9 sections). Runs in VS Code. Markdown explanations throughout.

8 ML Models Compared

XGBoost, Random Forest, LightGBM, CatBoost, Balanced RF, Stacking Ensemble, Logistic Regression, SVM. All tuned and benchmarked.

Full Documentation

192 variables documented: type, description, time point, value range. Python utilities included. Config-driven setup.

How the model works.

Three steps. One 10-minute window. Reproducible results.

01

Capture

Data is collected at 8 timestamps (t0, t5, t10, t20, t30, t40, t50, t60 minutes) after every token launch on Solana. 192 on-chain metrics per snapshot — wallet count, trade volume, early buyer positions, whale concentration, and more.

02

Predict

XGBoost uses only the first 10 minutes of data (t0, t5, t10 snapshots). 27 features — wallet count, early buyer activity, trade volume, whale concentration. A probability score is generated at t10.

03

Act

At threshold 0.60: 40% hit rate on real runners. Without the model, 10 random picks yield ~0 winners. With the model, 10 picks yield ~4 genuine runners. Tune the threshold to your risk profile.

What the data reveals.

Findings that only emerge from large-scale on-chain analysis across 46,890 tokens.

++

More wallets at launch = stronger demand signal

Runners had a higher initial holder count (median 6 vs 2 for failures). The runner rate climbs monotonically — tokens with 50+ wallets at launch were 9.5× more likely to succeed. Genuine distributed demand, not a ghost token or single-wallet accumulation.

++

Early buyer activity is a positive sign

early_buyer_positions (early buyer wallets) is a strong positive predictor. Early buyers target tokens with genuine momentum. Their presence means informed capital has already entered.

++

Volume and transactions dominate

daily_trade_volume and daily_transactions are the two strongest model features. High early trading activity reliably predicts whether momentum builds or collapses.

+

Elevated ATH at launch signals potential

peak_valuation is non-linear. Runners launched with higher ATH (median $9,150 vs $7,000). Tokens in the top quartile had a 2.3× lift over base rate. The mediocre mid-range — the typical unremarkable launch — is where failures concentrate.

+

Longer bio correlates with success

Tokens with a real description (bio_word_length) show better outcomes. Lazy deployers who skip the bio rarely build community.

Tested against 8 models. XGBoost wins.

Out-of-sample results on a held-out 20% test set (9,378 tokens).

Precision-Recall Curves — All Models

Precision-Recall curves. XGBoost clearly outperforms all others. Random baseline ≈ 0.0046 — nearly flat.

SHAP Feature Importance — XGBoost

SHAP feature importance. Exactly which on-chain signals drive XGBoost's predictions — transparency into the model.

Time Evolution of Key Metrics

Wallet count, trade volume, and whale concentration from t0 → t5 → t10. Orange (winners) and blue (losers) diverge sharply. The 10-minute window is real.

Feature Signal Summary

Feature signal summary. Green = strong predictor, red = weak or negative. Based on EDA across all 46,890 tokens.

Daily Trade Volume by Class

Daily trade volume at launch by outcome class. Winners (orange) are measurably elevated from minute 0.

Model AUC-PR AUC-ROC Precision Recall Accuracy
XGBoost ★ 0.099 0.929 40% 9.3% 99.5%
Stacking Ensemble 0.092 0.961 7.9% 74.4% 95.9%
Random Forest 0.090 0.956 13.5% 11.6% 99.3%
Balanced Random Forest 0.087 0.960 9.0% 48.8% 97.5%
Logistic Regression 0.085 0.909 8.3% 60.5% 96.7%
CatBoost 0.060 0.903 6.7% 25.6% 98.0%
LightGBM 0.058 0.844 14.8% 9.3% 99.3%
SVM (RBF) 0.030 0.803 99.5%

Random baseline AUC-PR = 0.0046. Dataset: 46,890 tokens, ~215:1 class imbalance.

A glimpse into the dataset.

10 real tokens from the dataset. Last column shows the actual return from minute 10 to minute 40.

Wallet Count Early Buyers Daily Transactions Daily Trade Volume Whale Concentration % Peak Valuation % Return (t10→t40)
0 2 4 $714 0% $7,000 −82.3%
12 12 32 $2,083 8% $9,800 −16.9%
12 16 27 $3,232 25% $9,300 −45.7%
2 1 2 $626 7% $5,000 −91.4%
11 19 81 $9,602 14% $10,400 −33.4%
0 4 4 $879 0% $7,600 +48.1%
0 7 14 $5,067 0% $11,500 −67.8%
5 3 18 $1,840 31% $8,200 −28.9%
8 6 44 $4,380 19% $12,100 −55.1%
9 10 21 $834 8% $7,900 +3,731.8%

Real tokens from the dataset. Token identifiers removed. Return = (usd_rate at t40 − usd_rate at t10) / usd_rate at t10 × 100

Case study

4 wallets.
8 early buyers.
50x in 40 minutes.

At launch, this token had just 4 unique wallets. To most scanners, it looked dead on arrival — no meaningful volume, near-zero market cap, easy to dismiss.

But 8 early buyer wallets had already positioned. Daily transaction count was accelerating from t0 to t5. Whale concentration sat at 71% — a small group with conviction, not a retail dump. The model flagged it at t10 with a confidence score of 0.74.

By t40, the token had returned over 50% from its t10 price. The model catches ~4 tokens like this per 10 flags. Without the model: ~0 winners per 10 random picks.

Launch Signal Card
wallet count 4
early buyer positions 8
daily transactions (t0) 847
daily trade volume (t0) $12,400
whale concentration 71%
peak valuation (t0) $290
model score (t10) 0.74 ✓
outcome (t40) +62% ↑

Choose your package.

One-time purchase. No subscription. Payment via USDC.

Common
$300 USD

Everything you need to run the analysis.

  • Dataset: 46,890 tokens × 192 columns
  • 8 ML models trained and evaluated
  • 2 Jupyter notebooks (EDA + model comparison)
  • Python utilities & config-driven setup
  • Full variable documentation (192 columns)
  • Threshold strategy guide
Most Popular Premium
$500 USD

Direct access to the creator.

  • Everything in Common
  • 1-hour live consultation call
  • Walk through the model results together
  • Custom threshold strategy for your trading style
  • Q&A on methodology and data collection

Payment via USDC on Solana. Delivery within 24 hours of payment confirmation. Consultation scheduled at your convenience.

On-chain verified data 46,890 real tokens Reproducible results

Questions.

Do I need to be a data scientist to use this?

No. The notebooks are designed for technical users comfortable with Python and VS Code. If you can run pip install and open a Jupyter notebook, you can run the full pipeline. The notebooks include explanatory markdown throughout every section.

What's the actual target variable — what counts as a "winner"?

A token is labeled as a winner if its price grew more than 50% from its 10-minute price to its 40-minute price. This is a conservative definition — many tokens pump more aggressively. The label uses the t10 snapshot as baseline and t40 as the outcome window.

Why is precision 40% and not higher?

The dataset has a 215:1 class imbalance — only 0.46% of tokens are winners. 40% precision at threshold 0.60 means 4 out of every 10 model-flagged tokens are real runners. Without any model, you'd expect roughly 0 winners from 10 random picks. The 87x improvement over baseline is the meaningful number.

Can I build a trading bot from this?

The dataset and model give you a strong statistical foundation, but production deployment requires your own real-time data pipeline to collect on-chain metrics at t0, t5, and t10. All 27 features used by the model map to publicly available on-chain data, queryable through standard Solana RPC endpoints and open-source indexing tools — building that pipeline is standard engineering work. This package covers the ML side — data collection infrastructure is outside scope.

What Python version and dependencies are required?

Python 3.9+. All dependencies are in requirements.txt: scikit-learn, xgboost, lightgbm, catboost, imbalanced-learn, shap, pandas, numpy, matplotlib, seaborn, pyyaml. Install with pip install -r requirements.txt.

What does the consultation call cover (Premium)?

A 1-hour session covering: understanding your model results, tuning the decision threshold to your risk tolerance, feature interpretation via SHAP analysis, and discussing how to adapt the model to your specific trading strategy. Scheduled at your convenience after purchase.

Why not just build this myself?

You could. Here's what that looks like in practice.

The infrastructure ran 24/7 for over 2 months — September through November — capturing on-chain metrics at 8 precise time intervals after every Solana token launch. That's 387,000+ raw data points across 16 snapshot files, each requiring real-time polling, error handling, deduplication, and storage. Not a weekend project.

Then comes the pipeline: parsing numeric strings in every format blockchain APIs return (K/M/B suffixes, percentage strings, scientific notation), merging 8 datasets by token address, engineering cross-snapshot features, constructing 12 binary targets, and producing a clean 46,890-token × 192-column dataset ready for analysis.

Then the analysis: a 10-section statistical EDA, 8 ML models trained and compared, threshold optimization, SHAP explainability, and validation against a 215:1 imbalanced dataset.

Training a model takes an afternoon. Building the dataset takes months. That's what you're buying.

Who built this

About Me

Nicolás Tursi

I'm a data analyst and ML practitioner studying at the Faculty of Exact and Natural Sciences, UBA (Buenos Aires).

Professionally, I lead an AI laboratory at a cybersecurity company, where I design and deploy machine learning systems on real-world, high-stakes data.

I've been active in crypto and meme token markets for years — long enough to recognize patterns that most traders dismiss as noise. This project is the intersection of both worlds: rigorous ML methodology applied to on-chain data at a scale that isn't feasible to collect manually.

Built with Python · scikit-learn · XGBoost · SHAP · 46,890 tokens · 2 months of data collection

Get in Touch

Contact

Questions about the data, methodology, or packages? Reach out directly and we'll get back to you promptly.

ntursisol@gmail.com

We typically respond within 24 hours.