Market Gaps
Posts
Size Doesn't Matter | Who Has the Best Data?

Size Doesn't Matter | Who Has the Best Data?

August 29, 2024

In partnership with

Who Has the Best Data?

We scour 100+ sources daily
Read by CEOs, scientists, business owners and more
3.5 million subscribers

xAI is Crushing it

It feels like just recently the puck has shifted in terms of the ‘best’ AI. First, it was ChatGPT. Then it seemed Claude took over. Nowm Grok from xAI launched which recently has become the ‘best-in-class’ among many standardized tests.

https://x.com/lmsysorg/status/1827041269534879784

Why is that a company with an order of magnitude less fundraising and launching this year, can beat out OpenAI and Anthropic?

We scour 100+ sources daily
Read by CEOs, scientists, business owners and more
3.5 million subscribers

Data is the New New Oil

Last week, I had a chat with my dad about who has the 'best data.' I thought this would make for a great newsletter post so sharing this here:

1. More data =/ best data

In 2006, British mathemetician Clive Humby coined the phrase “Data is the new oil.” Entire industries were built off the data captured, better targetting ad spend and ultimately increase average basket. Even better, some platforms successfully got consumers to put their data in (age, gender, interests, birthday, etc). People were literally handing you free money!

Becasue available data underpins the AI models, there’s levels to data value.

You need new data, edge cases, and data that has an economic arbitrage. There’s 5 buckets of data value:

New Data: Is this data new and if so, how new is it? This minute, week, month, year? Ken Griffin will spend $1B to get data 5 milliseconds faster than the world…
Meta-Data vs. Raw Data: Meta-data is data inferred from information. Apple can determine how many people wake up before 6am, average heart beat, etc. But raw data can almost be classified as people’s willigness to share information. Taking a survey about ‘how likely you are to go to the gym’ is somewhat helpful. But knowing what they actually do is moreso. The problem is people are more status driven and sounding correct than they the reality. Also, there’s often fake data (see below).
Edge Cases: New data that’s truly rare is considered an edge case. This is why having a lot of data is still important. In self driving, this was considered something like ‘deer runs in front of your car.’ That’s important because if the edge case can’t be solved, you hit the deer. Now, edge cases in self driving are insanely rare becase they’ve seen a lot but if we find edge cases we need to classify them and quickly.
Structured vs. Unstructured Data: Unstructured data requires cleaning and structuring. This is costlier but AI’s primary value proposition is turning unstructured to structured. For example, take 500k websites → ping their website, come up with a few interesting topics, and then send them an email that feels highly individualized. Requires a browser bot, inference, and then some email infrastucture.
Economic Arbitrage: This is the most important. If I learn that people are consuming less Coca-Cola, I personally can only profit by shorting Coca-Cola. There’s an upside to that of my bank account and the amount I want to invest. On the flip side, Coca-Cola really does benefit from this and they can save billions in knowing this trend, adjusting their supply chain, etc. So the primary benefactors need to be insanely large multinationals.

There’s one other factor…

True Data: This seems obvious but there’s a ton of fake data. A Russian bot farm which is focused on spreading misinformation, needs to be removed from the feedback loop. If, somehow, that AI tool starts to trust bad data then it will completely destroy the credibility of that system. This has happened, most notably, with Google.

My thoughts:

Anthropic and OpenAI

I don't think Claude or OpenAI have a data moat. They're quite screwed as they’re running on large sets of (mostly) purchased data. Sure, they can buy data from Reddit but that’s a lot of fake data. They also rely heavily on data aggregators and web crawlers which are arguably regurgiatating their own content. Claude is crawling a blog about ‘10 Ways to Improve Sales’ which it literally wrote! That’s not exactly useful. The partnerships are costly and the data is often old and tarnished.

They do have a leg up on chatbot Q&A data. People are asking questions to Claude/ChatGPT more than any other tool - but this is a small % of all questions asked and is ever-fleeting. Early adopters want to use ‘the best’ AI chatbot and switch immediately. Laggards tend to not use AI tools yet.

In this sense, I think X has the most pure data source in terms of people's thoughts but lacks meta-data and a lot of economic data. The ideal social media channel will have the most diverse and widespread set of thoughts and opinions. This is beyond politics, you want a well argued case for small rabbit holes and technical wonks. You want someone to complain about a leading indicator before the news does. Twitter/X does that the best.

Reddit, although in a similar space of ideas, lack user identification which makes it hard to run 'did this actually happen' proof.

Meta-Data:

Arguably Meta, Google, and Apple have the best data however a lot of that is meta-data (avg. person wakes up at 8am, texts 9 people per day, goes to the bathroom at this time etc). Meta-data isn’t worse than any other data, it’s perhaps better. If I say ‘I’m going the gym’ and then actually don’t go:

The meta-data is far superior than the signal of going. You can then predict how much space is needed for gyms in the city, supply of equipment, and other pertinent inputs.

However, this data is probably not going to change too much over time. The value is not in the data but rather in the delta. If last year 2% of people went to the gym and this year 1.9% do, then that economic value is limited. Selling ‘access to proprietary data’ will then often fail unless you can find someone who truly benefits from the upside of that 0.1%. If you’re selling the data for $100k/year to a local gym in Wichita, that probably won’t do much good.

Only Google has a clear edge in having so much under their ecosystem: Gmail, Maps, Youtube, Files, Calendar, Android, and Search combined is really extensive in the value chain of data. But if that’s the case, why have they seemed to do the worst of all major FAANG companies in deploying that edge.

Economic Arbitrage:

Data across individuals is technically big data with small value, individually. Alternatively, ERP & CRM companies, Microsoft, and banks/payment processsors have great corporate and economic data but struggle with edge cases and economic arbitrage. If Salesforce realizes companies are shuttering more often, that's not going to be easy to arbitrage (maybe they can layoff early or cut some production but that's it). American Express may be constantly watching things like default rate but they certainly have no idea when the levee breaks. If anything, they’re likely too conservative to allocate capital if it’s a blinking KPI in their Executive Dashboard.

MSFT is unique like Google in that they own multiple large brands but they also have a terrible search console and a lot of corporate risk in creating something themselves. They don’t have much data captured via phones, tablets, or many developers/creatives who choose Mac. They also don’t have any social media business so I can’t exactly find their edge. This is why they’ve partnered with Anthropic and OpenAI!

Computer Vision Value:

Tesla currently has the most computer vision data, reviewing millions of miles each day. But someone like Waymo, agnostic to car brands, could overtake them with ease. Tesla cannot make a deal with Ford because that’s their primary competitor. Waymo, which doesn’t produce cars, can (and will). They just signed a deal with Uber which will likely put them ahead of Tesla in terms of driving miles within 12 months.

Waymo unfortunately doesn't have a value ramp towards non-car-use-cases which Tesla does (FSD and then Optimus). They are owned by Google but they are notoriously bad with hardware.

Robots do seem like the surefire way to unlock value via computer vision but there’s tons of other and easier ways to capture that today (vs. building a robot which might deploy in 2030).

Amazon and Nvidia:

Amazon has a ton of data that’s hard to analyze. The sleeper in Amazon is AWS which is probably a global top 3 sources for corporate data. The problem is AWS cannot data capture everything since they are just the provider and are limited to mostly meta-data.

Amazon(.)com seems great but it’s mostly economic data. While that’s not bad, they mostly are just the platform for supply chains. The data Amazon has is only useful for the producers to understand trends. IF they own the products, then they can influence their decisions directly. But they don’t own 99% of things sold on Amazon. While they can share that data to producer, they’re sharing data to (mostly) Chinese Manufacturing companies. There’s the rub.

Nvidia is probably the sleeper pick here but they are working with such large tech companies that are (hopefully) restricting the data flow.

That’s all from me, stay nimble.

Justin