6. Steve Huffman Looked at Google and OpenAI and Said "We're Not Giving Our Data Away for Free Anymore." Every SaaS Founder Should Read That Sentence Again. TechCrunch

published on 29 March 2026

Your data is your business's hidden revenue stream. Reddit's CEO, Steve Huffman, made headlines by halting free access to Reddit's data for companies like Google and OpenAI. Here's why this matters to SaaS founders:

  • Reddit's Licensing Success: By 2024, Reddit secured $203 million in licensing deals from AI companies, generating $66.4 million in projected annual revenue.
  • The Problem: AI companies scrape data to train their tools, often bypassing content owners. This reduces web traffic by up to 75% and creates competition using your own resources.
  • The Solution: Reddit introduced paid API access, charging major companies like Google ($60M/year) and OpenAI ($70M/year) while enforcing privacy and compliance safeguards.

Key takeaway for SaaS founders: Stop giving your data away for free. Treat it as a monetizable asset, protect it with technical and legal measures, and establish structured licensing agreements to create new revenue streams.

The Problem: Giving Away Data for Free Hurts Your Business

There was a time when SaaS companies believed that offering free access to their data would naturally drive traffic. Search engines would index their content and funnel users to their platforms. But the landscape has shifted. Now, AI companies scrape this data to develop competing tools that keep users locked into their ecosystems. For SaaS founders focused on building lasting value, understanding this change is crucial.

How AI Companies Use Your Data Without Paying

AI companies deploy bots to scrape massive amounts of data from SaaS platforms, forums, and websites. This data becomes the training material for Large Language Models (LLMs), which power billion-dollar AI tools. And here’s the kicker: the data owners don’t see a dime.

Some AI firms even disregard robots.txt, a standard tool meant to block unwanted crawlers. For example, in June 2025, Reddit filed a lawsuit against Anthropic, accusing the company’s bots of scraping Reddit over 100,000 times, despite Reddit explicitly blocking them [3]. Steve Huffman, Reddit’s CEO, summed up the issue:

"We've had Microsoft, Anthropic, and Perplexity treat all online content as free raw material." - Steve Huffman, CEO, Reddit [8]

By sidestepping data licensing, AI companies save billions of dollars. If they had to pay for access, developers would face costs running into hundreds of billions annually [7]. That’s immense value pulled directly from your platform, with no financial return for you.

What SaaS Companies Lose When Data Is Free

The consequences go far beyond missed licensing revenue. When AI companies use your data to train their models, they create tools capable of answering user questions directly. This drastically reduces the need for users to visit your site. A study by The Atlantic found that AI-enhanced search engines could resolve 75% of user queries without requiring a click-through to the original source [2].

This shift erodes the fundamental "give-and-take" that the web was built on. David Pierce, Editor at The Verge, voiced his concerns plainly:

"The basic social contract of the web is falling apart." - David Pierce, Editor, The Verge [5]

Unrestricted access to your data also weakens your competitive edge and impacts your company’s valuation. When rivals use your data to create derivative products, they’re essentially leveraging your intellectual property to compete against you. Reddit’s Chief Legal Officer, Ben Lee, didn’t hold back when addressing Anthropic’s alleged scraping:

"We will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for redditors or respect for their privacy." - Ben Lee, Chief Legal Officer, Reddit [3]

To put things into perspective, Google’s licensing deal with Reddit is reportedly worth $60 million annually [5]. These challenges highlight the urgency for SaaS companies to adopt strategies that ensure their data generates value for them - not just for others.

Reddit's Approach: Charging for Data Access

In mid-2023, Reddit made a major shift by moving away from its open-access model. It introduced a Public Content Policy and transitioned its API to a contract-based, premium service for commercial use [1][4]. This change means commercial partners now pay for higher usage limits, ensuring that those benefiting from Reddit's data contribute financially to the platform.

Licensing Deals with OpenAI and Google

OpenAI

By January 2024, Reddit had secured licensing agreements worth $203 million over two to three years [2]. These deals give AI companies structured, real-time access to Reddit's Data API, which houses over 1 billion posts and more than 16 billion comments [2].

Google's share of this agreement amounts to $60 million annually [9], while OpenAI's deal is estimated at $70 million per year [6]. By early 2025, these licensing agreements made up about 10% of Reddit's total revenue [9][6]. Reddit's COO, Jen Wong, highlighted the importance of this revenue:

"It's a small part of our revenue - I'll call it 10%. For a business of our size, that's material, because it's valuable revenue." – Jen Wong, COO, Reddit [9]

These deals aren't just about data access - they're "two-way" partnerships. AI companies gain Reddit's training data, while Reddit benefits from access to their large language models, which it uses to develop new tools for its community and moderators [11]. OpenAI even became an advertising partner on Reddit as part of its agreement [10].

To protect user privacy, these contracts enforce strict rules. Partners must honor user decisions to delete content, ensuring deletions are reflected in their models [1][3]. Reddit has also put in place technical and legal safeguards to ensure compliance with these agreements.

Technical Protections to Control Data Access

Reddit has reinforced its platform with technical measures to tightly regulate data access. It actively monitors and blocks "known bad actors" who attempt to scrape public content at scale without authorization [1]. Additionally, certain data types, like "mature content", are now restricted from API access [4].

Beyond technical barriers, Reddit also takes legal action when necessary. For example, it prohibits using its data for purposes like facial recognition, government surveillance, or identifying individuals [1][3]. If technical defenses fall short, legal consequences follow.

For researchers, Reddit created the r/reddit4researchers subreddit, offering controlled, free access for academics and those conducting research in good faith [1][4]. This approach strikes a balance, supporting legitimate research while safeguarding Reddit's commercial interests and maintaining control over how its data is used.

Since implementing these licensing deals, Reddit has seen a 450% year-over-year increase in non-advertising revenue [11]. This demonstrates how treating data as a monetizable resource can significantly impact a company's financial outlook.

What SaaS Founders Can Learn About Data Protection and Monetization

Reddit's decision to shift from free data access to a paid licensing model wasn't random - it was a calculated move involving policy updates, technical safeguards, and, when necessary, legal action. SaaS founders can take inspiration from this approach to secure their data assets and explore new revenue opportunities.

Create Licensing Agreements for Your Data

Start by identifying the data your platform generates that could hold commercial value. Reddit discovered that its vast repository of 1 billion posts and 16 billion comments was a goldmine for AI companies seeking real-time conversational data to train their models effectively [2]. Similarly, your platform might produce valuable data, such as user activity patterns, market trends, or industry-specific insights.

In May 2024, Reddit introduced its Public Content Policy, which allowed free access to non-commercial researchers via the r/reddit4researchers subreddit. However, commercial entities were required to sign licensing agreements for any use of Reddit's data to "power, augment, or enhance" their products [1].

When drafting these agreements, prioritize user privacy. Reddit's contracts, for instance, mandate that partners remove any data tied to user-deleted posts from their AI training datasets. They also forbid using the data for purposes like facial recognition, government surveillance, or individual identification [1]. These measures demonstrate that protecting user trust can go hand-in-hand with monetizing data.

Of course, licensing agreements are only effective if you can enforce them.

Protect Data Quality with Verification Systems

A licensing framework is meaningless without the ability to ensure compliance. Reddit has put technical safeguards in place to prevent unauthorized data scraping. For example, it uses robots.txt files to signal that automated crawling is off-limits without proper authorization. Additionally, Reddit actively monitors bot activity to catch and address violations [3].

When technical measures fail, legal action becomes the next step. A notable example is Reddit's June 2025 lawsuit against Anthropic for unauthorized scraping. This legal move sent a clear warning to others. As Chief Legal Officer Ben Lee stated:

"We will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for redditors or respect for their privacy." [3]

This approach highlights that enforcing data policies requires both technical vigilance and a willingness to take legal steps when necessary.

Focus on Long-Term Value, Not Quick Wins

While securing contracts and setting up technical defenses are crucial, thinking about the bigger picture is just as important. Reddit's strategy for data licensing emphasizes sustainable, long-term revenue over chasing short-term profits. The company's COO, Jen Wong, explained that although data licensing accounts for only about 10% of Reddit's revenue, it represents "valuable revenue" for a business of their scale [9].

Moreover, Reddit structured its licensing agreements as mutually beneficial partnerships. As CEO Steve Huffman explained:

"It didn't make sense for Reddit to continue to give 'all of that value to some of the largest companies in the world for free.'" [1][2]

The takeaway for SaaS founders? Your data isn't just a byproduct - it’s a growing asset. By treating it as a core revenue source, you can strengthen your business, create new income streams, and enhance your company’s overall valuation.

How to Build Your Data Monetization Strategy: 4 Steps

4-Step Data Monetization Strategy for SaaS Founders

4-Step Data Monetization Strategy for SaaS Founders

Reddit's transformation from offering free data access to generating $203 million through licensing wasn't a spur-of-the-moment decision. It required a deliberate strategy - identifying valuable data, setting up access controls, and forming partnerships. Here's how you can create your own data monetization plan.

Step 1: Identify Which Data You Can Monetize

Start with a thorough audit of your data. Separate public, monetizable content from private data. For example, public-facing content like user posts, comments, or discussions can be monetized, but private messages, emails, and browsing history should remain off-limits [1][12].

Certain types of data are especially appealing to AI companies. Content that reflects real-time trends - like news, fashion, or current events - holds high value because AI models rely on staying updated [2]. Additionally, data with rich conversational depth, such as Reddit's 1 billion posts and 16 billion comments, becomes attractive because it mirrors authentic human interaction on a massive scale [2].

Ask yourself: does your data have the potential to "power, augment, or enhance" a third-party product? If yes, you’ve found an asset worth monetizing. Just ensure that sharing it complies with privacy regulations and avoids exposing personal information [1].

Once you've pinpointed valuable data, the next step is to design a structure for controlled access.

Step 2: Set Up Paid API Access and Licensing Tiers

Create clear access tiers to differentiate between user types. For example, offer free access to non-commercial users - like researchers, moderators, or tools that improve user experience - while charging commercial entities that use your data for profit or AI development [1][3][4]. Reddit implemented this model in May 2024 with its Public Content Policy, allowing researchers to access data through the r/reddit4researchers subreddit, while requiring commercial users to sign licensing agreements [1].

For commercial users, establish tiered pricing based on usage and rights. Reddit’s deals with Google ($60 million per year) and OpenAI (around $70 million per year) highlight the potential of these partnerships, collectively contributing 10% of Reddit's $1.3 billion annual revenue in 2025 [6].

This structured approach ensures you can manage access while paving the way for partnerships with trustworthy AI companies.

Step 3: Choose the Right AI Partners

Partner only with AI companies that respect technical protocols and privacy standards. Avoid those who bypass standard practices, such as ignoring robots.txt files or using unofficial API channels. Reddit’s decision to reject Anthropic is a prime example of prioritizing trust [3][4]. As Steve Huffman put it:

"For any relationship, there has to be a foundation of trust. And so yes, there will be companies we simply won't work with because I don't think we will be able to come to an agreement, let alone enforce it." [12]

Ensure your contracts include strict privacy clauses. Prohibit partners from using data for identifying individuals, ad targeting, or activities like government surveillance, facial recognition, or background checks [1]. Additionally, partners must have the technical ability to honor user deletions and sync with your platform [1].

Once partnerships are in place, you can further increase your data's appeal through internal analysis.

Step 4: Use AI Tools to Increase Your Data's Value

Leverage AI tools internally to analyze and refine your data, making it more attractive to potential partners. AI-powered analytics can uncover patterns and trends, helping you identify which content types drive the most engagement. This insight allows you to price different data segments more strategically.

You can also use compliance tools to monitor how partners use your data and ensure they adhere to licensing agreements. By showcasing your data’s enhanced value and tracking its use, you’ll be better positioned to negotiate favorable terms and enforce compliance.

The key is to treat your data as a valuable resource and prepare thoroughly before approaching potential licensees. This groundwork not only boosts your earning potential but also strengthens your ability to secure better partnerships.

Conclusion: Your Data Is an Asset - Treat It Like One

Steve Huffman turned proprietary content into a $130 million annual revenue stream, making up 10% of Reddit's total revenue [6]. This highlights what’s possible when you treat your data as a real asset.

For SaaS founders, the takeaway is straightforward: if your platform generates authentic, real-time content valuable for AI training, there’s untapped revenue waiting for you. Reddit’s groundbreaking deals illustrate this potential. Beyond financial gains, safeguarding your data enhances privacy, controls how partners use it, and prevents unauthorized scraping [1][3].

The landscape has changed. AI-powered search engines now answer up to 75% of user queries without directing traffic back to the original source [2]. That means fewer clicks and less revenue - unless you secure licensing agreements that ensure fair compensation for the value AI companies derive. As Huffman aptly stated:

"We don't need to give all of that value to some of the largest companies in the world for free." [4]

This shift underscores the urgency for SaaS founders to rethink how they manage and monetize their data.

The time to act is now. Begin by auditing your data, implementing tiered API access, and partnering with reliable organizations. Those who define clear terms, enforce protections, and monetize effectively will create steady revenue streams. Meanwhile, others risk inadvertently financing their competitors’ growth.

Your data is an asset - secure it, price it, and profit.

FAQs

What data can my SaaS legally license without violating user privacy?

Your SaaS can use publicly available data, like Reddit content, for commercial purposes - but only if you obtain the appropriate licensing agreements. Reddit now mandates contracts for accessing and utilizing its data, particularly for AI training or other commercial applications. Make sure you adhere to user privacy laws and Reddit's content policies when working with licensed data.

How do I stop AI bots from scraping my site if they ignore robots.txt?

To keep AI bots that bypass robots.txt from accessing your site, you can use strategies like rate limiting, IP blocking, and CAPTCHAs.

  • Rate limiting: Restrict the number of requests allowed per IP address over a specific time period. This helps prevent bots from overloading your server.
  • IP blocking: Identify and block IP addresses associated with known scraping bots or suspicious activity.
  • CAPTCHAs: Use CAPTCHAs to verify users on critical pages, ensuring that only humans can proceed.

Additionally, server-side monitoring can help detect unusual patterns, such as rapid or excessive requests, allowing you to respond quickly. Combining these technical measures with legal actions can further protect your data from unauthorized scraping.

How should I price a paid API and data licensing tiers for AI companies?

When setting prices for paid APIs or data licensing tiers, it's essential to focus on the value your product delivers, its impact on the broader ecosystem, and the risks tied to mispricing. Striking the right balance ensures both profitability and long-term sustainability.

Flexible pricing models tend to work best. These could include:

  • Usage-based pricing: Charging based on API call volume or data consumption.
  • Subscription plans: Offering predictable, recurring revenue streams.
  • Outcome-based pricing: Aligning costs with the results or benefits delivered to the customer.

Tiered pricing is another effective approach. For instance, you can create tiers based on factors like:

  • API call volume: Higher usage tiers for enterprise clients and lower tiers for startups or smaller developers.
  • Data sensitivity: Premium pricing for access to highly sensitive or specialized datasets.
  • Use case: Adjusting prices based on whether the API is used for simple applications or more complex, high-value integrations.

This structure allows you to offer premium options for enterprise clients while still providing affordable entry points for smaller developers, ensuring a healthy balance between revenue generation and ecosystem growth.

Related Blog Posts

Read more

Built on Unicorn Platform