Idea Master
Posts
Scaling Reddit’s Media Metadata: Building a Unified, High-Performance Store

Scaling Reddit’s Media Metadata: Building a Unified, High-Performance Store

Discover the architecture and optimizations that power Reddit’s ultra-fast metadata store, handling massive scale with millisecond precision

Raviraj Kumar
August 17, 2024

Reddit, a platform with billions of posts—most of them images, videos, GIFs, and embedded media—faces the immense challenge of managing and analyzing this data effectively. Every piece of media needs to be searchable, with context like thumbnails, playback URLs, bitrates, and more. However, Reddit's metadata was initially scattered across multiple systems, each with its formats and query patterns, leading to inefficiencies and inconsistencies.

The Core Problem: Fragmented Metadata

As Reddit's user base and content volume grew, the platform struggled with a decentralized metadata system. Metadata across various media objects was distributed across multiple databases, each with its own schema and query patterns. This fragmentation led to significant issues:

- Inconsistent Metadata: Different databases storing metadata in varied formats resulted in inconsistencies.

- Inefficient Queries: Varying query patterns across databases made data retrieval slow and complex.

- Scalability Issues: The distributed nature of metadata storage hindered Reddit's ability to scale and manage increasing data loads effectively.

To address these challenges, Reddit needed a unified metadata store—a single source of truth for all media-related metadata.

The Solution: A Unified Metadata Store

To solve the metadata fragmentation problem, Reddit architected a unified metadata store built on AWS Aurora Postgres. This store became the central repository for all media-related metadata, ensuring consistency and efficiency across the platform.

Architecture Overview:

- Raw Data → API (Service Layer) → PG Bouncer → AWS Aurora Postgres

- API (Service Layer): The interface through which clients interact with the metadata store.

- PG Bouncer: A connection pooler that manages connections to Aurora Postgres, reducing the overhead on the database.

- AWS Aurora Postgres: The unified metadata store that holds all media metadata in a consistent format.

This architecture ensured that all media metadata was centralized, providing a single point of access for all queries and updates. However, this has an issue with the different database sources.

Data Migration: A Herculean Task

One of the biggest challenges Reddit faced was migrating several terabytes of metadata from multiple databases to the new unified store—all while keeping the platform operational, with 100,000 requests per second coming in.

Data Migration Blueprint:

1. Enable Dual Writes: Clients began writing new data to both the source databases and the new metadata store. This ensured that any new data was captured in the unified store while older data remained in the original databases.

2. Backfill Data: Older data from the source databases was backfilled into the new unified store. During this process, Reddit implemented checks to compare the old and new data and fix any inconsistencies.

3. Enable Dual Reads: Clients started reading from both the source databases and the new metadata store to ensure that the new system was functioning correctly.

4. Ramp Up Traffic: Slowly, Reddit began directing more traffic to the new metadata store, eventually phasing out the old databases.

Addressing the Key Challenges

Reddit faced two significant challenges during this migration:

1. Write Inconsistencies: There was a risk that writes to the new metadata store would succeed while writes to the source databases failed, leading to data discrepancies.

2. Data Overwrite Risk: During the backfill process, there was a risk that older data from the source databases could overwrite newer data in the metadata store.

The Kafka Stream Solution

Reddit System Design

To mitigate these risks, Reddit implemented a Kafka stream as part of its migration strategy. Here's how it worked:

- Change Data Capture (CDC): As clients performed dual writes, CDC tracked changes in the source databases and ingested them into Kafka.

- Consumers and Validation: Consumers picked up events from Kafka, performed validation checks to ensure consistency, and then ingested the data into the new metadata store. Any inconsistencies were logged in a separate table for further investigation.

This approach allowed Reddit to maintain data integrity throughout the migration process, ensuring that the new metadata store was accurate and consistent.

Optimizing for High Performance

With billions of media posts to manage, Reddit optimized its metadata store for high performance:

- Denormalized JSON Storage: Although AWS Aurora Postgres is a relational database, Reddit chose to store metadata in denormalized JSON fields, effectively using it as a NoSQL key-value store. This approach allowed them to achieve high read performance.

- Partitioning with pg-partman: To manage the large volumes of data, Reddit partitioned the metadata store using pg-partman, an extension that creates and manages partitions based on time or number of entries. For instance, they partitioned data by post_id, creating new partitions for every 90 million IDs.

- Scheduled Maintenance: Using pg-cron, Reddit scheduled regular maintenance tasks to create new partitions and drop old ones as per their retention policy.

Performance Metrics

Reddit’s optimized metadata store achieved impressive performance metrics:

- p50 (50th percentile latency): 2.6 ms

- p90 (90th percentile latency): 4.7 ms

- p99 (99th percentile latency): 17 ms

These numbers were achieved without using a read-through cache, demonstrating the efficiency of their architecture.

Conclusion

Reddit’s journey to a unified, high-performance metadata store is a testament to the power of thoughtful architecture and careful planning. By consolidating fragmented systems into a single, optimized store, Reddit not only solved its immediate scalability challenges but also laid the groundwork for future innovation. The combination of a robust migration strategy, real-time data validation with Kafka, and cutting-edge optimization techniques has positioned Reddit to continue growing and evolving its platform, all while maintaining the lightning-fast performance that its users expect.

Topics for upcoming Newsletter:

How Hotstar Application Scaled 25 Million Concurrent Users….

Stay tuned.

_{"Enjoyed the insights? If you're looking for someone with a solid foundation in data science and DevOps, who can tackle complex challenges like building scalable architectures and optimizing performance, let's connect! I'm ready to bring the same level of expertise to your team. Reach out and let’s discuss how I can contribute to your next big project."}

magic.beehiiv.com/v1/5e46d4c3-983c-45d3-825c-453aec299db9?email={{email}}