How we built the infrastructure for a large-scale social media field experiment during the 2024 US election

18 minute read

What we built

During the 2024 US election, we ran one of the first large-scale field experiments where academic researchers, not a platform company, controlled the ranking algorithm on a live social network. For eight weeks around the 2024 US presidential election, we built and tested different ways of organizing people’s social media feeds on Bluesky and studied how that changed what they saw and how they felt about politics. It turns out, a lot of how you see the world is narrowly defined by what social media companies choose (and don’t choose) to put in your feeds! This work was published in Nature, one of the most prestigious publications in science.

What we built was the machinery to make all this research possible:

Pipelines that turned Bluesky’s public event stream into a queryable post corpus.
Classifiers that labeled content for toxicity, politics, and constructiveness.
Recommendation algorithms that created curated feeds based on different interventions for how to improve the social media experience.
An API layer that exposed those feeds to any user on Bluesky.

Users experienced custom feeds in the normal Bluesky app. Researchers experienced something rarer: logged exposure to algorithmically curated political content during a national election, with experimental control over every detail of what was shown to users.

Why did we need to build this in the first place?

Social media algorithms shape what people see, which shapes what they believe about others and about politics. Our team had run experiments showing those effects, but mostly in labs or on platforms where we could not touch the ranking layer itself.

Facebook and Twitter controlled their algorithms. They had little incentive to let academics randomly assign users to different ranking policies during a national election. So if we wanted to test causal effects of algorithm design, we needed a platform where we could actually implement and serve those policies to real users.

That requirement cascades quickly. Once you commit to live intervention, you cannot stop at “train a model.” You need:

A continuous view of platform activity
A way to label and store posts at scale
A ranking engine that can encode experimental conditions
A serving layer users can actually access

This sort of “end-to-end pipeline for academic field experiments” exists, and such a setup is generally reserved for large tech companies. Therefore, we had to build it from scratch.

How did Bluesky make this possible?

Bluesky was the rare platform where three things were true at once:

Outsiders could host custom feeds.
Live platform-scale data was publicly available.
Those feeds could be exposed to real users in the native app.

That combination is what made a field experiment possible outside Facebook and Twitter. Twitter open-sourced its ranking algorithm in 2023, but it did not open its data stream or let you serve alternative feeds to its users. Bluesky did both. This meant we could populate feeds with live posts from the actual platform and serve them to real users.

The election timing also amplified everything. Elon’s Twitter changes pushed a wave of users onto Bluesky just as political content was becoming the dominant form of engagement on the platform. That made the study substantively interesting in a way a quieter period would not have: we were not testing ranking algorithms on a static corpus, but on a live network during a national election, as the user base and content mix were both shifting rapidly. Doing that credibly still required building the full data and ML stack (ingestion, preprocessing, classification, feed generation, serving), but Bluesky gave us something closed platforms never would: the ability to run those algorithms on real users, in the native app, during the political moment we actually cared about.

Architecture at a glance

flowchart LR
  B[Bluesky firehose and APIs] --> S[Sync pipelines]
  S --> P[Preprocessing]
  P --> C[Fan-out to integrations]
  C --> M[ML classifiers]
  C --> SP[Superposter calculation]
  C --> V[Offline FAISS embeddings]
  M --> U[Unify integrations]
  SP --> U
  V --> U
  U --> R[Generate feeds]
  R --> A[Feed API]
  A --> U2[Bluesky users]

At a high level, the system is organized around the following components:

Data ingestion: connect to the real-time Bluesky data stream and capture live records.
Data processing: filter invalid records and enqueue valid ones for downstream services.
Integrations fan-out: run enrichment integrations (ML classifiers, superposter calculation, vector embedding generation) and label the latest batches of posts.
Consolidate enriched posts: aggregate labels from enrichment integrations into a single representation per post.
Feed generation: given the pool of available posts and the algorithm implementation, generate feeds for all study users and persist them to S3.
Serving: expose feeds through a FastAPI backend registered with Bluesky. Study participants subscribe in the Bluesky app; the client pings the backend, fetches the latest feed, and renders it.

I was the sole engineer on the project. In the sections below, I use we when describing what the study required or what users experienced, and I when describing implementation choices, experiments, and operational tradeoffs that were mine alone.

Deeper dive into the architecture

Data ingestion

For the data ingestion module, I connected to the Bluesky real-time data stream (firehose). Bluesky publishes events that happen on the platform (“user X likes post Y”, “user Z made a new post”) in real time. In a move unprecedented, Bluesky chose to make this data publicly available.

What alternatives did I consider?

I considered, but eventually moved away from, a few alternatives.

Bluesky API

Bluesky maintains a public API. I used it for small-scale tasks (e.g., getting the current profile for a user). The endpoints from Bluesky are much more fully hydrated (e.g., the GET /posts endpoint has post text, engagement data, etc.) but the throughput was too low for our scale (I forget the exact rate limits, but I think the total data I could have collected was on the order of low thousands per day at most, when the pipeline was collecting easily 1,000× that). In addition, if I manually polled users I wouldn’t be able to know when to scrape data for a user. Most users weren’t particularly active, so I could likely have pinged the API for their data every few days. However, there were a few power users who posted and engaged with content every single day. In addition, the study required like data, and likes aren’t readily available through the API.

Web scraping

I tried a few experiments with Selenium to develop web scrapers that crawled a user’s feed to get their posts and liked posts. This did not work particularly well, as it both failed to capture content from power users and also was prone to false negatives. I retried this when AI computer agents were first introduced. Those experiments yielded perhaps 30 posts in 30 minutes of waiting for the AI agent to scroll the page. They’ve very likely gotten much better, but I anticipate that doing it at scale would’ve gotten the AI agent session throttled by Bluesky (plus racked up quite a tab for myself).

PDS backfills

The Bluesky data stream is a live firehose of real-time data. However, if one isn’t able to gather data in real-time, Bluesky provides ways to traverse the PDSes which store the data publicly.

I used this approach to backfill records that I did not have. For example, the pipeline tracked like records from the real-time firehose. However, these like records were very sparse, only having the like ID and the user who liked the post. To analyze anything about the posts that were liked, I had to figure out how to backfill records by traversing the PDSes.

The way to do this wasn’t documented at all online when I was building it out, and I had to piece it together from checking the Discord chats and asking questions to the Bluesky devs. Frankly I’m unsure if I truly understand how it works, as I still don’t 100% understand Bluesky’s underlying architecture, but I implemented a solution to backfill records.

This approach worked OK but had low throughput and often disconnected. It did not capture data at the scale the study needed.

Bluesky Jetstream

Bluesky recently introduced Jetstream, a streamlined way to backfill records. This did not exist during the study, though we’ve experimented with it in the lab for other use cases. This is much more feasible to do than the manual backfill that I was doing before and has a much higher QPS. Backfills are now no longer as gruesome and I actually would likely couple this with the firehose to have a hybrid ingestion architecture with both real-time ingestion and batch backfills.

Tradeoffs from using the data stream

The choice to use the data stream came with a few tradeoffs.

App behavior coupled to spikes/crashes from the Bluesky side

The expected load on the system was closely coupled to the load from the data stream. When there was heavy traffic on Bluesky, the pipeline experienced heavy traffic as well. When Bluesky crashed or throttled, ingestion often stalled with it.

I managed this in the following ways:

Decouple data persistence from live ingestion: during times of high throughput, the ingestion job was rate-limited by write throughput. I resolved this by splitting ingestion into 2 parallel jobs: a job that writes the data stream in-memory to a temp/ output as .json files and a job that takes those records and writes them to permanent .parquet storage. This also had the additional benefit of reducing the number of .parquet files I had to write (thereby better leveraging parquet’s columnar format).
Decouple downstream services from data ingestion: ingested records were both written to permanent parquet storage as well as enqueued for downstream services. Rather than triggering downstream services after the data ingestion pipeline or creating an event-driven architecture, I chose to instead explicitly decouple the runtimes for the data ingestion and data processing services. I formalized this through the use of separate orchestration DAGs for both layers, meaning that the data processing and downstream jobs could just run on whatever data, if any, were enqueued from the upstream data ingestion layer.

Have to know upfront what data you want

Using the data stream requires that you know upfront which records you want. If I had to collect a new data type, I would have to restart the connection to the data stream and manually add in a filter for that data type, and then backfill manually if the study needed records from before that filter went live.

Inevitable downtime periods

I needed long-lived persistent servers, but this wasn’t available on HPC. I ran the longest jobs I could, 7-day persistent jobs, and then manually restarted them. I set up the jobs to email me whenever they finished, and I also created a small cronjob that polled the job scheduler to see if the job was finished or not, and automatically submitted a new job once one ended.

Inevitably, there would be a few minutes of downtime between when a data ingestion job finished and when the job scheduler would enqueue and run the request for the next data ingestion job. This meant that for a few minutes, the pipeline may have missed some activity. However, this likely had a negligible impact, as (1) it was a few minutes in a week (<10 minutes) and (2) I scheduled restarts for the afternoon in Asia (as I was traveling abroad at the time), which mapped to late evening US time — I doubted anyone would have meaningful social media activity at 3am (and if they do, they need to sleep!). Even with this downtime, the pipeline still had three nines of availability.

Preprocessing

I developed a preprocessing service that would do basic cleaning (e.g., standardize text) and filters (e.g., removing non-English text, removing posts from blacklisted accounts).

Simple, straightforward service. The only complications were around estimating how many records I could comfortably hold in memory. I had a memory-inefficient approach at first (turns out Pandas indexes are expensive!) and had to be careful about duplicating dataframes in memory, especially as I began filtering multiple millions of posts at a time. I more aggressively implemented inplace transformations and prioritized using the vectorized functionalities offered by Pandas.

Given what I know now, I likely would’ve explored some of the following options;

Explore predicate pushdown: rather than doing the filters in-memory, I would’ve checked to see if I could’ve pushed some of the filters to the DuckDB + SQL layer. For example, the study’s basic excludelist filtering could’ve been done as a SQL query.
Using Polars instead of pandas: I’ve seen some convincing online content and I’ve done some light experiments myself demonstrating Polars improvements over Pandas. I’m unsure how well it would do for this specific task, but it would be something to explore.

Enrichment

ML integrations

For enrichment, the study relied on a variety of ML integrations:

Google’s Perspective API

I used the Perspective API from Google (set to be deprecated end of 2026), which has been a popular tool in the social science community for classifying things like toxicity. For the study, we used it to classify a variety of endpoints like constructiveness and moral outrage. I actually helped develop a moral outrage classifier in undergrad, but Google’s Perspective API was more performant. I write more about what I was doing in this blog post. Integrating the API was straightforward — it was free and consistently reliable. I was getting up to 100 QPS (which, at ~360,000 posts/hr, was OK enough for our scale), but if we wanted to scale further, we might have to consider either (1) getting multiple API keys or (2) distilling our own models based on the Perspective API labels and running them side-by-side.

LLM-based classification

I developed LLM-based classifiers for questions like “does this post have political content?”. I wrote about this in a series of blog posts:

Investigating JSON vs. YAML for prompts.
Exploring a quasi-RAG architecture, which I eventually scrapped.
Experimenting with how many posts can be batched into a single prompt without reducing accuracy.
Initial experiments with LLMs as a political classifier

These experiments gave me in-depth hands-on exposure with LLM providers, tradeoffs between them, and concepts like structured output and TTFT (time-to-first-token) that had meaningful impacts on pipeline performance. For the study I largely used gpt-4o-mini, which did well enough on a local test set (having a recall >= 0.8, as per these results), though when gpt-5-nano became available I pivoted to that as well.

I did more rigorous experiments later on, as I fleshed out an LLM-based intergroup classifier, introducing Opik for LLM telemetry and more closely investigating if prompt batching had any meaningful impact as compared to threaded concurrency.

Batch size	Current (1 req/post)	Prompt batched (20 conc, 10/post)
1	7.60	2.05
10	12.33	10.86
50	14.63	13.81
100	18.38	18.76
200	19.43	17.90
400	31.21	30.26
600	36.13	37.24
800	46.12	54.95
1,000	53.01	64.46
1,200	71.51	76.70
1,500	93.41	117.11
2,000	124.29	136.62
2,500	141.85	162.76
3,000	155.56	178.98
4,000	238.66	273.86
5,000	260.22	319.32

I was surprised to learn that just having 1 post per request and then trying to maximize the number of concurrent requests actually was faster than batching 20 posts into a single request. I figured that the batched version would be handicapped by token generation time, and I thought that the p95 of more requests would be worse so therefore the total completion time would be larger for 20 requests than 1 request, but the experiments seem to suggest otherwise.

Vector-based embeddings

I also generated vectors for the posts. I did this in an offline batch pipeline and used them in the study’s personalization layer (for example, by upranking posts similar to those a user had previously liked). This was more expensive and time-consuming than other parts of the pipeline, so I also explored a sampling approach, doing this for subsets of posts at a time.

Generating vector embeddings for this use case did slightly improve feed quality, though a much more thorough implementation would’ve developed two-tower models. I did not have the bandwidth to build that nor the dataset sizes available to do so, but it would be an interesting follow-up project that I’d like to investigate.

Recommendation algorithms

The recommendation algorithm we developed for the paper was one of its key innovations. We created a new algorithm that took as its base a generic engagement-based algorithm and explicitly downranked toxic content and upranked constructive content. For the engagement signal, we weighted a linear combination of content personalized to that user (using embedding similarity) and content generally engaging on the platform (as measured by, for example, likes).

Serving

I developed a FastAPI backend and deployed it in a persistent EC2 instance. In hindsight I wish I had used something like Railway or AWS App Runner, which would’ve saved me the multiple days required to set up networking and permissions. When serving the posts, I had first tried to introduce a Redis cache layer, but the traffic was infrequent enough that maintaining a persistent external cache server didn’t make sense.

Next, I had tried a “serverless Redis” solution that I found. But, latency was much too high due to the cold start problem (and in fact, the Bluesky client sometimes timed out).

Instead, I went with an in-memory cache, keyed on user ID. Requests would pull from the cache first, and then if there were no cache hits, the API would pull from S3. However, cache misses were pretty rare. Periodically, the server would check S3 for new feeds; I tried to align that schedule with feed-generation cron jobs, though the two weren’t strictly coupled — so there were times when a check ran while generation was still in progress and found nothing new. That lossiness was acceptable because the study didn’t need real-time feed updates. I increased the polling frequency to compensate, and that was a good-enough fix.

Observability and DevOps

Observability and DevOps were late additions on my part.

I had a naive logging solution where I would append logs to a text file and then grep or tail that file for inspection. It worked OK enough, though it was by no means proactive or intuitive (plus it led to me anxiously logging into the HPC to see if anything broke). At minimum I could’ve codified this in a formalized runbook as well, but I was just doing it solo so I was going off memory.

I did invest in extensive unit testing along the way, which was greatly helped by AI agents. But even 100% unit test coverage doesn’t actually mean you have no bugs - it just tells you that you’ve caught errors that you know are there. I could’ve invested more deeply in load testing, integration testing, and smoke testing. I adopted these along the way for very specific tasks (e.g., doing extensive load testing) of the live backend API.

I had somewhat of a split between dev (my local computer) and prod (the on-prem cluster). It would’ve been great to have a continuous deployment setup where I could’ve deployed to prod directly. I know that HPCs support Singularity, their equivalent of Docker, but I wasn’t able to figure out how to set that up nor could I figure out a continuous deployment setup. I also didn’t have access to a persistent server anyways. “Shipping to prod” meant updating the main branch, as I would always keep the HPC version of the code set to main.

What I would’ve done differently

I would’ve made observability a more upfront concern. This would’ve given me more insight into not only bugs, but also load, throughput, and other metrics measuring the liveliness of the system. I would’ve introduced OpenTelemetry and pushed those logs to S3. Hosting Grafana or Prometheus would’ve likely been overkill, but I could’ve done a much simpler version. I didn’t need real-time results; a batch cadence would’ve been OK enough, and it would’ve been a low-footprint, cheap solution.

I imagine each record could look something like this:

{"ts": "...", "dag": "ingestion", "run_id": "...", "stage": "parquet_write", "records": 1200000, "duration_s": 340, "status": "ok", "host": "..."}

Whenever a job completes, I’d sync those logs to an S3 bucket. Then I could use Glue to register partitions, query with Athena, and then connect Grafana to Athena to query the data and build dashboards.

Infra and deployment

The app was deployed on a hybrid AWS + on-prem layer. I first investigated and proposed an AWS-only build, but the hundreds of dollars per month that it would cost seemed unnecessary when compared to the plentiful, freely available compute and storage at Northwestern.

Of course, using Northwestern’s HPC meant no always-on jobs and no running managed services like Kafka or Prometheus on-prem. The pipeline runtimes were defined by Prefect orchestration DAGs, which were themselves triggered by cron jobs scheduled as 7-day jobs on the cluster.

I used AWS where the study needed durability or a user-facing endpoint:

Serving: the FastAPI backend ran on EC2.
Blob storage: S3 held generated feeds, backups, and archival data.
Analysis: large-scale queries ran more effectively in Athena than in ad hoc Python scripts on the cluster.
DynamoDB: some user and job metadata had to be available to both on-prem pipelines and the EC2 API, so I stored it there.

Conclusion

We built the infrastructure for a field experiment testing how recommendation algorithms shape political exposure during the 2024 US election. The app existed to answer a question closed platforms make nearly impossible to ask causally: if you change what the ranking algorithm optimizes for, what happens to what people actually see? Bluesky let us implement and serve those ranking policies inside the native client, on live platform content, to real users — something Facebook and Twitter have never offered to outside researchers at this scale.

We built the infrastructure to support such an ambitious research direction, and in hindsight that infrastructure was the core part of the project: without ingestion at scale, enrichment pipelines, feed generation, and a serving layer users could actually reach, the experiment would have stayed theoretical. What emerged was not a miniature Twitter, but a research-grade system shaped by HPC constraints, election-season load, and the need to keep real users on working feeds for months. It was messy in places and good enough everywhere that mattered, which, for a live field experiment during a national election, is exactly the bar that counts.

Share on

Twitter Facebook LinkedIn

Mark Torres