We Were Wrong About DOC vs PDF: How Sample Size Fooled Us
Correction Notice
In Part 3 of this series, we reported that DOC files (437 avg downloads) outperform PDF files (383 avg downloads) on TPT. We recommended offering DOC formats. That finding was a confound. At 23,000+ products, we discovered the DOC "advantage" was entirely explained by the age of DOC files, not their format. This post explains what went wrong and what the data actually says.
This is a correction. In our original analysis at 6,000 products, we reported that DOC files outperform PDFs by 14%. At 23,000+ products, we discovered this was wrong. The DOC "advantage" was a confound created by the age of DOC files and the small size of the DOC sample. Here is the full story.
What We Originally Reported
In Part 3, working from a smaller dataset, we wrote:
Our Original Claim (Part 3)
"DOC files average 436 downloads vs. PDF at 383 — teachers want editable content. Offer both formats."
The numbers were accurate for that sample. DOC files did average more downloads than PDFs. The conclusion — that DOC format drives more downloads — was wrong. We were comparing apples to oranges and did not realize it until the dataset grew large enough to reveal the confound.
The Confound: Age
When we expanded to 23,000+ products, we ran a simple check that should have been in the original analysis: what is the average posting year for each file type?
| File Type | Count | Avg Posting Year | Avg Downloads |
|---|---|---|---|
| DOC/DOCX | 225 | 2014.7 | 437 |
| 13,513 | 2021.4 | 383 |
The average DOC file was posted in 2014.7. The average PDF was posted in 2021.4. That is a 6.7-year gap.
As we showed in Part 4, products compound over time. A product from 2014 has had over a decade to accumulate downloads. A product from 2021 has had roughly 5 years. DOC files were not outperforming PDFs because teachers prefer DOC. They appeared to outperform because they were older and had more time to accumulate downloads.
The Statistical Trap
We compared 225 DOC files (avg age: 12 years) against 13,513 PDF files (avg age: 5 years) and concluded DOC was better. This is like comparing retirement savings of 60-year-olds to 30-year-olds and concluding that older people are better at saving. The variable is time, not format.
Controlling for Seller Size
To test whether DOC genuinely outperforms PDF, we controlled for seller size — the strongest predictor of downloads (Part 2). If DOC format itself drives downloads, the advantage should persist within each seller tier.
| Seller Size | DOC Avg DL | DOC Count | PDF Avg DL | PDF Count |
|---|---|---|---|---|
| 0–99 followers | 296 | ~80 | 96 | ~5,400 |
| 100–999 | 911 | 49 | 271 | ~3,200 |
| 1K–10K | 45 | 3 | 924 | ~2,800 |
| 10K+ | 198 | 1 | 2,788 | ~2,100 |
Look at those DOC sample sizes. Three DOC files in the 1K–10K tier. One DOC file in the 10K+ tier. These are not data points — they are anecdotes.
In the two tiers with enough DOC files to be somewhat meaningful (0–99 and 100–999 followers), DOC does appear to outperform PDF. But even there, the DOC files are dramatically older than the PDFs. The 80 DOC files in the small-seller tier are disproportionately from the 2012–2016 era, when DOC was more common and when those products have had a decade to accumulate downloads.
In the tiers where DOC sample sizes are laughably small (3 files, 1 file), PDF crushes DOC. The 1K–10K tier shows PDF at 924 avg downloads vs. DOC at 45. The 10K+ tier shows PDF at 2,788 vs. DOC at 198. But with sample sizes of 3 and 1, these numbers are meaningless noise.
The Real Finding: PDF Dominates at Scale
Once you account for age and seller size, the story reverses. PDF is the dominant format on TPT by volume (13,513 vs. 225) and by performance in the seller tiers that actually have representative sample sizes.
This makes practical sense. PDF is the universal format for education resources:
- Every device can open a PDF without special software
- Formatting is preserved exactly as the creator intended
- Schools can distribute PDFs through any LMS or email system
- PDF is the expected default — buyers know what they are getting
DOC files were more common in TPT's early years (2010–2015) when the platform was smaller and sellers were often sharing resources they had already created in Word for their own classrooms. As TPT professionalized, sellers shifted to PDF as the publication format, often creating in Canva, Google Slides, or specialized design tools that export to PDF.
But Wait: "Editable" Is a Real Signal
The DOC finding was wrong. But there is a related finding that holds up: the word "editable" in a product title correlates with higher downloads.
| Title Contains "Editable" | Avg Downloads | Multiplier |
|---|---|---|
| Yes | 690 | 2.3x |
| No | 305 | 1.0x (baseline) |
Products with "editable" in the title average 690 downloads vs. 305 without — a 2.3x difference. That looks significant. But before we make the same mistake twice, let us control for seller size.
| Seller Size | "Editable" Avg DL | Without Avg DL | Multiplier |
|---|---|---|---|
| 0–99 followers | 87 | 91 | 0.96x |
| 100–999 | 295 | 268 | 1.1x |
| 1K–10K | 1,102 | 887 | 1.2x |
| 10K+ | 6,789 | 2,206 | 3.1x |
The pattern is striking. For small sellers (under 1,000 followers), "editable" in the title makes essentially no difference. The multiplier is 0.96x for the smallest sellers and 1.1x for mid-small sellers — both within noise range.
But for large sellers (10K+ followers), "editable" in the title correlates with a 3.1x download multiplier — 6,789 avg downloads vs. 2,206. That is enormous.
Why "Editable" Only Works for Big Sellers
"Editable" is a trust-and-premium signal, not a magic keyword. When a seller with 50,000 followers labels a product "editable," buyers trust that it genuinely is editable, well-formatted, and worth customizing. They have social proof. When a seller with 12 followers labels a product "editable," there is no trust backing the claim. The word amplifies existing authority — it does not create it.
How Small Samples Fooled Us
This is worth unpacking because it is a mistake anyone doing data analysis can make.
At 6,000 products, we had roughly 80–100 DOC files in our sample. Those 80–100 files were disproportionately old (published 2010–2016), disproportionately from sellers who had been on the platform for a decade, and disproportionately had accumulated downloads over many years. Against 4,000+ PDFs that were on average 7 years younger, the DOC average was higher.
We saw the higher average and concluded: format matters, DOC wins. We did not check for the confound because we were not looking for it.
At 23,000 products, the confound became visible because:
- The DOC sample grew only modestly (to ~225), confirming it was a rare format
- The age gap became impossible to ignore (2014.7 vs 2021.4)
- Controlling for seller size collapsed the effect in the tiers with adequate DOC representation
- The tiny DOC counts in large-seller tiers (3 files, 1 file) revealed the original comparison was meaningless
The lesson: 225 vs. 13,513 is not a fair comparison. When one group is 60x smaller than the other, a handful of outliers can pull its average in any direction. We needed to control for confounding variables (age, seller size) before attributing the difference to format. We did not do that in Part 3. We should have.
What the Corrected Data Says for Sellers
- PDF is the right format. It is the standard, the expectation, and the best-performing format at scale. If you can only offer one format, make it PDF.
- Offering "editable" versions is valuable — but only after you have an audience. For sellers under 1,000 followers, "editable" adds no measurable download advantage. Focus on building your follower base first.
- Once you have 10K+ followers, "editable" becomes a powerful differentiator. The 3.1x multiplier at that tier is real and significant. At that point, offering editable versions (Google Slides, PowerPoint, or editable PDFs) is worth the extra effort.
- The word "editable" in the title is a trust signal, not a search optimization hack. Adding it to your title when you have no audience and no reviews will not move the needle. It works because it amplifies trust that already exists.
The Meta-Lesson
Small samples lie. We had 225 DOC files vs. 13,513 PDFs — a 60:1 ratio — and drew a format comparison from it. The small sample overrepresented outliers and hid confounding variables. This is why we kept scraping to 23,000+. The extra data did not just give us more precision — it revealed that our original conclusion was wrong. If we had stopped at 6,000 products, we would never have caught the error.
Corrections to Part 3
We have added an edits section to Part 3 noting this correction. The specific claims that should be read with this correction in mind:
- Original: "DOC files average 436 downloads vs. PDF at 383 — teachers want editable content." Corrected: The DOC advantage was a confound caused by DOC files being 7 years older on average. PDF is the dominant and best-performing format at scale.
- Original: "Offer both formats." Corrected: Offer PDF as the primary format. Add editable versions only after establishing an audience (1K+ followers) where the "editable" trust signal has value.
- Original: "Editable formats win." Corrected: "Editable" in titles amplifies existing seller authority. It has no measurable effect for sellers under 1,000 followers.
Methodology Notes
File type classification uses TPT's reported file type metadata. Products listing multiple file types are counted under each type. "Editable" detection uses case-insensitive substring matching on product titles. Seller size tiers use follower count at the time of data collection, not at the time of product posting.
The age confound analysis compares average posting year by file type. Ideally, we would control for posting year directly (comparing DOC vs. PDF products from the same year), but DOC sample sizes within individual years are too small for reliable comparison (often fewer than 10 per year). This is itself evidence that DOC is not a viable analysis category — the sample is simply too small for robust conclusions.
Up Next in This Series
Part 6 tackles the deepest question in the dataset: do followers actually download your products, or is follower count just a trust signal? Five tests point to a surprising answer. Read Part 6: The Three-Part Flywheel
This is Part 5 of an ongoing series on TPT marketplace research. Part 1: the big picture data | Part 2: the follower multiplier | Part 3: file formats and pricing | Part 4: the power law | Part 6: the three-part flywheel. Follow my journey as I learn new skills and build tools with Brian at Actyra.