757 words
~4min read

Sowing the Seed Data

  March 6th 2025

Coming at you here with another Cat Search update! Since last time the new design has been implemented (see below), authentication is mostly done (still need to send emails), and many bugs have been squashed 🪳.

However, I got a little side tracked the last couple days by a “simple” item on my todo list I didn’t even mention last time: Update Seed Data.

What is Seeding Data#

I’m not entirely sure where the term came from since your data doesn’t typically sprout the same way the seeds in your garden do but for those unfamiliar with the term seeding data it’s just pre-filling your database. Often times it’s generated fake data just used to test with.

In my case though it’s not just test data! The seed data will be used as the initial categorized sites when Cat Search launches.

And Why Should You Bother?#

One of the worst first user impressions I can imagine (which is not difficult having experienced this many times myself) is to land on a page full of empty states and placeholders. It’s a ghost town of the forgotten web where they built it and no one came, clearly destined to decay.

Chinese Ghost City

On the flip side, if the app is bustling with activity showing the product value it’s now a happening place you don’t want to miss out on.

Any user first impression is a critical moment in forming an opinion of what your product does, how is it’s valuable, and why should anyone care.

If you’re not able to get real data in your app when it launches the next best thing is to have a sandbox demo environment for customers to try first, which of course requires a seeding script. Depending on your app, this make take the form of an ephemeral demo environment or just one that gets reset every so often.

While the above is the primary reason to seed in my mind, there’s another benefit.

Let’s face it, production data is generally messier than a kids bedroom after a sugar rush. There’s all kinds of unexpected edge cases the developers never thought of like really long names, tables that have way more data than expected, and the general bucket of “non-conforming” data from partially deleted records or the result of bugs that were never fixed. Regardless though, it’s a good habit to update your seed script to include this weird but possible data so that the devs see it every day working and consider how to handle all the edge cases. This also includes having a realistic amount of data compared to production so the devs get annoyed with any scaling performance problems and just go and fix them. The bugs won’t be out of sight out of mind if you’re testing with realistic data.

Sources#

With Cat Search my challenge was finding good sources to pull website categories from. There are a number of companies out there if you search “website category api” that provide APIs for categorizing domains but they’re all exorbitantly expensive and it’s probably against their terms of service to use the data to put them out of business ;-)

So publicly available lists it had to be.

I started with the ahrefs top sites list which includes categories but had to do a bit of cleaning on it to make it usable and the categories make sense. Like whatsapp.com is not “Internet and Telecom” really.

Another great source I’m exploring more is good ol’ Wikipedia. There are list pages like List of Internet forums and List of Video Game Websites that list out a bunch of the popular sites. It does take a bit of scripting to get the actual domains of all the sites though since the links on these pages are to other Wiki pages.

There’s a few open source projects that helped too. Kagi Small Web is the basis for my Indie Blogs category and DevDocs is used for the Official Documentation category.

The last major source has been uBlacklist subscription lists. Many of these are sus and go overboard but with a little bit of grooming they provided good lists of spam and AI sites.

If you know of any other good sources please let me know in the comments!

Results#

With all that seeding work the results speak for themselves. Across 2 pages of google results many popular queries (😉) only have a couple Uncategorized results.

Cat Search Seeded Data

I’m sure there are a number of search queries that I never make which will be much more uncategorized but I’ll have to launch the thing to figure out what those are!

Want the inside scoop?

Sign up and be the first to see new posts

No spam, just the inside scoop and $10 off any photo print!