A little bit of history
Back in 2018, I decided to build a Crawler and Website audit tool that is going to be easy to use and extremely powerful with built-in analytics. I have this grand vision where the data generated with Crawler will be easily digested by my data visualization app and I will figure out a way to integrate both the apps and create a comprehensive package.
The grand vision is where SEO and web developers can simply download the package on their computers or self-host on many cloud services. You can also host it on AWS and run a million page audit if you wish. SEO Data Scientists will no longer have to extract data into Tableau or other Business Intelligence tools. This is the vision for CrawlSpider Audit Tool.
Naming ceremony for the SEO Crawler and Audit tool
I remember logging into Google Adwords and ran several queries for SEO and Crawl related keywords. Using that data I tried various combinations and luckily “CrawlSpider” was available for just 9 bucks.
I grabbed it.
I felt lucky that the name was so cool and perfect for the product I am envisioning.
Deciding on the Product features and planning
I knew building a crawler and a robust seo auditing tool is not a small task that can be done over the weekend. I wanted to have thorough understanding of the industry and all the pain points.
So I kept deferring building any plans and feature list.
Instead, I decided to become a SEO myself.
I enrolled into Glen Allsopp’s MarketingInc course and SEO Blue print
Started going through all the modules on lead generation, client acquisition and other SEO modules.
I decided to offer SEO audits as service. Glen even kept sending all the audit leads to all his members including myself. I joined upwork and offered audit service. I became good at using ScreamingFrog and doing all the analysis.
Using my dashboard software, I even built a SEO log analysis tool over a week’s time. The dashboard app InfoCaptor is very robust, it can consume csv files and create tables for you. All I had to do was slice and dice the Apache logs data and publish them into various dashboards for log analysis. I even called it LogTiger or something 🙂
Log Analyzer helps in doing deep technical audits. Currently I don’t know how I am going to integrate that with Crawlspider so will leave it at that.
Feeling the pain
When you do typical SEO audits, you provide a list of issues along with actions that the client can work upon immediately.
Repeating an audit and comparing the results with the previous became an Excel juggling act. I noted this pain. I wanted the SEO change capture and compare as the core functionality of my Auditing tool. Back then ScreamingFrog did not have any compare feature but I guess lot of their customers must have felt the pain too.
Distribution and Marketing approach
Early on (around 2010) when I had built InfoCaptor, I did try various marketing and distribution methods including download sites. I was getting wary of that approach. I was leaning towards marketplace approach as distribution channel. I got familiar with marketplaces such as CodeCanyon, Shopify and WordPress ecosystem.
I had some familiarity with WordPress plugin development but that was 10 years ago.
I decided that WordPress ecosystem is perfect for CrawlSpider as everything is PHP based and all my apps are as well.
I was not aware of all the limitations and intricacies that are involved in building and releasing a plugin. So I quickly learnt and published WooCommerce Reporting plugin and a simple Table Builder for WordPress.
Why there is no ScreamingFrog plugin for WordPress?
I looked everywhere but was not able to locate any SEO plugin within the WordPress ecosystem that did SEO Audits just like ScreamingFrog. I thought, this could be an opportunity or probably there are no takers or there is some technical limitation.
Now having published two plugins for WordPress and WooCommerce, I think I know the reason behind this.
All you have is 30 seconds
WordPress sites are hosted on third party hosting services. These hosting providers have strict controls on the CPU time and the amount of memory allocated to PHP processes. The general default is 30 seconds of execution time.
Now if you think about this, it is plenty of time to do simple database fetches and send it to the client. A website Crawler is very resource intensive. It can spawn multiple threads and keep lot of data in memory during the crawl phase. SEO Crawler need to crawl all the pages, posts, product pages and the next big task is extracting page content for each of those crawled links and once extracted you need spend some CPU cycles on the analysis.
This I think is the reason why a different approach and architecture is needed if you were to build a Crawler plugin for WordPress.
SEO Change Capture – Pain becomes Idea
Majority of the SEO Audit tools are online SAAS based where you login and kick of the audits. I don’t intend to build a complete Crawler and Audit tool as a plugin but I wanted to test out the market and see what is possible.
I did not intend to invest heavily into a SEO Auditor right away. I thought what if I could just make the SEO Change Capture and Compare as the plugin? This way I can test the market and demand. This functionality will eventually roll into the SEO Auditor.
SEO Monitor – The Idea
The idea is very basic and easy to understand.
Imagine you have a website and assume it is a WordPress or WooCommerce website. You have invested heavily into SEO and Content marketing. Your website brings in good amount of Google traffic due to the SEO investments you have made.
Let say it has quite a good amount of content on it, say 100 pages (blogs+pages+products)
The SEO Monitor’s job is to scan each of your pages and alert you whenever something changes. A change could be as simple as following
- Modified the Title tag where you or someone added or removed a keyword
- Modified H1, H2 , H3 … tag content
- Changed anchor to an Internal or External link
- Javascript code modified
- Entire Paragraphs modified
You get the idea. A page is made up of several elements and the placements of these elements decide the overall SEO structure of your page. If any words within any tag change or if links start pointing elsewhere, this changes the SEO structure of the page. This could be intentional or a mistake or as a side effect of another plugin.
As a site owner, as a business owner, you are entitled to know all the changes that occur on your website. This plugin’s job is just to do that. It provides you an extra set of eyes to alert you when something changes on your website.
Architecture
Earlier I mentioned the 30 second limitation. On some private hosting, this could be more say 60 seconds but there is always going to be a max execution time limit. This does not exists for desktop tools.
So how do you architect the crawling, extraction and analysis?
The idea is again very simple.
You pick one URL at a time. Extract the content and then mark it say ‘Crawled’.
There are some 20 to 30 different scans and checks on each page. For every single checks, you do the check and mark it “Checked”. Let say you did Check1, Check2 and in the middle of Check3 the time runs out. The system has already saved the state for URL1: completed Check1 and Check2.
The next time the execution begins, it knows what checks were done and it will resume from that point onwards. Once all the checks are done, pick the next URL in the queue.
What are the different kinds of SEO or On-Page Checks
This is straight from the database. Check the image below
This is roughly a set of 45 checks, some performed at root domain level and majority of them are page level.
In addition, there are custom checks you can define for each URL
Every URL will go through the above listed checks and keep a history. If anything changes in the outcome of the checks then it makes a note of it. Later in the analysis phase, it decides and alerts the owner on the failed checks.
Crawl List
Since this is a wordpress plugin, the crawl list of urls is already available. There is no need to start with a seed URL. The list of Crawl pages, posts, product pages, document type pages etc is readily available through WordPress API.
This plugin is the first step towards building CrawlSpider and it is coming along good. Once the plugin is released, it will become part of the CrawlSpider functionality. Stay tuned for more updates as I intend to document the entire development and marketing process. If you are interested in testing this plugin then please drop me a message.