Webshrinker Domain Categorization: How It Works
The beginning of Webshrinker
Before we get into how domain categorization works, let’s take you back to the beginning. In 2012, Adam Spotton started a screenshot-as-a-service company that would eventually become Webshrinker. It would automatically take screenshots of websites, as opposed to doing this manually, that could then be used for educational resources.
Domain categorization was added to Webshrinker in 2015, and soon after, it became the main driver of the product along with threat intelligence.
Today, Webshrinker still offers a screenshot API as well as one of two ways to access domain categorization data.
Breaking down domain categorization
We’ve defined website categorization in the past. Here, we’ll focus just on how Webshrinker as a domain categorization tool works.
Domain processing begins one of three ways:
- New domain ingestion—Webshrinker’s AI scans the web, crawling new domains and re-indexing previously categorized domains
- External feeds—external data sources that are given to Webshrinker to process
- Customer triggers—Customers initiate domain processing by requesting domain categorization
Determining the category
After domain ingestion, a site gets categorized. Since Webshrinker has two domain taxonomies, it will determine categories for both taxonomies but only deliver information for the category it is asked to retrieve.
For instance, if a customer uploads a list of domains to be processed by Webshrinker and asks for IAB categories, they will only receive those categories. However, Webshrinker has both sets of data.
There are over 40 categories in Webshrinker’s unique taxonomy, and over 400 categories (and subcategories) in the IAB. Webshrinker’s categorization is ideal for internet filtering and security applications, while the IAB categories are better for anyone looking to integrate domain data with AdTech.
Additionally, a single domain can have multiple categories per taxonomy. Under the IAB, a site might be categorized as:
- Hobbies & Interests
- Card Games
Under Webshrinker’s unique taxonomy, the same site might have the following categories:
To detect threats, Webshrinker uses advanced Machine Learning algorithms. Three components of Webshrinker’s threat detection make it incredibly good at detecting 0-day attacks:
- Checking for threat markers
- Browser simulation
- Image analysis
When Webshrinker checks for threat markers, it has over 20 different indicators that it looks for in order to determine if a site is malicious or not. If the number of malicious markers reaches a certain threshold, the domain is categorized as a threat. If you were requesting categorization on a threat domain, you might get a traditional category like “Gambling” as well as a return of “Deceptive.”
The browser simulation is unique to Webshrinker, and it’s part of what makes it so accurate. Hackers know that websites are being scanned and that they might be categorized as a threat and taken down. One way they can work around this is by deceiving any known bots that scan their website into thinking their site is for some other purpose. If it registers that AI is looking at their site, it might show completely different content to throw it off its trail.
It’s really doubly deceptive when you think about it.
But Webshrinker simulates a browser session that perfectly mimics one that a human would open. So the malicious website that knows to show alternative content to a bot doesn’t show that content, so Webshrinker can accurately categorize the site as a threat.
Webshrinker’s image analysis is what makes it so good at spotting phishing sites. It’s been well-trained in the art of logo detection. For every site that Webshrinker scans, it does a certain level of image analysis.
It evaluates the content and assigns a score. If the score is low, Webshrinker is more confident that the site is not a threat. If the score is high, Webshrinker will flag the site as deceptive.
But if the scores are smackdab in the middle, Webshrinker has a few more tricks up its sleeve to get more information out of that site. It performs a higher level of image analysis that will give it a more definitive answer.
Checking against public data
Webshrinker is heavily reliant on its own database and how it’s been programmed, but it’s still important to check publicly available data to make sure everything is accurate.
A huge reason for this is to reduce the number of false positives. Malicious sites are usually only malicious for a relatively small amount of time when considering the life of a website. Marking a site as deceptive when it’s no longer a threat can cause serious problems for those using domain categorization data. Webshrinker avoids these problems by checking against public data, and rechecking malicious websites periodically to see if they’ve been cleaned.
Interested in learning more of the ins and outs of Webshrinker? Get a demo.