-
Notifications
You must be signed in to change notification settings - Fork 1
Alexa Top 1M is no longer. Replacement? #1
Comments
Cisco Umbrella 1 Million
Data: http://s3-us-west-1.amazonaws.com/umbrella-static/index.html Looks very promising! |
What's the current status here? @pmeenan are we just using a stale Alexa list? |
@pmeenan yes. afaik Amazon is not updating the rankings any more, but the list is still there (as a download). |
Hi, |
@hsbahri You are asking at the wrong place - you need to ask Alexa (or Amazon which owns it). We just use the public 1M list they used to provide. |
Google recently released the Chrome User Experience report, which includes 908K distinct domains from 1.2M distinct origins. Looking at the diff between these domains and Alexa's, CrUX seems to be higher quality in some ways. For example it excludes the t.co link shortener and microsoftonline.com, which is just the domain for logged in users. Neither of these domains are useful for HA to crawl. It's also updated monthly. One big drawback is that CrUX isn't ranked, so we wouldn't necessarily know the relative popularity of domains. Something to consider though. |
Correction: The 908K number was for distinct domains excluding subdomains. The number of distinct domains including subdomains is more like 1.2M. I've also been exploring a newer version of the Alexa list than the One goal is to prune out the URLs in the current crawls that are low quality, eg not in CrUX. According to the following query, that would be about half of our dataset :-/ SELECT
url
FROM
`httparchive.runs.2017_12_01_pages`
WHERE
NET.HOST(url) IN (
SELECT
DISTINCT NET.HOST(origin)
FROM
`chrome-ux-report.all.201711`) 217K URLs out of 470K (46%) are in CrUX. |
Instead of joining the 470k that we currently pull, what does the intersection of the full 1M look like? |
Going by the latest Alexa list, much worse actually. #standardSQL
SELECT
rank,
domain
FROM
`httparchive.urls.20171221`
WHERE
domain IN (
SELECT
domain
FROM (
SELECT
domain
FROM
`httparchive.urls.20171221`) AS alexa
JOIN (
SELECT
DISTINCT NET.HOST(origin) AS domain
FROM
`chrome-ux-report.all.201711`) AS crux
USING (domain))
ORDER BY
rank 134,768 domains. If we use the |
I kind of like the idea of using the intersection of the older list and the CrUX report. We were using the top 500k from the older list anyway so joining with CrUX would let us filter it to just the actual domains that serve pages (and puts us in the same ballpark). Sounds like the newer domain lists are pretty much useless. |
If they increased the granularity and extended the list to ~5M I think it'd be very valuable. But instead they watered it down with 350K uninteresting CDN variations. :( Using the older list SGTM. We would need to figure out the issue of taking an Alexa domain (with no subdomain or protocol) and picking the "canonical" origin from CrUX. In some cases this could be "prefer the origin with www and/or https" but it gets weird in other cases, like "for live.com prefer https://outlook.live.com". One brutish approach could be a test crawl on the domains themselves and update the list with wherever the initial URL redirected to. |
Couldn't we use *.live.com from CrUX and cover all navigated domains instead of trying to limit it to 1? Granted, that may change the counts. Basically look for an exact match as well as *. |
To maintain the ranking integrity we'd need a 1:1 map of domain to origin. If we wanted to have unranked origins as well, I'd be ok with grabbing as many subdomains as we can accommodate. |
By the sounds of it, their new list is reporting top requested origins, which is a change from top navigated origins, and is thus far less interesting or useful for us. The corollary here is that the new ranks are also of little value to us moving forward. With that in mind... Long term, I don't see what value we get out of the Alexa list anymore: the intersection is small and we don't trust the ranks. As such, it seems that we can sunset our use of Alexa sometime in 2018. In the short term, we don't have the capacity to crawl the full CrUX list, and we need some signal to help pick out the "high value" origins.. For that, intersecting old Alexa (e.g. 20170315) with CrUX sgtm. Also, given that we don't trust ranks moving forward.. I'd suggest we stop surfacing them as well. As soon as we have enough capacity we can drop the requirement for Alexa list and bootstrap from CrUX. Does that map to what you guys are thinking? |
sgtm |
One thing that comes to mind here is that there would be two distinct shifts in the data: replacing the Alexa 500K with the Alexa+CrUX ~500K, then expanding to a pure CrUX ~1M. I wonder if it would be worth waiting for the capacity improvements and skipping over the Alexa+CrUX hybrid. These changes will have a big effect on the continuity of many if not all metrics and it may be preferable to have one dramatic shift than two. |
Do we have a good guesstimate for when we could do the "full" migration? |
sgtm, particularly since we have a near-term plan for the capacity increase. |
Probably a couple of months on the infrastructure side depending on the hardware order going through and migration of the existing server. |
https://twitter.com/Alexa_Support/status/800755671784308736
Alternatives
Quantcast Top 1M US sites
Majestic Million CSV
ahrefs
Another crawl based service. I don't see any free ranking dumps though?
SimilarWeb Top sites
Top 50 for free.. requires paid account to see more.
Others we could or should consider?
The text was updated successfully, but these errors were encountered: