Skip to content

Latest commit

 

History

History

1236

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Given a url startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl

Return all urls obtained by your web crawler in any order.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
  • Do not crawl the same link twice.
  • Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such: 

interface HtmlParser {
  // Return a list of all urls from a webpage of given url.
  public List<String> getUrls(String url);
}

Below are two examples explaining the functionality of the problem, for custom testing purposes you'll have three variables urlsedges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Note: Consider the same URL with the trailing slash "/" as a different URL. For example, "http://news.yahoo.com", and "http://news.yahoo.com/" are different urls.

 

Example 1:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

Example 2:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

 

Constraints:

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl is one of the urls.
  • Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from 'a' to 'z', digits  from '0' to '9' and the hyphen-minus character ('-').
  • The hostname may not start or end with the hyphen-minus character ('-'). 
  • See:  https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
  • You may assume there're no duplicates in url library.

Companies: Bloomberg, Microsoft, Dropbox, Facebook

Related Topics:
String, Depth-First Search, Breadth-First Search, Interactive

Similar Questions:

Hints:

  • Use DFS/BFS to search start from the startURL. Remember to get rid of duplicate URLs.

Solution 1.

// OJ: https://leetcode.com/problems/web-crawler
// Author: github.com/lzl124631x
// Time: O(N)
// Space: O(N)
class Solution {
    string getOrigin(string url) {
        int p = 0;
        for (int i = 0; i < 3; ++i) {
            p = url.find('/', p);
            if (i < 2) ++p;
        }
        return url.substr(0, p);
    }
public:
    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
        unordered_set<string> seen{{startUrl}};
        queue<string> q{{startUrl}};
        auto origin = getOrigin(startUrl);
        while (q.size()) {
            auto url = q.front();
            q.pop();
            auto urls = htmlParser.getUrls(url);
            for (auto &next : urls) {
                if (seen.count(next) == 0 && next.starts_with(origin)) {
                    seen.insert(next);
                    q.push(next);
                }
            }
        }
        return vector<string>(begin(seen), end(seen));
    }
};