In this article I am trying to share a piece of code that might be useful to
some of the developers.
We can find a lot of code in C# that will parse the http urls in given string.
But it is difficult to find a code that will:
- Accept a url as argument, parse the site
content
- Fetch all urls in the site content, parse
the site content of each urls
- Repeat the above process until all urls
are fetched.
Scenario
Taking the website http://valuestocks.in
(A Stock Market Site) as example I would like to get all the urls inside the
website recursively.
Design
The main class is SpiderLogic which contains all necessary methods and
properties.
![Url1.gif]()
The GetUrls() method is used to parse the website and return the urls. There are
two overloads for this method.
The first one takes 2 arguments. The url and and a Boolean indicating if
recursive parsing is needed or not.
E.g.: GetUrls(http://www.google.com", true);
The second one is 3 arguments, url, base url and recursive Boolean.
This method is intended for usage like the url is a sub level of the base url.
And the web page contains relative paths. So in order to construct the valid
absolute urls, the second argument is necessary.
E.g.: GetUrls("http://www.whereincity.com/india-kids/baby-names/
", http://www.whereincity.com/ ,
true);
Method Body of GetUrls()
public
IList<string>
GetUrls(string url,
string baseUrl,
bool
recursive)
{
if (recursive)
{
_urls.Clear();
RecursivelyGenerateUrls(url, baseUrl);
return _urls;
}
else
return InternalGetUrls(url,
baseUrl);
}
InternalGetUrls()
Another method of interest would be InternalGetUrls() which fetches the content
of url, parses the urls inside it and constructs the absolute urls.
private
IList<string>
InternalGetUrls(string baseUrl,
string absoluteBaseUrl)
{
IList<string>
list = new List<string>();
Uri uri = null;
if (!Uri.TryCreate(baseUrl,
UriKind.RelativeOrAbsolute,
out uri))
return list;
// Get the http content
string siteContent =
GetHttpResponse(baseUrl);
var allUrls = GetAllUrls(siteContent);
foreach (string uriString
in allUrls)
{
uri = null;
if (Uri.TryCreate(uriString,
UriKind.RelativeOrAbsolute,
out uri))
{
if (uri.IsAbsoluteUri)
{
if (uri.OriginalString.StartsWith(absoluteBaseUrl))
// If different domain / javascript: urls needed
exclude this check
{
list.Add(uriString);
}
}
else
{
string newUri =
GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
if (!string.IsNullOrEmpty(newUri))
list.Add(newUri);
}
}
else
{
if (!uriString.StartsWith(absoluteBaseUrl))
{
string newUri =
GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
if (!string.IsNullOrEmpty(newUri))
list.Add(newUri);
}
}
}
return list;
}
Handling Exceptions
There is an OnException delegate that can be used to get the exceptions
occurring while parsing.
Tester Application
A tester windows application is included with the source code of the article.
You can try executing it.
The form accepts a base url as the input and clicking the Go button it parses
the content of url and extracts all urls in it. If you need a recursive parsing
please check the Is Recursive check box.
![Url2.gif]()
Next Part
In the next part of the article, I would like to create a url verifier website
that verifies all the urls in a website. I agree after doing a search we can
find free providers like that. My aim is to learn & develop a custom code that
could be extensible and reusable across multiple projects by community.