Mr. Cluey : How To ...? : Finding Hyperlinks


How can I find out how many hyperlinks there are on a page?

This question is addressed in segue.com, volume 2, number 1. Unfortunately, the solution presented there is not correct.

Breaking a FOR loop

A for loop is a construct with a particular purpose; to do repetitive work some fixed number of times. Anytime you find a for loop which gives out somewhere in the middle, you should be immediately suspicious. There are occassions where this style is a good idea. This particular instance isn't one of them.

Another hint that something is wrong is the print statement at the after the loop. The routine here is taking advantage of the fact that the counter is still set. Were we coding in C, the for loop would probably look like

for ( int i = 1; i <= 1000, i++ ) { //check for link } 

and the counter would only exist within the scope of the block.

When you find a for loop with a break in the middle, that references the counter on the outside, you have two strong clues that the wrong control statement is being used. 4Test permits you to do this, but it is easy to introduce a bug in your code this way.

Hint: how many links will be reported if there are 1500 links on the page?

A much better approach is to use a while loop. Unlike the for statement, which is designed to run through n iterations then stop, the while statement is intended to loop until some condition is reached. The while version would look something like this

int nLinkCount = 0 ;

while ( SegueNet.HtmlLink("#{nLinkCount+1}").Exists() )
{
	print( "FOUND ONE" ) ;
	nLinkCount++ ;
}

Print( "There are {nLinkCount} links" ) ;

Where the original code reported an incorrect result after a mere 1000 links, this version is robust enough to count links until an int overflow error occurs. The second version also makes the intent more clear; it will be easier for people maintaining the code to see what is supposed to be going on.

For extra credit, time the two different implementations. The for loop is about 2% faster than the while loop, both are quick enough that the time required to check the links will drown out the difference.

Counting all the links

The second, but more significant, flaw in the code is it doesn't recognize that links can be contained in other windows. Tables are often used to give the author control of the layout of a web page. It is not uncommon to find tables with links to files in one column and descriptions of files in the other.

So what happens when the original algorithm is applied? The windows are organized something like this

HtmlPage("ThePage")
	HtmlLink("A")
	HtmlLink("B")
	HtmlTable("#1")
		HtmlColumn("#1")
			HtmlLink("C")
			HtmlLink("D")
			HtmlLink("E")

So the counter starts chugging along... HtmlLink("#1") matches to "A", and HtmlLink("#2") matches to "B", so Exists() returns true in these cases. But now we check for HtmlLink("#3"), expecting it to find "C", but Exists() returns false.

And this is correct, as Silk sees windows in the Html world. "C" isn't HtmlLink("#3"). It is actually HtmlTable("#1").HtmlColumn("#1").HtmlLink("#1"). Yuck.

This problem is a little bit harder to deal with, but the algorithm is straight forward, and has many useful applications. What we need to do is iterate through all of the windows in the hierarchy, counting links as we go. Conceptually, the easiest way to do this is with recursion.

The general approach will be to iterate through all of the children of a window. If the child is a link, add one to the link count. If the child is not a link, count the number of links this child has (by calling the same function again), and add that to the count for the current window.

integer CountLinks( window wCrnt )
{
	integer ntempLinks ;
	window wChild ;

	for each wChild in wCrnt.GetChildren()
	{
		if WindowIsOfClass( wChild, HtmlLink )
			ntempLinks += 1 ;
		else
			ntempLinks += CountLinks( wChild ) ;
	}

	return ( ntempLinks ) ;
}

The table below shows the result of calling CountLinks() with some of the windows described.

Window CountLinks() Comments
HtmlPage("ThePage").HtmlTable("#1").HtmlColumn("#1").HtmlLink("C") 0
HtmlPage("ThePage").HtmlTable("#1").HtmlColumn("#1") 3 1 each for C, D, and E
HtmlPage("ThePage").HtmlTable("#1") 3 none of its own, but 3 from HtmlColumn("#1")
HtmlPage("ThePage").HtmlLink("A") 0
HtmlPage("ThePage") 5 1 each for A and B, plus another 3 for HtmlTable("#1")

The function is now correct (it returns the right answer). Also, the for each statement actually does every window in the list (compare this with the original for loop, which would give out in the middle).


Mr. Cluey : How To ...? : Finding Hyperlinks

Return to ATS Automated Testing Resources Page