Extracting webpage content using Qt4

Sometimes we want to check a web page or even to extract it’s content but we do not need to get all the information in the page, the images, the format ,we want only the essential information, In this case I will present I needed a application that will check a forum for new posts and display only the important stuff to me(no adds or other stuff). This particular forum has no mechanism to notify a member if no posts appear in one topic. This program works without using an account.

This can be done very easy in Qt using the class QWebView. An object of this class that in my code is declared

QWebView *view;

can display a web page. The important feature that i use is not displaying the page but executing JavaScript code on the page,so instead of using RegExp or other techniques to extract the information i use a JavaScript query

//get the main frame

QWebFrame *frame=view->page()->mainFrame();

if(frame==NULL)

return "NULL";

//get the elements i need

QWebElementCollection elements=frame->findAllElements("*.tr_list_f");

My application formats this data and stores it and displays it,here is the code

QString RL_HtmlParser::onPageLoad(bool ok)

{

if(!ok)

throw 11;//throw a exception , create a nice exception class later

//get the main frame

QWebFrame *frame=view->page()->mainFrame();

if(frame==NULL)

return "NULL";

//get the elements i need

QWebElementCollection elements=frame->findAllElements("*.tr_list_f");

int length=elements.count();

if(0==length)

return "NO MATCH FOUND";

QString sb("<html><head></head><body><table border=’3′>");

for(int i=0;i<length;i++)

{

if(0==i%2) continue;

sb+="<tr><td> "+elements.at(i).toPlainText()+"</td></tr>";

}

sb+="</table></body></html>";

//old text contains the old data that was read

//so if we have an update we store this data

if(sb!=oldText)

{

oldText=sb;

settings->setValue("oldtext",oldText);

}

//we display the current data

this->view->setHtml(sb);

//and return it as a string

return sb;

}

This is how a web page look like

and this is the application’

It is possible to use proxy and authentication to login and get the page, i downloaded the page in a temporary file and then i loaded it up in the webview but is possible to load it directly from the web

Leave a Reply