WebDevelopersJournal.comTips on Web Page Design, HTML and Graphics
SITE SEARCH
Newsletters
Java/Open Source Daily



Jobs at webdeveloper.com

Resources By Subject
Technical
Graphical
Authoring
Business
WDJ resources
Archive

internet.com

internet.commerce
  • Partner With Us
















Developer Channel


Find a web host with:
CGI Access DB Support Telnet Access
NT Servers UNIX Servers



Semi-automatic?

JavaScript
JavaScript Helper:
Meet Paige Turner, the least geeky geek we've ever come across.

Variables and Operators Explained:
First of a three part guide to JavaScript basics.

Controlling Forms:
Enhance your HTML forms with a touch of JS.

DHTML:
Forget how it works, let's see some in action!


DIY User Profiling (2)

by

Build Your Own Content Customizer

<<< Back to Part 1

The Whole Hog

One thing we didn't mention about subject-tagging is that to produce related links, the system needs to know about the nature of every page on the server. Each page's tags must be registered in a database so that the script building the "related links" section can easily pick out the URLs that most closely match the tags in the page the person is viewing.

You'll therefore need to write a script to crawl over all the pages on the server, either by crawling over the files on the disks or by HTTP spidering. The latter is preferable, as it will only catch pages that are active, i.e. the ones with links pointing at them. The program will pull out the metatags you've defined, and store them in the subject-to-page lookup database.

But hold on. If we're spidering pages, why not simply read the text of the documents? It's very easy to scan a page and read the text. Although you'll need a fair amount of disk space and some carefully planned indexing in the database, you could easily devise a system that (say) counts the ten most popular nouns in a page, perhaps adding extra weight to the words in the <TITLE>…</TITLE> and <Hn>…</Hn> tags, and stores this information in the page database.

All you're doing is letting the machine figure out the subject matter of the page (albeit rather coarsely) rather than having someone read the pages and define subject tags by hand.

Ecommerce And Advertising Integration

Although so far we've mainly looked at the idea of building "favourite" and "related topic" links - content-based profile exploitation - it's worth mentioning integration between content and the other important bits of the site.

When a normal site with static pages serves 'targeted' ads, this is usually achieved by putting tags in the HTML of the pages themselves. The ad serving engine uses these tags to decide which ads to serve to the user, and the tags are usually related to sections of the site (the 'travel section', the 'music section', and so on. If you have a system that can decide on the fly what the subject matter of a page is, it can insert these tags automatically, generally in a far better targeted way than coding them into the page source by hand. And better ad targeting equals charging your advertisers more.

Likewise with your online shop, assuming you have one. If you spider the shop just like you did your content, and store the nature of the shop's contents in the page database, then it's just as easy to flash a targeted special offer on the front page as it is to produce the user's "favourite links".

No right answer

So how exactly do you build it? Well, because no two sites are the same, there's no generic answer that will cover all sites. The level to which you take the profiling and custom page building is a function of your funding, disk space, processor power, expertise and development resources.

The first thing you must do is try to forecast what your page impression levels will be, and build your system with this in mind. Remember that using a decent profiling system should make your impressions increase faster than they used to, as users will stay longer and look at more pages.

By all means use something low-end like dbm or hashed files on a low-end Linux box if your usage predictions are modest, but give serious thought to commercial SQL databases, even clusters of database servers, big Unix boxes and loads of disk and RAM if your site is likely to grow fast.
Remember that the speed with which people see the pages depends mostly on how fast the database queries run, and it's best to get it right from day one rather than have to upgrade.

When recording information, use the INSERT SQL command, not UPDATE. UPDATE does a whole load of work, and can be slow - INSERT just bangs the record on the end and maybe updates a few indexes, and it's much faster.

Think carefully about what information you want to pull out of the system, and make sure you index the databases. Also check how your database system works 'under the hood' - for example, it may be more efficient to use an SQL 'view' to extract data than to do a complex query over a number of tables.
You'll also need to decide how you want to identify returning users: do you want to use a login/password box, or are you happy just to use cookies and not worry about the relatively few browsers that can't (or won't) accept them?

Aspiring to greatness

Finally, remember that the commercial packages cost significant amounts of money, so you're not going to be able to emulate them completely with just a few man-days or man-weeks of work. What you can do, though, even if you only scrape the surface of the subject, is give the people visiting your site a significantly better experience than they had with that old bunch of static pages. So even with a basic profiling system you're one up on most of the sites out there.



Dave Cartwright's first proper job was running NetWare 2.0a servers for a defence contractor and fighting with a digital phone switch (one of the first of its kind). Having graduated with a boringly technical degree in theoretical computer science, he became a Unix systems and network manager at UEA, Norwich, UK. While there, the Internet came to UK academia and later Mr. Berners-Lee came up with this Web thing (an excellent excuse to 'research new technology' rather than doing boring support stuff). Before disappearing into journalism in '95 (as technical editor of Network Week) Dave did a lot of work back-ending Web servers with databases. Having earned an easy living for a couple of years as a techie writer, he then went back into the real world as IT and Telecomms Manager at CMP UK. He's now Chief Technology Officer at Vavo.com.
Suits PonytailsPropheadsContact WDJDiscussWeb AudioSearch


The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers