Main content extraction

Extract the essential information of a webpage.

Why is it useful?

Extracting information from a website is useful and important for every user. However it’s not always an easy task. When you browse the web you can find a lot of noisy and useless elements that can be annoying.

The main content in a webpage contains the relevant content to the user. It is usually composed of text, images, and any other multimedia; and it is typically surrounded or even interrupted by irrelevant information, such as headers, footers, menus, banners, advertisements, etc.

The main content in a webpage can be useful for:

  • Accessibility tools, because they can automatically start reading the actual content of the page.
  • Other systems and tools, such as indexers or wrappers, as a preliminary stage to avoid banners and unnecessary content in later phases of the analysis.

One important advantage of this tool is that it not only extract the main content text from the webpage, but also images, videos, and any other multimedia.

Two kinds of extractors

A page-level technique only takes into account the elements, DOM nodes and text of the URL given as input. The main benefit of a page-level tool is that it only needs to load and analyze one single webpage to detect the main content. The speed of the algorithm is increased compared to site-level techniques.

A site-level technique, on the other hand, goes beyond analyzing a single page. In addition to the given URL, it loads and examines other pages from the same website to identify recurring patterns, which helps in accurately extracting the main content. Although this approach is slower, it enhances the reliability of the extraction by using insights from multiple pages within the same site.

When you use CESY, you can choose the kind of extractor you prefer.

FAQ

  • Extract Content: It automatically extracts the main content of a webpage.
  • Format Output: It extracts main content in HTML, XML, JSON and plain text format.

This software has been designed and implemented in the computer science labs of the UPV.

Yes, MEW is able to work on a synchronous and asynchronous way.

Examples

Drag the slider to see the webpage before and after extracting its main content. All other elements are hidden away.

1. New York University’s History  (original)

ex1ex1.2

2. United Nations, News & Media, French (original)

nationsunites2nationsunites1

3. Linux Mint Partners Page (original)

linux2linux1

4. Industry Congress, Digital Twins news (original)

digitaltwins2digitaltwins1

Want to try it yourself?