Converting a webpage to something usable

not sure if it can be done javascript, markdown

Published May 30, 2024

I want to get the text from a website and turn it into markdown. The final answer is that this doesn't work all that great, but we get something. Here are the libraries that we explore:

I was looking for basically what instapaper does, but I think that uses a whole different method than looking at the text.

Turndown/Javascript

1
  npm i jsdom turndown
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  // turndown.js
  import TurndownService from 'turndown';
  import { JSDOM } from 'jsdom';

  export default async function extractMarkdown(url) {
      const response = await fetch( url );
      const doc = await response.text();

      const turndownService = TurndownService();

      const markdown = turndownService.turndown(doc);
      return markdown;
  }

  const markdown = await extractMarkdown( "https://willschenk.com/fragments/2024/i_need_a_trigger_warning/" )
  console.log( markdown )

This ends up looking really good, but it does include all of the navigation and other junk in the window.

Readability/Javascript

1
  npm i jsdom @mozilla/readability
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
  // readability.js
  import TurndownService from 'turndown';
  import { Readability } from '@mozilla/readability';
  import { JSDOM } from 'jsdom';


  export default async function extractText(url) {
      const doc = await JSDOM.fromURL(url);
      const reader = new Readability(doc.window.document, {keepClasses:false, classesToPreserve: ['BLOCKQUOTE']});
      const article = reader.parse();

      console.log( article.content );
      
      const turndownService = TurndownService();
      const markdown = turndownService.turndown(article.content);

      return markdown;
  }

  const text = await extractText( "https://willschenk.com/fragments/2024/i_need_a_trigger_warning/" )

  console.log( text )

Doesn't get the block quote right (just has it inline) and fails to return the attribution.

Trafilatura/Python

Installation:

1
  pip install trafilatura

Usage:

1
  trafilatura -u https://willschenk.com/fragments/2024/i_need_a_trigger_warning/

Similar to readability, doesn't get the block quote right (just has it inline) and fails to return the attribution, and also returns all the natigation junk

Boilerpy3/Pyhton

1
  pip install boilerpy3
1
2
3
4
5
6
7
8
  from boilerpy3 import extractors

  extractor = extractors.ArticleExtractor()

  # From a URL
  content = extractor.get_content_from_url('https://willschenk.com/fragments/2024/i_need_a_trigger_warning/')

  print( content )

Conclusion

Based on this limitd testing I'd go with straight turndown.

Previously

I need a trigger warning

2024-04-30

Next

Chromadb on fly.io adding some auth

2024-05-30

labnotes

Previously

Programmatically Interacting with LLMS different techniques for local, remote, rag and function calling

2024-04-16

Next

Chromadb on fly.io adding some auth

2024-05-30