Summarizing webpages with language models
Similar to (and perhaps more simply than) analyzing Youtube video transcripts with language models, I wanted to apply a similar approach to webpages like articles, primarily for the purposes of determining the subject content of lengthy pieces and experimenting to see if this is useful at all.
The html2text
script is good at
extracting content from html.
When combined with a few other CLIs, we can prompt
the language model to create a summary for the cleaned HTML page.
This was my first attempt:
curl -s "<url>" | html2text | llm "summarize this article"
which gave me the following error
Traceback (most recent call last): File "/opt/homebrew/bin/llm", line 8, in <module> sys.exit(cli()) ^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/llm/cli.py", line 268, in prompt prompt = read_prompt() ^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/ site-packages/llm/cli.py", line 156, in read_prompt stdin_prompt = sys.stdin.read() ^^^^^^^^^^^^^^^^ File "<frozen codecs>", line 322, in decodeUnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 640:invalid start byte
I solved that with this modification
curl -s "<url>" | html2text | iconv -f ISO-8859-1 -t UTF-8 | llm"summarize this article"
Recommended
Playing with Language Models
Since the launch of GPT3, and more notably ChatGPT, I’ve had a ton of fun learning about and playing with emerging tools in the language model space.
Making Your Vision Real with Models
Using models for various different purposes daily has been a satisfying endeavor for me because they can be used as tools to help make your vision...
Using Nix to Install `llm` with Plugins
I use Simon's llm to quickly run LLM prompts. This package is easily installed with brew or pip, so if you want to use it, I recommend those...