adbar / htmldate Goto Github PK

Fast and robust date extraction from web pages, with Python or on the command-line

Home Page: https://htmldate.readthedocs.io

License: Apache License 2.0

Python 100.00%

date date-parser datetime entity-extraction html-parsing information-extraction lxml metadata metadata-extraction natural-language-processing nlp parsing time web-scraping webscraping

htmldate's Introduction

Hi there! 👋

Links

⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee

Activity

🔭 Currently working on gathering texts on the Web and detecting word trends

Programming experience

🖩 First programs written on a TI-83 Plus in TI-BASIC

htmldate's People

Contributors

Stargazers

Watchers

htmldate's Issues

Switch from requests to urllib3 dependency

The change brings both a speed-up and a better maintainability: adbar/trafilatura#41

Problems running htmldate

Hi @adbar ,I'm trying to develop a small script with Python in Anaconda to use htmldate, and when I try to run it I have some errors:

The code is this one:

import htmldate as hd
hd.find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')

And the error is ImportError: cannot import name etree . I have check with pip list if I have lxml installed and I can see I have the version 4.5.2, so what happen?. Thanks so much

return datetime instead of date

Is there a way to force htmldate to look for datetime and not date, or prioritise specific extractors over others, eg opengraph over url-extraction. Let me give you an example:

from htmldate import find_date
url = "https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/"
find_date(url, outputformat='%Y-%m-%d %H:%M:%S', verbose = True)

INFO:htmldate.utils:URL detected, downloading: https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/
DEBUG:urllib3.connectionpool:Resetting dropped connection: www.ot.gr
DEBUG:urllib3.connectionpool:https://www.ot.gr:443 "GET /2022/03/23/apopseis/daimonopoiisi/ HTTP/1.1" 200 266737
DEBUG:htmldate.extractors:found date in URL: /2022/03/23/
'2022-03-23 00:00:00'

returns:

'2022-03-23 00:00:00'

But if you look at the article you can find: <meta property="article:published_time" content="2022-03-23T06:15:58+00:00">

Correct bug resulting from change in libxml2?

Doctypes like <!DOCTYPE html … /> are now a problem, see https://bugs.launchpad.net/lxml/+bug/1955915

requests version conflict in Google Colab

htmldate/setup.py

Line 74 in 4cfc156

'requests >= 2.24.0; python_version > "3.4"',

Error in Google Colab

When installing trafilatura in Google Colab (08-Sep-2020) then htmldate throws an error due to a version conflict.

because the standard docker image of Google Colab is currently using requests==2.23.0 for Python 3.6.9

Keep track of where the date has been found

So far only the logs provide info on this. It would be nicer to be able to pinpoint the type (header, element, or text) or even the exact location of the result.

To do so the location info has to be propagated back along with the result.

DublinCore

Does htmldate recognize DublinCore (DC) meta tags? I see you call out OpenGraph, so I was wondering if DC is recognized explicitly? Or if it scans through ALL meta tags to find dates?

Thanks!

Add JSON output with more information to `find_date()`

So far find_date() returns a string containing the result. To add context another format is required, JSON is a good candidate.

An optional parameter like as_json=True could allow for the following additional info:

input (url or HTML document)
date format
other candidates (?)
source element or type (?)

find date not ture

url : https://securityintelligence.com/new-banking-trojan-icedid-discovered-by-ibm-x-force-research/

but this ture date is November 13, 2017

"URL couldn't be processed: %s" during callinf of find_date()

I got a problem with exctracting date from website.
date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

I got such an error:

ValueError Traceback (most recent call last)
in ()
----> 1 date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

1 frames
/usr/local/lib/python3.7/dist-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
598 if verbose is True:
599 logging.basicConfig(level=logging.DEBUG)
--> 600 tree = load_html(htmlobject)
601 find_date.extensive_search = extensive_search
602 min_date, max_date = get_min_date(min_date), get_max_date(max_date)

/usr/local/lib/python3.7/dist-packages/htmldate/utils.py in load_html(htmlobject)
165 # log the error and quit
166 if htmltext is None:
--> 167 raise ValueError("URL couldn't be processed: %s", htmlobject)
168 # start processing
169 tree = None

ValueError: ("URL couldn't be processed: %s", 'https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731')

I will be gratefull for any support and help with this.

Checking newlines in regex expressions

During #13 I see that in extractors.py some newlines (for code readability) in regex expressions are problematic. Better to check and test those.

lxml pinning / Python 3.12 support for MacOS

Hello,

I'm unable to find a reason why lxml is pinned here
https://github.com/adbar/htmldate/blob/master/setup.py#L121-L122

It's breaking my local dev which is running on MacOS and encountering the pinned version which doesn't compile for Python 3.12 (later versions of lxml compile just fine).

For the mean time I've fork'd this repo and removed the pinning.

Please advice, happy to make any required PRs!
Thanks.

Change license to Apache 2.0

Dear all,

Htmldate is now widely used and it has become apparent that the GPL license is not prevalent in Python packages, its potential implications are also not easily understood. To better align with community standards and promote ongoing development, I plan to transition the software license to Apache 2.0.

As you have contributed lines still currently in use your agreement is needed to move forward on this. For the sake of simplicity I use the list of current contributors as provided by Github and you are listed below even if your contributions may focus on documentation or evaluation files.

If you agree please add a corresponding message to this thread. If not please contact me or open a public discussion.

@evolutionoftheuniverse: pending
@DerKozmonaut (evaluation): pending
@coreydockser: pending
@RadhiFadlillah: ✔️
@kernc: 🚫
@rahulbot: ✔️
@danielskatz (docs): ✔️
@Ashikpaul: pending
@felipehertzer: ✔️
@SalihTalha: ✔️
@liulinlin90: ✔️
@MSK1582: pending
@getorca: ✔️
@adamh-oai: ✔️

CLI: more config args

Configuration arguments are available for Python functions, it would be nice to make them available as command-line arguments as well:

outputformat

Add correct image links for Pypi

Some images are replaced by alt text as they are missing on the Pypi Readme. This could be fixed by

Adding the images to the manifest file and make sure they are found
Replacing the dynamic links by static ones (to the docs) in the Readme (probably easier)

How to use this library with beautiful soup?

I'm not sure I understand how to use this library when I already have a bs4 object. Thanks in advance!

try_date_expr validation error

in extractors.py the function try_date_expr() does not check that the string argument is of the right type and raises an exception if it is not.

To reproduce:
when processing a meta node of the form

<meta itemprop="dateCreated" datetime="">

in examine_header(), the line

attempt = tryfunc(elem.get("datetime") or elem.get("content"))

calls try_date_expr() with string=None which raises AttributeError: 'NoneType' object has no attribute 'strip'

Suggestion:
add in try_date_expr()

if not string:
  return None

how are timezones handled when available?

Some articles include the full publication time, with timezone, in HTML meta tags or Javascript config. Does this library parse and handle those timezones? Relatedly, how does it internally store dates with regards to timezone - are the all returned in machine-local time, held in GMT, or something else?

For instance, this Guardian article includes the article:published_time meta tag with a timezone included. Does this library recognize that timezone and return the date as it would be in GMT? Same for this article on CNN, which includes the datePublished meta tag.

Inaccurate Extraction of Original and Updated Publication Dates

Description of the Issue:
When using the htmldate library to extract both the original publication date and the most recent update date from web pages, the function find_date returns the same date for both, even though the HTML source of the pages clearly contains different dates for the original publication and last modification.

Steps to Reproduce:

Use the find_date function from the htmldate library to extract the publication date from a URL.
Call find_date twice for each URL, first with original_date=True to get the original publication date, and then with original_date=False to get the updated date.
Compare the results.

Expected Behavior:
The function should return distinct dates for the original publication and the most recent update (if available), based on the webpage's metadata.

Actual Behavior:
The function returns the same date for both the original publication and the most recent update.

Example Code:

from htmldate import find_date

# Example URL
url = 'https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/'

# Attempt to extract original publication date
original_date = find_date(url, original_date=True)

# Attempt to extract most recent update date
updated_date = find_date(url, original_date=False)

print(f'Original Date: {original_date}')
print(f'Updated Date: {updated_date}')

Possible Causes:
The library might not be parsing certain HTML meta tags correctly.
There could be an issue with the heuristic approach used to differentiate between the dates.
I hope this information helps in diagnosing and resolving the issue. Thank you for your assistance!

`find_date` doesn't extract `%D %b %Y` formatted dates in free text

For the following MWE:

from htmldate import find_date

print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))

htmldate outputs 2022-01-01 instead of the expected 2022-10-19.

I've traced the execution of the above call and I believe it is the search_page function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022 part of the date string (which autocompletes the rest to 1st Jan).

I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.

Good test cases

Hi Adrien

here are a few test cases where the extraction gave a wrong answer:

https://www.gardeners.com/how-to/vegetable-gardening/5069.html
https://www.almanac.com/vegetable-gardening-for-beginners

LATEST_POSSIBLE max date can become outdated

If you don't pass in a max_date, htmldate will use the LATEST_POSSIBLE constant (in get_max_date in validators.py). This constant is initialized to datetime.now().

This is an issue when htmldate is used in a long running process, such as a server which runs 24/7, or even in workers that are not restarted often. After one day of uptime, htmldate will still use the max_date of the previous day.

This can be overwritten from the calling code (by passing in the appropriate parameters), but I think it would be nicer if htmldate initialized max_date every time with datetime.now() in the get_max_date function.

Find_date() returns none

Whenever I call the find_date() function in Jupyter Notebook, it always seems to return None (the second line is the same link as on the ReadMe). There's no module error and I believe I've installed it correctly via pip and git. Is there a solution to this in Jupyter or should I just switch over to regular python?

Catch exception in python-dateutil

TypeError as mentioned in adbar/trafilatura#61, originating in IllegalMonthError within python-dateutil.

Minor discrepancy in evaluation numbers

Hi @adbar,

Great package! I really like the evaluation suite. The script is clear and the webpage is a fair description. I ran the evaluation myself and get some minor discrepancies in the numbers from what is published on the webpage. In particular, I get the following:

number of documents: 225
nothing
{'true_positives': 0, 'false_positives': 0, 'true_negatives': 0, 'false_negatives': 225, 'time': 0}
htmldate extensive
{'true_positives': 196, 'false_positives': 29, 'true_negatives': 0, 'false_negatives': 0, 'time': 5.767443418502808}
precision: 0.871 recall: 1.000 accuracy: 0.871 f-score: 0.931
htmldate fast
{'true_positives': 186, 'false_positives': 20, 'true_negatives': 0, 'false_negatives': 19, 'time': 5.7627551555633545}
precision: 0.903 recall: 0.907 accuracy: 0.827 f-score: 0.905
newspaper
{'true_positives': 87, 'false_positives': 11, 'true_negatives': 0, 'false_negatives': 127, 'time': 108.73073124885559}
precision: 0.888 recall: 0.407 accuracy: 0.387 f-score: 0.558
newsplease
{'true_positives': 130, 'false_positives': 28, 'true_negatives': 0, 'false_negatives': 67, 'time': 111.8387098312378}
precision: 0.823 recall: 0.660 accuracy: 0.578 f-score: 0.732
articledateextractor
{'true_positives': 125, 'false_positives': 28, 'true_negatives': 0, 'false_negatives': 72, 'time': 7.127204179763794}
precision: 0.817 recall: 0.635 accuracy: 0.556 f-score: 0.714
date_guesser
{'true_positives': 110, 'false_positives': 26, 'true_negatives': 0, 'false_negatives': 89, 'time': 31.155170917510986}
precision: 0.809 recall: 0.553 accuracy: 0.489 f-score: 0.657
goose
{'true_positives': 94, 'false_positives': 12, 'true_negatives': 0, 'false_negatives': 119, 'time': 18.64737319946289}
precision: 0.887 recall: 0.441 accuracy: 0.418 f-score: 0.589

as output from comparison.py. Specifically, the F-1 score for htmldate extensive should be 0.944 according to the website, but I get 0.931. Any idea where the discrepancy comes from?

Parsing fails for older dates

By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue.

CLI:

htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"

Python:

Here is the debugging without min_date:

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

With min_date at "1990-01-01":

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

Bug originally posted by @kinoute in #8 (comment)

No valid results for martech.org URLs

Hello and thank you for this great lib!

htmldate gives "# ERROR no valid result for url" for articles on martech.org which nonetheless display pub date in header.

e.g. https://martech.org/why-testing-is-strategic-experimentation-for-sustainable-growth/

While "on April 4, 2022 at 10:43 am" is displayed at the beginning of the article.

Check the language, clarity and consistency of documentation

A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs folder and online on htmldate.readthedocs.io

Several problems could arise:

Non-idiomatic use of English (not quite fluent or natural)
Unclear or incomplete descriptions
Code examples that don't work
Typos in explanations or code sections
Outdated sections

Improve test coverage

As of now the coverage is at 90%, some portions of the code are not featured in the tests.

The unit_tests.py file in the tests/directory features a series of web pages as well as code loops, this is where further tests could be added.

Please refer to the contributing guidelines.

Feature: Add Portuguese month names

Hi @adbar,

Would be possible to add PT month names and abbreviations in the next release?

jan janeiro
fev fevereiro
mar março
abr abril
mai maio
jun junho
jul julho
ago agosto
set setembro
out outubro
nov novembro
dez dezembro

Thanks.

memory: handling of `lru_cache`

See discussion in #56.

Add dev requirements

Add a requirements-dev.txt file with the following dependencies

black
mypy
types-dateparser types-python-dateutil types-urllib3
pytest pytest-cov

Update CI workflow (.github/workflows/tests.yml) accordingly, i.e. remove the dev elopment packages listed there and use the new file instead

Error when trying to deploy on aws-lambda

It yields

"errorMessage": "cannot import name 'etree' from 'lxml' (/var/task/lxml/init.py)",

Trying to figure out what is the issue/how to fix it.

Deprecate Python 3.6 & 3.7

Error installing trafilatura on playwright focal image

I'm getting below error when trying to install trafilatura on mcr.microsoft.com/playwright/python:v1.32.1-focal docker image

I tried many versions with no luck. Is there way to fix this without introducing a lot of image size?

Building wheels for collected packages: sentence-transformers, typing, uuid, backports-datetime-fromisoformat, lit
  Building wheel for sentence-transformers (setup.py): started
  Building wheel for sentence-transformers (setup.py): finished with status 'done'
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125926 sha256=85fbd76a2c8311631cab1cf9611cf0ef12e43e06c26bbaaca0a0ad9ab4323f63
  Stored in directory: /root/.cache/pip/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
  Building wheel for typing (setup.py): started
  Building wheel for typing (setup.py): finished with status 'done'
  Created wheel for typing: filename=typing-3.7.4.3-py3-none-any.whl size=26305 sha256=ec7f26377d7304b784c9a15bf2152e785604f05a42b5e5467060b10f282f16d5
  Stored in directory: /root/.cache/pip/wheels/5e/5d/01/3083e091b57809dad979ea543def62d9d878950e3e74f0c930
  Building wheel for uuid (setup.py): started
  Building wheel for uuid (setup.py): finished with status 'done'
  Created wheel for uuid: filename=uuid-1.30-py3-none-any.whl size=6478 sha256=42f6b14e52efa4385e0e1d94a2aa9481407fa95875859e5090f7c7cc64dd5465
  Stored in directory: /root/.cache/pip/wheels/1b/6c/cb/f9aae2bc97333c3d6e060826c1ee9e44e46306a178e5783505
  Building wheel for backports-datetime-fromisoformat (setup.py): started
  Building wheel for backports-datetime-fromisoformat (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/backports
      copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
      creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      running build_ext
      building 'backports._datetime_fromisoformat' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/backports
      creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for backports-datetime-fromisoformat
  Running setup.py clean for backports-datetime-fromisoformat
  Building wheel for lit (pyproject.toml): started
  Building wheel for lit (pyproject.toml): finished with status 'done'
  Created wheel for lit: filename=lit-16.0.6-py3-none-any.whl size=93584 sha256=7eb1709c8fb581da100e3f4309e4d214a3e1db491afcc2f3aa2d8e092360fa61
  Stored in directory: /root/.cache/pip/wheels/05/ab/f1/0102fea49a41c753f0e79a1a4012417d5d7ef0f93224694472
Successfully built sentence-transformers typing uuid lit
Failed to build backports-datetime-fromisoformat
Installing collected packages: uuid, tokenizers, sentencepiece, safetensors, pytz, playwright-stealth, mpmath, lit, lambda-warmer-py, cmake, backports-datetime-fromisoformat, asyncio, urllib3, typing-extensions, typing, tqdm, tld, threadpoolctl, tabulate, sympy, soupsieve, sniffio, six, simplejson, regex, pyyaml, python-json-logger, python-dotenv, pyspellchecker, pluggy, pillow, packaging, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, lxml, langcodes, joblib, jmespath, jellyfish, idna, h11, fsspec, fastapi-events, exceptiongroup, click, charset-normalizer, certifi, backports.zoneinfo, uvicorn, tzlocal, segtok, scipy, requests, python-dateutil, pydantic, nvidia-cusolver-cu11, nvidia-cudnn-cu11, nltk, mangum, justext, jinja2, filelock, courlan, beautifulsoup4, awslambdaric, anyio, yake, starlette, scikit-learn, rake-nltk, pandas, huggingface-hub, dateparser, botocore, transformers, s3transfer, htmldate, fastapi, trafilatura, boto3, triton, torch, torchvision, sentence-transformers
  Running setup.py install for backports-datetime-fromisoformat: started
  Running setup.py install for backports-datetime-fromisoformat: finished with status 'error'
  error: subprocess-exited-with-error

  × Running setup.py install for backports-datetime-fromisoformat did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      running install
      /usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/backports
      copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
      creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      running build_ext
      building 'backports._datetime_fromisoformat' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/backports
      creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> backports-datetime-fromisoformat

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python -m pip install --upgrade pip
The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1

Support for lxml version 5+

The tests didn't pass so the lxml version has been pinned in the last release. Fix or bypass the changes introduced by a newer libxml version.

ignore undateable domains more intentionally

In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.

Extensive search too greedy

If many potential dates are present on the page the extensive search gets too greedy and needs too much time to find a potential candidate, example: http://www.historicalkits.co.uk/Leeds_United/Leeds_United.htm

error: redefinition of group name 'm' as group 5; was group 2 at position 116

Hello there,

Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev

Here is the error using iPython and Python 3.8.12:

# works
In [3]: from htmldate import find_date

In [4]: find_date("https://osmh.dev")
Out[4]: '2020-11-29'

# doesn't work
In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

The last example throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-6-9988648ad55b> in <module>
----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
    653
    654     # try time elements
--> 655     time_result = examine_time_elements(
    656         search_tree, outputformat, extensive_search, original_date, min_date, max_date
    657     )

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
    389                         return attempt
    390                 else:
--> 391                     reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
    392                     if reference > 0:
    393                         break

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
    300     attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
    301     if attempt is not None:
--> 302         return compare_values(reference, attempt, outputformat, original_date)
    303     return reference
    304

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
    110 def compare_values(reference, attempt, outputformat, original_date):
    111     """Compare the date expression to a reference"""
--> 112     timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
    113     if original_date is True and (reference == 0 or timestamp < reference):
    114         reference = timestamp

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
    331         if not format_regex:
    332             try:
--> 333                 format_regex = _TimeRE_cache.compile(format)
    334             # KeyError raised when a bad format is found; can be specified as
    335             # \\, in which case it was a stray % but with a space after it

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
    261     def compile(self, format):
    262         """Return a compiled re object for the format string."""
--> 263         return re_compile(self.pattern(format), IGNORECASE)
    264
    265 _cache_lock = _thread_allocate_lock()

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
    250 def compile(pattern, flags=0):
    251     "Compile a regular expression pattern, returning a Pattern object."
--> 252     return _compile(pattern, flags)
    253
    254 def purge():

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
    946
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
    829                     group = state.opengroup(name)
    830                 except error as err:
--> 831                     raise source.error(err.msg, len(name) + 1) from None
    832             sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    833                            not (del_flags & SRE_FLAG_VERBOSE))

error: redefinition of group name 'm' as group 5; was group 2 at position 116

Configure pre-commit for this repository and update documentation

In order to help new contributors it would be nice to add pre-commit hooks to the repository with the following checks:

black
flake8
isort
...?

The CONTRIBUTING.md file could get updated accordingly.

CLI: add --version arg

Version display is usually expected from a command-line interface, see adbar/trafilatura#145

Compatibility with Python 3.11

Due to an upcoming version the compatibility has to be checked.

If the tests pass, we can reference it explicitly in the setup file.

Strange inferred date for target news article

Hello @adbar,

I just stumbled upon an issue when extracting contents from this html file (an article from LeMonde): https://gist.github.com/Yomguithereal/de4457a421729c92a976b506268631d7

It returns 2021-01-31 (which was a date in the future at the time the html was downloaded, i.e. more than one year ago) because it latches on something which is an expiry date for something in a JavaScript string litteral.

I don't really know how trafilatura tries to extract a date from html pages, but I guess here it was found because of a regex scanning the whole text? In which case maybe a condition checking that the found dates are not in the future could help (this could also be tedious because one would need to pass the "present" date when extracting data collected in the past).

Drop support for Python 3.5

Only support Python versions 3.6+ in the future and see if the code can be improved or cleaned on the way.

Example to search the code: https://github.com/adbar/htmldate/search?l=Python&q=%22Python+3.%22

Test htmldate on further web pages and report bugs

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

OverflowError in parse_datetime_as_naive

See adbar/trafilatura#49

Changing regex pattern for customization

Hello,

Is it possible to customize the regex pattern for the output to retrieve the date in find_date function? We need a specific format that is retrieving the date and time jointly.

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
It has a strong track record in terms of security – it's allowed-list-based.
It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see upstream issue.

nh3:

nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
Thanks to the Rust backend, nh3 is also significantly faster than bleach.
Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.