⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee
🔭 Currently working on gathering texts on the Web and detecting word trends
🖩 First programs written on a TI-83 Plus in TI-BASIC
Fast and robust date extraction from web pages, with Python or on the command-line
Home Page: https://htmldate.readthedocs.io
License: Apache License 2.0
The change brings both a speed-up and a better maintainability: adbar/trafilatura#41
See issue adbar/trafilatura#216.
Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to extensive_search
:
import os
import psutil
from htmldate import find_date
with open('test.html', 'rb') as inputf:
html = inputf.read()
for i in range(10):
result = find_date(html, extensive_search=False)
process = psutil.Process(os.getpid())
print(i, ":", process.memory_info().rss / 1024 ** 2)
tracemalloc doesn't give any clue.
Hi @adbar ,I'm trying to develop a small script with Python in Anaconda to use htmldate, and when I try to run it I have some errors:
The code is this one:
import htmldate as hd
hd.find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
And the error is ImportError: cannot import name etree . I have check with pip list if I have lxml installed and I can see I have the version 4.5.2, so what happen?. Thanks so much
Is there a way to force htmldate to look for datetime and not date, or prioritise specific extractors over others, eg opengraph over url-extraction. Let me give you an example:
from htmldate import find_date
url = "https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/"
find_date(url, outputformat='%Y-%m-%d %H:%M:%S', verbose = True)
INFO:htmldate.utils:URL detected, downloading: https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/
DEBUG:urllib3.connectionpool:Resetting dropped connection: www.ot.gr
DEBUG:urllib3.connectionpool:https://www.ot.gr:443 "GET /2022/03/23/apopseis/daimonopoiisi/ HTTP/1.1" 200 266737
DEBUG:htmldate.extractors:found date in URL: /2022/03/23/
'2022-03-23 00:00:00'
returns:
'2022-03-23 00:00:00'
But if you look at the article you can find: <meta property="article:published_time" content="2022-03-23T06:15:58+00:00">
Doctypes like <!DOCTYPE html … />
are now a problem, see https://bugs.launchpad.net/lxml/+bug/1955915
Line 74 in 4cfc156
When installing trafilatura
in Google Colab (08-Sep-2020) then htmldate
throws an error due to a version conflict.
because the standard docker image of Google Colab is currently using requests==2.23.0
for Python 3.6.9
So far only the logs provide info on this. It would be nicer to be able to pinpoint the type (header, element, or text) or even the exact location of the result.
To do so the location info has to be propagated back along with the result.
Does htmldate recognize DublinCore (DC) meta tags? I see you call out OpenGraph, so I was wondering if DC is recognized explicitly? Or if it scans through ALL meta tags to find dates?
Thanks!
So far find_date()
returns a string containing the result. To add context another format is required, JSON is a good candidate.
An optional parameter like as_json=True
could allow for the following additional info:
url : https://securityintelligence.com/new-banking-trojan-icedid-discovered-by-ibm-x-force-research/
but this ture date is November 13, 2017
I got a problem with exctracting date from website.
date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');
I got such an error:
ValueError Traceback (most recent call last)
in ()
----> 1 date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');
1 frames
/usr/local/lib/python3.7/dist-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
598 if verbose is True:
599 logging.basicConfig(level=logging.DEBUG)
--> 600 tree = load_html(htmlobject)
601 find_date.extensive_search = extensive_search
602 min_date, max_date = get_min_date(min_date), get_max_date(max_date)
/usr/local/lib/python3.7/dist-packages/htmldate/utils.py in load_html(htmlobject)
165 # log the error and quit
166 if htmltext is None:
--> 167 raise ValueError("URL couldn't be processed: %s", htmlobject)
168 # start processing
169 tree = None
ValueError: ("URL couldn't be processed: %s", 'https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731')
I will be gratefull for any support and help with this.
During #13 I see that in extractors.py some newlines (for code readability) in regex expressions are problematic. Better to check and test those.
Hello,
I'm unable to find a reason why lxml
is pinned here
https://github.com/adbar/htmldate/blob/master/setup.py#L121-L122
It's breaking my local dev which is running on MacOS and encountering the pinned version which doesn't compile for Python 3.12 (later versions of lxml compile just fine).
For the mean time I've fork'd this repo and removed the pinning.
Please advice, happy to make any required PRs!
Thanks.
Dear all,
Htmldate is now widely used and it has become apparent that the GPL license is not prevalent in Python packages, its potential implications are also not easily understood. To better align with community standards and promote ongoing development, I plan to transition the software license to Apache 2.0.
As you have contributed lines still currently in use your agreement is needed to move forward on this. For the sake of simplicity I use the list of current contributors as provided by Github and you are listed below even if your contributions may focus on documentation or evaluation files.
If you agree please add a corresponding message to this thread. If not please contact me or open a public discussion.
Configuration arguments are available for Python functions, it would be nice to make them available as command-line arguments as well:
Some images are replaced by alt text as they are missing on the Pypi Readme. This could be fixed by
I'm not sure I understand how to use this library when I already have a bs4 object. Thanks in advance!
in extractors.py
the function try_date_expr()
does not check that the string argument is of the right type and raises an exception if it is not.
To reproduce:
when processing a meta node of the form
<meta itemprop="dateCreated" datetime="">
in examine_header()
, the line
attempt = tryfunc(elem.get("datetime") or elem.get("content"))
calls try_date_expr()
with string=None
which raises AttributeError: 'NoneType' object has no attribute 'strip'
Suggestion:
add in try_date_expr()
if not string:
return None
Some articles include the full publication time, with timezone, in HTML meta tags or Javascript config. Does this library parse and handle those timezones? Relatedly, how does it internally store dates with regards to timezone - are the all returned in machine-local time, held in GMT, or something else?
For instance, this Guardian article includes the article:published_time
meta tag with a timezone included. Does this library recognize that timezone and return the date as it would be in GMT? Same for this article on CNN, which includes the datePublished
meta tag.
Description of the Issue:
When using the htmldate library to extract both the original publication date and the most recent update date from web pages, the function find_date returns the same date for both, even though the HTML source of the pages clearly contains different dates for the original publication and last modification.
Steps to Reproduce:
Expected Behavior:
The function should return distinct dates for the original publication and the most recent update (if available), based on the webpage's metadata.
Actual Behavior:
The function returns the same date for both the original publication and the most recent update.
Example Code:
from htmldate import find_date
# Example URL
url = 'https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/'
# Attempt to extract original publication date
original_date = find_date(url, original_date=True)
# Attempt to extract most recent update date
updated_date = find_date(url, original_date=False)
print(f'Original Date: {original_date}')
print(f'Updated Date: {updated_date}')
Possible Causes:
The library might not be parsing certain HTML meta tags correctly.
There could be an issue with the heuristic approach used to differentiate between the dates.
I hope this information helps in diagnosing and resolving the issue. Thank you for your assistance!
For the following MWE:
from htmldate import find_date
print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))
htmldate
outputs 2022-01-01
instead of the expected 2022-10-19
.
I've traced the execution of the above call and I believe it is the search_page
function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022
part of the date string (which autocompletes the rest to 1st Jan).
I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.
Hi Adrien
here are a few test cases where the extraction gave a wrong answer:
https://www.gardeners.com/how-to/vegetable-gardening/5069.html
https://www.almanac.com/vegetable-gardening-for-beginners
Somewhat related, this one 'hangs':
https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854
If you don't pass in a max_date, htmldate will use the LATEST_POSSIBLE constant (in get_max_date in validators.py). This constant is initialized to datetime.now().
This is an issue when htmldate is used in a long running process, such as a server which runs 24/7, or even in workers that are not restarted often. After one day of uptime, htmldate will still use the max_date of the previous day.
This can be overwritten from the calling code (by passing in the appropriate parameters), but I think it would be nicer if htmldate initialized max_date every time with datetime.now() in the get_max_date function.
Whenever I call the find_date()
function in Jupyter Notebook, it always seems to return None
(the second line is the same link as on the ReadMe). There's no module error and I believe I've installed it correctly via pip and git. Is there a solution to this in Jupyter or should I just switch over to regular python?
TypeError
as mentioned in adbar/trafilatura#61, originating in IllegalMonthError
within python-dateutil
.
Hi @adbar,
Great package! I really like the evaluation suite. The script is clear and the webpage is a fair description. I ran the evaluation myself and get some minor discrepancies in the numbers from what is published on the webpage. In particular, I get the following:
number of documents: 225
nothing
{'true_positives': 0, 'false_positives': 0, 'true_negatives': 0, 'false_negatives': 225, 'time': 0}
htmldate extensive
{'true_positives': 196, 'false_positives': 29, 'true_negatives': 0, 'false_negatives': 0, 'time': 5.767443418502808}
precision: 0.871 recall: 1.000 accuracy: 0.871 f-score: 0.931
htmldate fast
{'true_positives': 186, 'false_positives': 20, 'true_negatives': 0, 'false_negatives': 19, 'time': 5.7627551555633545}
precision: 0.903 recall: 0.907 accuracy: 0.827 f-score: 0.905
newspaper
{'true_positives': 87, 'false_positives': 11, 'true_negatives': 0, 'false_negatives': 127, 'time': 108.73073124885559}
precision: 0.888 recall: 0.407 accuracy: 0.387 f-score: 0.558
newsplease
{'true_positives': 130, 'false_positives': 28, 'true_negatives': 0, 'false_negatives': 67, 'time': 111.8387098312378}
precision: 0.823 recall: 0.660 accuracy: 0.578 f-score: 0.732
articledateextractor
{'true_positives': 125, 'false_positives': 28, 'true_negatives': 0, 'false_negatives': 72, 'time': 7.127204179763794}
precision: 0.817 recall: 0.635 accuracy: 0.556 f-score: 0.714
date_guesser
{'true_positives': 110, 'false_positives': 26, 'true_negatives': 0, 'false_negatives': 89, 'time': 31.155170917510986}
precision: 0.809 recall: 0.553 accuracy: 0.489 f-score: 0.657
goose
{'true_positives': 94, 'false_positives': 12, 'true_negatives': 0, 'false_negatives': 119, 'time': 18.64737319946289}
precision: 0.887 recall: 0.441 accuracy: 0.418 f-score: 0.589
as output from comparison.py. Specifically, the F-1 score for htmldate extensive
should be 0.944 according to the website, but I get 0.931. Any idea where the discrepancy comes from?
By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue.
CLI:
htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"
Python:
Here is the debugging without min_date
:
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'
With min_date
at "1990-01-01":
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'
Bug originally posted by @kinoute in #8 (comment)
Hello and thank you for this great lib!
htmldate gives "# ERROR no valid result for url" for articles on martech.org which nonetheless display pub date in header.
e.g. https://martech.org/why-testing-is-strategic-experimentation-for-sustainable-growth/
While "on April 4, 2022 at 10:43 am" is displayed at the beginning of the article.
A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs
folder and online on htmldate.readthedocs.io
Several problems could arise:
As of now the coverage is at 90%, some portions of the code are not featured in the tests.
The unit_tests.py
file in the tests/
directory features a series of web pages as well as code loops, this is where further tests could be added.
Please refer to the contributing guidelines.
Hi @adbar,
Would be possible to add PT month names and abbreviations in the next release?
Thanks.
See discussion in #56.
requirements-dev.txt
file with the following dependencies.github/workflows/tests.yml
) accordingly, i.e. remove the dev elopment packages listed there and use the new file insteadIt yields
"errorMessage": "cannot import name 'etree' from 'lxml' (/var/task/lxml/init.py)",
Trying to figure out what is the issue/how to fix it.
I'm getting below error when trying to install trafilatura
on mcr.microsoft.com/playwright/python:v1.32.1-focal
docker image
I tried many versions with no luck. Is there way to fix this without introducing a lot of image size?
Building wheels for collected packages: sentence-transformers, typing, uuid, backports-datetime-fromisoformat, lit
Building wheel for sentence-transformers (setup.py): started
Building wheel for sentence-transformers (setup.py): finished with status 'done'
Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125926 sha256=85fbd76a2c8311631cab1cf9611cf0ef12e43e06c26bbaaca0a0ad9ab4323f63
Stored in directory: /root/.cache/pip/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
Building wheel for typing (setup.py): started
Building wheel for typing (setup.py): finished with status 'done'
Created wheel for typing: filename=typing-3.7.4.3-py3-none-any.whl size=26305 sha256=ec7f26377d7304b784c9a15bf2152e785604f05a42b5e5467060b10f282f16d5
Stored in directory: /root/.cache/pip/wheels/5e/5d/01/3083e091b57809dad979ea543def62d9d878950e3e74f0c930
Building wheel for uuid (setup.py): started
Building wheel for uuid (setup.py): finished with status 'done'
Created wheel for uuid: filename=uuid-1.30-py3-none-any.whl size=6478 sha256=42f6b14e52efa4385e0e1d94a2aa9481407fa95875859e5090f7c7cc64dd5465
Stored in directory: /root/.cache/pip/wheels/1b/6c/cb/f9aae2bc97333c3d6e060826c1ee9e44e46306a178e5783505
Building wheel for backports-datetime-fromisoformat (setup.py): started
Building wheel for backports-datetime-fromisoformat (setup.py): finished with status 'error'
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-38
creating build/lib.linux-x86_64-cpython-38/backports
copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
running build_ext
building 'backports._datetime_fromisoformat' extension
creating build/temp.linux-x86_64-cpython-38
creating build/temp.linux-x86_64-cpython-38/backports
creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for backports-datetime-fromisoformat
Running setup.py clean for backports-datetime-fromisoformat
Building wheel for lit (pyproject.toml): started
Building wheel for lit (pyproject.toml): finished with status 'done'
Created wheel for lit: filename=lit-16.0.6-py3-none-any.whl size=93584 sha256=7eb1709c8fb581da100e3f4309e4d214a3e1db491afcc2f3aa2d8e092360fa61
Stored in directory: /root/.cache/pip/wheels/05/ab/f1/0102fea49a41c753f0e79a1a4012417d5d7ef0f93224694472
Successfully built sentence-transformers typing uuid lit
Failed to build backports-datetime-fromisoformat
Installing collected packages: uuid, tokenizers, sentencepiece, safetensors, pytz, playwright-stealth, mpmath, lit, lambda-warmer-py, cmake, backports-datetime-fromisoformat, asyncio, urllib3, typing-extensions, typing, tqdm, tld, threadpoolctl, tabulate, sympy, soupsieve, sniffio, six, simplejson, regex, pyyaml, python-json-logger, python-dotenv, pyspellchecker, pluggy, pillow, packaging, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, lxml, langcodes, joblib, jmespath, jellyfish, idna, h11, fsspec, fastapi-events, exceptiongroup, click, charset-normalizer, certifi, backports.zoneinfo, uvicorn, tzlocal, segtok, scipy, requests, python-dateutil, pydantic, nvidia-cusolver-cu11, nvidia-cudnn-cu11, nltk, mangum, justext, jinja2, filelock, courlan, beautifulsoup4, awslambdaric, anyio, yake, starlette, scikit-learn, rake-nltk, pandas, huggingface-hub, dateparser, botocore, transformers, s3transfer, htmldate, fastapi, trafilatura, boto3, triton, torch, torchvision, sentence-transformers
Running setup.py install for backports-datetime-fromisoformat: started
Running setup.py install for backports-datetime-fromisoformat: finished with status 'error'
error: subprocess-exited-with-error
× Running setup.py install for backports-datetime-fromisoformat did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
running install
/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-38
creating build/lib.linux-x86_64-cpython-38/backports
copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
running build_ext
building 'backports._datetime_fromisoformat' extension
creating build/temp.linux-x86_64-cpython-38
creating build/temp.linux-x86_64-cpython-38/backports
creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> backports-datetime-fromisoformat
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python -m pip install --upgrade pip
The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1
The tests didn't pass so the lxml version has been pinned in the last release. Fix or bypass the changes introduced by a newer libxml version.
In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).
In terms of implementation, we could just copy filter_url_for_undateable
function from date_guesser
and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date
.
If many potential dates are present on the page the extensive search gets too greedy and needs too much time to find a potential candidate, example: http://www.historicalkits.co.uk/Leeds_United/Leeds_United.htm
Hello there,
Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev
Here is the error using iPython and Python 3.8.12:
# works
In [3]: from htmldate import find_date
In [4]: find_date("https://osmh.dev")
Out[4]: '2020-11-29'
# doesn't work
In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')
The last example throws an error:
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-6-9988648ad55b> in <module>
----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
653
654 # try time elements
--> 655 time_result = examine_time_elements(
656 search_tree, outputformat, extensive_search, original_date, min_date, max_date
657 )
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
389 return attempt
390 else:
--> 391 reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
392 if reference > 0:
393 break
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
300 attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
301 if attempt is not None:
--> 302 return compare_values(reference, attempt, outputformat, original_date)
303 return reference
304
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
110 def compare_values(reference, attempt, outputformat, original_date):
111 """Compare the date expression to a reference"""
--> 112 timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
113 if original_date is True and (reference == 0 or timestamp < reference):
114 reference = timestamp
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
566 """Return a class cls instance based on the input string and the
567 format string."""
--> 568 tt, fraction, gmtoff_fraction = _strptime(data_string, format)
569 tzname, gmtoff = tt[-2:]
570 args = tt[:6] + (fraction,)
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
331 if not format_regex:
332 try:
--> 333 format_regex = _TimeRE_cache.compile(format)
334 # KeyError raised when a bad format is found; can be specified as
335 # \\, in which case it was a stray % but with a space after it
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
261 def compile(self, format):
262 """Return a compiled re object for the format string."""
--> 263 return re_compile(self.pattern(format), IGNORECASE)
264
265 _cache_lock = _thread_allocate_lock()
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
250 def compile(pattern, flags=0):
251 "Compile a regular expression pattern, returning a Pattern object."
--> 252 return _compile(pattern, flags)
253
254 def purge():
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
762 if isstring(p):
763 pattern = p
--> 764 p = sre_parse.parse(p, flags)
765 else:
766 pattern = None
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
946
947 try:
--> 948 p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
949 except Verbose:
950 # the VERBOSE flag was switched on inside the pattern. to be
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
441 start = source.tell()
442 while True:
--> 443 itemsappend(_parse(source, state, verbose, nested + 1,
444 not nested and not items))
445 if not sourcematch("|"):
/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
829 group = state.opengroup(name)
830 except error as err:
--> 831 raise source.error(err.msg, len(name) + 1) from None
832 sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
833 not (del_flags & SRE_FLAG_VERBOSE))
error: redefinition of group name 'm' as group 5; was group 2 at position 116
In order to help new contributors it would be nice to add pre-commit hooks to the repository with the following checks:
The CONTRIBUTING.md file could get updated accordingly.
Version display is usually expected from a command-line interface, see adbar/trafilatura#145
Due to an upcoming version the compatibility has to be checked.
If the tests pass, we can reference it explicitly in the setup file.
Hello @adbar,
I just stumbled upon an issue when extracting contents from this html file (an article from LeMonde): https://gist.github.com/Yomguithereal/de4457a421729c92a976b506268631d7
It returns 2021-01-31
(which was a date in the future at the time the html was downloaded, i.e. more than one year ago) because it latches on something which is an expiry date for something in a JavaScript string litteral.
I don't really know how trafilatura tries to extract a date from html pages, but I guess here it was found because of a regex scanning the whole text? In which case maybe a condition checking that the found dates are not in the future could help (this could also be tedious because one would need to pass the "present" date when extracting data collected in the past).
Only support Python versions 3.6+ in the future and see if the code can be improved or cleaned on the way.
I have mostly tested htmldate
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.
Please install the dateparser
library beforehand as it significantly extends linguistic coverage: pip
or pip3 install -U dateparser
or pip install -U htmldate[all]
.
Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS
and ADDITIONAL_EXPRESSIONS
).
Thanks!
Hello,
Is it possible to customize the regex pattern for the output to retrieve the date in find_date function? We need a specific format that is retrieving the date and time jointly.
I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.
The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.
Two viable alternatives worth considering are bleach
and nh3
. Here's why:
bleach:
nh3:
We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.
Let me know if we can help you with this transition anyhow and have a nice day.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.