christabor / plantstuff Goto Github PK

5.0 3.0 0.0 937 KB

Warning! messy/unstable! :herb: :evergreen_tree: :maple_leaf: :leaves: :hibiscus: Utilities for retrieving, computing, organizing, and creating plant/horticulture data from various sources.

License: MIT License

Python 98.96% HTML 0.98% Makefile 0.06%

plants biology horticulture datasets

plantstuff's Introduction

plantstuff

🌿 🌲 🍁 🍃 🌺 Utilities for retrieving, computing, organizing, and creating plant/horticulture data from various sources.

DISCLAIMER

This repo is a work-in-progress. The ultimate goals are not yet defined, so everything is still very messy and NOT production ready in any capactiy (whatever that means here).

LEGAL COPYRIGHT DISCLAIMER

No scraped data is stored here until content copyright is verified.

Scraping spiders

The following scrapy spiders have been created and have been tested to generate real uniform data. All spiders are under scraping.scrapers.spiders:

theplantlist
springhillnursery
provenwinners
wikipedia (basic categorical lists for now)

Note: unless otherwise noted, these are not considered exhaustive - but they typically do retrieve most all urls and handle pagination.

Works in progress

Monrovia
Perennials.com
Plantlust

plantstuff's People

Contributors

Stargazers

Watchers

plantstuff's Issues

Setup converted schema and test in neo4j

Find a good graph modeling software

This will be used to visually model the domain. But we need to seamlessly export and test the model using a graph database.

The data file also needs to be stored in git, hence the need for readable text in a common format

Requirements

Needs to be free (no trial)
Performant (at least 1000s of nodes, edges, labels per graph)
allow exporting/importing to a parseable non binary format (e.g. json)

Nice to have

Supports exporting to .dot graphviz specification

References/research

https://www.quora.com/What-are-the-best-database-design-tools-for-graph-databases
http://grakn.ai/ (alternative to neo4j)

http://openrefine.org/

For organizing the disparate generated datasets

Create initial proof-of-concept for basic plant categories using mind maps

@nicolesimon

E.g.

https://www.google.com/search?biw=960&bih=529&tbm=isch&sa=1&ei=gjvIWv7dCazdjwTV-LPYCg&q=graph+database+design&oq=graph+database+design&gs_l=psy-ab.3..0i30k1j0i24k1l4.47272.48014.0.48154.16.7.0.0.0.0.124.748.3j4.7.0....0...1c.1.64.psy-ab..14.1.124....0.q7BL_GghzUg#imgrc=Xq_unDoCS9XA5M:

From notes (disorganized)

database growing like a plant - or structured like a plant

FEW CASE STUDIES FROM BROAD CATEGORIES FIRST

e.g.

TREE
SHRUB
GRASS
VINE

decidous
evergreen
    broadleaf

perennial
    herbaceous

grasses
vines
    lianas

ferns
rhizomes
bulbs
corms 
annual
biennial

CONSTRAINTS
    ZONE 8 CLIMATE

This is probably worth reading as well, since if we did decide to go with a graph database, this is the only viable one I'm aware of.
https://neo4j.com/developer/guide-data-modeling/

Determine a way to deal with source attribution in schema/datasets

Discussed with @alyjak

Find a way to represent plant data over time

Noticed this page has an interesting characteristic:

http://www.coniferkingdom.com/acer-pictum-ssp-mono-usugumo/

HXW@10YRS: 8'x4'

This concept is easy to represent over time in more regular intervals e.g.

HXW@2YRS: 3'x1'
HXW@5YRS: 4'x2'
HXW@10YRS: 8'x4'
Etc

And if a plant doesn't typically live as long, it would have a null value:

HXW@10YRS: null

Which makes it easy to search where matches are in a time range AND height range.

This would be separate of something like average or max height, or could be used that way as well.

Pipfile / requirements.txt

Would be quite useful to have an authoritative list of the required packages - I see pytesseract, PIL, amongst others are imported, but I don't see a list anywhere.

Cheers!

Create document describing query requirements for db

I'm currently working on a plant database and one of the important factors in getting the SCHEMA right is making sure it can accommodate various types of questions.

There are two pieces of criteria I would use to judge "correctness":

does it answer the question fully, assuming the data is present? obviously if the data is not available, no schema would work, so I would ignore that as a degenerate case.
assuming the previous, does it do so in a reasonable timeframe? If the query is too slow it's not useful
is it organized in such a way that it doesn't require complex querying? E.g. If sql, is it not so normalized that common query scenarios require a complex jumble of joins?

With that criteria in mind, it is important to enumerate all the likely scenarios. Many of them would be categorical. The few categories I have so far are below. Note that these should eventually be made into questions.

horticulture

landscape design

companion planting
xericulture
permaculture

genetics

general

climate change

what plants are most susceptible to fluctuation changes above x?
what plants have the broadest zone range?

@nicolesimon After some discussion, I think we'll move forward with a graph model database, at least in terms of initial exploration. We can leverage document or relational databases if we feel it's inappropriate, but it seems highly advantageous for the complex relationships and idiosyncracies we have to support.

We will need to solve #6, #7, #8 #4 as well, and possibly look at #5 to achieve this prototype.

Look into using dataclasses for native data support

There has been a lot of back and forth with defining a more abstract schema, and making it work in python. It seems like dataclasses might be the perfect solution - but it needs to support complex list/object dependencies. Possibly this could use the typing module to achieve something pythonic and native.

Document scraping and schema assembly process and requirements

Leveraging scraped data is vital to the Project, but it poses many problems:

Inconsistent schema
Varying degrees of trustworthiness
Middleware required to perform ETL for later use

We need to document this in detail so we can define clear paths to mitigate or address each one.

For example, when gathering scraped data, source a may contain one set of data, while source b contains a subset, or a different set entirely.

In this scenario, problems emerge:

The need to merge data into one unified schema
The need to verify data
The need to clean and transform data from one ontology to another (since all sources of data represent it in different ways for their respective needs)

And also the possible need to provide attribution so that data trust can be amazed at a later point in time.

Look into planteome for ontology inspiration

http://browser.planteome.org

Convert schema to graph model

Partially depends on #6