Code Monkey home page Code Monkey logo

plantstuff's Introduction

plantstuff

๐ŸŒฟ ๐ŸŒฒ ๐Ÿ ๐Ÿƒ ๐ŸŒบ Utilities for retrieving, computing, organizing, and creating plant/horticulture data from various sources.

DISCLAIMER

This repo is a work-in-progress. The ultimate goals are not yet defined, so everything is still very messy and NOT production ready in any capactiy (whatever that means here).

LEGAL COPYRIGHT DISCLAIMER

No scraped data is stored here until content copyright is verified.

Scraping spiders

The following scrapy spiders have been created and have been tested to generate real uniform data. All spiders are under scraping.scrapers.spiders:

  • theplantlist
  • springhillnursery
  • provenwinners
  • wikipedia (basic categorical lists for now)

Note: unless otherwise noted, these are not considered exhaustive - but they typically do retrieve most all urls and handle pagination.

Works in progress

  • Monrovia
  • Perennials.com
  • Plantlust

plantstuff's People

Contributors

christabor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

plantstuff's Issues

Find a good graph modeling software

This will be used to visually model the domain. But we need to seamlessly export and test the model using a graph database.

The data file also needs to be stored in git, hence the need for readable text in a common format

Requirements

  • Needs to be free (no trial)
  • Performant (at least 1000s of nodes, edges, labels per graph)
  • allow exporting/importing to a parseable non binary format (e.g. json)

Nice to have

  • Supports exporting to .dot graphviz specification

References/research

Create initial proof-of-concept for basic plant categories using mind maps

@nicolesimon

E.g.

https://www.google.com/search?biw=960&bih=529&tbm=isch&sa=1&ei=gjvIWv7dCazdjwTV-LPYCg&q=graph+database+design&oq=graph+database+design&gs_l=psy-ab.3..0i30k1j0i24k1l4.47272.48014.0.48154.16.7.0.0.0.0.124.748.3j4.7.0....0...1c.1.64.psy-ab..14.1.124....0.q7BL_GghzUg#imgrc=Xq_unDoCS9XA5M:

From notes (disorganized)

  1. database growing like a plant - or structured like a plant

FEW CASE STUDIES FROM BROAD CATEGORIES FIRST

e.g.

TREE
SHRUB
GRASS
VINE

decidous
evergreen
    broadleaf

perennial
    herbaceous

grasses
vines
    lianas

ferns
rhizomes
bulbs
corms 
annual
biennial

CONSTRAINTS
    ZONE 8 CLIMATE

This is probably worth reading as well, since if we did decide to go with a graph database, this is the only viable one I'm aware of.
https://neo4j.com/developer/guide-data-modeling/

Find a way to represent plant data over time

Noticed this page has an interesting characteristic:

http://www.coniferkingdom.com/acer-pictum-ssp-mono-usugumo/

HXW@10YRS: 8'x4'

This concept is easy to represent over time in more regular intervals e.g.

HXW@2YRS: 3'x1'
HXW@5YRS: 4'x2'
HXW@10YRS: 8'x4'
Etc

And if a plant doesn't typically live as long, it would have a null value:

HXW@10YRS: null

Which makes it easy to search where matches are in a time range AND height range.

This would be separate of something like average or max height, or could be used that way as well.

Pipfile / requirements.txt

Would be quite useful to have an authoritative list of the required packages - I see pytesseract, PIL, amongst others are imported, but I don't see a list anywhere.

Cheers!

Create document describing query requirements for db

I'm currently working on a plant database and one of the important factors in getting the SCHEMA right is making sure it can accommodate various types of questions.

There are two pieces of criteria I would use to judge "correctness":

  • does it answer the question fully, assuming the data is present? obviously if the data is not available, no schema would work, so I would ignore that as a degenerate case.

  • assuming the previous, does it do so in a reasonable timeframe? If the query is too slow it's not useful

  • is it organized in such a way that it doesn't require complex querying? E.g. If sql, is it not so normalized that common query scenarios require a complex jumble of joins?

With that criteria in mind, it is important to enumerate all the likely scenarios. Many of them would be categorical. The few categories I have so far are below. Note that these should eventually be made into questions.

horticulture

landscape design

  • companion planting
  • xericulture
  • permaculture

genetics

general

climate change

  • what plants are most susceptible to fluctuation changes above x?
  • what plants have the broadest zone range?

@nicolesimon After some discussion, I think we'll move forward with a graph model database, at least in terms of initial exploration. We can leverage document or relational databases if we feel it's inappropriate, but it seems highly advantageous for the complex relationships and idiosyncracies we have to support.

We will need to solve #6, #7, #8 #4 as well, and possibly look at #5 to achieve this prototype.

Look into using dataclasses for native data support

There has been a lot of back and forth with defining a more abstract schema, and making it work in python. It seems like dataclasses might be the perfect solution - but it needs to support complex list/object dependencies. Possibly this could use the typing module to achieve something pythonic and native.

Document scraping and schema assembly process and requirements

Leveraging scraped data is vital to the Project, but it poses many problems:

  • Inconsistent schema
  • Varying degrees of trustworthiness
  • Middleware required to perform ETL for later use

We need to document this in detail so we can define clear paths to mitigate or address each one.

For example, when gathering scraped data, source a may contain one set of data, while source b contains a subset, or a different set entirely.

In this scenario, problems emerge:

  • The need to merge data into one unified schema
  • The need to verify data
  • The need to clean and transform data from one ontology to another (since all sources of data represent it in different ways for their respective needs)

And also the possible need to provide attribution so that data trust can be amazed at a later point in time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.