The following links are helpful for the project,
- 10 minutes to Pandas
- Beautiful Soup
- Requests
- plotly
- MovieLens Dataset
- OMDB API
- Markdown Quick Tutorial
The dataset01.csv
and dataset02.csv
consists of 27000 entries.
For project, we have filtered the dataset for year 1990-2014, country as USA, language as English for which we get 10060 entries.
-
Run the
filteringDataset.ipynb
to filter the dataset and remove duplicate ID’s. After executing we getdatasetWithoutBoxOffice.csv
. -
Run
extractBoxOffice.ipynb
to extract box office using WebCrawl class present inwebcrawl.py
. After executing we getdatasetWithBoxOffice.csv
.
Optional(but suggested): We have made 10 copies of extractBoxOffice.ipynb
with 1000 entries each, and then using mergeCSV.ipynb
we have merged all the csv's to get datasetWithBoxOffice.csv
.
Alternatively, you can run
extractBoxOfficeAllEntries.ipynb
to extract box office for all entries, but consumes lot of time (in hrs).
-
Run
extractTicketInflationPrice.ipynb
to extract table of ticket inflation price by year. After executing we getticketPriceInflation.csv
. -
Run
adjustTicketPriceInflation.ipynb
. After executing we getfinalDataset.csv
. -
Run
plotDataset1.ipynb
,plotDataset2.ipynb
to visualise the dataset.
For Windows when converting to csv use encoding as UTF-8.
Images
- Snapshot of Final dataset
- One of the plot of dataset