Code Monkey home page Code Monkey logo

gpalfy / socialnetworkcomments Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 323 KB

:memo: Text Data Analysis & Machine Learning on supermarket's Social Network non-English Comments using limited language resources via Java, Scala & Apache Spark.

License: MIT License

Java 97.71% Shell 2.29%
social-network-analysis language-resources limited-language-resources data-analysis text-processing text-analysis nlp natural-language-processing java scala container docker-container machine-learning nlp-machine-learning zeppelin-notebook hadoop apache-spark spark spark2

socialnetworkcomments's Introduction

Social Network Comments

Data Analysis & Machine Learning on popular German supermarket's Social Network Comments published by Slovak users in Slovak language. Use Java Natural Language Processing (NLP) tools and the Apache Spark environment to gain insights about Slovak users most demanding topics. Most of NLP tools is for English language, there for the hardest part was to develop path in the field of limited language resources (e.g. non English or non popular world languages), to analyze real Slovak language data. The project is not self runnable, because the aim of the published work is to show hints for processing language which is limited in NLP resources as well as using Java / Scala on the computing cluster (well known in the field of Big Data).

Basic Data Analysis & Machine Learning

  • Text Analysis, e.g. word frequencies
  • Sentiment Classification, e.g. the given user text comment expresses positive / neutral / negative sentiments (i.e. emotions)?
  • Graph Analysis, e.g. how to discover hidden connections between words?
  • Latent Dirichlet Allocation, e.g. use unsupervised learning to explore social network comment's clusters

Analysis & Machine Learning is developed in the Zeppelin Notebooks (you can find *.json files in src/ directory). If you don't have Zeppeline installed on your local computer, please look at *.pdf files in the doc/ directory.

Data Acquisition

Use Python script to collect comments from the most popular social network page in Slovakia (the script is left out).

Data Preprocessing

Only part of the data preprocessing Java program is published to prepare non English language data for analysis & machine learning tasks. The shared Java program deals with word frequencies on the Spark cluster computing environment. In case of interest for the whole program please drop a line.

Used Technologies

  • Python
  • Java
  • Scala
  • Zeppelin Notebooks
  • Apache Hadoop / Apache Spark
  • Docker Containers

Dependencies

  • Data Acquisition: Python script with libraries (random, pickle, time & requests).
  • Data Preprocessing: Java program with packages (Jsoup, Spark) & local Apache Spark cluster
  • Basic Data Analysis & Machine Learning: Scala with Zeppelin Notebooks on local Apache Spark cluster

Run

Docker containers are in use to run parts of project. Let's see the directory Docker/ where the individual Bash scripts uses different containers. I'd like to share the collected social network comments for you to run parts of the project, but I'm not sure if this is legal (more detail available on General Data Protection Regulation).

License

Licensed under a MIT License.

Help

As always, improvements, issues (socialnetworkanalysis/issues) & suggestions are well come. Feel free to connect with me ;)

Roadmap

  • put legal or simulation data to make project runnable
  • interpret results - for non technical users e.g. the word frequencies, cluster labels, comment's sentiment classes, ... are almost self-explainable for someone, but the precise visual representation has a potential to add more insight
  • develop unit tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.