Technical aspects of Data Science
A place to discuss the modelling and programming side of data science
Consider a scenario where you're expected to solve a very tricky probability problem but you don't know how to solve it or another scenario where the probability problem requires a specific domain knowledge in which you're not an expert. Monte Carlo Simulation will come to your rescue in such scenarios, it is a method in which we simulate the random experiment using computational algorithms. It is usually a much simpler method to find the required probability compared to the theoretical (or mathematical) methods, however it is not as accurate as the mathematical method and it can be slow & computationally expensive.Lets understand Monte Carlo Simulation using examples!Lets start with one of the simplest and most commonly sited example of Monte Carlo Simulation and once we get a hang of it. We'll solve a tricky problem using the same technique.https://www.linkedin.com/posts/fazil-mohammed-4062711b2_monte-carlo-simulations-activity-6945626961729712128-tcgU?utm_source=linkedin_share
❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️Research Collection ThreadThread of the coolest papers in forecasting ⛄️🌨️️️️🌨️️️️Add any hot forecasting research you come across! 🔥🔥🔥Please add a TLDR, a link to the paper, and ideally any relevant code base.❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️❄️
The First Principal ComponentPCA is the most popular dimensionality reduction technique (IMHO it is basically a data transformation technique, viewing the same data using a better choice of coordinate system). In this write up, we're going to clear a common misconception about the very first principal component.In almost every lecture or article explaining PCA, they refer to the first principal component as the 'line of best fit' for the data (not that I completely disagree). They mention the above statement as a passing remark, but 'line of best fit' is a hyperbole we commonly use in the context of OLS, do they actually mean it in that sense? (some people do mistake it to be in the sense of OLS method, which is wrong on so many levels). This statement warrants a clear explanation without which it can lead to serious misconception. Lets first see where this particular idea stems from.You can find the full writeup and notebook here https://www.linkedin.com/posts/fazil-mohammed-4062711b2
Hey,So in R I have a tibble that is 1 x 9 and what I want to do is basically take those 9 variables and move them out of the tibble and become their own, free, variables.Aka I want to turn something like:df <- tibble::tibble(a = 1, b = 2, c = 3)intoa <- 1b <- 2c <- 3Without having to either individual put each variable into a new one witha <- df$ab <- df$bc <- df$cThings I have tried:tibble::deframe(df)unlist(df)Is there like a single function somewhere that can do this or do I have to faff about?
It’s always interesting and insightful to learn how other people like think about solving data science problems, and a big part of this is the tools we like to use.This thread is for a discussion of useful models, algorithms and techniques -- whatever they may be, and whatever they may be for.
Bias-Variance Decomposition is one of the most important concepts in ML. It is helpful in understanding the performance of a machine learning algorithm and understanding the issue of overfitting. Here's my take on the same, hopefully it might be a good refresher for you too! #machinelearning #ml #bias #variancehttps://www.linkedin.com/posts/fazil-mohammed-4062711b2_my-notes-on-bias-variance-decomposition-activity-6930915814606852096-Vw39?utm_source=linkedin_share&utm_medium=member_desktop_web
I’m wondering if anyone has done any research into/has had any success in accounting for inflation in their price elasticity modelling? Given the sharp rise in fuel, energy and food prices over the past few months, an item that was once quite elastic may now be very elastic, especially when considering luxury items.I.e. a small price increase of a luxury product in today’s financial climate may result in a greater drop in demand than it would have previously.If you build an elasticity model on many year’s worth of historical sales data, should you weight recent data more favourably to account for this sort of behaviour? Is there an alternative way to account for inflation?
Interesting paper on a netflix case study of recommender systems Deep Learning for Recommender Systems: A Netflix Case Study Even though many deep-learning models can be understood as extensions of existing (simple) recommendation algorithms, significant improvements in performance over well-tuned non-deep-learning approaches were not observed. Only when numerous features of heterogeneous types were added to the input data, deep-learning models started to shine. Netflix have a technical blog with some interesting articles athttps://netflixtechblog.medium.com/
Seems like to read data from Redshift I have (at least) 2 options: DBI::dbConnect(connection, ‘table-name’)vsdplyr::tbl(connection, ‘table-name’) The latter doesn’t read the whole table into memory and so you can perform all sorts of dplyr goodness on it before using collect() to read it into memory.Why would you ever use dbConnect() in that case?
Are neural network or deep networks inherently better than classical machine learning algorithms? Before joining Peak as a Data Scientist, my answer to that would be definitely yes, 100%. That opinion was shaped by the fact that my previous experience was only in deep neural networks and only learnt about classical machine learning algorithm in my Masters in Machine Learning. This was also heavily influenced by the fact that the news I read on machine learning advancements were about AlphaGo, DeepFakes, and Natural Language Processing. I thought neural network algorithm was king, even used it in my data challenge for the Graduate Scheme Assessment day. Since starting at Peak, I have learnt that neural network are not all that in the current landscape of Decision Intelligence. Many of my colleagues have built incredible solutions for customers from Inventory Optimisation, Price Optimisation, Forecasting Demand, and much more. Some have even tried to implement deep neural network algori
Wait, am I the only one who didn't know scikit-learn and XGBoost can do continual learning?!Continual learning is where you update a model as new data comes in. Not by retraining the model from scratch (old data + new data). This is great if you're dealing with huge data where retraining on the entire dataset would be computationally painful. Has anyone used this before? What cool uses could this unlock in your projects?
Most timeseries models assume that (conditional on the model) the “future will resemble the past”. In slightly more specific terms, this involves assuming that the “properties” of the training data relevant to the model will be the same as those encountered by the model during production (or equivalently that “the data generating mechanism” won’t change in ways relevant to the model between encountering training and production data).An important preprocessing step when performing any kind of timeseries modelling is consequently to handle time periods in the training data which are in some relevant way “not representative”. This could mean removing certain dates from the data entirely, using indicator variables to help the model distinguish between these non representative periods, or using some other kind of imputation technique.The most appropriate way of dealing with such non-representative data will depend on the structure of the data itself (sampling frequency, distribution of valu
I’m having some trouble printing the logger in jupyter. I triedimport loggingfrom orion.utils.logging import IO_LOGIO_LOG.setLevel(logging.DEBUG)logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)andlogging.basicConfig(filename='example.log', filemode='w', level=logging.DEBUG)but I don’t get anything!
I’m interested in knowing how people typically save their plots from RStudio. Are there any ways you’ve found to be super quick and easy, or alternatively anything that should be avoided? E.g. maybe a certain way always skews the dimensions.Would be great if you could vote on the poll below so I can see what the most commonly used method is!
A common tactic you hear in data science is:“start by building a simple model, then build a more complicated one and see if it improves performance”For example, start by building a logistic classifier, then perhaps see if a Random Forest performs better.But how much improvement can we expect? If our logistic reg gets 60% accuracy, and our random forest gets 90% accuracy, is that normal or has something gone wrong?I’m interesting everyone’s experiences: when you’ve done this “simple model → complex model” tactic, how big performance boost did you see? Did you see any at all?!
Welcome to the Programming in Data Science part of the Community 👋Some ideas for discussion for this section: New technologies that people are using and finding useful in their data science work Questions on how to use a particular programming language or technology, for example how to use ggplot in R for making charts or how to use Docker Sharing of resources you find elsewhere that might be of use to the community And anything else relevant to programming in data science!
Already have an account? Login
Login to the community
No account yet? Create an account
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.