Marcin Kosiński
a biostatistical background,
working in IT and keen on
FSelectorRcpp on CRAN
Mar 14, two thousand seventeen • Marcin Kosiński
FSelectorRcpp – Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support, has ultimately arrived on CRAN after a year of development. It is also tooled with a parallel backend.
3rd Bday of Warsaw R Enthusiasts Group
This Thursday Warsaw R Enthusiasts Group (in grind Spotkania Entuzjastów R – SER – cheese ) celebrated it’s 3rd bday! Check this post to find out what we have ready for this special occasion.
Use switch() instead of ifelse() to comeback a NULL
Have you ever attempted to come back a NULL with the ifelse() function? This function is a plain vectorized workflow for conditional statements. However, one can’t just come back a NULL value as a result of this evaluation. Check a tricky workaround solution in this post.
Comparing (Fancy) Survival Forms with Weighted Log-rank Tests
We have just adopted weighted Log-rank tests to the survminer package, thanks to survMisc::comp. What are they and why they are useful? Read this blog post to find out. I used ggthemr to make the presentation a little bit more bizarre.
How successful can an R meetup be? meet(R) in Tricity! – RSelenium and Big Data processing
At Thursday (12.01.2017) we had a chance to attend the very first TriCity R Users Group (Pomerania, Poland) meeting. The meetup was unexpectedly very successful! The success can be measured in the time attendees spent on ardently comments and questions after each of two superb presentations. After every 20-25 min long presentation we could observe thirty min long upbeat discussion! It is amazing that questions lasted longer than presentations. Is it thanks to the climate? Is it due to the nature of a Pomeranian community? Perhaps this is due to excellent organization? In this post I present summary of the meeting, I describe presentations and expose organizers’ identity.
When You Went too Far with Survival Plots During the survminer 1st Anniversary
We are celebrating the 1st anniversary of the survminer’s release on CRAN! Due to that fact I have ready the most (uber platinum) customized survival plot that I could imagine. I went too far because it took over thirty parameters to create a graph..
Entropy Based Picture Binarization with imager and FSelectorRcpp
The pic processing and the computer vision have gained a significant interest in last two decades. The photo analysis can be used to detect items or people on photos and movies. It is widely used in the medicine to detect cancer tissues and to improve brain, lungs and heart diseases diagnostic. The computer automation enabled analyzing terabytes of an picture data, based on which we improve our life status and get insights for business decisions. In this post I present basic operations that can be applied to a plain photo, all thanks to imager package by which I am truly affected. I also present a quick entropy treatment to the photo binarization, which applied to photos on a greyscale converts them to the binarized black-and-white output.
Controlling Expenses on Ali Express with RSelenium
Due to the incrising interest in the Internet and due to the its rising number of users, one can notice the surprising growth in the request for analyzing data and information in the Internet that were left by users and for users. Many companies and institutions base their business decisions on the extensive research of social media portals and Internet forums, where users leave reviews on various products and brands. Not only the same analysis, but also the capability to obtain data from the Internet, is a key part of the puzzle…
Determine optimal cutpoints for numerical variables in survival plots
The often request in the biostatistical research is to group patients depending on explanatory variables that are continuous. In some cases the requirement is to test overall survival of the subjects that suffer on a mutation in specific gene and have high expression (over expression) in other given gene. To visualize differences in the Kaplan-Meier estimates of survival forms inbetween groups, very first the discretization of continuous variable is performed. Problems caused by categorization of continuous variables are known and widely spread (Harrel, 2015), but in this case there show up a simplification requirement for the discretization. In this post I present the maxstat(maximally selected rank statistics) statistic to determine the optimal cutpoint for continuous variables, which was provided it in the survminer package by Alboukadel Kassambara kassambara.
News from archivist Two.0 on eRum2016 conference
Ten days ago eRum2016 conference (European R Users Meeting 2016) has finished. It was a yam-sized event that attracted over two hundred fifty attenders, both from academia and business. Meeting was a superb chance to listen to amazing keynotes like Heather Turner, Katarzyna Stapor, Rasmus Bååth, Jakub Glinka, Ulrike Grömping, Przemyslaw Biecek, Romain Francois, Marek Gagolewski, Matthias Templ and Katarzyna Kopczewska. Big thank you goes to the entire organizing committee and dr Maciej Beręsewicz (head) especially! There were ten workshops, two packages sessions, two data workflow sessions, three methodolody sessions, one BioR session, two business sessions, lightnings talks, a poster session and of course a superb welcome paRty. I could not miss a chance to present news from the last release (ver Two.0) of ours archivist package.
Mini R Quizzes
Data Science is sometimes a strained and responsible job… so we spontaneously organize mini R quizzes in our R team at the company!
The Beauty Face of R – Warsaw R-Ladies
Very first Warsaw R-Ladies Workshops were held yesterday! Over one hundred R-Ladies have registered for the event! In this post I present notes from this meeting and provide some pictures of the beauty part of R. You can also learn a bit about the gender gap in the R community.
Rocker – explanation and motivation for Docker containers usage in applications development
What is R? I was asked at the end of my presentation on the 10th Cracow R Users Meetup that was held last Friday (30.09.2016). I felt strange but absolutely confirmed that R is the language of Data Science and is designed to performed the statistical data analysis. Later I found out that few of listeners came to the meetup to listen more about Docker than R, as my topic was Rocker – explanation and motivation for Docker containers usage in applications development. In this post I present the overview of my presentation. If you are not familiar with using Dockers in R applications development, then this is a must read for you!
Warsaw R Enthusiast Meetups Season Finale
Warsaw R and Data Analytics Enthusiast group is an effort that aims at integrating users of the R language in Warsaw, Poland. Our group has over nine hundred seventy members at its meetup page. In this post I provide a summary of our group, our last two meetups before this season ending and I present plans for the future. Check who we are, what are we talking about, what are our future meetings about and how you can become a member or a co-organizer of Warsaw R Enthusiasts events.
Monitoring R Applications with RZabbix
As R users we mostly perform analysis, produce reports and create interactive shiny applications. Those are rather one-time spectacles. Sometimes, however, the R developer comes in the world of the real software development, where R applications should be distributed and maintained on many machines. Then one indeed appreciates the value of a decent applications monitoring. In this post I present my last package called RZabbix – the R interface to the Zabbix API data.
What Every R Package Must (Truly) Contain? An Example on the eRum2016 Package
The R package development is a elaborate process of creating (mostly) a useful software, that will (most likely) be used by other users. This means the provided contraption should be resistant, immune, well tested and decently documented. Developers from many different languages have invented various approaches to improve software development, creating documentation or package testing. R users have adapted few of them and mostly we use travis for continuous integration, roxygen2 for documentation, devtools for testing and knitr / rmarkdown for writing manuals, tutorials, vignettes and package websites. This software development kit causes that the R package structure is rather broad, especially since many of us (R developers) puts source code from different languages into the package root to speed up the spectacle of created implements. Moreover we built our software libraries that are based on other packages, which complicates the NAMESPACE of the ready package and compels the understanding of difference inbetween dependent, imported and suggested packages. In this entire ecosystem of development equipment and requirements for decent package structure I’ve been asked What Every R Package Must Contain? You wouldn’t guess how effortless was the reaction.
Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library
Machine and statistical learning wizards are becoming more impatient to perform analysis with Spark ML library if this is only possible. It’s fancy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster – let’s say one hundred machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio), the connector that will convert you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it – user sparklyr!
BioC two thousand sixteen Conference Overview and Few Ways of Downloading TCGA Data
Few weeks ago I have a fine pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their practices and to better explain their work, as the understanding inbetween collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the skill that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.
LDAvis Demonstrate Case on R-Bloggers
Text mining is a fresh challenge for machine wandering practitioners. The enhanced interest in the text mining is caused by an augmentation of internet users and by rapid growth of the internet data which is said that in 80% is a text data. Extracting information from articles, news, posts and comments have became a desirable skill but what is even more needful are instruments for text mining models diagnostics and visualizations. Such visualizations enable to better understand the insight from a model and provides an effortless interface for presenting your research results to greater audience. In this post I present the Latent Dirichlet Allocation text mining model for text classification into topics and a good LDAvis package for interactive visualizations of topic models. All this on R-Bloggers posts!
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms
Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally requiring Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.
R Hero saves Backup City with archivist and GitHub
Have you ever suffered because of the impossibility of reproducing graphs, tables or analysis’ results in R? Have you ever bothered yourself for not being able to share R objects (i.e., plots or final analysis models) within your reports, posters or articles? Or maybe simply you have too many objects you can’t manage to store in a convenient and handy way? Now you can share partial results of analysis, provide hooks to valuable R objects within articles, manage analysis’ results and restore objects’ pedigree with archivist package and its extension archivist.github, allautomatically through GitHub without closing RStudio. If you are tired of archiving results by yourself, then read how you can became an R Hero with the archivist.github package power.
Survival plots have never been so informative
Hadley Wickham’s ggplot2 version Two.0 revolution, at the end of 2015, triggered many crashes in dependent R packages, that eventually led to deletions of few packages from The Comprehensive R Archive Network. It occured that survMisc package was liquidated from CRAN on 27th of January two thousand sixteen and R world remained vulnerable in the fight with the elegant visualizations of survival analysis. Then a fresh implement – survminer package, created by Alboukadel Kassambara – appeared on the R survival scene to pack the gap in visualizing the Kaplan-Meier estimates of survival kinks in elegant grammar of graphics like way. This blog presents main features of core ggsurvplot() function from survminer package, which creates the most informative, elegant and limber survival plots that I have seen!
R Trio.Trio.0 is another motivation for Docker
Have you ever encountered R packages versioning issues when one application required different dependent packages versions than other? Have you ever got stuck with your project because of wrong pre-installed software versions on machine on which you should run your code? Or maybe you had mighty adventures with installing R software on a fresh machine because you couldn’t recall all installation steps like; what have I done two years ago that RCurl works on my local machine but I can’t install it now on my virtual machine with Windows? Or maybe installation of your R project on fresh machine was effortless but admin couldn’t manage with this process, as he’s not regular R user? If you ever find it problematic to budge your R applications to other machines, then this Docker guid post is for you!
RTCGA factory of R packages – Quick Guide
Yesterday we have been delivered with the fresh version of R – R Three.Three.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s fresh version Three.Trio. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be lightly installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas examine.
Improve your shiny dashboard with Disqus panel
Getting users feedback is always a pleasant moment. In most cases in World of Open Source we are creating implements and applications for people and we love to hear that someone thinks our (generally pet) project is useful. Mostly this moment is nicer than any paycheck. Check this post to find out how to provide effortless interface for collecting the user feedback!
Answers to FAQ about SparkR for R users
Many people keep asking me whether I have attempted SparkR, is it worth using, is it sexy or WHAT is it at all. I felt that creating frequently asked questions (FAQ) in the field of WHAT is that Spark/SparkR? would help many R Scientists to understand this Big Data Whirr-tool. I have gathered information from the documentation and some code from stackoverflow questions in prep for the list below.