snap-research / arielai_youtube_3d_hands Goto Github PK
View Code? Open in Web Editor NEWA dataset for 3D hand reconstruction in the wild.
License: Other
A dataset for 3D hand reconstruction in the wild.
License: Other
Thanks for sharing the great dataset and relative work.
I have some dobuts abut the mesh 3d formats. In the load_db.py
, the function just plots the projected mesh points using the mesh vertices' x and y
,e.g. plt.plot(vertices[:, 0], vertices[:, 1], 'o', color='green', markersize=1)
in line 56, and the paper also indicates that
"meshes in the image coordinate system is better than pretraining in the canonical frame and estimating camera parameters."
So, did the json of the label about the dataset just use the data format( u,v d),d is the scaled depth, instead of format(x,y,z) in camera C.S. ? If so, how about the output of the net? did it also observe the same format(uvd)? And could you please share some method about normalization of mesh's gt while training the net?
Thank you again!
Hi, the link of the dataset request form (https://forms.gle/U385D7b7Qfrig9NR9) is not avaliable for me on Chrome. Please verify it again, thank you !
I'm currently working on your dataset.
In annotation files, there only exists part of annotation (not for all frames in video) and vertex coordinate is saved as 2.5D coordinate(x,y coordinate in image space).
So, I'm wondering if I can get whole annotation file that cover above mentioned problems. (Annotations for all frames, 3D coordinate for mesh vertices or MANO parameters)
I have followed the guidelines provided on GitHub and submitted the dataset request form as instructed. However, I have not received a response regarding my request. It seems the email address is not valid now. Could you please provide an update on the status of my request or any additional information on the process?
Should I use MANO model to fit to the provided mesh vertices and extract the resulting 3D joint locations? Or do you already have this data? Thanks!
rt
Hi, when I run your program "python download_images.py --vid VIDEO_ID", there is a problem "regex_search: could not find match for (?:v=|/)([0-9A-Za-z_-]{11})."occured. It seems to be related to pytube, but after I modified it according to the suggestions on github(pytube/pytube#312 (comment)), it still doesn't work, so I cannot get the dataset. In addition, program "python download_images.py --set train" seem can work. The pytube version I use is "9.6.4", can you give me some help? Thank you very much!
Hello, I am very interested in your work. However, The dataset link request that is likely to be invalid. Could you please give your datasets link.
I hope for your reply.
best
Qiu
Hi, I had some questions regarding the iterative fitting performed to create the dataset. After reading the paper my understanding is that it is split up into two separate parts. First is optimizing the camera parameters + hand orientation, and then the rest of the remaining parameters (poses and shape). I have the following questions
In section 3 you explain that you optimize for the pose, shape, camera translation, and camera scaling. Specifically for the camera parameters, you explain that you initialize it similar to Simplify-x. For clarity does this mean that you are estimating the extrinsic parameters (R,t)? or just the camera translation (t)? In Simplify-x the camera translation is initialized with the assumption that the person is standing straight and similar triangles are used to estimate the depth. Is the equivalent done in your case but with just the palm joints (non-mcp joints and wrist)?
Is the camera translation equivalent to the mesh translation from the camera? aka T_delta is how you translate the mesh away from the camera and s is how you scale to the world coordinates? If so why the choice to treat it as camera translation vs mesh translation? does the difference even matter?
How do you initialize the camera's intrinsic parameters? Are the camera center and focal length assumed to be known values? How can this be used to get a good estimate for unconstrained in the wild images? or are you using a weak perspective camera model?
Great work on this and I appreciate the help!
Hi,
Thanks for the awesome work. I have another healthcare dataset that has a large domain shift compared to the datasets available in the wild.
I am wondering how to generate the GT meshes for my video to train the network as you did.
Are the steps
1.) Detect the 2d key points using openpose and crop the hand image.
2.) Using github repos like https://hassony2.github.io/obman.html estimate the shape and pose parameters from RGB images.
3) Pass these to MANO to get some GT mesh which will not be good.
4) Now keep varying the shape and pose parameters manually with initial estimates as the above until we are satisfied?
Is this how we do it? or am I missing something obvious ?
I am new to graphics and any help will be greatly appreciated.
Thanks a lot
I think this repo is not maintained by the authors anymore.... But I hope somebody can help me.
I saw this issue, but still cannot understand the definition of mesh vertices in the annotation file.
(x,y) are clearly in the image pixel space. But what is the definition of the z coordinates? What is the metric (mm? cm?) for them?
It seems the z coordinates are normalized, but how?
Thanks for sharing the great youtube 3D dataset.
where did you put the camera parameters in Youtube 3D when project vertices from 3d to pixel planes?
as title
The viz_sample
function in load_db.py
accepts an optional faces
parameter, how do I get the faces to provide to the function?
nice work. I found there are some mistakes for downloading the video clips
kmtmR5nC0S4
B-0aiNk9bXk
2wtgc5Pl8bA
d8LtOm2cZpk
rzZadl9uy8I
SCJYNApRo08
cheers
yangang
In json file, there are two keys: "images" and "annotations". I found that 'image_id' in "annotations" are not consistent with "images". That is, no all 'image_id' in "images" are used. The json file can be concisely processed
another problem is that not all hands in the given frames have annotations. Most of them give 1 hand only. Is it available to label full hands for all the frames?
Hello, great project. I did not receive any email when write Dataset Request Form for academic research about a week ago.Hope your reply. email [email protected]
Hi,Thanks for the kindness sharing for the youtube 3D dataset.
As the data format you provide,The annotation did not contain hand bounding box information,so I am curious about the way you crop hand region from the original youtube video. Did you just use openpose to detect keypoints and use the tightest bounding box enclosing all hand joints?Or use some hand detection algorithm to crop the hand region?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.